# GPT-2 Finetuning

Made by [Artem Konevskikh](https://aiculedssul.net/) 

Based on notebook by [Max Woolf](http://minimaxir.com). For more about `gpt-2-simple`, you can visit [this GitHub repository](https://github.com/minimaxir/gpt-2-simple).

## Installation

In [None]:
#@title Imports
#@markdown By running this cell we are loading libraries needed to work with GPT2
!pip install git+https://github.com/minimaxir/gpt-2-simple
import gpt_2_simple as gpt2
from datetime import datetime
from google.colab import files
import shutil

In [None]:
#@title Downloading GPT-2
#@markdown If you're retraining a model on new text, you need to download the GPT-2 model first. 

#@markdown There are two released sizes of GPT-2 that we can work with in Colab:

#@markdown * `124M` (default): the "small" model, 500MB on disk.
#@markdown * `355M`: the "medium" model, 1.5GB on disk.

#@markdown Larger model has more knowledge, but takes longer to finetune and longer to generate text. You can specify which base model to use by changing `model_name`.

model_name = "124M" #@param ["124M", "355M"] {allow-input: true}

#@markdown This cell downloads it from Google Cloud Storage and saves it in the Colaboratory VM at `/models/<model_name>`.

#@markdown This model isn't permanently saved in the Colaboratory VM; you'll have to redownload it if you want to retrain it at a later time.


gpt2.download_gpt2(model_name=model_name)


In [None]:
#@title Mounting Google Drive

#@markdown Colab notebooks are Virtual Machines, so any data stored in it will be vanished as soon as we close it or reset it. So the best way to keep input data and save trained models is to mount your Google Drive and store it there.

#@markdown After running this cell you will get the link, where you should grant the access to your Drive and copy auth token. Paste this token to the input below and press Enter

gpt2.mount_gdrive()

### Dataset Text File

To finetune GPT2 on your texts you should prepare a single plain text file. Collect the books and articles and copy all of them to this text file.

GPT2 works with `.txt` files only, so this link https://www.pdf2go.com/pdf-to-text might be useful if you have your books in `pdf` format.

There are two ways to load training data - directly to this notebook or from Google Drive.

In [None]:
#@title Uploading a Text File directly to Colaboratory

#@markdown First way is good if you have file less than 10mb. To upload file go to the Colaboratory Notebook sidebar on the left of the screen, select *Files* (folder icon) and use *Upload* button or drag-n-drop file directly to the files list

#@markdown Then copy file path to the input below and run this cell

file_name = '/content/bb-vs-ooo.txt' #@param {type: "string"}

In [None]:
#@title Uploading a Text File from Google Drive

#@markdown Second way is preferable, especially if you have files larger than 10mb. 

#@markdown Upload file to the Drive, find it in *Files* sidebar, copy file path to the input below and run this cell

gdrive_file_name = '/content/drive/MyDrive/medialab/bb-vs-ooo.txt' #@param {type: "string"}
file_name = gdrive_file_name.split('/')[-1]
shutil.copyfile(gdrive_file_name, file_name)


## Finetune GPT-2

The next cell will start the actual finetuning of GPT-2. It creates a persistent TensorFlow session which stores the training config, then runs the training for the specified number of `steps`. (to have the finetuning run indefinitely, set `steps = -1`)

The model checkpoints will be saved in `/checkpoint/run1` by default. The checkpoints are saved every 500 steps (can be changed) and when the cell is stopped.

The training might time out after 4ish hours; make sure you end training and save the results so you don't lose them!

In [None]:
#@title Set parameters

#@markdown **Model**

#@markdown Choose model name you downloaded before
model_name="124M" #@param ["124M", "355M"] {allow-input: true}

#@markdown Set to `fresh` to start training from the base GPT-2, or set to `latest` to restart training from an existing checkpoint.
restore_from='fresh' #@param ["fresh", "latest"] {allow-input: true}

#@markdown Run name. Subfolder within `checkpoint` to save the model. This is useful if you want to work with multiple models (will also need to specify  `run_name` when loading the model)
run_name='gpt-medialab-bbooo' #@param {type: "string"}

#@markdown **Steps**

#@markdown Number of steps to train the model.
steps=200 #@param {type: "number"}

#@markdown Number of steps to print example output
sample_every=30 #@param {type: "number"}

#@markdown Number of steps to save the model.
save_every=50 #@param {type: "number"}

#@markdown Number of steps to print training progress.
print_every=10 #@param {type: "number"}

#@markdown **Other**

#@markdown Learning rate for the training. (default `1e-4`, can lower to `1e-5` if you have <1MB input data)
learning_rate=0.000055 #@param {type:"slider", min:1e-5, max:1e-4, step:4.5e-5}

#@markdown Check if you want to continue finetuning an existing model (w/ `restore_from='latest'`) without creating duplicate copies. 
overwrite=False #@param {type: "boolean"}

In [None]:
#@title Run finetuning
#@markdown **IMPORTANT NOTE:** If you want to rerun this cell, **restart the VM first** (Runtime -> Restart Runtime). You will need to rerun imports and settings but not recopy files.
sess = gpt2.start_tf_sess()

gpt2.finetune(sess,
              dataset=file_name,
              model_name=model_name,
              steps=steps,
              restore_from=restore_from,
              run_name=run_name,
              print_every=print_every,
              sample_every=sample_every,
              save_every=save_every,
              learning_rate=learning_rate,
              overwrite=overwrite,
              )

In [None]:
#@title Save results

#@markdown By running this cell the checkpoint folder will be copied as a *.rar* compressed file to your own Google Drive.

#@markdown You can use this file later to generate text.

gpt2.copy_checkpoint_to_gdrive(run_name=run_name)

## That's all! Now we can generate the text!