# Finetuning GPT2 using simple library

*Based on the notebook of [Max Woolf](https://colab.research.google.com/drive/1VLG8e7YSEwypxU-noRNhsv5dW4NfTGce)*

In the following, we will train a pre-trained GPT2 model with a corpus of childrenbooks using  `gpt-2-simple` library.

## 1. Getting ready

### 1.1. Install Libraries

In [1]:
%tensorflow_version 1.x
!pip install -q gpt-2-simple
import gpt_2_simple as gpt2
from datetime import datetime
from google.colab import files

TensorFlow 1.x selected.
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.



### 1.2. GPU

(Text from Max Woolf): Colaboratory uses either a Nvidia T4 GPU or an Nvidia K80 GPU. The T4 is slightly faster than the old K80 for training GPT-2, and has more memory allowing you to train the larger GPT-2 models and generate more text.

You can verify which GPU is active by running the cell below.

In [1]:
!nvidia-smi

Sat May  8 08:31:53 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 465.19.01    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   39C    P0    26W / 250W |      0MiB / 16280MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

### 1.3. Downloading GPT-2

For fine-tuning and retraining, we will need to download the GPT-2 model first.

We will use  the smallest version of the 4 available versions with `124M` parameters; it takes ca. 500 MB on disk (vs. the medium model with `335M` parameters and 1.5GB on disk). The large model (with `774M` parameters) as well as the x-large model (with `1558M` parameters) cannot be finetuned with Colab. 

The larger the model, the more knowledge it has but the longer it also takes to train and generate text as well as needing more space on disk. 

The next cell downloads the model from Google Cloud Storage and saves it in the Colaboratory VM at `/models/<model_name>`.

In [3]:
# Download small gpt2 model
gpt2.download_gpt2(model_name="124M")

Fetching checkpoint: 1.05Mit [00:00, 243Mit/s]                                                      
Fetching encoder.json: 1.05Mit [00:00, 2.41Mit/s]
Fetching hparams.json: 1.05Mit [00:00, 661Mit/s]                                                    
Fetching model.ckpt.data-00000-of-00001: 498Mit [01:16, 6.55Mit/s]                                  
Fetching model.ckpt.index: 1.05Mit [00:00, 318Mit/s]                                                
Fetching model.ckpt.meta: 1.05Mit [00:00, 3.64Mit/s]
Fetching vocab.bpe: 1.05Mit [00:00, 2.79Mit/s]


### 1.4. Mounting Google Drive

Mount your Google Drive in the VM and upload the training text file there. We train our text on a handselected library of childrenbooks from the Gutenberg Library.

In [5]:
# Mount Google Drive
gpt2.mount_gdrive()

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [6]:
# Give file name of the training text (file available at: "...data/stories/input_stories_toddlerpluschildren.txt")
file_name = "input_stories_toddlerpluschildren.txt"

In [7]:
gpt2.copy_file_from_gdrive(file_name)

## 2. Finetuning GPT-2

(Text from Max Woolf): The next cell will start the actual finetuning of GPT-2. It creates a persistent TensorFlow session which stores the training config, then runs the training for the specified number of `steps`. (to have the finetuning run indefinitely, set `steps = -1`)

The model checkpoints will be saved in `/checkpoint/run1` by default. The checkpoints are saved every 500 steps (can be changed) and when the cell is stopped.

**IMPORTANT NOTE:** If you want to rerun this cell, **restart the VM first** (Runtime -> Restart Runtime). You will need to rerun imports but not recopy files.

Other optional-but-helpful parameters for `gpt2.finetune`:


*  **`restore_from`**: Set to `fresh` to start training from the base GPT-2, or set to `latest` to restart training from an existing checkpoint.
* **`sample_every`**: Number of steps to print example output
* **`print_every`**: Number of steps to print training progress.
* **`learning_rate`**:  Learning rate for the training. (default `1e-4`, can lower to `1e-5` if you have <1MB input data)
*  **`run_name`**: subfolder within `checkpoint` to save the model. This is useful if you want to work with multiple models (will also need to specify  `run_name` when loading the model)
* **`overwrite`**: Set to `True` if you want to continue finetuning an existing model (w/ `restore_from='latest'`) without creating duplicate copies. 

In [8]:
%time

sess = gpt2.start_tf_sess()

gpt2.finetune(sess,
              dataset=file_name,
              model_name='124M',
              steps=1000,
              restore_from='fresh',
              run_name='run2',
              print_every=10,
              sample_every=200,
              save_every=500
              )

CPU times: user 4 µs, sys: 0 ns, total: 4 µs
Wall time: 7.15 µs
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
Loading checkpoint models/124M/model.ckpt
INFO:tensorflow:Restoring parameters from models/124M/model.ckpt


  0%|          | 0/1 [00:00<?, ?it/s]

Loading dataset...


100%|██████████| 1/1 [00:04<00:00,  4.57s/it]


dataset has 951927 tokens
Training...
[10 | 18.93] loss=3.24 avg=3.24
[20 | 31.42] loss=2.98 avg=3.11
[30 | 43.91] loss=2.98 avg=3.07
[40 | 56.39] loss=2.81 avg=3.00
[50 | 68.88] loss=2.76 avg=2.95
[60 | 81.35] loss=2.66 avg=2.90
[70 | 93.85] loss=2.90 avg=2.90
[80 | 106.35] loss=2.66 avg=2.87
[90 | 118.83] loss=2.95 avg=2.88
[100 | 131.34] loss=2.95 avg=2.89
[110 | 143.83] loss=2.84 avg=2.88
[120 | 156.31] loss=2.61 avg=2.86
[130 | 168.80] loss=2.69 avg=2.84
[140 | 181.27] loss=2.87 avg=2.85
[150 | 193.74] loss=2.70 avg=2.84
[160 | 206.22] loss=2.51 avg=2.81
[170 | 218.70] loss=2.66 avg=2.80
[180 | 231.19] loss=2.68 avg=2.80
[190 | 243.68] loss=2.77 avg=2.80
[200 | 256.16] loss=2.72 avg=2.79
 I have a very odd feeling about them. 
 There is only one way to go, and that is to take away the one-party-party that's only half the fun. 
 There's only one way for me to live, and that is the other way. 
' 'I wish I wouldn't,' said Jane as she started to walk past. 
 'The world's a long way fr

After training, we copy the checkpoint folder to our Google Drive.

In [9]:
gpt2.copy_checkpoint_to_gdrive(run_name='run2')

## 3. Generating Text

### 3.1. Load the Model Checkpoint

The next cell copies the `.rar` checkpoint file from your Google Drive into the Colaboratory VM.

In [10]:
# Copy checkpoint file into Colab
gpt2.copy_checkpoint_from_gdrive(run_name='run2')

From Colab, we load the retrained model checkpoint and metadata necessary to generate text.

**IMPORTANT NOTE:** To re-run this cell, it is important to **restart the VM first** incl. rerunning libraries/installs.

In [2]:
# Loading the re-trained model checkpoint and metadata
sess = gpt2.start_tf_sess()
gpt2.load_gpt2(sess, run_name='run2')

Loading checkpoint checkpoint/run2/model-1000
INFO:tensorflow:Restoring parameters from checkpoint/run2/model-1000


### 3.2. Generate Text From The Trained Model

We use `generate` to generate text from the loaded model. Optional parameters for `gpt2.generate` include:

*  **`length`**: Number of tokens to generate (default 1023, the maximum)
* **`temperature`**: The higher the temperature, the crazier the text (default 0.7, recommended to keep between 0.7 and 1.0)
* **`top_k`**: Limits the generated guesses to the top *k* guesses (default 0 which disables the behavior; if the generated output is super crazy, you may want to set `top_k=40`)
* **`top_p`**: Nucleus sampling: limits the generated guesses to a cumulative probability. (gets good results on a dataset with `top_p=0.9`)
* **`truncate`**: Truncates the input text until a given sequence, excluding that sequence (e.g. if `truncate='<|endoftext|>'`, the returned text will include everything before the first `<|endoftext|>`). It may be useful to combine this with a smaller `length` if the input texts are short.
*  **`include_prefix`**: If using `truncate` and `include_prefix=False`, the specified `prefix` will not be included in the returned text.

In [3]:
gpt2.generate(sess, run_name='run2')

To understand the Games well, you must first understand what the good people of that country are like. 
 They have very generous and kind-hearted parents, and a very strong and determined and prosperous guild. 
 They are very industrious, and they are all highly respected. 
 The children of these people are the most valuable and the bravest. 
 They have the most splendid hopes—though they must always be hopeful at the worst possible moments. 
 They are most industrious in their preparations, and their hearts are most fervent in the pursuit of their hopes. 
 They are said to be the guardians of mankind—a most unvarying and reliable regard. 
 They are very good-hearted and patient. 
 They have the most splendid minds, and their minds are always ready to work for the good of mankind. 
 And now it is time we began to understand what they are like. 
                                                                                                                                Below is a list

In [7]:
gpt2.generate(sess,
              length=100,
              temperature=0.7,
              prefix="Once upon a time",
              nsamples=5,
              batch_size=5
              )

Once upon a time there was a king, and his kings were fair and upright. 
 The children could not tell whether he was a king-builder or a king-builder's son. 
 But the king-builder knew a king, and he made him a king-builder's son. 
 'Then all of a sudden,' said the child, 'there was a king, and he took away all the children of the land. 
 Then he placed in a cave some ice, and he
Once upon a time vast numbers of people were able to live in the great cities of the world, and millions of them were citizens of those great nations. 
 And these people were rude, ignorant, stupid, uncaring, stupidly happy, uncaring, uncaring, stupidly wicked. 
 The cities were like dream-changes in history. 
       'I was born in the great city and I built this city,' said the King of Babylon, 'to keep the
Once upon a time the gods and goddesses of that city and its people had their share in the flourishing of the country. 
 For through their presence all became aperçus, and the local people became a people.

In [5]:
seed = "Our story begins with an king that lived in his castle with his queen and their two children and they ruled over a large kingdom of happy people."
max_len = 350

In [6]:
gpt2.generate(sess,
              length=250,
              temperature=0.7,
              prefix=seed,
              nsamples=5,
              batch_size=5,
              top_k = 40
              )

Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
Our story begins with an king that lived in his castle with his queen and their two children and they ruled over a large kingdom of happy people. 
 Then the queen sent men to take the children as captives and to make them strong enough to become strong enough to fight, and she sent men of that tribe to dig for treasure and to carry the captives up the tower. 
 And then the king took the children with him to seek refuge in the mountains. 
 And the children were strong men, and the king took them with him to seek the land of the nymphs, the nymphs of the wood where the charm is, and he built a great city there. 
' 'And now,' said Jane, 'the children are strong men, and the king took them with him to seek the land of the blue-fish, the blue-fish of the sea, and he built a great city there. 
' 'And so we are told,' said Anthea, 'and so are all your ancestors, you and your baby brother and your bab

# LICENSE

MIT License

Copyright (c) 2019 Max Woolf

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.