<a href="https://colab.research.google.com/github/addadda023/GPT-2-text-generation/blob/master/Train_a_GPT_2_Text_Generating_Model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Text generation using [GPT-2-simple](https://github.com/minimaxir/gpt-2-simple),  Python package that wraps existing model fine-tuning and generation scripts for OpenAI's GPT-2 text generation model.

Let's download the packages. Note tensorflow 1.x version is installed because the gpt2 package doesn't support 2.0 yet. This is also important to note if you want to deploy the model as docker image later.

In [1]:
%tensorflow_version 1.x
#!pip install -q gpt-2-simple
import gpt_2_simple as gpt2
from datetime import datetime
from google.colab import files

The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.



### Check GPU status

Since GPU is strongly recommended, check the status of GPU. Remember to select GPU in Tuntime -> Change runtime type.

In [2]:
# Check which GPU is being run 
!nvidia-smi

Tue Nov 12 23:45:14 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.50       Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   40C    P0    27W / 250W |      0MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|  No ru

### GPT-2 Model download

To train the model on new text, we need to download the GPT-2 model first. 
There are three released sizes of GPT-2:

1. 124M (default): the "small" model, 500MB on disk.
2. 355M: the "medium" model, 1.5GB on disk.
3. 774M: the "large" model, cannot currently be finetuned with Colaboratory but can be used to generate text from the pretrained model.
4. 1558M: the "extra large", true model. Will not work if a K80 GPU is attached to the notebook. (like 774M, it cannot be finetuned).

This next cell downloads it from Google Cloud Storage and saves it in the Colaboratory VM at /models/<model_name>.

In [0]:
gpt2.download_gpt2(model_name="124M")

### Uploading/loading your input text file.

The best way to get input text to-be-trained into the Colaboratory VM, and to get the trained model out of Colaboratory, is to route it through Google Drive first.

Running this cell (which will only work in Colaboratory) will mount your personal Google Drive in the VM, which later cells can use to get data in/out. (it will ask for an auth code; that auth is not saved anywhere)

Alternatively, you can directly upload the text file to the notebook sidebar top left if its less than **10MB**.

In [0]:
gpt2.mount_gdrive()

In [0]:
# Check contents of google drive
!ls "/content/drive/My Drive"

In [0]:
file_name = 'YTA_comments.txt'
gpt2.copy_file_from_gdrive(file_name)

### Fine tuning GPT2

The next cell will start the finetuning of GPT-2. It creates a persistent TensorFlow session which stores the training config, then runs the training for the specified number of steps (to have the finetuning run indefinitely, set steps = -1).

The model checkpoints will be saved in `/checkpoint/run1` by default. Make sure to change the `run_name` variable if you're training different versions. The checkpoints are saved every 500 steps (can be changed) and when the cell is stopped.

The training might time out after a few hours, so make sure you end training and save the results so you don't lose them! You can simply stop the cell and it will auto-store the last checkpoint data. The model will serve from that last checkpoint.

**NOTE:** If you want to rerun this cell, restart the VM first (Runtime -> Restart Runtime). You will need to rerun imports but not recopy files.

Parameters for gpt2.finetune:

* **restore_from:** Set to fresh to start training from the base GPT-2, or set to latest to restart training from an existing checkpoint.
* **sample_every:** Number of steps to print example output.
* **print_every:** Number of steps to print training progress.
* **learning_rate:** Learning rate for the training. (default 1e-4, can lower to 1e-5 if you have `<`1MB input data)
* **run_name:** Subfolder within checkpoint to save the model. This is useful if you want to work with multiple models (will also need to specify run_name when loading the model).
* **overwrite:** Set to True if you want to continue finetuning an existing model (w/ restore_from='latest') without creating duplicate copies.

The input used to finetune this model is from the first 6 months of subreddit [AITA](https://www.reddit.com/r/AmItheAsshole/). The texts are being used purely for educational purpose and the author doesn't endorse any of the 
writings.

In [5]:
sess = gpt2.start_tf_sess()

gpt2.finetune(sess,
              dataset=file_name,
              model_name='124M',
              steps=1000,
              restore_from='fresh',
              run_name='run1',
              print_every=10,
              sample_every=200,
              save_every=200
              )

Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
Loading checkpoint models/124M/model.ckpt
INFO:tensorflow:Restoring parameters from models/124M/model.ckpt


  0%|          | 0/1 [00:00<?, ?it/s]

Loading dataset...


100%|██████████| 1/1 [00:33<00:00, 33.59s/it]


dataset has 7201019 tokens
Training...
[10 | 17.95] loss=3.47 avg=3.47
[20 | 30.27] loss=3.50 avg=3.48
[30 | 42.58] loss=3.38 avg=3.45
[40 | 54.91] loss=3.47 avg=3.45
[50 | 67.25] loss=3.26 avg=3.42
[60 | 79.56] loss=3.43 avg=3.42
[70 | 91.85] loss=3.38 avg=3.41
[80 | 104.21] loss=3.38 avg=3.41
[90 | 116.53] loss=3.22 avg=3.39
[100 | 128.83] loss=3.22 avg=3.37
[110 | 141.15] loss=3.24 avg=3.36
[120 | 153.47] loss=3.17 avg=3.34
[130 | 165.79] loss=3.21 avg=3.33
[140 | 178.12] loss=3.16 avg=3.32
[150 | 190.43] loss=3.29 avg=3.32
[160 | 202.76] loss=3.21 avg=3.31
[170 | 215.07] loss=3.17 avg=3.30
[180 | 227.41] loss=3.18 avg=3.29
[190 | 239.74] loss=3.29 avg=3.29
[200 | 252.05] loss=3.17 avg=3.28
Saving checkpoint/run_1000000_comments_final/model-200
 rights if his life is going to suffer if he finds out the truth. 
I have a story that a woman is dating an actress whose name I recognize and my entire life I would have to live like I'm her friend and it's not even worth it.
The reason she 

Remember to copy the last checkpoint to Google drive. You can then download the model from Google drive.

In [0]:
gpt2.copy_checkpoint_to_gdrive(run_name='run1')

### Generate text from trained model

Use the generate command to generate a sample output. 

Helpful parameters for gpt2.generate:

* **length:** Number of tokens to generate (default 1023, the maximum)
* **temperature:** The higher the temperature, the crazier the text (default 0.7, recommended to keep between 0.7 and 1.0)
* **top_k:** Limits the generated guesses to the top k guesses (default 0 which disables the behavior; if the generated output is super crazy, you may want to set top_k=40)
* **top_p:** Nucleus sampling: limits the generated guesses to a cumulative probability. (gets good results on a dataset with top_p=0.9)
* **truncate:** Truncates the input text until a given sequence, excluding that sequence (e.g. if `truncate='<|endoftext|>'`, the returned text will include everything before the first <|endoftext|>). It may be useful to combine this with a smaller length if the input texts are short. You can also use `'\n'` to generate only 1 line of output.
* **include_prefix:** If using truncate and `include_prefix=False`, the specified prefix will not be included in the returned text.

In [17]:
gpt2.generate(sess, run_name='run1',
              length=100,
              prefix='YTA.',
              truncate='\n')

YTA. If he makes you pay for the seat you can change seats in there so you can recline.


### Loading a pretrained model

You can also load a different pretrained model and generate text.

In [0]:
sess = gpt2.start_tf_sess()
gpt2.load_gpt2(sess, run_name='run2')
gpt2.generate(sess, run_name='run2',
              length=100,
              prefix='Coca cola',
              truncate='\n')