#  Train a GPT-2 Text-Generating Model w/ GPU For Free 

credit: [Max Woolf](http://minimaxir.com)


Retrain an advanced text generating neural network on any text dataset **for free on a GPU using Collaboratory** using `gpt-2-simple`!

In [1]:
%tensorflow_version 1.x
!pip install -q gpt-2-simple
import gpt_2_simple as gpt2
from datetime import datetime
from google.colab import files

TensorFlow 1.x selected.
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.



## GPU

Colaboratory uses either a Nvidia T4 GPU or an Nvidia K80 GPU. The T4 is slightly faster than the old K80 for training GPT-2, and has more memory allowing you to train the larger GPT-2 models and generate more text.

You can verify which GPU is active by running the cell below.

In [2]:
!nvidia-smi

Wed Nov 25 07:37:53 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.38       Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   69C    P8    12W /  70W |      0MiB / 15079MiB |      0%      Default |
|                               |                      |                 ERR! |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

## Downloading GPT-2

In [3]:
gpt2.download_gpt2(model_name="124M")

Fetching checkpoint: 1.05Mit [00:00, 311Mit/s]                                                      
Fetching encoder.json: 1.05Mit [00:00, 118Mit/s]                                                    
Fetching hparams.json: 1.05Mit [00:00, 329Mit/s]                                                    
Fetching model.ckpt.data-00000-of-00001: 498Mit [00:04, 119Mit/s]                                   
Fetching model.ckpt.index: 1.05Mit [00:00, 302Mit/s]                                                
Fetching model.ckpt.meta: 1.05Mit [00:00, 151Mit/s]                                                 
Fetching vocab.bpe: 1.05Mit [00:00, 187Mit/s]                                                       


## Mounting Google Drive

In [4]:
gpt2.mount_gdrive()

Mounted at /content/drive


In [5]:
file_name = "for_train.txt"

If your text file is larger than 10MB, it is recommended to upload that file to Google Drive first, then copy that file from Google Drive to the Colaboratory VM.

In [6]:
gpt2.copy_file_from_gdrive(file_name)

## Finetune GPT-2

The next cell will start the actual finetuning of GPT-2. It creates a persistent TensorFlow session which stores the training config, then runs the training for the specified number of `steps`. (to have the finetuning run indefinitely, set `steps = -1`)

The model checkpoints will be saved in `/checkpoint/run1` by default. The checkpoints are saved every 500 steps (can be changed) and when the cell is stopped.

In [7]:
sess = gpt2.start_tf_sess()

gpt2.finetune(sess,
              dataset=file_name,
              model_name='124M',
              steps=1000,
              restore_from='latest',
              run_name='run2',
              overwrite=True,
              print_every=10,
              sample_every=200,
              save_every=500)

Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
Loading checkpoint models/124M/model.ckpt
INFO:tensorflow:Restoring parameters from models/124M/model.ckpt


  0%|          | 0/1 [00:00<?, ?it/s]

Loading dataset...


100%|██████████| 1/1 [00:05<00:00,  5.24s/it]


dataset has 1090522 tokens
Training...
Saving checkpoint/run2/model-0
[10 | 28.88] loss=2.70 avg=2.70
[20 | 53.18] loss=2.78 avg=2.74
[30 | 76.55] loss=2.58 avg=2.69
[40 | 99.89] loss=2.39 avg=2.61
[50 | 123.53] loss=2.50 avg=2.59
[60 | 147.13] loss=2.58 avg=2.59
[70 | 170.60] loss=2.51 avg=2.58
[80 | 194.08] loss=2.62 avg=2.58
[90 | 217.60] loss=2.46 avg=2.57
[100 | 241.20] loss=2.46 avg=2.56
[110 | 264.75] loss=2.39 avg=2.54
[120 | 288.26] loss=2.45 avg=2.53
[130 | 311.75] loss=2.63 avg=2.54
[140 | 335.22] loss=2.51 avg=2.54
[150 | 358.70] loss=2.53 avg=2.54
[160 | 382.23] loss=2.25 avg=2.52
[170 | 405.81] loss=2.55 avg=2.52
[180 | 429.36] loss=2.36 avg=2.51
[190 | 452.88] loss=2.24 avg=2.50
[200 | 476.37] loss=2.47 avg=2.49
 talented on a bike. He's a perfectionist. Always trying out new things.
KEITH: No problem.
JERRY: You don' think he's gonna drive, or something?
JERRY: No, I don't. (Jerry's apartment) (Jerry is sitting on the couch)
JERRY: You know this is very flattering to he

After the model is trained, you can copy the checkpoint folder to your own Google Drive.

If you want to download it to your personal computer, it's strongly recommended you copy it there first, then download from Google Drive. The checkpoint folder is copied as a `.rar` compressed file; you can download it and uncompress it locally.

In [8]:
gpt2.copy_checkpoint_to_gdrive(run_name='run2')

You're done! Feel free to go to the **Generate Text From The Trained Model** section to generate text based on your retrained model.

## Load a Trained Model Checkpoint

Running the next cell will copy the `.rar` checkpoint file from your Google Drive into the Colaboratory VM.

In [37]:
gpt2.copy_checkpoint_from_gdrive(run_name='run2')

The next cell will allow you to load the retrained model checkpoint + metadata necessary to generate text.

**IMPORTANT NOTE:** If you want to rerun this cell, **restart the VM first** (Runtime -> Restart Runtime). You will need to rerun imports but not recopy files.

In [None]:
sess = gpt2.start_tf_sess()
gpt2.load_gpt2(sess, run_name='run2')

## Generate Text From The Trained Model

After you've trained the model or loaded a retrained model from checkpoint, you can now generate text. `generate` generates a single text from the loaded model.

In [9]:
gpt2.generate(sess, run_name='run2')

KRAMER: I'm still a virgin.
GEORGE: You're the one who did it!
JERRY: I'm not a virgin!
KRAMER: (Picks up a match from the table) This is the guy who told me to meet George at the bookstore and I told him to meet me there.
GEORGE: He's gonna see you after I meet him!
JERRY: That's why I'm writing this letter! I'm writing from my home in Vermont, where I'm living with my mother and three cats.
KRAMER: I'll be right there. (They leave)
JERRY: What did he say?
KRAMER: He said that I should see him again and that I should stay in business.
JERRY: You should stay in business.
KRAMER: Yeah, I should do the same.
JERRY: Well why don't you stick around. (Scene ends) Elaine's office) (Elaine is reading a newspaper while looking through a magazine. She notices a Polaroid photo of a man and woman kissing)
ELAINE: I can't believe this. (Shuts the door) (Scene ends) Jerry's apartment) (Elaine's at his apartment with a Polaroid shot of a man and a woman kissing)
ELAINE: (Shuts the door) (Scene ends)

If you're creating an API based on your model and need to pass the generated text elsewhere, you can do `text = gpt2.generate(sess, return_as_list=True)[0]`

You can also pass in a `prefix` to the generate function to force the text to start with a given character sequence and generate text from there (good if you add an indicator when the text starts).

You can also generate multiple texts at a time by specifing `nsamples`. Unique to GPT-2, you can pass a `batch_size` to generate multiple samples in parallel, giving a massive speedup (in Colaboratory, set a maximum of 20 for `batch_size`).

Other optional-but-helpful parameters for `gpt2.generate` and friends:

*  **`length`**: Number of tokens to generate (default 1023, the maximum)
* **`temperature`**: The higher the temperature, the crazier the text (default 0.7, recommended to keep between 0.7 and 1.0)
* **`top_k`**: Limits the generated guesses to the top *k* guesses (default 0 which disables the behavior; if the generated output is super crazy, you may want to set `top_k=40`)
* **`top_p`**: Nucleus sampling: limits the generated guesses to a cumulative probability. (gets good results on a dataset with `top_p=0.9`)
* **`truncate`**: Truncates the input text until a given sequence, excluding that sequence (e.g. if `truncate='<|endoftext|>'`, the returned text will include everything before the first `<|endoftext|>`). It may be useful to combine this with a smaller `length` if the input texts are short.
*  **`include_prefix`**: If using `truncate` and `include_prefix=False`, the specified `prefix` will not be included in the returned text.

In [11]:
gpt2.generate(sess,
              length=250,
              temperature=0.9,
              prefix="FRANK",
              nsamples=5,
              batch_size=5
              )

FRANK: He thinks I'm a phony.
GEORGE: What are you gonna do now? You're gonna fall for my face? (Kramer comes back with Jerry)
JERRY: I'm gonna be like a burrito. I'm gonna go to his apartment and use the shower. I'll pretend he's calling from my purse.
KRAMER: Sounds like a good idea. (Monk's)
GEORGE: Kramer, you're fiending.
ELAINE: No, I'm never going to a party.
KRAMER: But if you're going to a party, I'm not going to be a phony.
ELAINE: Yeah, but Kramer, you know how you incorporate fake relationships into your acting career, 'cos they usually end up turning heads.
KRAMER: Neither am I. Opening Night. George and Jerry are there.
GEORGE: Hey look, what's the difference. I'm dressing!
JERRY: You're not dressing.
GEORGE: I am dressed!
JERRY: Really? Look at this, you've got like a beautiful face!
GEORGE: You might wanna see.
JER
FRANK: That was wonderful!
ELAINE: Oh, great... Full story here: "The Great Gatski"... And it's a beautiful lesson in how to make a normal conversation.
GEOR

For bulk generation, you can generate a large amount of text to a file and sort out the samples locally on your computer. The next cell will generate a generated text file with a unique timestamp.

You can rerun the cells as many times as you want for even more generated texts!

In [29]:
gen_file = 'gpt2_gentext_{:%Y%m%d_%H%M%S}.txt'.format(datetime.utcnow())

gpt2.generate_to_file(sess,
                      destination_path=gen_file,
                      length=500,
                      prefix="KRAMER",
                      temperature=1,
                      nsamples=100,
                      batch_size=20
                      )

In [30]:
# may have to run twice to get file to download
files.download(gen_file)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

# Etcetera

If the notebook has errors (e.g. GPU Sync Fail), force-kill the Colaboratory virtual machine and restart it with the command below:

In [None]:
!kill -9 -1