<a href="https://colab.research.google.com/github/YorkU-Cameroon/ml_labs/blob/main/Lab10.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# GPT-2 Example

## Here we will run through the popular GPT-2:

GPT-2 is a neural-network-powered language model. A language model is a model that predicts the likelihood of a sentence existing in the world. For example, a language model can label the sentence “I take my dog for a walk” as more probable to exist (i.e., on the Internet) than the sentence “I take my banana for a walk.” This is true for sentences as well as phrases and, more generally, any sequence of characters.

Like most language models, GPT-2 is elegantly trained on an unlabeled text dataset (in this case, the training data includes among others Common Crawl and Wikipedia). Words or phrases are randomly removed from the text, and the model must learn to fill them in using only the surrounding words as context. It’s a simple training task that results in a powerful and generalizable model.


---




Setup:

1) Make sure GPU is enabled, go to edit->notebook settings->Hardware Accelerator GPU

2) Make a copy to your google drive, click on copy to drive in panel

Note: Colab will reset after 12 hours make sure to save your model checkpoints to google drive around 10-11 hours mark or before, then go to runtime->reset all runtimes. Now copy your train model back into colab and start training again from the previous checkpoint.

clone and cd into repo




In [None]:
!git clone https://github.com/MissCrispenCakes/GPT2_example_CAMEROON.git

In [None]:
cd GPT2_example_CAMEROO\N

Install requirements.

We will use an earlier version of TensorFlow than we have been using for our labs so far - this will make it easier for you to play around with the introductory content that exists online. (If time will update/transfer to newer TF - don't worry about the 'incompatible' flags - they get resolved)

In [None]:
!pip3 install tensorflow==1.15.0rc3
!pip3 install -r requirements.txt

Mount drive to access google drive for saving and accessing checkpoints later. Have to log in to your google account - you may already be connected from Lab 9, run this anyway to check.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Download the model data - we will only consider pre-trained models that are small and medium in size (due to time, space)

*   117M
*   345M



In [None]:
!python3 download_model.py 117M

In [None]:
!python3 download_model.py 345M

encoding

In [None]:
!export PYTHONIOENCODING=UTF-8

Fetch checkpoints if you have them saved in google drive -  if you don't have a saved checkpoint yet DON'T worry! The following commands will still run and check the drive but won't find anything yet.

In [None]:
!cp -r /content/drive/My\Drive/checkpoint/ /content/GPT2_example_CAMEROO\N// 


Let's get our train on! In this case the file is A Tale of Two Cities (Charles Dickens) from Project Gutenberg. To change the dataset GPT-2 models will fine-tune on, change this URL to another .txt file, and change corresponding part of the next cell. Note that you can use small datasets if you want but you will have to be sure not to run the fine-tuning for too long or you will overfit badly. Roughly, expect interesting results within minutes to hours in the 1-10s of megabyte ballpark, and below this you may want to stop the run early as fine-tuning can be very fast.

In [None]:
!wget https://www.gutenberg.org/files/98/98-0.txt


Start training, add --model_name '345M' to use 345 model. Use 117M for the smaller model. Again, as we are using an older version of TensorFlow there will be some 'deprecated' flags; we can ignore these. You can stop the training at anypoint with CTRL + C _OR_ you can right click on the three dots to the left of the output below and select "interrupt execution" (the three dots appear directly below the spinning stop/play button beside the code); the training will continue 'indefinitely' and will populate the screen with samples in intervals of 100. To get a feel and gist for the system, stop the training after first 200 and move on. You can always come back as we have set up checkpoints!

In [None]:
!chmod u+rwx train.py
!PYTHONPATH=src ./train.py --dataset /content/GPT2_example_CAMEROO\N//98-0.txt --model_name '345M'

Save our checkpoints to start training again later

In [None]:
!cp -r /content/GPT2_example_CAMEROO\N/checkpoint /content/drive/My\ Drive/

Load your trained model for use in sampling below (117M or 345M) -  if you don't have a trained model yet DON'T worry! The following commands will still run and check the drive but won't find anything yet. If you have only trained 117M or only trained 345M - you will only be able to load one of, not both of the following load codes.

In [None]:
!cp -r /content/GPT2_example_CAMEROO\N/checkpoint/run1/* /content/GPT2_example_CAMEROO\N/models/117M/

In [None]:
!cp -r /content/GPT2_example_CAMEROO\N/checkpoint/run1/* /content/GPT2_example_CAMEROO\N/models/345M/

Now we will generate conditional samples from the model given a prompt you provide -  change top-k hyperparameter if desired (default is 40),  if you're using 345M, add "--model-name 345M"

# Have fun with the model inputs! Try words, sentences, paragraphs!
## You can enter whatever comes to mind! Ask a question. Give a comment. Paste in a passage from a book. Get creative!

The prompt will appear at the bottom of the output section, wait for it, enter your text and hit 'enter' on your keyboard to 'communicate' with your trained model. As before, you can 'interrupt execution' to move on. You can return back at anytime by running this section of code again.

In [None]:
!chmod u+rwx src/interactive_conditional_samples.py
!python3 src/interactive_conditional_samples.py --top_k 40 --model_name "345M"

To check flag descriptions, use:

In [None]:
!python3 src/interactive_conditional_samples.py -- --help

An alternative to interactive sample generation is the following, again "interrupt execution" when you are ready to move on:

Generate unconditional samples from the model,  if you're using 345M, add "--model-name 345M"

In [None]:
!chmod u+rwx src/generate_unconditional_samples.py
!python3 src/generate_unconditional_samples.py --model_name "345M" | tee /tmp/samples

To check flag descriptions, use:

In [None]:
!python3 src/generate_unconditional_samples.py -- --help