<a href="https://colab.research.google.com/github/bhaveshprajapat/dissertation-colab-gpt2/blob/main/Satire%20Language%20Modelling%20%5BColab%20with%20GPU%5D.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Language Modelling in Google Colab (Pro) with *GPUs*
Bhavesh Prajapat,
adapted from [Train a GPT-2 Text-Generating Model w/ GPU](https://colab.research.google.com/drive/1VLG8e7YSEwypxU-noRNhsv5dW4NfTGce) by [Max Woolf](http://minimaxir.com)

This Colab notebook shows the process taken for generating fine-tuned models using the satire dataset produced as part of my dissertation project '‘Investigating
Language Modelling Suitability for Originating Satire'.


## Fine-tuning stage
Sets up the Colab VM, downloads a GPT-2 model, and readies the dataset to fine-tune with.

### Setup
Sets TF version, and displays GPU info.

The size of the GPU Memory is important for determining which models you can run. Use Colab Pro for reliable access to higher-memory GPUs.

In [None]:
%tensorflow_version 1.x

gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Select the Runtime → "Change runtime type" menu to enable a GPU accelerator, ')
  print('and then re-execute this cell.')
else:
  print(gpu_info)

TensorFlow 1.x selected.
Sun May 10 16:10:22 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.82       Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   39C    P0    27W / 250W |      0MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                         

Install  PIP requirements and mount  Google Drive

In [None]:
!pip install -q gpt-2-simple
import gpt_2_simple as gpt2
from datetime import datetime
from google.colab import files

  Building wheel for gpt-2-simple (setup.py) ... [?25l[?25hdone
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.



In [None]:
# Skip if Google Drive is already mounted
gpt2.mount_gdrive()

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


### Runtime Constants
model_size is one of:
*   `124M` (S, 1.5GB)
*   `355M` (M)
*   `774M` (L)
*   `1558M` (XL, 6.5GB)

L and XL are unlikely to work well with lower-memory GPUs

Optionally, copy an existing checkpoint from Google Drive


In [None]:
model_size = "124M" #@param ["124M", "355M", "774M", "1558M"]
sess_run_name = "GPT2S-A-SATIRE-500" 

In [None]:
# Comment out if unneeded/existing checkpoint doesn't exist
gpt2.copy_checkpoint_from_gdrive("GPT2L-A-SATIRE-500" )

### Load Dataset
The satire training dataset `DatasetA.zip` must be loaded in to the Colab runtime.

In [None]:
# Comment and uncomment as necessary
!rm -rf DatasetA
!cp "/content/drive/My Drive/DatasetA.zip" .
!unzip -q DatasetA.zip -d DatasetA
# Set the dataset folder name
folder_name = "/content/DatasetA"

cp: cannot stat '/content/drive/My Drive/DatasetA.zip': No such file or directory
unzip:  cannot find or open DatasetA.zip, DatasetA.zip.zip or DatasetA.zip.ZIP.


### Finetune GPT-2

The next cell will start the actual finetuning of GPT-2. It creates a persistent TensorFlow session which stores the training config, then runs the training for the specified number of `steps`. (to have the finetuning run indefinitely, set `steps = -1`)

The model checkpoints will be saved in `/checkpoint/[sess_run_name]` by default. The checkpoints are saved every 500 steps (can be changed) and when the cell is stopped.

The training might time out after 4ish hours; make sure you end training and save the results so you don't lose them!

Other optional-but-helpful parameters for `gpt2.finetune`:

* **`learning_rate`**:  Learning rate for the training. (default `1e-4`, can lower to `1e-5` if you have <1MB input data)
*  **`run_name`**: subfolder within `checkpoint` to save the model. This is useful if you want to work with multiple models (will also need to specify  `run_name` when loading the model)
* **`overwrite`**: Set to `True` if you want to continue finetuning an existing model (w/ `restore_from='latest'`) without creating duplicate copies. 

In [None]:
#@title Fine-tuning parameters

#@markdown Not all of the parameters listed above are included in the form.

gpt2.download_gpt2(model_name=model_size)

import tensorflow as tf # Import tf library directly
tf.reset_default_graph() # Allows this cell to be re-run without VM restart
sess = gpt2.start_tf_sess()

# gpt2.load_gpt2(sess, run_name=sess_run_name) # Load a backed-up checkpoint from Google Drive
dataset=folder_name
#@markdown Set to `fresh` to start training from the base GPT-2, or set to `latest` to restart training from an existing checkpoint.
restore_from = "fresh" #@param ["fresh", "latest"] {allow-input: true}

#@markdown Number of fine-tuning steps to take
steps =  500#@param {type:"integer"}

#@markdown Number of steps to print training progress.
print_every=10#@param {type:"integer"}

#@markdown Number of steps to print example output
sample_every=500#@param {type:"integer"}

#@markdown Number of fine-tuning steps to take
save_every=1000#@param {type:"integer"}
gpt2.finetune(sess,
              dataset=folder_name,
              steps=steps,
              restore_from=restore_from,
              run_name=sess_run_name,
              print_every=print_every,
              model_name=model_size,
              sample_every=sample_every,
              save_every=save_every
              )
gpt2.copy_checkpoint_to_gdrive(run_name=sess_run_name)

---
## Text-generation Stage

This immediately follows on from the previous Fine-tuning stage. However, `sess_run_name` can be the name of any model which has been backed up to Google Drive, not necessarily a model that has just been constructed in the above steps.

### Setup, and set text-generation parameters

In [None]:
# Copy a checkpoint to the Colab Runtime
!cp -r "drive/My Drive/Saved Colab Checkpoints/sess_run_name" checkpoint/sess_run_name
import tensorflow as tf
tf.reset_default_graph()
sess = gpt2.start_tf_sess()
gpt2.load_gpt2(sess, run_name=sess_run_name)

#@markdown Length of samples in characters
text_length =  300#@param {type:"number"}
#@markdown The higher the temperature, the 'crazier' the text.
gen_temp = 0.7 #@param {type:"slider", min:0.7, max:1.0, step:0.01}
#@markdown Makes text sample generations conditional on a set prefix.
text_prefix = "Insert text here" #@param {type:"string"}
#@markdown Number of samples to generate 
nsamp =  5#@param {type:"number"}
#@markdown Batch size (samples to generate in parallel, can increase speed of generation)
batch_s =  5#@param {type:"number"}

### Generate text

In [None]:
gpt2.generate(sess,
              run_name=sess_run_name,
              length=text_length,  
              temperature=gen_temp, 
              prefix=text_prefix,
              nsamples=nsamp,
              batch_size=batch_s 
              )

### Text generation in bulk

You can generate a large amount of text to a file and sort out the samples locally on your computer. The next cell will generate a generated text file with a unique timestamp.

You can rerun the cells as many times as you want for even more generated texts!

In [None]:
#@title Text-generation parameters

gen_file = 'gpt2_gentext_{:%Y%m%d_%H%M%S}.txt'.format(datetime.utcnow())

#@markdown Length of samples in characters
text_length =  300#@param {type:"number"}
#@markdown The higher the temperature, the 'crazier' the text.
gen_temp = 0.7 #@param {type:"slider", min:0.7, max:1.0, step:0.01}
#@markdown Number of samples to generate 
nsamp =  5#@param {type:"number"}
#@markdown Batch size (samples to generate in parallel, can increase speed of generation)
batch_s =  5#@param {type:"number"}
gpt2.generate_to_file(sess,
                      destination_path=gen_file,
                      length=text_length,
                      temperature=gen_temp,
                      nsamples=nsamp,
                      batch_size=batch_s
                      )

In [None]:
# Youmay have to run this cell twice to get file to download
files.download(gen_file)