*GPT2 Training Notebook for DiTTo Youtube Predictor* 

Original by [Max Woolf](http://minimaxir.com) modified by [Greg Raiz](http://gregraiz.com) utilized by DiTTo team Stevens Institute of Technology SSW695A Spring2021

For more about `gpt-2-simple`, you can visit [this GitHub repository](https://github.com/minimaxir/gpt-2-simple). 
Max Woolf blog on gpt2 [blog post](https://minimaxir.com/2019/09/howto-gpt2/) for more information how to use this notebook!
Max Woolf original example notebook [GPT2 Notebook](https://www.youtube.com/redirect?event=video_description&redir_token=QUFFLUhqbXFMYXdidFZPRlZiWlFiX1JrR20zTXdKMXNvQXxBQ3Jtc0treE5ndS1JUzZoM0RzejZwTVFhbWlPUS0zWHBTbV90Snc3WlBhVDA1RmY4dFpCQWpqamtZendLS0xWTXJhdGdvTllfV2U0OGcwMTUxZ0QtY2NsTmJnN1hWbDRFS3g2M3phdFVvWklMLTdtM1BZcXlodw&q=http%3A%2F%2Fbit.ly%2Fgraiz_colab)
Greg Raiz's [Youtube Tutorial](https://www.youtube.com/watch?v=R6KoIp1ETpM&t=247s)


In [1]:
model= "355M"  #@param ['124M', '355M', '774M', '1558M']
# Note that these are millions of parameters.  
# The 774M model is 3GB,
# the 1558M model is 6GB. Start small, before going big.

iterations =  226#@param {type: "number"}
# If we're training, how many iterations do we want?

trainingName = 'views_predictor' #@param {type: "string"}   
# Each new model you train should be named. 

file_name = 'init.csv'  #@param {type: "string"}
# If you have a training file in your Google drive, specify the filename
# that will be used. 


%tensorflow_version 1.x        # This uses an older version of tensorflow
!pip install -q gpt-2-simple   # You will get warnings but it's Ok. 
import gpt_2_simple as gpt2
from datetime import datetime
from google.colab import files


`%tensorflow_version` only switches the major version: 1.x or 2.x.
You set: `1.x        # This uses an older version of tensorflow`. This will be interpreted as: `1.x`.


TensorFlow 1.x selected.
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.



In [2]:
!nvidia-smi

Tue Apr  6 15:34:32 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.67       Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   35C    P8     9W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [3]:
# download selected model
gpt2.download_gpt2(model_name=model)

Fetching checkpoint: 1.05Mit [00:00, 312Mit/s]                                                      
Fetching encoder.json: 1.05Mit [00:00, 6.35Mit/s]
Fetching hparams.json: 1.05Mit [00:00, 322Mit/s]                                                    
Fetching model.ckpt.data-00000-of-00001: 1.42Git [02:47, 8.46Mit/s]
Fetching model.ckpt.index: 1.05Mit [00:00, 714Mit/s]                                                
Fetching model.ckpt.meta: 1.05Mit [00:00, 4.91Mit/s]
Fetching vocab.bpe: 1.05Mit [00:00, 7.51Mit/s]


In [4]:
# mount google drive for model storage
from google.colab import drive
gpt2.mount_gdrive()

Mounted at /content/drive


In [5]:
# copy training file to notebook working directory
gpt2.copy_file_from_gdrive(file_name) 
print(file_name)

init.csv


In [None]:
# finetune model and save to google drive
# parameters for gpt2.finetune:
# restore_from: Set to fresh to start training from the base GPT-2, or set to latest to restart training from an existing checkpoint.
# sample_every: Number of steps to print example output
# print_every: Number of steps to print training progress
# learning_rate: Learning rate for the training. (default 1e-4, can lower to 1e-5 if you have <1MB input data)
# run_name: subfolder within checkpoint to save the model. This is useful if you want to work with multiple models (will also need to specify run_name when loading the model)
# overwrite: Set to True if you want to continue finetuning an existing model (w/ restore_from='latest') without creating duplicate copies.

sess = gpt2.start_tf_sess()
gpt2.finetune(sess,
              dataset=file_name,    # Filename, Model and Training Name are specified in the Init Function
              model_name=model,
              steps=iterations,
              restore_from='latest',
              overwrite=True,
              run_name = trainingName,
              print_every=10,
              sample_every=10,
              save_every=10,
              )

gpt2.copy_checkpoint_to_gdrive(run_name=trainingName)

In [2]:
# copy trained model checkpoint to notebook working directory and load

gpt2.copy_checkpoint_from_gdrive(run_name=trainingName)
sess = gpt2.start_tf_sess()
gpt2.load_gpt2(sess, run_name=trainingName)

Loading checkpoint checkpoint/views_predictor/model-226
INFO:tensorflow:Restoring parameters from checkpoint/views_predictor/model-226


In [3]:
# Generate text from the trained model

text_creativity = 70 #@param {type: "slider", min: 50, max: 100}
# Changes how wacky the text gets. 
gpt2.generate(sess, run_name=trainingName,temperature=(text_creativity/100))

<|startoftext|>13<|endoftext|>
<|startoftext|>14<|endoftext|>
<|startoftext|>15<|endoftext|>
<|startoftext|>16<|endoftext|>
<|startoftext|>17<|endoftext|>
<|startoftext|>18<|endoftext|>
<|startoftext|>19<|endoftext|>
<|startoftext|>20<|endoftext|>
<|startoftext|>21<|endoftext|>
<|startoftext|>22<|endoftext|>
<|startoftext|>23<|endoftext|>
<|startoftext|>24<|endoftext|>
<|startoftext|>25<|endoftext|>
<|startoftext|>26<|endoftext|>
<|startoftext|>27<|endoftext|>
<|startoftext|>28<|endoftext|>
<|startoftext|>29<|endoftext|>
<|startoftext|>30<|endoftext|>
<|startoftext|>31<|endoftext|>
<|startoftext|>32<|endoftext|>
<|startoftext|>33<|endoftext|>
<|startoftext|>34<|endoftext|>
<|startoftext|>35<|endoftext|>
<|startoftext|>36<|endoftext|>
<|startoftext|>37<|endoftext|>
<|startoftext|>38<|endoftext|>
<|startoftext|>39<|endoftext|>
<|startoftext|>40<|endoftext|>
<|startoftext|>41<|endoftext|>
<|startoftext|>42<|endoftext|>
<|startoftext|>43<|endoftext|>
<|startoftext|>44<|endoftext|>
<|starto

In [13]:
# parameters for gpt2.generate:
# length: Number of tokens to generate (default 1023, the maximum)
# temperature: The higher the temperature, the crazier the text (default 0.7, recommended to keep between 0.7 and 1.0)
# top_k: Limits the generated guesses to the top k guesses (default 0 which disables the behavior; if the generated output is super crazy, you may want to set top_k=40)
# top_p: Nucleus sampling: limits the generated guesses to a cumulative probability. (gets good results on a dataset with top_p=0.9)
# truncate: Truncates the input text until a given sequence, excluding that sequence (e.g. if truncate='<|endoftext|>', the returned text will include everything before the first <|endoftext|>). It may be useful to combine this with a smaller length if the input texts are short.
# include_prefix: If using truncate and include_prefix=False, the specified prefix will not be included in the returned text.

gpt2.generate(sess, 
              run_name=trainingName,
              length=10,
              prefix="Given [ANGER=0, DISGUST=0, FEAR=0, JOY=0.880435, SADNESS=0, TENTATIVE=0.8821536, ANALYTICAL=0.589295 , CONFIDENT=0.775702], Views=",
              nsamples=1,
              batch_size=1
              )

Given [ANGER=0, DISGUST=0, FEAR=0, JOY=0.880435, SADNESS=0, TENTATIVE=0.8821536, ANALYTICAL=0.589295 , CONFIDENT=0.775702], Views=<|endoftexttexttexttexttexttext


In [14]:
# save generated text to a file

gen_file = 'gpt2_gentext_{:%Y%m%d_%H%M%S}.txt'.format(datetime.utcnow())

gpt2.generate_to_file(sess, 
              run_name=trainingName,
              destination_path=gen_file,
              temperature=0.7,
              length=10,
              nsamples=1,
              batch_size=1
              )


In [15]:
# download file to local directory.  may have to run twice to get file to download

files.download(gen_file)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [16]:
# Save the model to local directory

import pickle

model_filename = 'youtubePredictor_gpt2_finetuned_355M.sav'
pickle.dump(trainingName, open(model_filename, 'wb'))
files.download(model_filename)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>