# Training a GPT-2 model on Donald Trump's tweets

[GPT-2](https://openai.com/blog/better-language-models/) refers to a series of [transformer models](https://towardsdatascience.com/transformers-141e32e69591) developed by [OpenAI](https://towardsdatascience.com/transformers-141e32e69591) for automated text generation. 


GPT-2 comes pre-trained on text from eight million outbound links from Reddit. However, we can take this one step further and "finetune" a model with extra input from another source. This allows us to nudge the model to produce output more similar to this new text. For example, if you finetuned a GPT-2 model on "The Great Gatsby", it would pick up on common grammatical structures and might even start to wax longingly about Daisy Buchanan.  

**In this demo we train the medium-sized GPT-2 model (355 million parameters) on over 28,000 tweets from the [@realDonaldTrump](https://twitter.com/realDonaldTrump) twitter account.**

The majority of the code for this demo is lifted (which much thanks) from a [colab notebook](https://colab.research.google.com/drive/1VLG8e7YSEwypxU-noRNhsv5dW4NfTGce?authuser=1#scrollTo=H7LoMj4GA4n_) put together by [Max Woolf](https://github.com/minimaxir), a data scientist at Buzzfeed.

Max Woolf is also responsible for the [GPT-2-simple](https://github.com/minimaxir/gpt-2-simple) python library used in this demo.

This python notebook is intended for [Google Colab](https://colab.research.google.com/). Some commands may not work elsewhere.

## Set up coding environment

First In order to speed up training, make sure your colab runtime is using a GPU as its hardware accelerator. This will allow the model to finetune much faster. 

To tell colab to use a GPU, nagivate to the dropdown menu above labeled "Runtime." Selecting "Change runtime type" opens up a window where you can select "GPU" as your "Hardware accelerator." Hit "SAVE" and you are good to go. 

Now, onto the code!

In [0]:
!pip install gpt-2-simple #Installs gpt-2-simple library in your colab python environment
import gpt_2_simple as gpt2 #imports the library for subsequent method calls
from datetime import datetime 
from google.colab import files #module for uploading trump_tweets.csv

Collecting gpt-2-simple
  Downloading https://files.pythonhosted.org/packages/75/2f/4b2d933decca7f79e3ae2eb3859e2b30bb1f572634d2c84f925d765e3b8e/gpt_2_simple-0.6.tar.gz
Collecting regex
[?25l  Downloading https://files.pythonhosted.org/packages/e3/8e/cbf2295643d7265e7883326fb4654e643bfc93b3a8a8274d8010a39d8804/regex-2019.11.1-cp36-cp36m-manylinux1_x86_64.whl (643kB)
[K     |████████████████████████████████| 645kB 8.1MB/s 
Collecting toposort
  Downloading https://files.pythonhosted.org/packages/e9/8a/321cd8ea5f4a22a06e3ba30ef31ec33bea11a3443eeb1d89807640ee6ed4/toposort-1.5-py2.py3-none-any.whl
Building wheels for collected packages: gpt-2-simple
  Building wheel for gpt-2-simple (setup.py) ... [?25l[?25hdone
  Created wheel for gpt-2-simple: filename=gpt_2_simple-0.6-cp36-none-any.whl size=25388 sha256=0a9bae74df1dc871e134c59c8a206431a1cec0e9dd684377fdfd4fcd42b629aa
  Stored in directory: /root/.cache/pip/wheels/cc/e7/21/4cb10bcf085ff791a08bbd03aa3fd860f6e730f37b5dbbea28
Successful

The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.



Because gpt-2-simple uses an old version of tensorflow, we have to force colab to use an older release than tensorflow 2.0

In [0]:
%tensorflow_version 1.x
import tensorflow as tf
assert tf.__version__ <= "2.0"

## Upload Trump tweets

Before we do anything with GPT-2, let's upload all of Trump's tweets into our local directory. 

Thankfully, gpt-2-simple is smart enough to be able to read a ***single column*** csv when finetuning. 


So, before making this notebook, I downloaded over 40,000 tweets from the @realdonaldtrump twitter account from [this website](http://www.trumptwitterarchive.com/archive). The site even has an option where you can select to download only the text of the tweets. I then removed all retweets (with the [remove_RT.py](https://github.com/aaronbrezel/GPT-2_Demo/blob/master/bot/remove_RT.py) file) from the corpus in order to avoid muddling our training data. We only want tweets written by the account.

You can either collect and isolate the tweets yourself or use the csv from [this repository](https://github.com/aaronbrezel/GPT-2_Demo). 

Download csv from Github repo

In [0]:
!git clone https://github.com/aaronbrezel/GPT-2_Demo.git

Cloning into 'GPT-2_Demo'...
remote: Enumerating objects: 2416, done.[K
remote: Counting objects: 100% (2416/2416), done.[K
remote: Compressing objects: 100% (2018/2018), done.[K
remote: Total 2433 (delta 397), reused 2404 (delta 389), pack-reused 17[K
Receiving objects: 100% (2433/2433), 60.12 MiB | 9.96 MiB/s, done.
Resolving deltas: 100% (402/402), done.


In [0]:
#Set variable that can easily find the csv of the text of Trump's tweets
file_path_to_csv = '/content/GPT-2_Demo/bot/tweet_text_minus_rt.csv'

Upload csv from your local file system

In [0]:
# uploaded = files.upload()

# for fn in uploaded.keys():
#   print('User uploaded file "{name}" with length {length} bytes'.format(
#       name=fn, length=len(uploaded[fn])))

## GPT-2 model

Fetch the medium gpt-2 model.

In [0]:
gpt2.download_gpt2(model_name="355M")

Fetching checkpoint: 1.05Mit [00:00, 391Mit/s]                                                      
Fetching encoder.json: 1.05Mit [00:00, 130Mit/s]                                                    
Fetching hparams.json: 1.05Mit [00:00, 530Mit/s]                                                    
Fetching model.ckpt.data-00000-of-00001: 1.42Git [00:14, 97.2Mit/s]                                 
Fetching model.ckpt.index: 1.05Mit [00:00, 254Mit/s]                                                
Fetching model.ckpt.meta: 1.05Mit [00:00, 129Mit/s]                                                 
Fetching vocab.bpe: 1.05Mit [00:00, 187Mit/s]                                                       


Currently (Nov. 26, 2019), GPT-2-simple can only finetune on the small and medium GPT-2 models. 

To load the smaller model, change model_name from "355M" to "124M"

You can also load the "774M" and "1558M" models. You cannot finetune them.

### Start finetuning

The following text cell is lifted directly from Max Woolf's GPT-2 [colab tutorial](https://colab.research.google.com/drive/1VLG8e7YSEwypxU-noRNhsv5dW4NfTGce#scrollTo=LdpZQXknFNY3). Honestly, it was so short and to the point, that I didn't see a reason to change it. 



---


[gpt2.finetune()] will start the actual finetuning of GPT-2. It creates a persistent TensorFlow session which stores the training config, then runs the training for the specified number of `steps`. (to have the finetuning run indefinitely, set `steps = -1`)

The model checkpoints will be saved in `/checkpoint/run1` by default. The checkpoints are saved every 500 steps (can be changed) and when the cell is stopped.

The training might time out after 4ish hours; make sure you end training and save the results so you don't lose them!

**IMPORTANT NOTE:** If you want to rerun this cell, **restart the VM first** (Runtime -> Restart Runtime). You will need to rerun imports but not recopy files.

Other optional-but-helpful parameters for `gpt2.finetune`:


*  **`restore_from`**: Set to `fresh` to start training from the base GPT-2, or set to `latest` to restart training from an existing checkpoint.
* **`sample_every`**: Number of steps to print example output
* **`print_every`**: Number of steps to print training progress.
* **`learning_rate`**:  Learning rate for the training. (default `1e-4`, can lower to `1e-5` if you have <1MB input data)
*  **`run_name`**: subfolder within `checkpoint` to save the model. This is useful if you want to work with multiple models (will also need to specify  `run_name` when loading the model)
* **`overwrite`**: Set to `True` if you want to continue finetuning an existing model (w/ `restore_from='latest'`) without creating duplicate copies. 


---



Before training the model, make sure your csv has text entries in every row. If you take the csv from the github repo, you should not have this problem

In [0]:
import csv
with open(file_path_to_csv) as fp:
  reader = csv.reader(fp)
  count = 0
  for row in reader:
    count = count + 1 
    if len(row) != 0 and type(row[0]) == str: #checks that there is something in each row of the csv and that thing is a string
      temp = "everything is okay"
    else:
      print(count) #Print the offending row index in the csv



Now, we train. The more steps you pass, the more tightly we will fit to @realDonaldTrump's tweets. I usually like to err on less steps, as I find it makes the model more unpredictable. 

In [0]:
sess = gpt2.start_tf_sess()

gpt2.finetune(sess,
              dataset=file_path_to_csv,
              model_name='355M',
              steps=400,
              restore_from='fresh',
              run_name='trump_tune_small',
              print_every=10,
              sample_every=200,
              save_every=500,
              )

Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
Instructions for updating:
Please use tensorflow.python.ops.op_selector.get_backward_walk_ops.
Loading checkpoint models/355M/model.ckpt
INFO:tensorflow:Restoring parameters from models/355M/model.ckpt


100%|██████████| 1/1 [00:00<00:00, 12.27it/s]

Loading dataset...





dataset has 1242687 tokens
Training...
[10 | 17.36] loss=2.65 avg=2.65
[20 | 26.12] loss=2.44 avg=2.55
[30 | 34.91] loss=2.86 avg=2.65
[40 | 43.67] loss=2.55 avg=2.63
[50 | 52.44] loss=2.18 avg=2.53
[60 | 61.20] loss=1.93 avg=2.43
[70 | 69.97] loss=2.28 avg=2.41
[80 | 78.73] loss=1.71 avg=2.32
[90 | 87.50] loss=1.94 avg=2.27
[100 | 96.25] loss=2.01 avg=2.25
[110 | 105.00] loss=2.06 avg=2.23
[120 | 113.76] loss=1.80 avg=2.19
[130 | 122.50] loss=1.99 avg=2.18
[140 | 131.25] loss=2.13 avg=2.17
[150 | 140.01] loss=2.64 avg=2.21
[160 | 148.76] loss=2.13 avg=2.20
[170 | 157.51] loss=1.72 avg=2.17
[180 | 166.26] loss=2.00 avg=2.16
[190 | 175.01] loss=2.58 avg=2.18
[200 | 183.76] loss=2.77 avg=2.22
24|<|startoftext|>@Wes_Haley Thanks. It is worth the wait.<|endoftext|>
<|startoftext|>@D_G_Stellar    Thank you.<|endoftext|>
<|startoftext|>On my way @JaredFlynn and I discussed a few of the issues facing our great nation. Jared is the best. We agree. We want a great &amp; powerful @USNavy. Thanks

Test it out!

In [0]:
gpt2.generate(sess, length=50, temperature=0.9, prefix="Nancy Pelosi is cool!", run_name='trump_tune_small')

Nancy Pelosi is cool!<|endoftext|>
<|startoftext|>Thank you South Florida for the group of gentlemen in front of the statue of Thomas Jefferson. They all deserve a good (very) sincere reply.<|endoftext|>


Pretty neat, although we have a few text artifacts from the training left over. We can clean these up quite a bit in the next tutorial where we create a simple bot loop for prompting tweets.


### Mount gdrive and save model

Finetuning a GPT-2 model creates a pretty big folder. You can save it on your local machine, but since we're in colab, we might as well save it on drive. In the code snipits below, we "mount" your gdrive, which allows you to read and write files from your drive directly in colab. We then use a handy method from the gpt-2-simple library to copy the information from our new model into gdrive. The .tar file we save is roughly 1.3GB

We can then access the finetuned model anytime from this file, without having to retrain. Very convenient.  


In [0]:
from google.colab import drive
drive.mount('/content/drive') #For the record, there is also a method in gpt-2-simple that does this, but the same action

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


In [0]:
gpt2.copy_checkpoint_to_gdrive(run_name='trump_tune_small')

### Check out the trump_demo.ipynb file in the repository for the next step.