# GPT-2 News Headline Generation

## tl;dr
1. `Connect` or `Reconnect`
2. Upload your Kaggle API key (instructions included later)
3. `Runtime` -> `Restart and run all`
4. Wait 15-30 minutes
5. Laugh at weird computer-generated headlines


by Brian Lechthaler, 
*based on [aitextgen](https://github.com/minimaxir/aitextgen)*

In [1]:
from datetime import datetime
def mktimestamp():
  timestamp = datetime.now()
  msg = "Last Updated: " + str(timestamp)
  return msg
print(mktimestamp())

Last Updated: 2020-11-12 07:49:38.854455


# Dependencies
Download and install all necessary dependencies with `pip`, then `import` what we need.

In [2]:
# Freeze versions of dependencies for now
!pip install -q transformers==2.9.1
!pip install -q pytorch-lightning==0.7.6

!pip install -q aitextgen

import logging
logging.basicConfig(
        format="%(asctime)s — %(levelname)s — %(name)s — %(message)s",
        datefmt="%m/%d/%Y %H:%M:%S",
        level=logging.INFO
    )

from aitextgen import aitextgen
from aitextgen.colab import mount_gdrive, copy_file_from_gdrive
from aitextgen.TokenDataset import TokenDataset, merge_datasets
from aitextgen.utils import build_gpt2_config
from aitextgen.tokenizers import train_tokenizer

[K     |████████████████████████████████| 645kB 5.7MB/s 
[K     |████████████████████████████████| 1.1MB 30.0MB/s 
[K     |████████████████████████████████| 890kB 45.4MB/s 
[K     |████████████████████████████████| 3.8MB 49.7MB/s 
[?25h  Building wheel for sacremoses (setup.py) ... [?25l[?25hdone
[K     |████████████████████████████████| 256kB 2.9MB/s 
[K     |████████████████████████████████| 829kB 7.1MB/s 
[?25h  Building wheel for future (setup.py) ... [?25l[?25hdone
[K     |████████████████████████████████| 573kB 6.0MB/s 
[K     |████████████████████████████████| 81kB 9.5MB/s 
[?25h  Building wheel for aitextgen (setup.py) ... [?25l[?25hdone
  Building wheel for fire (setup.py) ... [?25l[?25hdone


11/12/2020 07:50:02 — INFO — transformers.file_utils — PyTorch version 1.7.0+cu101 available.
11/12/2020 07:50:03 — INFO — transformers.file_utils — TensorFlow version 2.3.0 available.


# Mount Google Drive
Because any data in the VM this notebook is running on will be nuked once the Jupyter kernel stops running, it's helpful to mount your Google Drive to the Colab VM to persist some files that we'll use in this notebook.

*Note:* your data will not be shared with anyone who does not have direct access to the VM running this Colab notebook.

In [3]:
#mount_gdrive()

# Download Dataset from Kaggle
Downloads the 'million-headlines' dataset contributed by Kaggle user `therohk`

1.   Sign into Kaggle in a separate tab
2.   Click [this link](https://kaggle.com/me/account) to go to your Kaggle account settings
3. Under the `API` section, click/tap `Create new API token`. If this is not the first time you have followed this step, consider clicking `Expire API Token` prior to generating a new token.
4. In the Colab file browser, upload the `kaggle.json` API token you just downloaded in step 3.



In [4]:
!pip install -q kaggle

In [5]:
!mkdir -p /root/.kaggle
!mv kaggle.json /root/.kaggle/kaggle.json
!chmod 600 /root/.kaggle/kaggle.json
!rm -rf million-headlines.zip
!kaggle datasets download -d therohk/million-headlines

Downloading million-headlines.zip to /content
 79% 16.0M/20.2M [00:00<00:00, 32.5MB/s]
100% 20.2M/20.2M [00:00<00:00, 58.3MB/s]


In [6]:
!rm -rf abcnews-date-text.csv
!unzip million-headlines.zip

Archive:  million-headlines.zip
  inflating: abcnews-date-text.csv   


# Transform Dataset
We need to define a couple functions to 

In [7]:
import pandas as pd

In [8]:
def writeln(line, path):
  line = line + "\n"
  with open(path, 'a') as saveto:
    saveto.write(line)

In [9]:
def finalsave(df, colname, filename):
  print("Transforming dataset...")
  for index, row in df.iterrows():
    line = row[colname]
    writeln(line, filename)
  print("Done!")

# Create a Ramdisk for our Dataset
This is a little-known trick for Linux systems to create a temporary file store in memory and mount it at `/media/ramdisk`. This speeds up the transform we need to make on our dataset in a little bit, as well as speeds up copying our dataset into GPU memory.

In [10]:
!sudo mkdir -p /media/ramdisk
!sudo mount -t tmpfs -o size=128M tmpfs /media/ramdisk

In [11]:
dataset_csv = '/content/abcnews-date-text.csv'
csvingest = pd.read_csv(dataset_csv)

*Important Note:* If the next cell crashes your Colab runtime, you probably ran out of memory. Sorry, but if the problem persists you may need to shell out $10 to Google for Colab Pro and change the runtime to GPU High RAM.

In [12]:
!rm -rf /content/dataset.csv
!touch /content/dataset.csv
file_name = '/media/ramdisk/dataset.csv'
finalsave(csvingest, 'headline_text', file_name)

Transforming dataset...
Done!


# Train the Tokenizer
This runs on the CPU and may take a few minutes.


In [13]:
train_tokenizer(file_name)

11/12/2020 07:54:09 — INFO — aitextgen.tokenizers — Saving aitextgen-vocab.json and aitextgen-merges.txt to the current directory. You will need both files to build the GPT2Tokenizer.


# Configure GPT-2

In [14]:
config = build_gpt2_config(vocab_size=5000, max_length=16, dropout=0.0, n_embd=256, n_layer=8, n_head=8)
config

GPT2Config {
  "activation_function": "gelu_new",
  "attn_pdrop": 0.0,
  "bos_token_id": 0,
  "embd_pdrop": 0.0,
  "eos_token_id": 0,
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_ctx": 16,
  "n_embd": 256,
  "n_head": 8,
  "n_layer": 8,
  "n_positions": 16,
  "resid_pdrop": 0.0,
  "summary_activation": null,
  "summary_first_dropout": 0.0,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "vocab_size": 5000
}

In [15]:
ai = aitextgen(config=config,
               vocab_file="aitextgen-vocab.json",
               merges_file="aitextgen-merges.txt",
               to_gpu=True)

11/12/2020 07:54:10 — INFO — aitextgen — Constructing GPT-2 model from provided config.
11/12/2020 07:54:10 — INFO — aitextgen — Using a custom tokenizer.


# Finetune GPT-2 to dataset
Training should take about an hour on an NVidia Tesla P100 GPU. Text generated from the model should get progressively better over iterations.

In [16]:
ai.train(file_name,
         line_by_line=True,
         num_steps=25000,
         generate_every=1000,
         save_every=1000,
         save_gdrive=False,
         learning_rate=1e-4,
         batch_size=256,
         )

HBox(children=(FloatProgress(value=0.0, layout=Layout(flex='2'), max=1186017.0), HTML(value='')), layout=Layou…

11/12/2020 07:54:25 — INFO — aitextgen.TokenDataset — Encoding 1,186,017 rows from /media/ramdisk/dataset.csv.
GPU available: True, used: True
11/12/2020 07:55:05 — INFO — lightning — GPU available: True, used: True
No environment variable for node rank defined. Set as 0.
CUDA_VISIBLE_DEVICES: [0]
11/12/2020 07:55:05 — INFO — lightning — CUDA_VISIBLE_DEVICES: [0]





HBox(children=(FloatProgress(value=0.0, layout=Layout(flex='2'), max=25000.0), HTML(value='')), layout=Layout(…

[1m1,000 steps reached: saving model to /trained_model[0m
[1m1,000 steps reached: generating sample texts.[0m
last of the pampy
[1m2,000 steps reached: saving model to /trained_model[0m
[1m2,000 steps reached: generating sample texts.[0m
labor warns dunlot
[1m3,000 steps reached: saving model to /trained_model[0m
[1m3,000 steps reached: generating sample texts.[0m
cairns police seek to be unite
[1m4,000 steps reached: saving model to /trained_model[0m
[1m4,000 steps reached: generating sample texts.[0m
twire to be stops in the australian
[1m5,000 steps reached: saving model to /trained_model[0m
[1m5,000 steps reached: generating sample texts.[0m
police investigate hit and run
[1m6,000 steps reached: saving model to /trained_model[0m
[1m6,000 steps reached: generating sample texts.[0m
rspca to face trial after fatal kemp fire
[1m7,000 steps reached: saving model to /trained_model[0m
[1m7,000 steps reached: generating sample texts.[0m
wednesday weather
[1m8,

11/12/2020 08:47:57 — INFO — aitextgen — Saving trained model pytorch_model.bin to /trained_model


# Generate a few Samples
Now, for the fun part! Before I continue, I want to be really clear: all headlines you see in this notebook are 100% fake and generated by GPT-2.  Please use this code responsibly (that means never use this to intentionally decieve people, generate clickbait, or do anything else blackhatty)

In [17]:
ai.generate(n=15,
            batch_size=1024,
            temperature=1.0,
            top_p=0.999)

police probe school stabbings
rann announces new indigenous plan
sternone to sue for umpiring
jericho our economy and is what happening this summer
former detective sinks nietrack
heart tips for lnp paedophiles in
pearson wins golden gifts cup
man charged over fathers stabbing
extra money for drought declarations in sa
simplot management
rural tas nsw oadamante
choppy season for loncy
cabelle beck to remain as tamarine
newman calls for more indigenous intervention
toronto battery barkly lanter


# Credits

This project was made possible by the cumulative efforts of the following parties:

Brian Lechthaler *author of this notebook*
* https://github.com/brianlechthaler
* https://twitter.com/brianlechthaler

Max Woolf *author of [aitextgen](https://github.com/minimaxir/aitextgen), the training code this notebook is based on.*
* https://minimaxir.com/
* https://github.com/minimaxir

Rohit Kulkarni *author of [million-headlines](https://www.kaggle.com/therohk/million-headlines) dataset*
* https://www.linkedin.com/in/rohit-kulkarni-21b0724a/
* https://kaggle.com/therohk

OpenAI *creators of [GPT-2](https://en.wikipedia.org/wiki/OpenAI#GPT-2) model*
* https://openai.com 
* https://openai.com/blog/tags/gpt-2/
