# GPT-2 arXiv Title Generator

## tl;dr
1. `Connect` or `Reconnect`
2. Upload your Kaggle API key (instructions included later)
3. `Runtime` -> `Restart and run all`
4. Wait for about 90 minutes
5. Laugh at weird computer-generated headlines


by Brian Lechthaler, 
*based on [aitextgen](https://github.com/minimaxir/aitextgen)*

In [1]:
from datetime import datetime
def mktimestamp():
  timestamp = datetime.now()
  msg = "Last Updated: " + str(timestamp)
  return msg
print(mktimestamp())

Last Updated: 2020-11-13 22:07:03.662976


In [2]:
!nvidia-smi

Fri Nov 13 22:07:04 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.32.00    Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla V100-SXM2...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   42C    P0    23W / 300W |      0MiB / 16130MiB |      0%      Default |
|                               |                      |                 ERR! |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

# Dependencies
Download and install all necessary dependencies with `pip`, then `import` what we need.

In [3]:
# Freeze versions of dependencies for now
!pip install -q transformers==2.9.1
!pip install -q pytorch-lightning==0.7.6

!pip install -q aitextgen

import logging
logging.basicConfig(
        format="%(asctime)s — %(levelname)s — %(name)s — %(message)s",
        datefmt="%m/%d/%Y %H:%M:%S",
        level=logging.INFO
    )

from aitextgen import aitextgen
from aitextgen.colab import mount_gdrive, copy_file_from_gdrive
from aitextgen.TokenDataset import TokenDataset, merge_datasets
from aitextgen.utils import build_gpt2_config
from aitextgen.tokenizers import train_tokenizer

11/13/2020 22:07:12 — INFO — transformers.file_utils — PyTorch version 1.7.0+cu101 available.
11/13/2020 22:07:13 — INFO — transformers.file_utils — TensorFlow version 2.3.0 available.


# Mount Google Drive
Because any data in the VM this notebook is running on will be nuked once the Jupyter kernel stops running, it's helpful to mount your Google Drive to the Colab VM to persist some files that we'll use in this notebook.

*Note:* your data will not be shared with anyone who does not have direct access to the VM running this Colab notebook.

In [4]:
mount_gdrive()

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


# Download Dataset from Kaggle
Downloads the 'arxiv' dataset contributed by Kaggle user `Cornell-University`

1.   Sign into Kaggle in a separate tab
2.   Click [this link](https://kaggle.com/me/account) to go to your Kaggle account settings
3. Under the `API` section, click/tap `Create new API token`. If this is not the first time you have followed this step, consider clicking `Expire API Token` prior to generating a new token.
4. In the Colab file browser, upload the `kaggle.json` API token you just downloaded in step 3.



In [5]:
!pip install -q kaggle

In [6]:
!mkdir -p /root/.kaggle
!mv kaggle.json /root/.kaggle/kaggle.json
!chmod 600 /root/.kaggle/kaggle.json
!rm -rf arxiv.zip
!kaggle datasets download -d Cornell-University/arxiv

mv: cannot stat 'kaggle.json': No such file or directory
Downloading arxiv.zip to /content
 97% 874M/902M [00:10<00:00, 115MB/s] 
100% 902M/902M [00:10<00:00, 93.1MB/s]


In [7]:
!rm -rf arxiv-metadata-oai-snapshot.json
!unzip arxiv.zip
dataset_original = 'arxiv-metadata-oai-snapshot.json'

Archive:  arxiv.zip
  inflating: arxiv-metadata-oai-snapshot.json  


# Transform Dataset
We need to define a couple functions to drop unnecessary columns from our data before using it to finetune GPT-2. In addition to dropping all columns except exactly what we need, we also randomly sample 45% of the data so that our dataset will reliably fit in GPU memory.

In [8]:
import pandas as pd

In [9]:
def writeln(line, path):
  line = line + "\n"
  with open(path, 'a') as saveto:
    saveto.write(line)

In [10]:
def finalsave(df, colname, filename, sample_frac):
  print("Transforming dataset...")
  df = df.sample(frac=sample_frac)
  for index, row in df.iterrows():
    line = row[colname]
    writeln(line, filename)
  print("Done!")

*Important Note:* If any cell below this message crashes your Colab runtime, you probably ran out of memory. Sorry, but if the problem persists you may need to shell out $10 to Google for Colab Pro and change the runtime to GPU High RAM.

In [12]:
jsoningest = pd.read_json(dataset_original, lines=True)

11/13/2020 22:09:01 — INFO — numexpr.utils — NumExpr defaulting to 4 threads.


In [13]:
!rm -rf /content/dataset.csv
!touch /content/dataset.csv
file_name = '/content/dataset.csv'
finalsave(jsoningest, 
          'title', 
          file_name,
          0.45)

Transforming dataset...
Done!


In [14]:
def cleardf(df):
  print('Emptying Pandas DataFrame...')
  df = pd.DataFrame()
  print('Done!')

In [15]:
cleardf(jsoningest)

Emptying Pandas DataFrame...
Done!


# Train the Tokenizer
This runs on the CPU and will take a while.


In [16]:
file_name = '/content/dataset.csv'
!rm -rf aitextgen-merges.txt
!rm -rf aitextgen-vocab.json
!rm -rf trained_model
train_tokenizer(file_name)

11/13/2020 22:11:31 — INFO — aitextgen.tokenizers — Saving aitextgen-vocab.json and aitextgen-merges.txt to the current directory. You will need both files to build the GPT2Tokenizer.


# Configure GPT-2

In [17]:
config = build_gpt2_config(vocab_size=60000, 
                           max_length=20,
                           dropout=0.0, 
                           n_embd=256, 
                           n_layer=16, 
                           n_head=16)
config

GPT2Config {
  "activation_function": "gelu_new",
  "attn_pdrop": 0.0,
  "bos_token_id": 0,
  "embd_pdrop": 0.0,
  "eos_token_id": 0,
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_ctx": 20,
  "n_embd": 256,
  "n_head": 16,
  "n_layer": 16,
  "n_positions": 20,
  "resid_pdrop": 0.0,
  "summary_activation": null,
  "summary_first_dropout": 0.0,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "vocab_size": 60000
}

In [18]:
ai = aitextgen(config=config,
               vocab_file="aitextgen-vocab.json",
               merges_file="aitextgen-merges.txt",
               to_gpu=True)

11/13/2020 22:11:31 — INFO — aitextgen — Constructing GPT-2 model from provided config.
11/13/2020 22:11:32 — INFO — aitextgen — Using a custom tokenizer.


# Finetune GPT-2 to dataset
Training should take about an hour and a half on an NVidia Tesla P100 GPU. Text generated from the model should get progressively better over iterations.

In [19]:
!rm -rf trained_model
!nvidia-smi
ai.train(file_name,
         line_by_line=True,
         num_steps=20000,
         generate_every=500,
         save_every=100,
         save_gdrive=True,
         learning_rate=1e-4,
         batch_size=256)

Fri Nov 13 22:11:38 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.32.00    Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla V100-SXM2...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   42C    P0    37W / 300W |   1463MiB / 16130MiB |      0%      Default |
|                               |                      |                 ERR! |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

HBox(children=(FloatProgress(value=0.0, layout=Layout(flex='2'), max=1168181.0), HTML(value='')), layout=Layou…

11/13/2020 22:11:39 — INFO — aitextgen.TokenDataset — Encoding 1,168,181 rows from /content/dataset.csv.





GPU available: True, used: True
11/13/2020 22:12:22 — INFO — lightning — GPU available: True, used: True
No environment variable for node rank defined. Set as 0.
CUDA_VISIBLE_DEVICES: [0]
11/13/2020 22:12:22 — INFO — lightning — CUDA_VISIBLE_DEVICES: [0]


HBox(children=(FloatProgress(value=0.0, layout=Layout(flex='2'), max=20000.0), HTML(value='')), layout=Layout(…

[1m100 steps reached: saving model to /trained_model[0m
[1m200 steps reached: saving model to /trained_model[0m
[1m300 steps reached: saving model to /trained_model[0m
[1m400 steps reached: saving model to /trained_model[0m
[1m500 steps reached: saving model to /trained_model[0m
[1m500 steps reached: generating sample texts.[0m
  SEPI7
[1m600 steps reached: saving model to /trained_model[0m
[1m700 steps reached: saving model to /trained_model[0m
[1m800 steps reached: saving model to /trained_model[0m
[1m900 steps reached: saving model to /trained_model[0m
[1m1,000 steps reached: saving model to /trained_model[0m
[1m1,000 steps reached: generating sample texts.[0m
The $G(T)_2$ and 2)
[1m1,100 steps reached: saving model to /trained_model[0m
[1m1,200 steps reached: saving model to /trained_model[0m
[1m1,300 steps reached: saving model to /trained_model[0m
[1m1,400 steps reached: saving model to /trained_model[0m
[1m1,500 steps reached: saving model to /tr

11/13/2020 23:37:34 — INFO — aitextgen — Saving trained model pytorch_model.bin to /trained_model


# Generate a few Samples
Now, for the fun part! Before I continue, I want to be really clear: all headlines you see in this notebook are 100% fake and generated by GPT-2.  Please use this code responsibly (that means never use this to intentionally decieve people, generate clickbait, or do anything else blackhatty)

In [20]:
ai.generate(n=25,
            batch_size=512,
            temperature=1.0,
            top_p=0.999)

  from muonium in Pb-Pb collisions at LDA
Sparticle formation from fragmentation of a single and dutiny-parity
Unimodality of multiplicative nash
Antiproton cross sections at high energy in the T1 picture and
  in $\dot{\mathrm{P}^2$
New explicit and exactly blow-up to the 2D XY and Ahar
Accurate estimation of the power spectrum of the Hubble Frontier: the
  \leq e^\cy \ln \infty$-Borel
Autonomous quantum dots in strong laser fields via bending of laser
  model
  Networks
  campus labeled with the Parkes cloud
Communication Latent Variable Network Models for Medical Image Sequences
  with non-constant Ricci curvature
  (Leptofire)
Antiferromagnetism in the Hubbard model
Change and Stopping Properties of the CDF2 Inspired by Wi-
Public Key Supports in Smart People using Health M
  supersingular domains
Loss of the first kinematics of Galactic bulge
  CeO$_{5-\delta}$
  Gravitated Geometry
Precise Prediction of Wheeled Carbon Abundance with XMM-Newton X
The Kinetics of Folded Black Hole 

In [21]:
!nvidia-smi

Fri Nov 13 23:37:41 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.32.00    Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla V100-SXM2...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   47C    P0    38W / 300W |   9241MiB / 16130MiB |      0%      Default |
|                               |                      |                 ERR! |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

# Credits

This project was made possible by the cumulative efforts of the following parties:

Brian Lechthaler *author of this notebook*
* https://github.com/brianlechthaler
* https://twitter.com/brianlechthaler

Max Woolf *author of [aitextgen](https://github.com/minimaxir/aitextgen), the training code this notebook is based on.*
* https://minimaxir.com/
* https://github.com/minimaxir

Cornell University *author of [arxiv](https://www.kaggle.com/Cornell-University/arxiv) dataset*
* https://www.cornell.edu/
* https://www.kaggle.com/Cornell-University

OpenAI *creators of [GPT-2](https://en.wikipedia.org/wiki/OpenAI#GPT-2) model*
* https://openai.com 
* https://openai.com/blog/tags/gpt-2/
