# GPT-2 Anime Subtitle Generation

## tl;dr
1. `Connect` or `Reconnect`
2. Upload your Kaggle API key (instructions included later)
3. `Runtime` -> `Restart and run all`
4. Wait 15-30 minutes
5. Laugh at weird computer-generated Kickstarter projects


by Brian Lechthaler, 
*based on [aitextgen](https://github.com/minimaxir/aitextgen)*

# Dependencies
Download and install all necessary dependencies with `pip`, then `import` what we need.

In [1]:
!pip install -q kaggle
# Freeze versions of dependencies for now
!pip install -q transformers==2.9.1
!pip install -q pytorch-lightning==0.7.6

!pip install -q aitextgen

import logging
logging.basicConfig(
        format="%(asctime)s — %(levelname)s — %(name)s — %(message)s",
        datefmt="%m/%d/%Y %H:%M:%S",
        level=logging.INFO
    )

from aitextgen import aitextgen
from aitextgen.colab import mount_gdrive, copy_file_from_gdrive
from aitextgen.TokenDataset import TokenDataset, merge_datasets
from aitextgen.utils import build_gpt2_config
from aitextgen.tokenizers import train_tokenizer

11/14/2020 07:18:27 — INFO — transformers.file_utils — PyTorch version 1.7.0+cu101 available.
11/14/2020 07:18:28 — INFO — transformers.file_utils — TensorFlow version 2.3.0 available.


# Mount Google Drive
Because any data in the VM this notebook is running on will be nuked once the Jupyter kernel stops running, it's helpful to mount your Google Drive to the Colab VM to persist some files that we'll use in this notebook.

*Note:* your data will not be shared with anyone who does not have direct access to the VM running this Colab notebook.

In [2]:
#mount_gdrive()

# Download Dataset from Kaggle
Downloads the 'kickstarter-projects' dataset contributed by Kaggle user `jef1056`

1.   Sign into Kaggle in a separate tab
2.   Click [this link](https://kaggle.com/me/account) to go to your Kaggle account settings
3. Under the `API` section, click/tap `Create new API token`. If this is not the first time you have followed this step, consider clicking `Expire API Token` prior to generating a new token.
4. In the Colab file browser, upload the `kaggle.json` API token you just downloaded in step 3.



In [3]:
!mkdir -p /root/.kaggle
!mv kaggle.json /root/.kaggle/kaggle.json
!chmod 600 /root/.kaggle/kaggle.json
#!rm -rf anime-subtitles.zip
#!kaggle datasets download -d jef1056/anime-subtitles
#!rm -rf 'Anime Datasets V3.zip'
#!rm -rf 'input (Cleaned).txt'
#!unzip anime-subtitles.zip
#!wc -l 'input (Cleaned).txt'

mv: cannot stat 'kaggle.json': No such file or directory


# Train Tokenizer on Dataset
Bound to CPU, may take a few minutes.

In [4]:
file_name = 'input (Cleaned).txt'


In [5]:
!rm -rf aitextgen-merges.txt
!rm -rf aitextgen-vocab.json

In [6]:
train_tokenizer(file_name)

11/14/2020 07:19:03 — INFO — aitextgen.tokenizers — Saving aitextgen-vocab.json and aitextgen-merges.txt to the current directory. You will need both files to build the GPT2Tokenizer.


# Configure GPT-2 Training
Set various configuration variables to control how the GPT-2 model is re-trained to the data we are feeding it.

In [7]:
config = build_gpt2_config(vocab_size=30000, 
                           max_length=64, 
                           dropout=0.0, 
                           n_embd=256, 
                           n_layer=8, 
                           n_head=8)
config

GPT2Config {
  "activation_function": "gelu_new",
  "attn_pdrop": 0.0,
  "bos_token_id": 0,
  "embd_pdrop": 0.0,
  "eos_token_id": 0,
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_ctx": 64,
  "n_embd": 256,
  "n_head": 8,
  "n_layer": 8,
  "n_positions": 64,
  "resid_pdrop": 0.0,
  "summary_activation": null,
  "summary_first_dropout": 0.0,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "vocab_size": 30000
}

In [8]:
ai = aitextgen(config=config,
               vocab_file="aitextgen-vocab.json",
               merges_file="aitextgen-merges.txt",
               to_gpu=True)

11/14/2020 07:19:03 — INFO — aitextgen — Constructing GPT-2 model from provided config.
11/14/2020 07:19:04 — INFO — aitextgen — Using a custom tokenizer.


# Re-train GPT-2 to Dataset

This task is bound to the GPU and should take just under two hours to train on an NVidia V100 GPU.



In [9]:
!rm -rf trained_model

In [10]:
ai.train(file_name,
         line_by_line=True,
         num_steps=25000,
         generate_every=1000,
         save_every=500,
         save_gdrive=False,
         learning_rate=1e-4,
         batch_size=256)

HBox(children=(FloatProgress(value=0.0, layout=Layout(flex='2'), max=1248751.0), HTML(value='')), layout=Layou…

11/14/2020 07:19:10 — INFO — aitextgen.TokenDataset — Encoding 1,248,751 sets of tokens from input (Cleaned).txt.





GPU available: True, used: True
11/14/2020 07:19:48 — INFO — lightning — GPU available: True, used: True
No environment variable for node rank defined. Set as 0.
CUDA_VISIBLE_DEVICES: [0]
11/14/2020 07:19:48 — INFO — lightning — CUDA_VISIBLE_DEVICES: [0]


HBox(children=(FloatProgress(value=0.0, layout=Layout(flex='2'), max=25000.0), HTML(value='')), layout=Layout(…

[1m500 steps reached: saving model to /trained_model[0m
[1m1,000 steps reached: saving model to /trained_model[0m
[1m1,000 steps reached: generating sample texts.[0m
>I'll take some cetes.

[1m1,500 steps reached: saving model to /trained_model[0m
[1m2,000 steps reached: saving model to /trained_model[0m
[1m2,000 steps reached: generating sample texts.[0m
>I've heard of your mom's a bit.

[1m2,500 steps reached: saving model to /trained_model[0m
[1m3,000 steps reached: saving model to /trained_model[0m
[1m3,000 steps reached: generating sample texts.[0m
>No way.

[1m3,500 steps reached: saving model to /trained_model[0m
[1m4,000 steps reached: saving model to /trained_model[0m
[1m4,000 steps reached: generating sample texts.[0m
>It's always been a while...

[1m4,500 steps reached: saving model to /trained_model[0m
[1m5,000 steps reached: saving model to /trained_model[0m
[1m5,000 steps reached: generating sample texts.[0m
>What the heck are you so strong?


11/14/2020 09:06:31 — INFO — aitextgen — Saving trained model pytorch_model.bin to /trained_model


# Generate Samples
Finally, the fun part! Have the model generate 25 unique samples. As you can see, the results are quite believable. Please use this code responsibly, never to intentionally deceive or do evil with.

In [11]:
ai.generate(n=25,
            batch_size=16384,
            prompt=">",
            temperature=1,
            top_p=0.99999)

[1m>[0mSynchronizing, huh?

[1m>[0mThat's all it was for. - Let's eat.

[1m>[0mHot... Ahiru-san, I just got it!

[1m>[0mthe "Disciple"?! (That guy is from the Public Safety Bureau!

[1m>[0mTest.

[1m>[0mThis isn't right...

[1m>[0mI just happened to be at a  place where I run a little.

[1m>[0mHuh?

[1m>[0mThe moment you fall,  it really razor the same piece as it gets.

[1m>[0mI would rather feel bad to  tell with each other.

[1m>[0mOh, I know!

[1m>[0mI think I could say that I saw it.

[1m>[0mYou have no right to say that.

[1m>[0mAoba Johsai Shopping Dawn

[1m>[0mAssuming shell.

[1m>[0mYeah...

[1m>[0mI see.

[1m>[0mWhy? Why did Usui-kun do that?

[1m>[0mWhat?

[1m>[0mHuh?

[1m>[0mAre you hungry?

[1m>[0mLet me see those two back then!

[1m>[0mOkay!

[1m>[0mI don't know if I was being sucked in by a human face,

[1m>[0mAnd then, at least it has to be Takahashi in grade school.



In [12]:
!export "model_archive=anime_subtitlegen_$(date +%e_%b_%Y_%H_%M_%S)" ; mkdir $model_archive ; mv aitextgen-* $model_archive ; mv trained_model $model_archive ; tar -cvf $model_archive.tar $model_archive ; mv $model_archive.tar "drive/My Drive/" ; echo "Model successfully backup up to Google Drive. Feel free to factory reset the runtime."

anime_subtitlegen_14_Nov_2020_09_06_32/
anime_subtitlegen_14_Nov_2020_09_06_32/aitextgen-merges.txt
anime_subtitlegen_14_Nov_2020_09_06_32/trained_model/
anime_subtitlegen_14_Nov_2020_09_06_32/trained_model/pytorch_model.bin
anime_subtitlegen_14_Nov_2020_09_06_32/trained_model/config.json
anime_subtitlegen_14_Nov_2020_09_06_32/aitextgen-vocab.json
Model successfully backup up to Google Drive. Feel free to factory reset the runtime.


# Credits

This project was made possible by the cumulative efforts of the following parties:

Brian Lechthaler *author of this notebook*
* https://github.com/brianlechthaler
* https://twitter.com/brianlechthaler

Max Woolf *author of [aitextgen](https://github.com/minimaxir/aitextgen), the training code this notebook is based on.*
* https://minimaxir.com/
* https://github.com/minimaxir

Jess Fan [author](https://www.kaggle.com/jef1056) of [anime-subtitles](https://www.kaggle.com/jef1056/anime-subtitles) dataset
* https://github.com/JEF1056
* https://www.linkedin.com/in/jess-fan-677177196/

OpenAI *creators of [GPT-2](https://en.wikipedia.org/wiki/OpenAI#GPT-2) model*
* https://openai.com 
* https://openai.com/blog/tags/gpt-2/
