# GPT-2 Anime Subtitle Generation

## tl;dr
1. `Connect` or `Reconnect`
2. Upload your Kaggle API key (instructions included later)
3. `Runtime` -> `Restart and run all`
4. Wait 15-30 minutes
5. Laugh at weird computer-generated Anime Subtitles


by Brian Lechthaler, 
*based on [aitextgen](https://github.com/minimaxir/aitextgen)*

# Dependencies
Download and install all necessary dependencies with `pip`, then `import` what we need.

In [1]:
!pip install -q kaggle
# Freeze versions of dependencies for now
!pip install -q transformers==2.9.1
!pip install -q pytorch-lightning==0.7.6

!pip install -q aitextgen

import logging
logging.basicConfig(
        format="%(asctime)s — %(levelname)s — %(name)s — %(message)s",
        datefmt="%m/%d/%Y %H:%M:%S",
        level=logging.INFO
    )

from aitextgen import aitextgen
from aitextgen.colab import mount_gdrive, copy_file_from_gdrive
from aitextgen.TokenDataset import TokenDataset, merge_datasets
from aitextgen.utils import build_gpt2_config
from aitextgen.tokenizers import train_tokenizer

11/15/2020 00:28:40 — INFO — transformers.file_utils — PyTorch version 1.7.0+cu101 available.
11/15/2020 00:28:42 — INFO — transformers.file_utils — TensorFlow version 2.3.0 available.


# Mount Google Drive
Because any data in the VM this notebook is running on will be nuked once the Jupyter kernel stops running, it's helpful to mount your Google Drive to the Colab VM to persist some files that we'll use in this notebook.

*Note:* your data will not be shared with anyone who does not have direct access to the VM running this Colab notebook.

In [2]:
mount_gdrive()

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


# Download Dataset from Kaggle
Downloads the 'kickstarter-projects' dataset contributed by Kaggle user `jef1056`

1.   Sign into Kaggle in a separate tab
2.   Click [this link](https://kaggle.com/me/account) to go to your Kaggle account settings
3. Under the `API` section, click/tap `Create new API token`. If this is not the first time you have followed this step, consider clicking `Expire API Token` prior to generating a new token.
4. In the Colab file browser, upload the `kaggle.json` API token you just downloaded in step 3.



In [3]:
!mkdir -p /root/.kaggle
!mv kaggle.json /root/.kaggle/kaggle.json
!chmod 600 /root/.kaggle/kaggle.json
!rm -rf anime-subtitles.zip
!kaggle datasets download -d jef1056/anime-subtitles
!rm -rf 'Anime Datasets V3.zip'
!rm -rf 'input (Cleaned).txt'
!unzip anime-subtitles.zip
!wc -l 'input (Cleaned).txt'

Downloading anime-subtitles.zip to /content
 92% 129M/140M [00:01<00:00, 50.6MB/s] 
100% 140M/140M [00:02<00:00, 70.9MB/s]
Archive:  anime-subtitles.zip
  inflating: Anime Datasets V3.zip   
  inflating: input (Cleaned).txt     
1248751 input (Cleaned).txt


# Train Tokenizer on Dataset
Bound to CPU, may take a few minutes.

In [4]:
file_name = 'input (Cleaned).txt'


In [5]:
def cleandir(rm_model):
  print('cleaning working directory...')
  !rm -rf aitextgen-merges.txt
  !rm -rf aitextgen-vocab.json
  if rm_model == True:
    !rm -rf /content/trained_model
  elif rm_model == False:
    print('note: rm_model set to False, skipping model deletion.')
  else:
    print('note: rm_model not set to True or False, skipping model deletion.')

In [None]:
cleandir(False)

In [6]:
train_tokenizer(file_name)

11/15/2020 00:29:21 — INFO — aitextgen.tokenizers — Saving aitextgen-vocab.json and aitextgen-merges.txt to the current directory. You will need both files to build the GPT2Tokenizer.


# Configure GPT-2 Training
Set various configuration variables to control how the GPT-2 model is re-trained to the data we are feeding it.

In [7]:
config = build_gpt2_config(vocab_size=30000, 
                           max_length=64, 
                           dropout=0.0, 
                           n_embd=256, 
                           n_layer=8, 
                           n_head=8)
config

GPT2Config {
  "activation_function": "gelu_new",
  "attn_pdrop": 0.0,
  "bos_token_id": 0,
  "embd_pdrop": 0.0,
  "eos_token_id": 0,
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_ctx": 64,
  "n_embd": 256,
  "n_head": 8,
  "n_layer": 8,
  "n_positions": 64,
  "resid_pdrop": 0.0,
  "summary_activation": null,
  "summary_first_dropout": 0.0,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "vocab_size": 30000
}

In [8]:
ai = aitextgen(config=config,
               vocab_file="aitextgen-vocab.json",
               merges_file="aitextgen-merges.txt",
               to_gpu=True)

11/15/2020 00:29:21 — INFO — aitextgen — Constructing GPT-2 model from provided config.
11/15/2020 00:29:22 — INFO — aitextgen — Using a custom tokenizer.


# Re-train GPT-2 to Dataset

This task is bound to the GPU and should take just under two hours to train on an NVidia V100 GPU.



In [9]:
!rm -rf trained_model

In [10]:
ai.train(file_name,
         line_by_line=True,
         num_steps=100000,
         generate_every=1000,
         save_every=500,
         save_gdrive=False,
         learning_rate=1e-4,
         batch_size=256)

HBox(children=(FloatProgress(value=0.0, layout=Layout(flex='2'), max=1248751.0), HTML(value='')), layout=Layou…

11/15/2020 00:29:36 — INFO — aitextgen.TokenDataset — Encoding 1,248,751 sets of tokens from input (Cleaned).txt.





GPU available: True, used: True
11/15/2020 00:30:14 — INFO — lightning — GPU available: True, used: True
No environment variable for node rank defined. Set as 0.
CUDA_VISIBLE_DEVICES: [0]
11/15/2020 00:30:14 — INFO — lightning — CUDA_VISIBLE_DEVICES: [0]


HBox(children=(FloatProgress(value=0.0, layout=Layout(flex='2'), max=100000.0), HTML(value='')), layout=Layout…

[1m500 steps reached: saving model to /trained_model[0m
[1m1,000 steps reached: saving model to /trained_model[0m
[1m1,000 steps reached: generating sample texts.[0m
>And that it makes the pula is the  way I'll take her to that.

[1m1,500 steps reached: saving model to /trained_model[0m
[1m2,000 steps reached: saving model to /trained_model[0m
[1m2,000 steps reached: generating sample texts.[0m
>I just want to use a little like this!

[1m2,500 steps reached: saving model to /trained_model[0m
[1m3,000 steps reached: saving model to /trained_model[0m
[1m3,000 steps reached: generating sample texts.[0m
>My turn!

[1m3,500 steps reached: saving model to /trained_model[0m
[1m4,000 steps reached: saving model to /trained_model[0m
[1m4,000 steps reached: generating sample texts.[0m
>The truth is, you're just a good person.

[1m4,500 steps reached: saving model to /trained_model[0m
[1m5,000 steps reached: saving model to /trained_model[0m
[1m5,000 steps reached: ge

11/15/2020 07:33:07 — INFO — aitextgen — Saving trained model pytorch_model.bin to /trained_model


# Generate Samples
Finally, the fun part! Have the model generate 25 unique samples. As you can see, the results are quite believable. Certain nuances specific to Japanese to English translation such as (name)-chan (-chan = boy) or (name)-kun (-kun = man) are learned and replicated in generated output, made extra impressive by the fact that everything in this notebook is unsupervised learning meaning *we never told the robot what to do, it figured it out 100% on it's own!*. Even though it's a stretch to say this is applicable to AI-generated Anime subtitles, please use this code responsibly, never to intentionally deceive or do evil with.

In [11]:
ai.generate(n=25,
            batch_size=16384,
            prompt=">",
            temperature=1,
            top_p=0.99999)

[1m>[0mYou must have heard the rumor that the voices of the weretiger appeared near the street.

[1m>[0mWhat's the matter, Yananza?

[1m>[0mWell then, I want to eat you.

[1m>[0mYou should be grateful to me.

[1m>[0mThe truth is that none  answer Diana to your existence

[1m>[0mI had to live up here  just to recover the order.

[1m>[0mIf you try out those  things and just learn how...

[1m>[0mWe've got to come up with a story about we do!

[1m>[0mSubaru-sama...

[1m>[0mThey need to make  sense of how dangerous it is.

[1m>[0mYou okay?

[1m>[0mHuh?

[1m>[0mIt's a real problem!

[1m>[0mThis is quite the unforgettable rained

[1m>[0mWhat the hell was that, r-rel jerk?!

[1m>[0mIs it because there's anyone I can  really go to find something?

[1m>[0mBut it's a good thing that's...

[1m>[0mand my family allies

[1m>[0mI-It's all right. Mm.

[1m>[0mThere's nothing wrong with me\nin this many secrets!

[1m>[0mYeah. We don't have any to make the first 

In [12]:
!export "model_archive=anime_subtitlegen_$(date +%e_%b_%Y_%H_%M_%S)" ; mkdir $model_archive ; mv aitextgen-* $model_archive ; mv trained_model $model_archive ; tar -cvf $model_archive.tar $model_archive ; mv $model_archive.tar "drive/My Drive/" ; echo "Model successfully backup up to Google Drive. Feel free to factory reset the runtime."

anime_subtitlegen_15_Nov_2020_07_33_09/
anime_subtitlegen_15_Nov_2020_07_33_09/aitextgen-merges.txt
anime_subtitlegen_15_Nov_2020_07_33_09/trained_model/
anime_subtitlegen_15_Nov_2020_07_33_09/trained_model/pytorch_model.bin
anime_subtitlegen_15_Nov_2020_07_33_09/trained_model/config.json
anime_subtitlegen_15_Nov_2020_07_33_09/aitextgen-vocab.json
Model successfully backup up to Google Drive. Feel free to factory reset the runtime.


Last updated:

In [14]:
import datetime as dt
def 
print(dt.datetime.now())

2020-11-15 07:34:14.823167


# Credits

This project was made possible by the cumulative efforts of the following parties:

Brian Lechthaler *author of this notebook*
* https://github.com/brianlechthaler
* https://twitter.com/brianlechthaler

Max Woolf *author of [aitextgen](https://github.com/minimaxir/aitextgen), the training code this notebook is based on.*
* https://minimaxir.com/
* https://github.com/minimaxir

Jess Fan [author](https://www.kaggle.com/jef1056) of [anime-subtitles](https://www.kaggle.com/jef1056/anime-subtitles) dataset
* https://github.com/JEF1056
* https://www.linkedin.com/in/jess-fan-677177196/

OpenAI *creators of [GPT-2](https://en.wikipedia.org/wiki/OpenAI#GPT-2) model*
* https://openai.com 
* https://openai.com/blog/tags/gpt-2/
