# JokePT2
This was initially a Colab notebook I pulled down for posterity

In this I, purposely, train `aitextgen`, a GPT-2 wrapper, for less than suffice iterations on
* A transcript of a Charles Manson interview
* Yo momma jokes

The training process was essentially the same for both, except the yo momma jokes were trained with the flag `line_by_line=True`

Results are in the "training output" folder

In [1]:
# install packages

!pip install -q aitextgen

import logging
logging.basicConfig(
        format="%(asctime)s — %(levelname)s — %(name)s — %(message)s",
        datefmt="%m/%d/%Y %H:%M:%S",
        level=logging.INFO
    )

from aitextgen import aitextgen
from aitextgen.colab import mount_gdrive, copy_file_from_gdrive

[K     |████████████████████████████████| 572 kB 5.4 MB/s 
[K     |████████████████████████████████| 2.8 MB 37.5 MB/s 
[K     |████████████████████████████████| 87 kB 6.5 MB/s 
[K     |████████████████████████████████| 923 kB 33.8 MB/s 
[K     |████████████████████████████████| 829 kB 28.7 MB/s 
[K     |████████████████████████████████| 282 kB 42.0 MB/s 
[K     |████████████████████████████████| 636 kB 50.7 MB/s 
[K     |████████████████████████████████| 119 kB 36.9 MB/s 
[K     |████████████████████████████████| 1.3 MB 36.4 MB/s 
[K     |████████████████████████████████| 52 kB 1.5 MB/s 
[K     |████████████████████████████████| 3.3 MB 36.1 MB/s 
[K     |████████████████████████████████| 895 kB 38.3 MB/s 
[K     |████████████████████████████████| 142 kB 50.3 MB/s 
[K     |████████████████████████████████| 294 kB 47.8 MB/s 
[?25h  Building wheel for aitextgen (setup.py) ... [?25l[?25hdone
  Building wheel for fire (setup.py) ... [?25l[?25hdone
  Building wheel for fut

In [2]:
# check what GPU I've got
!nvidia-smi

Mon Sep 20 00:18:14 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.63.01    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   33C    P8    29W / 149W |      0MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

Download/load 124M GPT-2 model

In [11]:
ai = aitextgen(tf_gpt2="124M", to_gpu=True)

09/20/2021 00:55:07 — INFO — aitextgen — Loading 124M GPT-2 model from /aitextgen.
09/20/2021 00:55:10 — INFO — aitextgen — GPT2 loaded with 124M parameters.
09/20/2021 00:55:10 — INFO — aitextgen — Using the default GPT-2 Tokenizer.


### Mount Google drive
Mounts personal Google Drive in the VM which can be used for taking files in and out

In [4]:
mount_gdrive()

Mounted at /content/drive


### Load in training files

In [12]:
# filename = "all_ym_jokes.txt"
filename = "manson.txt"

### Tokenize the text

In [13]:
from aitextgen.tokenizers import train_tokenizer

# train custom tokenizer
train_tokenizer(filename)
tokenizer_file = "aitextgen.tokenizer.json"

In [14]:
from aitextgen.TokenDataset import TokenDataset

ai = aitextgen(tf_gpt2="124M", to_gpu=True, tokenizer_file=tokenizer_file)
data = TokenDataset(filename, tokenizer_file=tokenizer_file, block_size=64)

09/20/2021 00:56:03 — INFO — aitextgen — Loading 124M GPT-2 model from /aitextgen.
09/20/2021 00:56:05 — INFO — aitextgen — GPT2 loaded with 124M parameters.
09/20/2021 00:56:05 — INFO — aitextgen — Using a custom tokenizer.


  0%|          | 0/373 [00:00<?, ?it/s]

09/20/2021 00:56:05 — INFO — aitextgen.TokenDataset — Encoding 373 sets of tokens from manson.txt.


### Finetune GPT-2
Important parameters for `train()`:

- **`line_by_line`**: Set this to `True` if the input text file is a single-column CSV, with one record per row. aitextgen will automatically process it optimally.
- **`from_cache`**: If you compressed your dataset locally (as noted in the previous section) and are using that cache file, set this to `True`.
- **`num_steps`**: Number of steps to train the model for.
- **`generate_every`**: Interval of steps to generate example text from the model; good for qualitatively validating training.
- **`save_every`**: Interval of steps to save the model: the model will be saved in the VM to `/trained_model`.
- **`save_gdrive`**: Set this to `True` to copy the model to a unique folder in your Google Drive, if you have mounted it in the earlier cells
- **`fp16`**: Enables half-precision training for faster/more memory-efficient training. Only works on a T4 or V100 GPU.

Here are other important parameters for `train()` that are useful but you likely do not need to change.

- **`learning_rate`**: Learning rate of the model training.
- **`batch_size`**: Batch size of the model training; setting it too high will cause the GPU to go OOM. (if using `fp16`, you can increase the batch size more safely)

In [15]:
ai.train(data,
        #  line_by_line=True,
         line_by_line=False,
         from_cache=False,
         num_steps=3000,
         generate_every=1000,
         save_every=1000,
         save_gdrive=False,
         learning_rate=1e-3,
         fp16=False,
         batch_size=1)

09/20/2021 00:56:12 — INFO — pytorch_lightning.utilities.distributed — GPU available: True, used: True
09/20/2021 00:56:12 — INFO — pytorch_lightning.utilities.distributed — TPU available: False, using: 0 TPU cores
09/20/2021 00:56:12 — INFO — pytorch_lightning.utilities.distributed — IPU available: False, using: 0 IPUs
09/20/2021 00:56:12 — INFO — pytorch_lightning.accelerators.gpu — LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


  0%|          | 0/3000 [00:00<?, ?it/s]

[1m1,000 steps reached: saving model to /trained_model[0m
[1m1,000 steps reached: generating sample texts.[0m
, and you mad at me and you are all so you are all projected him with your children. You are too deaf, and you are.
It is mad at me, and you canday, and that you can say. You are a stolezf, and you are too deaf, but you are too deaf, dumb and blind to stand and you have any of the girls, and blind to believe in the money, and project, but you are all so come at me. You are doing that you. You have got got me. You are too deaf, and live with your children that you. You are all going to sits, and I have done, "You are all crsed. You are all going to you are what you are too deaf, but you are all going to you have done them." I say, and blind to get really, because that is what that you have going to you are too deaf, and blind to testifrible, and that your brain with this back and pral deaf, and blind to only only live with You only learned, "You are mad attention, but when y

09/20/2021 01:05:09 — INFO — aitextgen — Saving trained model pytorch_model.bin to /trained_model


### Generating multiple responses

You can pass in a `prompt` to the generate function to force the text to start with a given character sequence and generate text from there (good if you add an indicator when the text starts).

You can also generate multiple texts at a time by specifing `n`. You can pass a `batch_size` to generate multiple samples in parallel, giving a massive speedup (in Colaboratory, set a maximum of 50 for `batch_size` to avoid going OOM).

Other optional-but-helpful parameters for `ai.generate()` and friends:

*  **`min length`**: The minimum length of the generated text: if the text is shorter than this value after cleanup, aitextgen will generate another one.
*  **`max_length`**: Number of tokens to generate (default 256, you can generate up to 1024 tokens with GPT-2 and 2048 with GPT Neo)
* **`temperature`**: The higher the temperature, the crazier the text (default 0.7, recommended to keep between 0.7 and 1.0)
* **`top_k`**: Limits the generated guesses to the top *k* guesses (default 0 which disables the behavior; if the generated output is super crazy, you may want to set `top_k=40`)
* **`top_p`**: Nucleus sampling: limits the generated guesses to a cumulative probability. (gets good results on a dataset with `top_p=0.9`)

In [16]:
ai.generate(n=5,
            batch_size=5,
            max_length=100,
            temperature=0.7,
            top_p=0.9)

a case. I don't know why they know why I know what I don't know what I am forgetfen day it is forgetfit day it is or any way, the money that monthstiurch counts in your world ga is the Thirteenth undergry. You only su hippierestustorses is now that lieenth undergry outor, and you from now. You're all any
e. You're going blind until you thought. You know why, they are dying. You thought was afr thought afr thoughts sen away afr thought a know pubfe.
Well, oh sold sons everything that only give what they have made you think that coun the now.

If I don't have any liest of you want someone in jail cell and any kind of you like you
a Yanke about someone about things. You could have pict a woman would do: the highomobile as if woman was supposedct Actk reforneood in jaughteah so I have public opinion and actually, "Charlie think sold a year say, "Charaid of him." And do that arre did not give them." And do you only know why, but do
en the different. You, but I am forgetfie one day. I for

Generate 1000 samples of output text

In [17]:
ai.generate_to_file(n=1000, 
                    batch_size=5, 
                    max_length=50,   
                    top_p=0.9, 
                    temperature=1.2)

09/20/2021 01:05:25 — INFO — aitextgen — Generating 1,000 texts to ATG_20210920_010525_81724029.txt


  0%|          | 0/1000 [00:00<?, ?it/s]