<a href="https://colab.research.google.com/github/pszemraj/ai-msgbot/blob/main/colab-notebooks/%5BGPT2_normal%5D_aitextgen_text_generation_%2B_training_on_GPU.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#  aitextgen — Train a GPT-2 (or GPT Neo) Text-Generating Model w/ GPU

by [Max Woolf](https://minimaxir.com)

*Last updated: May 16th, 2021 (aitextgen v0.5.2)*

Retrain an advanced text generating neural network on any text dataset **for free on a GPU using Colaboratory** using `aitextgen`!

For more about `aitextgen`, you can visit [this GitHub repository](https://github.com/minimaxir/aitextgen) or [read the documentation](https://docs.aitextgen.io/).


To get started:

1. Copy this notebook to your Google Drive to keep it and save your changes. (File -> Save a Copy in Drive)
2. Run the cells below:


In [None]:
!nvidia-smi

Fri Oct 15 04:17:13 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.74       Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla V100-SXM2...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   33C    P0    24W / 300W |      0MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

## formatting

In [None]:
from IPython.display import HTML, display

# colab formatting
def set_css():
    display(
        HTML(
            """
  <style>
    pre {
        white-space: pre-wrap;
    }
  </style>
  """
        )
    )


get_ipython().events.register("pre_run_cell", set_css)

# setup

In [None]:
# update torch in case using a A100 GPU
!pip3 install torch==1.9.1+cu111 -f -q https://download.pytorch.org/whl/torch_stable.html

!pip3 install cudatoolkit==11.1

Looking in links: -q
Collecting https://download.pytorch.org/whl/torch_stable.html
  Downloading https://download.pytorch.org/whl/torch_stable.html (227 kB)
[K     |████████████████████████████████| 227 kB 511 kB/s 
[31m  ERROR: Cannot unpack file /tmp/pip-unpack-fe3fno1o/torch_stable.html (downloaded from /tmp/pip-req-build-h7v33s1r, content-type: text/html); cannot detect archive format[0m
[31mERROR: Cannot determine archive format of /tmp/pip-req-build-h7v33s1r[0m
[?25h[31mERROR: Could not find a version that satisfies the requirement cudatoolkit==11.1 (from versions: none)[0m
[31mERROR: No matching distribution found for cudatoolkit==11.1[0m


In [None]:
!pip install -q aitextgen

import logging

logging.basicConfig(
    format="%(asctime)s — %(levelname)s — %(name)s — %(message)s",
    datefmt="%m/%d/%Y %H:%M:%S",
    level=logging.INFO,
)

from aitextgen import aitextgen
from aitextgen.colab import mount_gdrive, copy_file_from_gdrive

[K     |████████████████████████████████| 572 kB 14.3 MB/s 
[K     |████████████████████████████████| 2.9 MB 64.3 MB/s 
[K     |████████████████████████████████| 87 kB 8.4 MB/s 
[K     |████████████████████████████████| 925 kB 57.8 MB/s 
[K     |████████████████████████████████| 596 kB 55.1 MB/s 
[K     |████████████████████████████████| 125 kB 87.7 MB/s 
[K     |████████████████████████████████| 829 kB 76.7 MB/s 
[K     |████████████████████████████████| 282 kB 77.5 MB/s 
[K     |████████████████████████████████| 1.3 MB 76.0 MB/s 
[K     |████████████████████████████████| 56 kB 5.5 MB/s 
[K     |████████████████████████████████| 895 kB 77.4 MB/s 
[K     |████████████████████████████████| 3.3 MB 57.1 MB/s 
[K     |████████████████████████████████| 271 kB 66.7 MB/s 
[K     |████████████████████████████████| 160 kB 64.3 MB/s 
[?25h  Building wheel for aitextgen (setup.py) ... [?25l[?25hdone
  Building wheel for fire (setup.py) ... [?25l[?25hdone
  Building wheel for fu

In [None]:
mount_gdrive()

Mounted at /content/drive


## GPU

Colaboratory uses a Nvidia P4, an Nvidia T4, an Nvidia P100, or an Nvidia V100. For finetuning GPT-2 124M, any of these GPUs will be fine, but for text generation, a T4 or a P100 is ideal since they have more VRAM. **If you receive a T4 or a V100 GPU, you can enable `fp16=True` during training for faster/more memory efficient training.**

You can verify which GPU is active by running the cell below. If you want to try for a different GPU, go to **Runtime -> Factory Reset Runtime**.

In [None]:
!nvidia-smi

Fri Oct 15 04:19:01 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.74       Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla V100-SXM2...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   33C    P0    24W / 300W |      0MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

## Loading GPT-2 or GPT Neo

If you're retraining a model on new text, you need to download and load the GPT-2 model into the GPU. 

There are several sizes of GPT-2:

* `124M` (default): the "small" model, 500MB on disk.
* `355M` (default): the "medium" model, 1.5GB on disk.
* `774M` (default): the "large" model, 3GB on disk.

You can also finetune a GPT Neo model instead, which is more suitable for longer texts and the base model has more recent data:

* `125M`: Analogous to the GPT-2 124M model.
* `350M`: Analogous to the GPT-2 355M model

The next cell downloads the model and saves it in the Colaboratory VM. If the model has already been downloaded, running this cell will reload it.

In [None]:
model_size = "355M"  # @param ["355M", "774M"]
load_from_folder = True  # @param {type:"boolean"}
load_folder_dir = "/content/drive/MyDrive/Programming/AI_peter/gpt2_std_gpu_355M"  # @param {type:"string"}

In [None]:
if load_from_folder:
    ai = aitextgen(
        model_folder=load_folder_dir, to_gpu=True, gradient_checkpointing=True
    )
else:
    ai = aitextgen(tf_gpt2=model_size, to_gpu=True, gradient_checkpointing=True)
# ai = aitextgen(tf_gpt2="124M", to_gpu=True)
# https://huggingface.co/distilgpt2
# Comment out the above line and uncomment the below line to use GPT Neo instead.
# model_size
# ai = aitextgen(model='gpt2-medium',
#                to_gpu=True,
#                gradient_checkpointing=True)

10/15/2021 04:19:34 — INFO — aitextgen — Loading model from provided weights and config in //content/drive/MyDrive/Programming/AI_peter/gpt2_std_gpu_355M.
  "Passing `gradient_checkpointing` to a config initialization is deprecated and will be removed in v5 "
10/15/2021 04:19:48 — INFO — aitextgen — GPT2 loaded with 354M parameters.
10/15/2021 04:19:48 — INFO — aitextgen — Gradient checkpointing enabled for model training.
10/15/2021 04:19:48 — INFO — aitextgen — Using the default GPT-2 Tokenizer.


# load training data

In [None]:
dl_link = "https://www.dropbox.com/s/ni17vlcp2urxnci/apple_and_whatsapp_msgs.txt?dl=1"  # @param {type:"string"}

In [None]:
# download test image
from urllib import request
from os.path import join
import os

vm_wd = os.getcwd()
local_name = join(vm_wd, "appleANDwhatsapp.txt")
request.urlretrieve(dl_link, local_name)

('/content/appleANDwhatsapp.txt', <http.client.HTTPMessage at 0x7f4377bfe910>)

In [None]:
file_name = "appleANDwhatsapp.txt"

If your text file is large (>10MB), it is recommended to upload that file to Google Drive first, then copy that file from Google Drive to the Colaboratory VM.

Additionally, you may want to consider [compressing the dataset to a cache first](https://docs.aitextgen.io/dataset/) on your local computer, then uploading the resulting `dataset_cache.tar.gz` and setting the `file_name`in the previous cell to that.

In [None]:
# copy_file_from_gdrive(local_name)

# Train / Finetune GPT-2

The next cell will start the actual finetuning of GPT-2 in aitextgen. It runs for `num_steps`, and a progress bar will appear to show training progress, current loss (the lower the better the model), and average loss (to give a sense on loss trajectory).

The model will be saved every `save_every` steps in `trained_model` by default, and when training completes. If you mounted your Google Drive, the model will _also_ be saved there in a unique folder.

The training might time out after 4ish hours; if you did not mount to Google Drive, make sure you end training and save the results so you don't lose them! (if this happens frequently, you may want to consider using [Colab Pro](https://colab.research.google.com/signup))

Important parameters for `train()`:

- **`line_by_line`**: Set this to `True` if the input text file is a single-column CSV, with one record per row. aitextgen will automatically process it optimally.
- **`from_cache`**: If you compressed your dataset locally (as noted in the previous section) and are using that cache file, set this to `True`.
- **`num_steps`**: Number of steps to train the model for.
- **`generate_every`**: Interval of steps to generate example text from the model; good for qualitatively validating training.
- **`save_every`**: Interval of steps to save the model: the model will be saved in the VM to `/trained_model`.
- **`save_gdrive`**: Set this to `True` to copy the model to a unique folder in your Google Drive, if you have mounted it in the earlier cells
- **`fp16`**: Enables half-precision training for faster/more memory-efficient training. Only works on a T4 or V100 GPU.

Here are other important parameters for `train()` that are useful but you likely do not need to change.

- **`learning_rate`**: Learning rate of the model training.
- **`batch_size`**: Batch size of the model training; setting it too high will cause the GPU to go OOM. (if using `fp16`, you can increase the batch size more safely)

In [None]:
import gc, os
from os.path import join

base_dir = "/content/drive/MyDrive/Programming/AI_peter"
temp_gpu_path = join(base_dir, "gp2_std_{}".format(model_size))
os.makedirs(temp_gpu_path, exist_ok=True)
gc.collect()

239

In [None]:
ai.train(
    file_name,
    output_dir=temp_gpu_path,
    line_by_line=False,
    from_cache=False,
    num_steps=60000,  # takes about 5 hours on 16 gb GPU for 75000
    generate_every=500,
    max_grad_norm=0.5,
    save_every=500,
    gradient_accumulation_steps=1,
    save_gdrive=True,
    #  learning_rate=1e-4,
    learning_rate=1e-3,
    fp16=True,
    batch_size=1,
    fp16_opt_level="O1",
    warmup_steps=500,
)

10/15/2021 04:20:21 — INFO — aitextgen — Loading text from appleANDwhatsapp.txt with generation length of 1024.


  0%|          | 0/239726 [00:00<?, ?it/s]

10/15/2021 04:20:21 — INFO — aitextgen.TokenDataset — Encoding 239,726 sets of tokens from appleANDwhatsapp.txt.
10/15/2021 04:20:27 — INFO — pytorch_lightning.trainer.connectors.accelerator_connector — Using native 16bit precision.
10/15/2021 04:20:27 — INFO — pytorch_lightning.utilities.distributed — GPU available: True, used: True
10/15/2021 04:20:27 — INFO — pytorch_lightning.utilities.distributed — TPU available: False, using: 0 TPU cores
10/15/2021 04:20:27 — INFO — pytorch_lightning.utilities.distributed — IPU available: False, using: 0 IPUs
10/15/2021 04:20:27 — INFO — pytorch_lightning.accelerators.gpu — LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


  0%|          | 0/60000 [00:00<?, ?it/s]

[1m500 steps reached: saving model to //content/drive/MyDrive/Programming/AI_peter/gp2_std_355M[0m
[1m500 steps reached: generating sample texts.[0m
 lol

peter szemraj:
i'm all for one hour and m mins

peter szemraj:
let's play cod tonight

jonathan:
yes

jonathan:
but i won't be able to watch on tues

jonathan:
then i will go back to us

peter szemraj:
https://huggingface.co/transformers/model-doc/gpt-ne trained-quick.html

peter szemraj:
get to speed on the ram. it's easy

jonathan:
why it does not work?

jonathan:
idk why it's so nice

peter szemraj:
the ram probably works better one

peter szemraj:
it sends in cpu and i already trained it so i will let you know if you never can ski down

peter szemraj:
https://:(


  torch.nn.utils.clip_grad_norm_(parameters, clip_val)


[1m1,000 steps reached: saving model to //content/drive/MyDrive/Programming/AI_peter/gp2_std_355M[0m
[1m1,000 steps reached: generating sample texts.[0m
 to all bring back

peter szemraj:
so this is the official guide

peter szemraj:
 bouldering is bouldering which you can access with if you like that one, i don't need to take a swiss bank account but they want to charge you for a wire transfer to and are all fine

calvin miller:
i can make just go to den ass off the phone with the best way to me

peter szemraj:
i know you want to be able to attend

peter szemraj:
which one of the most different boulders you don't even have on, so on to wait for me to show up front of the island

peter szemraj:
i'm going to try and do it a tad earlier

peter szemraj:
it's not as hard as houston so there are down to be much of a sbb, but there is still a place open tomap

calvin miller:
i can just
[1m1,500 steps reached: saving model to //content/drive/MyDrive/Programming/AI_peter/gp2_std_355M[0m


10/15/2021 08:39:51 — INFO — aitextgen — Saving trained model pytorch_model.bin to //content/drive/MyDrive/Programming/AI_peter/gp2_std_355M


In [None]:
save_path = "/content/drive/MyDrive/Programming/AI_peter/gpt2_std_gpu_{}".format(
    model_size
)

In [None]:
import os

os.makedirs(save_path, exist_ok=True)
ai.save(save_path)

You're done! Feel free to go to the **Generate Text From The Trained Model** section to generate text based on your retrained model.


# Use a Train Model for Generation

If you already had a trained model from this notebook, running the next cell will copy the `pytorch_model.bin` and the `config.json`file from the specified folder in Google Drive into the Colaboratory VM. (If no `from_folder` is specified, it assumes the two files are located at the root level of your Google Drive)

In [None]:
!nvidia-smi

Fri Oct 15 08:40:27 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.74       Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla V100-SXM2...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   37C    P0    38W / 300W |   4779MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [None]:
mount_gdrive()

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
# best model thus far @ 1.3B parameters and tuned for 50k steps
# from_folder = "/content/drive/MyDrive/Programming/AI_peter/GPT-Neo-1B-V1"

from_folder = save_path

if len(from_folder) > 2:

    for file in ["pytorch_model.bin", "config.json"]:
        if from_folder:
            copy_file_from_gdrive(file, from_folder)
        else:
            copy_file_from_gdrive(file)

    ai = aitextgen(model_folder=from_folder, to_gpu=True)
else:
    ai = aitextgen(model_folder=".", to_gpu=True)

10/15/2021 08:40:41 — INFO — aitextgen — Loading model from provided weights and config in //content/drive/MyDrive/Programming/AI_peter/gpt2_std_gpu_355M.
  "Passing `gradient_checkpointing` to a config initialization is deprecated and will be removed in v5 "
10/15/2021 08:40:47 — INFO — aitextgen — GPT2 loaded with 354M parameters.
10/15/2021 08:40:47 — INFO — aitextgen — Using the default GPT-2 Tokenizer.


## Generate Text From The Trained Model


`generate()` without any parameters generates a single text from the loaded model to the console.

In [None]:
ai.generate(n=3, max_length=256, temperature=1.0, top_p=0.9)

the way but i'd estimate now that i think if i don't respond you often i'm more likely back (sp?) it's a series of little i think i don't have it yet

peter szemraj:
it's just that big of a) i'd have you named that it being a good friendship with a person before or that you don't care about it you can be silent (and still doing so). you have to apologize for what you do, it's not weird)

peter szemraj:
(and i tell you that there is no close option that is not your comment)

peter szemraj:
you have 90%7 (k), but this is how i feel) i can't tell you a time in advance so i don't have to respond on it?

peter szemraj:
was way too busy then my memory is locked and i'm tired as hell lol

nika schrauf:
dude i'm in uni on give this to you... at least you know i'd be ready at home and it's


In [None]:
ai.generate(
    prompt="give me a good pickup line! peter szemraj:",
    temperature=1,
    min_length=10,
    batch_size=20,
)

[1mgive me a good pickup line! peter szemraj:[0m
will do!

peter szemraj:
my flight for july 25

peter szemraj:
i saw met some portuguese come i haven't said anything in me so soon

lillie szemraj:
my plane got u_

peter szemraj:
https://www.instagram.com/p/b2w5dvbxwielmjk9cdreea?si=-cvgd7pusgwxsgc8tfzeaq

peter szemraj:
https://www.instagram.com/p/b9dznfhdwv6/?igshid=dda32akjyjj

peter szemraj:
try this out

lillie szemraj:
https://www.reddit.com/r/dreddit/comments/b9dox/happy_early_decision_applic


If you're creating an API based on your model and need to pass the generated text elsewhere, you can do `text = ai.generate_one()`

You can also pass in a `prompt` to the generate function to force the text to start with a given character sequence and generate text from there (good if you add an indicator when the text starts).

You can also generate multiple texts at a time by specifing `n`. You can pass a `batch_size` to generate multiple samples in parallel, giving a massive speedup (in Colaboratory, set a maximum of 50 for `batch_size` to avoid going OOM).

Other optional-but-helpful parameters for `ai.generate()` and friends:

*  **`min length`**: The minimum length of the generated text: if the text is shorter than this value after cleanup, aitextgen will generate another one.
*  **`max_length`**: Number of tokens to generate (default 256, you can generate up to 1024 tokens with GPT-2 and 2048 with GPT Neo)
* **`temperature`**: The higher the temperature, the crazier the text (default 0.7, recommended to keep between 0.7 and 1.0)
* **`top_k`**: Limits the generated guesses to the top *k* guesses (default 0 which disables the behavior; if the generated output is super crazy, you may want to set `top_k=40`)
* **`top_p`**: Nucleus sampling: limits the generated guesses to a cumulative probability. (gets good results on a dataset with `top_p=0.9`)

In [None]:
ai.generate(
    n=3,
    batch_size=25,
    prompt="lillie szemraj: i just",
    max_length=256,
    temperature=1.0,
    top_p=0.9,
)

[1mlillie szemraj: i just[0m went to bed at 10 am with mom and dad

lillie szemraj:
check ur iron bro

lillie szemraj:
too high it too

peter szemraj:
doesn't wait to go out to wait

peter szemraj:
also hiking shoes and winter jackets, and much time we spending in hammock. i'm pretty comfortable!! i just have to wear this for when i have in the place to hang the trail to houston

peter szemraj:
hey it's only like 4 years oldies which are in the terrace and it's a series of little longer and different color

lillie szemraj:
wow that's a hell

lillie szemraj:
the purple one

peter szemraj:
yes

peter szemraj:
so yes i thought it was this one that suits and i had


For bulk generation, you can generate a large amount of texts to a file and sort out the samples locally on your computer. The next cell will generate `num_files` files, each with `n` texts and whatever other parameters you would pass to `generate()`. The files can then be downloaded from the Files sidebar!

You can rerun the cells as many times as you want for even more generated texts!

In [None]:
save_loc = (
    "/content/drive/MyDrive/Programming/AI_peter/output_files"  # @param {type:"string"}
)

In [None]:
p_list = [
    ["how are you doing?" + "\n", "\n", "peter szemraj:" + "\n"],
    ["christopher szemraj:" + "\n", "why is "],
    ["can you help me with my homework?" + "\n", "\n", "peter szemraj:" + "\n"],
    ["peter szemraj:" + "\n"],
    [
        "Hey I’m meeting the astrophysics professor via zoom after school any tips?"
        + "\n",
        "\n",
        "peter szemraj:" + "\n",
    ],
]


prompts = ["".join(line) for line in p_list]

In [None]:
from datetime import datetime
import pprint as pp

ds_date_time = datetime.now().strftime("%m.%d.%Y")

base_header = "gpt-neo-textgen-{}".format(ds_date_time)
prompt_IDs = [
    base_header + "_file-{}.txt".format(i + 1) for i in range(5, len(prompts) + 11)
]

prompt_mng = {}
for pid, text in zip(prompt_IDs, prompts):
    prompt_mng[pid] = text
pp.pprint(prompt_mng)

{'gpt-neo-textgen-10.15.2021_file-10.txt': 'Hey I’m meeting the astrophysics '
                                           'professor via zoom after school '
                                           'any tips?\n'
                                           '\n'
                                           'peter szemraj:\n',
 'gpt-neo-textgen-10.15.2021_file-6.txt': 'how are you doing?\n'
                                          '\n'
                                          'peter szemraj:\n',
 'gpt-neo-textgen-10.15.2021_file-7.txt': 'christopher szemraj:\nwhy is ',
 'gpt-neo-textgen-10.15.2021_file-8.txt': 'can you help me with my homework?\n'
                                          '\n'
                                          'peter szemraj:\n',
 'gpt-neo-textgen-10.15.2021_file-9.txt': 'peter szemraj:\n'}


In [None]:
from os.path import join

for pfile, my_prompt in prompt_mng.items():
    ai.generate_to_file(
        n=50,
        batch_size=5,
        prompt=my_prompt,
        max_length=1024,
        temperature=0.8,
        top_p=0.9,
        destination_path=join(save_loc, pfile),
    )

10/15/2021 09:31:10 — INFO — aitextgen — Generating 50 texts to /content/drive/MyDrive/Programming/AI_peter/output_files/gpt-neo-textgen-10.15.2021_file-6.txt


  0%|          | 0/50 [00:00<?, ?it/s]

10/15/2021 09:35:42 — INFO — aitextgen — Generating 50 texts to /content/drive/MyDrive/Programming/AI_peter/output_files/gpt-neo-textgen-10.15.2021_file-7.txt


  0%|          | 0/50 [00:00<?, ?it/s]

10/15/2021 09:40:13 — INFO — aitextgen — Generating 50 texts to /content/drive/MyDrive/Programming/AI_peter/output_files/gpt-neo-textgen-10.15.2021_file-8.txt


  0%|          | 0/50 [00:00<?, ?it/s]

10/15/2021 09:44:41 — INFO — aitextgen — Generating 50 texts to /content/drive/MyDrive/Programming/AI_peter/output_files/gpt-neo-textgen-10.15.2021_file-9.txt


  0%|          | 0/50 [00:00<?, ?it/s]

10/15/2021 09:49:06 — INFO — aitextgen — Generating 50 texts to /content/drive/MyDrive/Programming/AI_peter/output_files/gpt-neo-textgen-10.15.2021_file-10.txt


  0%|          | 0/50 [00:00<?, ?it/s]