#  aitextgen — Train a GPT-2 (or GPT Neo) Text-Generating Model w/ GPU



In [1]:
!pip install git+https://github.com/scorixear/aitextgen

import logging
logging.basicConfig(
        format="%(asctime)s — %(levelname)s — %(name)s — %(message)s",
        datefmt="%m/%d/%Y %H:%M:%S",
        level=logging.INFO
    )


Collecting git+https://github.com/scorixear/aitextgen
  Cloning https://github.com/scorixear/aitextgen to /tmp/pip-req-build-irexmaij
  Running command git clone --filter=blob:none --quiet https://github.com/scorixear/aitextgen /tmp/pip-req-build-irexmaij
  Resolved https://github.com/scorixear/aitextgen to commit 86e4be70a41a8740526f560d4ef0ba7647ed0f36
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting fire~=0.5.0 (from aitextgen==0.6.1)
  Downloading fire-0.5.0.tar.gz (88 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m88.3/88.3 kB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting pytorch-lightning~=2.0.0 (from aitextgen==0.6.1)
  Downloading pytorch_lightning-2.0.9.post0-py3-none-any.whl.metadata (23 kB)
Collecting transformers~=4.26.0 (from aitextgen==0.6.1)
  Downloading transformers-4.26.1-py3-none-any.whl.metadata (100 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [

In [2]:
import torch
torch.cuda.is_available()

True

In [3]:

from aitextgen import aitextgen
from aitextgen.colab import mount_gdrive, copy_file_from_gdrive

## GPU
*.

In [3]:
!nvidia-smi

/bin/bash: line 1: nvidia-smi: command not found


In [4]:
ai = aitextgen(tf_gpt2="124M", to_gpu=True)

# Comment out the above line and uncomment the below line to use GPT Neo instead.
# ai = aitextgen(model="EleutherAI/gpt-neo-125M", to_gpu=True)

INFO:aitextgen:Downloading the 124M GPT-2 TensorFlow weights/config from Google's servers


Fetching checkpoint:   0%|          | 0.00/77.0 [00:00<?, ?it/s]

Fetching hparams.json:   0%|          | 0.00/90.0 [00:00<?, ?it/s]

Fetching model.ckpt.data-00000-of-00001:   0%|          | 0.00/498M [00:00<?, ?it/s]

Fetching model.ckpt.index:   0%|          | 0.00/5.21k [00:00<?, ?it/s]

Fetching model.ckpt.meta:   0%|          | 0.00/471k [00:00<?, ?it/s]

INFO:aitextgen:Converting the 124M GPT-2 TensorFlow weights to PyTorch.
Converting TensorFlow checkpoint from /content/aitextgen/124M
Loading TF weight model/h0/attn/c_attn/b with shape [2304]
Loading TF weight model/h0/attn/c_attn/w with shape [1, 768, 2304]
Loading TF weight model/h0/attn/c_proj/b with shape [768]
Loading TF weight model/h0/attn/c_proj/w with shape [1, 768, 768]
Loading TF weight model/h0/ln_1/b with shape [768]
Loading TF weight model/h0/ln_1/g with shape [768]
Loading TF weight model/h0/ln_2/b with shape [768]
Loading TF weight model/h0/ln_2/g with shape [768]
Loading TF weight model/h0/mlp/c_fc/b with shape [3072]
Loading TF weight model/h0/mlp/c_fc/w with shape [1, 768, 3072]
Loading TF weight model/h0/mlp/c_proj/b with shape [768]
Loading TF weight model/h0/mlp/c_proj/w with shape [1, 3072, 768]
Loading TF weight model/h1/attn/c_attn/b with shape [2304]
Loading TF weight model/h1/attn/c_attn/w with shape [1, 768, 2304]
Loading TF weight model/h1/attn/c_proj/b wi

Save PyTorch model to aitextgen/pytorch_model.bin


INFO:aitextgen:Loading 124M GPT-2 model from /aitextgen.


Save configuration file to aitextgen/config.json


Generate config GenerationConfig {
  "bos_token_id": 50256,
  "eos_token_id": 50256,
  "transformers_version": "4.26.1"
}

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
INFO:aitextgen:GPT2 loaded with 124M parameters.
INFO:aitextgen:Using the default GPT-2 Tokenizer.


## Mounting Google Drive



In [6]:
mount_gdrive()

Mounted at /content/drive


In [7]:
file_name = "tale.txt" # cinderellla story

In [8]:
copy_file_from_gdrive(file_name)

## Finetune GPT-2

The next cell will start the actual finetuning of GPT-2 in aitextgen. It runs for `num_steps`, and a progress bar will appear to show training progress, current loss (the lower the better the model), and average loss (to give a sense on loss trajectory).

The model will be saved every `save_every` steps in `trained_model` by default, and when training completes. If you mounted your Google Drive, the model will _also_ be saved there in a unique folder.

The training might time out after 4ish hours; if you did not mount to Google Drive, make sure you end training and save the results so you don't lose them! (if this happens frequently, you may want to consider using [Colab Pro](https://colab.research.google.com/signup))

Important parameters for `train()`:

- **`line_by_line`**: Set this to `True` if the input text file is a single-column CSV, with one record per row. aitextgen will automatically process it optimally.
- **`from_cache`**: If you compressed your dataset locally (as noted in the previous section) and are using that cache file, set this to `True`.
- **`num_steps`**: Number of steps to train the model for.
- **`generate_every`**: Interval of steps to generate example text from the model; good for qualitatively validating training.
- **`save_every`**: Interval of steps to save the model: the model will be saved in the VM to `/trained_model`.
- **`save_gdrive`**: Set this to `True` to copy the model to a unique folder in your Google Drive, if you have mounted it in the earlier cells
- **`fp16`**: Enables half-precision training for faster/more memory-efficient training. Only works on a T4 or V100 GPU.

Here are other important parameters for `train()` that are useful but you likely do not need to change.

- **`learning_rate`**: Learning rate of the model training.
- **`batch_size`**: Batch size of the model training; setting it too high will cause the GPU to go OOM. (if using `fp16`, you can increase the batch size more safely)

In [9]:
ai.train(file_name,
         line_by_line=False,
         from_cache=False,
         num_steps=100,
         generate_every=50,
         save_every=50,
         save_gdrive=False,
         learning_rate=1e-3,
         fp16=False,
         batch_size=1,
         )

INFO:aitextgen:Loading text from tale.txt with generation length of 1024.


  0%|          | 0/279 [00:00<?, ?it/s]

INFO:aitextgen.TokenDataset:Encoding 279 sets of tokens from tale.txt.
INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:IPU available: False, using: 0 IPUs
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:pytorch_lightning.callbacks.model_summary:
  | Name  | Type            | Params
------------------------------------------
0 | model | GPT2LMHeadModel | 124 M 
------------------------------------------
124 M     Trainable params
0         Non-trainable params
124 M     Total params
497.759   Total estimated model params size (MB)


  0%|          | 0/100 [00:00<?, ?it/s]

  self.pid = os.fork()
Configuration saved in trained_model/generation_config.json


[1m50 steps reached: saving model to /trained_model[0m


Generate config GenerationConfig {
  "bos_token_id": 50256,
  "eos_token_id": 50256,
  "transformers_version": "4.26.1"
}

A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


[1m50 steps reached: generating sample texts.[0m

And she did.

First, Cinderella went to the garden and returned with a large, ripe pumpkin. The fairy godmother tapped it once with her wand and—in an instant—it transformed into a majestic, golden carriage.

Then, she went to a trap in the kitchen and returned with a rat.

“Excellent, dear,” said the fairy godmother, “now, if you wouldn’t mind just placing it on the coachman’s seat for me.”

Cinderella obliged.
Another swoosh of the wand and, this time, a resplendent coachman appeared. The two mice—which Cinderella found in the pantry—were turned into two smartly attired footmen. And the four grasshoppers into four magnificent golden carriage.

Cinderella's beautiful golden carriage with coachman white horses and footmen

“And now,” said the fairy godmother, “for my final touch.”

She tapped Cinderella’s shoulder with her wand.

A spark flew up.

The ragged clothes in which she stood a moment earlier were replaced with a picture of s

Configuration saved in trained_model/generation_config.json


[1m100 steps reached: saving model to /trained_model[0m


Generate config GenerationConfig {
  "bos_token_id": 50256,
  "eos_token_id": 50256,
  "transformers_version": "4.26.1"
}

A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


[1m100 steps reached: generating sample texts.[0m


INFO:pytorch_lightning.utilities.rank_zero:`Trainer.fit` stopped: `max_steps=100` reached.


,” he said, bowing his head slowly and offering his hand, “would you be so kind as to marry my precious,” said Cinderella.

She would be delighted,” said, shortly, “would you be so kind as to take my first dance?”

“I would be delighted,” said Cinderella.

She could feel her cheeks beginning to blush as she accompanied the Prince to the dancefloor.

The orchestra began to play.


Cinderella in beautiful gown happy dancing with the Prince

Every eye in the ballroom was upon them—including those of her step-sisters. But, such was her transformation, they simply didn’t recognise her. Instead, they stared in envy, longing to take her place before the Prince.

But they never did.

From that moment forth, the Prince had only eyes for Cinderella. The two of them share their eyes widened.

“It is my duty to announce,” he said, formally, “that his majesty the King sends forth an invitation. To all the young ladies of this fine kingdom to attend The Royal Ball at the palace.”




INFO:aitextgen:Saving trained model pytorch_model.bin to /trained_model
Configuration saved in trained_model/generation_config.json


You're done! Feel free to go to the **Generate Text From The Trained Model** section to generate text based on your retrained model.

## Generate Text From The Trained Model

After you've trained the model or loaded a retrained model from checkpoint, you can now generate text.

**If you just trained a model**, you'll get much faster training performance if you reload the model; the next cell will reload the model you just trained from the `trained_model` folder.

`generate()` without any parameters generates a single text from the loaded model to the console.

In [14]:
ai.generate(prompt= "cindrella went to play")

Generate config GenerationConfig {
  "bos_token_id": 50256,
  "eos_token_id": 50256,
  "transformers_version": "4.26.1"
}



[1mcindrella went to play[0m.

The orchestra began to play.m.

Cinderella in beautiful gown happy dancing with happy prince

Every eye in the ballroom was upon them—including those of her step-sisters. But, such was her transformation, they simply didn’t recognise her. Instead, they stared in envy, longing to take her place before the Prince.

But they never did.

From that moment forth, the Prince had only eyes for Cinderella. The two of them spent the whole evening dancing in each other’s arms and greatly enjoying each other’s company.

Many hours drifted by.

9 p.m.

10 p.m.

11 p.m.


m.

11 p.m.

Suddenly, Cinderella looked up at the clock—it was 11:59 p.m and she only had one minute before the magic spell was broken.

“Oh, I must go,” she said to the Prince.

Without delay, she darted out of the room, past the guards and through the palace. But—as she reached the hall—she left it. And—
