# Model Controller Tutorial: Training a GPT2 Language Model

> This notebook contains an end-to-end process of preprocess + tokenizing your text, and build language models based on GPT architecture

- skip_showdoc: true
- skip_exec: true

In [None]:
%reload_ext autoreload
%autoreload 2

In [None]:
#| hide
from nbdev.showdoc import *

In [None]:
import os

In [None]:
#This will specify a (or a list) of GPUs for training
os.environ['CUDA_VISIBLE_DEVICES'] = "0"

In [None]:
from that_nlp_library.text_transformation import *
from that_nlp_library.text_augmentation import *
from that_nlp_library.text_main_lm import *
from that_nlp_library.utils import seed_everything

In [None]:
from underthesea import text_normalize
from functools import partial
from pathlib import Path
from transformers import AutoTokenizer, AutoConfig, AutoModelForMaskedLM, AutoModelForCausalLM
from datasets import load_dataset
import pandas as pd
import numpy as np
from transformers import DataCollatorForLanguageModeling
from tokenizers import processors
from that_nlp_library.model_lm_main import *
from that_nlp_library.utils import resize_model_embeddings

comet_ml is installed but `COMET_API_KEY` is not set.


# Train a GPT2 Language Model From Scratch (with token concatenation)

This is the original way GPT2 is trained

## Create a TextDataLMController object

We will reuse the data and the preprocessings in [this tutorial](https://anhquan0412.github.io/that-nlp-library/text_main_lm.html) 

In [None]:
dset = load_dataset('sample_data',data_files=['Womens_Clothing_Reviews.csv'],split='train')
tdc = TextDataLMController(dset,
                         main_text='Review Text',
                         filter_dict={'Review Text': lambda x: x is not None},
                         metadatas='Title',
                         content_transformations=[text_normalize,str.lower],
                         seed=42,
                         verbose=False
                        )

Define our tokenizer for Roberta

In [None]:
_tokenizer = AutoTokenizer.from_pretrained('gpt2')

If you want to perform concatenation-of-token, and you want your causal LM to differentiate between sentences, you can add a special token to separate sentences, as follow:

In [None]:
_tokenizer._tokenizer.post_processor = processors.TemplateProcessing(
    single="$A " + _tokenizer.eos_token,
    special_tokens=[(_tokenizer.eos_token, _tokenizer.eos_token_id)],
)
_tokenizer.pad_token = _tokenizer.eos_token

In [None]:
_tokenizer

GPT2TokenizerFast(name_or_path='gpt2', vocab_size=50257, model_max_length=1024, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<|endoftext|>', 'eos_token': '<|endoftext|>', 'unk_token': '<|endoftext|>', 'pad_token': '<|endoftext|>'}, clean_up_tokenization_spaces=True)

Process and tokenize our dataset

In [None]:
block_size=112
tdc.process_and_tokenize(_tokenizer,line_by_line=False,max_length=block_size)

In [None]:
tdc.main_ddict

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'special_tokens_mask'],
        num_rows: 12741
    })
    validation: Dataset({
        features: ['input_ids', 'attention_mask', 'special_tokens_mask'],
        num_rows: 3235
    })
})

And set the data collator

In [None]:
tdc.set_data_collator(is_mlm=False)

## Initialize and train GPT2 Model from scratch

In [None]:
len(_tokenizer)

50257

In [None]:
_tokenizer.bos_token_id,_tokenizer.eos_token_id

(50256, 50256)

In [None]:
_config = AutoConfig.from_pretrained('gpt2',
                                     n_ctx=block_size,
                                     # just in case...
                                     vocab_size=len(_tokenizer),
                                     bos_token_id=_tokenizer.bos_token_id,
                                     eos_token_id=_tokenizer.eos_token_id,
                                     )
_config

GPT2Config {
  "_name_or_path": "gpt2",
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "bos_token_id": 50256,
  "embd_pdrop": 0.1,
  "eos_token_id": 50256,
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_ctx": 112,
  "n_embd": 768,
  "n_head": 12,
  "n_inner": null,
  "n_layer": 12,
  "n_positions": 1024,
  "reorder_and_upcast_attn": false,
  "resid_pdrop": 0.1,
  "scale_attn_by_inverse_layer_idx": false,
  "scale_attn_weights": true,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "task_specific_params": {
    "text-generation": {
      "do_sample": true,
      "max_length": 50
    }
  },
  "transformers_version": "4.31.0",
  "use_cache": true,
  "vocab_size": 50257
}

In [None]:
_model = language_model_init(AutoModelForCausalLM,
                             config=_config,
                             cpoint_path=None, # leave this as None to get a non-pretrained model
                             seed=42
                            )

Initiate a new language model from scratch
Total parameters: 124439808
Total trainable parameters: 124439808


In [None]:
_model

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=50257, bias=False)
)

In [None]:
_model = resize_model_embeddings(_model,_tokenizer)

Create a model controller

In [None]:
controller = ModelLMController(_model,data_store=tdc,seed=42)

And we can start training our model

In [None]:
lr = 1e-4
bs=32
wd=0.01
epochs= 4
warmup_ratio=0.25
controller.fit(epochs,lr,
               batch_size=bs,
               weight_decay=wd,
               warmup_ratio=warmup_ratio,
               save_checkpoint=False,
              )

You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Accuracy
0,No log,4.697953,0.200376
2,5.551800,4.097337,0.24722
2,5.551800,3.833579,0.275745
3,3.821300,3.779644,0.282777


Perplexity on validation set: 43.800


In [None]:
controller.trainer.model.save_pretrained('./sample_weights/lm_model')

## Generate text using model

In [None]:
sentence1 = 'major problem . this is by far one of the '
sentence2 = 'flattering . this is by far one of the '

In [None]:
from transformers import pipeline

In [None]:
text_gen = pipeline("text-generation",model='./sample_weights/lm_model', config = _config, tokenizer=_tokenizer)

In [None]:
# reference: https://huggingface.co/docs/transformers/v4.33.2/en/main_classes/text_generation#transformers.GenerationMixin.generate
preds = text_gen(sentence1, num_return_sequences=3,max_new_tokens=50,num_beams=1,do_sample=True)
for pred in preds:
    print(f">>> {pred['generated_text']}")

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


>>> major problem . this is by far one of the irl and was no extra room which i received many different colors but as other colors are not my hips, but the knit color is beautiful and it drapes really cute! i also know it's not a light weight but i'll make this because
>>> major problem . this is by far one of the a style, it has definitely super flattering and it is well made! that it runs large. i normally wear a 0 or 8 in dresses and received the medium in this one, but this fits and it looks adorable, but i really excited to hide
>>> major problem . this is by far one of the ________ida, it does not show your bust area. my top in the front portion looks like a tiny bit loose and loose -- especially about the back of the jacket. also, the pockets are a perfect size, the fit, which's stretchy


In [None]:
preds = text_gen(sentence2, num_return_sequences=3,max_new_tokens=50,num_beams=1,do_sample=True)
for pred in preds:
    print(f">>> {pred['generated_text']}")

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


>>> flattering . this is by far one of the ________ess and makes it a good cut to be. i like the pockets, and i want to find it feel that i don't feel like the fabric was not a little more flattering in. i'm keeping it because i think how it has been
>>> flattering . this is by far one of the ________h. and the fit is perfectly. i tried it on in the store and was true to size and it was well made and comfy. i could return it at an xl, but the skirt was very soft, so the sizing was
>>> flattering . this is by far one of the ________et, and the fabric is very flowy. it would be a bit more like a top for me. on me, because it is so very pretty, but very forgiving at the model, and flowy, and looks great with my jeans


# Finetune GPT2 Language Model (with token concatenation)

## Create a TextDataLMController object

We will reuse the data and the preprocessings in [this tutorial](https://anhquan0412.github.io/that-nlp-library/text_main_lm.html) 

In [None]:
dset = load_dataset('sample_data',data_files=['Womens_Clothing_Reviews.csv'],split='train')
tdc = TextDataLMController(dset,
                         main_text='Review Text',
                         filter_dict={'Review Text': lambda x: x is not None},
                         metadatas='Title',
                         content_transformations=[text_normalize,str.lower],
                         seed=42,
                         verbose=False
                        )

Define our tokenizer for Roberta

In [None]:
_tokenizer = AutoTokenizer.from_pretrained('gpt2')

If you want to perform concatenation-of-token, and you want your causal LM to differentiate between sentences, you can add a special token to separate sentences, as follow:

In [None]:
_tokenizer._tokenizer.post_processor = processors.TemplateProcessing(
    single="$A " + _tokenizer.eos_token,
    special_tokens=[(_tokenizer.eos_token, _tokenizer.eos_token_id)],
)
_tokenizer.pad_token = _tokenizer.eos_token

In [None]:
_tokenizer

GPT2TokenizerFast(name_or_path='gpt2', vocab_size=50257, model_max_length=1024, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<|endoftext|>', 'eos_token': '<|endoftext|>', 'unk_token': '<|endoftext|>', 'pad_token': '<|endoftext|>'}, clean_up_tokenization_spaces=True)

Process and tokenize our dataset

In [None]:
block_size=112
tdc.process_and_tokenize(_tokenizer,line_by_line=False,max_length=block_size)

In [None]:
tdc.main_ddict

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'special_tokens_mask'],
        num_rows: 12741
    })
    validation: Dataset({
        features: ['input_ids', 'attention_mask', 'special_tokens_mask'],
        num_rows: 3235
    })
})

And set the data collator

In [None]:
tdc.set_data_collator(is_mlm=False)

## Initialize and train GPT2 Model

In [None]:
len(_tokenizer)

50257

In [None]:
_tokenizer.bos_token_id,_tokenizer.eos_token_id

(50256, 50256)

In [None]:
_config = AutoConfig.from_pretrained('gpt2',
                                     n_ctx=block_size,
                                     # just in case...
                                     vocab_size=len(_tokenizer),
                                     bos_token_id=_tokenizer.bos_token_id,
                                     eos_token_id=_tokenizer.eos_token_id,
                                     )
_config

GPT2Config {
  "_name_or_path": "gpt2",
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "bos_token_id": 50256,
  "embd_pdrop": 0.1,
  "eos_token_id": 50256,
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_ctx": 112,
  "n_embd": 768,
  "n_head": 12,
  "n_inner": null,
  "n_layer": 12,
  "n_positions": 1024,
  "reorder_and_upcast_attn": false,
  "resid_pdrop": 0.1,
  "scale_attn_by_inverse_layer_idx": false,
  "scale_attn_weights": true,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "task_specific_params": {
    "text-generation": {
      "do_sample": true,
      "max_length": 50
    }
  },
  "transformers_version": "4.31.0",
  "use_cache": true,
  "vocab_size": 50257
}

In [None]:
_model = language_model_init(AutoModelForCausalLM,
                             config=_config,
                             cpoint_path='gpt2',
                             seed=42
                            )

Total parameters: 124439808
Total trainable parameters: 124439808


In [None]:
_model = resize_model_embeddings(_model,_tokenizer)

Create a model controller

In [None]:
controller = ModelLMController(_model,data_store=tdc,seed=42)

And we can start training our model

In [None]:
lr = 1e-4
bs=32
wd=0.01
epochs= 4
warmup_ratio=0.25
controller.fit(epochs,lr,
               batch_size=bs,
               weight_decay=wd,
               warmup_ratio=warmup_ratio,
               save_checkpoint=False,
              )

You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Accuracy
0,No log,3.113538,0.351026
2,3.303700,2.983198,0.365869
2,3.303700,2.939725,0.370865
3,2.886300,2.934201,0.37126


Perplexity on validation set: 18.806


In [None]:
controller.trainer.model.save_pretrained('./sample_weights/lm_model')

## Generate text using model

In [None]:
sentence1 = 'major problem . this is by far one of the '
sentence2 = 'flattering . this is by far one of the '

In [None]:
from transformers import pipeline

In [None]:
text_gen = pipeline("text-generation",model='./sample_weights/lm_model', config = _config, tokenizer=_tokenizer)

Xformers is not installed correctly. If you want to use memory_efficient_attention to accelerate training use the following command to install Xformers
pip install xformers.


In [None]:
# reference: https://huggingface.co/docs/transformers/v4.33.2/en/main_classes/text_generation#transformers.GenerationMixin.generate
preds = text_gen(sentence1, num_return_sequences=3,max_new_tokens=50,num_beams=1,do_sample=True)
for pred in preds:
    print(f">>> {pred['generated_text']}")

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


>>> major problem . this is by far one of the erynpranka's most comfortable dress and i don't know why. it is a light and airy fabric and the cut of the shoulders is perfect. the colors are gorgeous and i get tons of compliments every time i wear it. i agree
>>> major problem . this is by far one of the iphone tops i have purchased from retailer. it was supposed to be a more fitted fit, but i was disappointed that it was a tad snug on top and tight in back. i am 5'1 " and 135 lbs and ordered the m.
>>> major problem . this is by far one of the irls that i am buying these next season. i just found the other one online and i really wanted to be able boasts about it to other customers. they had not even washed it! the blue is a very gorgeous blueish-floral color


In [None]:
preds = text_gen(sentence2, num_return_sequences=3,max_new_tokens=50,num_beams=1,do_sample=True)
for pred in preds:
    print(f">>> {pred['generated_text']}")

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


>>> flattering . this is by far one of the ichroniest tops i've received. you can dress it up or dress down. it's so light in color and soft. i'm a bit on the big side so it may show a bit if you are bustier but it's so cute
>>> flattering . this is by far one of the irls with my closet. as soon as i saw this dress i knew i needed it! i got all my usual size small ( xs ) and it is so loose and flowy that i might need any length. i just need to wear
>>> flattering . this is by far one of the irl dresses i have purchased in the past 3 years. it is great quality, i received many compliments, there's an adorable lining throughout the dress! it's unique, the print is stunning and easy to dress up or dress down. if i'm


# Finetune GPT2 Language Model (with line-by-line concatenation)

## Create a TextDataLMController object

We will reuse the data and the preprocessings in [this tutorial](https://anhquan0412.github.io/that-nlp-library/text_main_lm.html) 

In [None]:
dset = load_dataset('sample_data',data_files=['Womens_Clothing_Reviews.csv'],split='train')
tdc = TextDataLMController(dset,
                         main_text='Review Text',
                         filter_dict={'Review Text': lambda x: x is not None},
                         metadatas='Title',
                         content_transformations=[text_normalize,str.lower],
                         seed=42,
                         verbose=False
                        )

Define our tokenizer for Roberta

In [None]:
_tokenizer = AutoTokenizer.from_pretrained('gpt2')
_tokenizer.pad_token = _tokenizer.eos_token

In [None]:
_tokenizer

GPT2TokenizerFast(name_or_path='gpt2', vocab_size=50257, model_max_length=1024, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<|endoftext|>', 'eos_token': '<|endoftext|>', 'unk_token': '<|endoftext|>', 'pad_token': '<|endoftext|>'}, clean_up_tokenization_spaces=True)

Process and tokenize our dataset

In [None]:
block_size=112
tdc.process_and_tokenize(_tokenizer,line_by_line=True,max_length=block_size)

In [None]:
tdc.main_ddict

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'special_tokens_mask'],
        num_rows: 18112
    })
    validation: Dataset({
        features: ['input_ids', 'attention_mask', 'special_tokens_mask'],
        num_rows: 4529
    })
})

And set the data collator

In [None]:
tdc.set_data_collator(is_mlm=False)

## Initialize and train GPT2 Model

In [None]:
len(_tokenizer)

50257

In [None]:
_tokenizer.bos_token_id,_tokenizer.eos_token_id

(50256, 50256)

In [None]:
_config = AutoConfig.from_pretrained('gpt2',
                                     n_ctx=block_size,
                                     # just in case...
                                     vocab_size=len(_tokenizer),
                                     bos_token_id=_tokenizer.bos_token_id,
                                     eos_token_id=_tokenizer.eos_token_id,
                                     )
_config

GPT2Config {
  "_name_or_path": "gpt2",
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "bos_token_id": 50256,
  "embd_pdrop": 0.1,
  "eos_token_id": 50256,
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_ctx": 112,
  "n_embd": 768,
  "n_head": 12,
  "n_inner": null,
  "n_layer": 12,
  "n_positions": 1024,
  "reorder_and_upcast_attn": false,
  "resid_pdrop": 0.1,
  "scale_attn_by_inverse_layer_idx": false,
  "scale_attn_weights": true,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "task_specific_params": {
    "text-generation": {
      "do_sample": true,
      "max_length": 50
    }
  },
  "transformers_version": "4.31.0",
  "use_cache": true,
  "vocab_size": 50257
}

In [None]:
_model = language_model_init(AutoModelForCausalLM,
                             config=_config,
                             cpoint_path='gpt2',
                             seed=42
                            )

Total parameters: 124439808
Total trainable parameters: 124439808


In [None]:
_model = resize_model_embeddings(_model,_tokenizer)

Create a model controller

In [None]:
controller = ModelLMController(_model,data_store=tdc,seed=42)

And we can start training our model

In [None]:
lr = 1e-4
bs=32
wd=0.01
epochs= 4
warmup_ratio=0.25
controller.fit(epochs,lr,
               batch_size=bs,
               weight_decay=wd,
               warmup_ratio=warmup_ratio,
               save_checkpoint=False,
              )

You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Accuracy
1,No log,3.01341,0.24856
2,3.194200,2.878801,0.259095
3,3.194200,2.830053,0.263616
4,2.764700,2.824914,0.264219


Perplexity on validation set: 16.859


In [None]:
controller.trainer.model.save_pretrained('./sample_weights/lm_model')

## Generate text using model

In [None]:
sentence1 = 'major problem . this is by far one of the '
sentence2 = 'flattering . this is by far one of the '

In [None]:
from transformers import pipeline

In [None]:
text_gen = pipeline("text-generation",model='./sample_weights/lm_model', config = _config, tokenizer=_tokenizer)

Xformers is not installed correctly. If you want to use memory_efficient_attention to accelerate training use the following command to install Xformers
pip install xformers.


In [None]:
# reference: https://huggingface.co/docs/transformers/v4.33.2/en/main_classes/text_generation#transformers.GenerationMixin.generate
preds = text_gen(sentence1, num_return_sequences=3,max_new_tokens=50,num_beams=1,do_sample=True)
for pred in preds:
    print(f">>> {pred['generated_text']}")

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


>>> major problem . this is by far one of the eryn regnier clothes i have purchased lately. very thin, loose knit and not flattering on an hourglass shape. if you are a 32 d, this is a very pretty top. however, there is no way to wear it with the
>>> major problem . this is by far one of the iphone's most comfortable pants i know. it was supposed to be a more straight fit, but i had a very hard time with the stretch. the fabric feels like it might be too long for an 8 th st, but the crotch was so
>>> major problem . this is by far one of the irls that i am buying these weeks, i normally buy one size up in maeve bottoms. when i boasts about having never had a " problem " with a large bust, i am being grossly misleading. i know this would " make


In [None]:
preds = text_gen(sentence2, num_return_sequences=3,max_new_tokens=50,num_beams=1,do_sample=True)
for pred in preds:
    print(f">>> {pred['generated_text']}")

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


>>> flattering . this is by far one of the erynflovskas best pieces i have ever purchased! i bought it in both colors. the color is just beautiful. i got a size medium. the material is soft and comfortable. i really can't tell if there's " shimmer or
>>> flattering . this is by far one of the irls with my best reviews as it's comfortable and flattering without the baggy poufy feel of what other reviewers rave about. it's flattering even if you have boobs. i'm 5'2 ", and it goes just below the hips
>>> flattering . this is by far one of the iphone 5 th dresses i have bought this season. it is great quality, i received many compliments, there's an elastic waist and the fabric is thick. this dress has great movement and is easy to dress up or down. i ordered the black
