## Transformers for Text Generation, Training and Fine-Tuning
In this section we are going to train and evaluate Transformers models to generate text.

Connect to Goodle Drive (if needed), add testing variables to the environment, and upload the dataset from storage

In [1]:
!export CUDA_LAUNCH_BLOCKING=1

In [2]:
import os
os.environ["CUDA_LAUNCH_BLOCKING"] = "1"

In [3]:
try:
  from google.colab import drive
  drive.mount('/content/drive')
  import sys
  path_to_project = '/content/drive/MyDrive/NLP_Project'
  sys.path.append(path_to_project)
  IN_COLAB = True
except:
  IN_COLAB = False

Mounted at /content/drive


In [4]:
import pandas as pd
import torch
import numpy as np

In [5]:
dataset_path = path_to_project + '/final_ds.csv' if IN_COLAB else './final_ds.csv'
df = pd.read_csv(dataset_path)

In [6]:
df.head(3)

Unnamed: 0,problem_description,solution_id,solution_code,problem_name,time_complexity_inferred,space_complexity_inferred
0,Xenia has a set of weights and pan scales. Eac...,0_0,__author__ = 'ratnesh.mishra'\n\nweights = map...,339_C. Xenia and Weights,O(1),O(n**2)
1,Xenia has a set of weights and pan scales. Eac...,0_2,import sys\nsys.setrecursionlimit (1000000)\n\...,339_C. Xenia and Weights,O(1),O(1)
2,Xenia has a set of weights and pan scales. Eac...,0_4,import sys\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n...,339_C. Xenia and Weights,O(1),O(1)


### Train T5-Base

The developers of the Text-To-Text Transfer Transformer (T5) write:

'With T5, we propose reframing all NLP tasks into a unified text-to-text-format where the input and output are always text strings, in contrast to BERT-style models that can only output either a class label or a span of the input. Our text-to-text framework allows us to use the same model, loss function, and hyperparameters on any NLP task.'

The model has 223M parameters.

Import libraries that are going to be used in the following subsections

In [None]:
!pip install -q transformers
!pip install datasets

In [None]:
from datasets import Dataset, DatasetDict

#### Dataset Pre-processing



For the Model T5-Base only 8000 samples are kept from the dataset due to the limitation of the resources availabe to us. We decided to keep the training (and fine-tuning) time of the models at approximately 2 hours for this section, and the amount of samples allowed us to maintain this contraint.

In [9]:
## only in initial part: reduce ds to allow faster testing of the code
df = df.head(8000).copy()
df.shape

(8000, 6)

Subset the data into training dataset, validation dataset and test dataset. We decided to use 10% of the dataset for tesing, and the remaining for train and validation.

In [10]:
from sklearn.model_selection import train_test_split

train_val, test = train_test_split(df, test_size=0.1)
train, val = train_test_split(train_val, test_size=0.2)

In [11]:
print('# train instances: ', train.shape[0])
print('# test instances:  ', test.shape[0])
print('# val instances:   ', val.shape[0])

# train instances:  5760
# test instances:   800
# val instances:    1440


We check the device

In [12]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(device)

cuda


#### Model and Tokenizer

Retrieve the model and the tokenizer

In [17]:
from transformers import T5Tokenizer, T5ForConditionalGeneration

In [18]:
model_name = 'google-t5/t5-base'

Upload the Tokenizer

In [19]:
tokenizer = T5Tokenizer.from_pretrained(model_name)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


In [20]:
tokenizer.pad_token = tokenizer.eos_token

Print some useful information on the tokenizer

In [21]:
print("vocabulary size: ", tokenizer.vocab_size)

vocabulary size:  32000


In [22]:
list(tokenizer.get_vocab().items())[600:610]

[('▁big', 600),
 ('▁God', 601),
 ('▁dass', 602),
 ('im', 603),
 ('▁30', 604),
 ('▁event', 605),
 ('▁development', 606),
 ('▁form', 607),
 ('▁read', 608),
 ('▁hand', 609)]

In [23]:
text = "You have an array in input, order the elements in it in O(n) time complexity. Add a wrong wordd"
encoded_input = tokenizer._tokenize(text)
print(encoded_input)

['▁You', '▁have', '▁an', '▁array', '▁in', '▁input', ',', '▁order', '▁the', '▁elements', '▁in', '▁it', '▁in', '▁O', '(', 'n', ')', '▁time', '▁complexity', '.', '▁Add', '▁', 'a', '▁wrong', '▁word', 'd']


In [24]:
encoded_ids = tokenizer(text)['input_ids']
print(encoded_ids)

[148, 43, 46, 5590, 16, 3785, 6, 455, 8, 2479, 16, 34, 16, 411, 599, 29, 61, 97, 11641, 5, 2334, 3, 9, 1786, 1448, 26, 1]


The model is uploaded and connected to the device

In [25]:
t5 = T5ForConditionalGeneration.from_pretrained(model_name, device_map=device)
t5.resize_token_embeddings(len(tokenizer))

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/892M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

Embedding(32100, 768)

In [26]:
print(t5)

T5ForConditionalGeneration(
  (shared): Embedding(32100, 768)
  (encoder): T5Stack(
    (embed_tokens): Embedding(32100, 768)
    (block): ModuleList(
      (0): T5Block(
        (layer): ModuleList(
          (0): T5LayerSelfAttention(
            (SelfAttention): T5Attention(
              (q): Linear(in_features=768, out_features=768, bias=False)
              (k): Linear(in_features=768, out_features=768, bias=False)
              (v): Linear(in_features=768, out_features=768, bias=False)
              (o): Linear(in_features=768, out_features=768, bias=False)
              (relative_attention_bias): Embedding(32, 12)
            )
            (layer_norm): T5LayerNorm()
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (1): T5LayerFF(
            (DenseReluDense): T5DenseActDense(
              (wi): Linear(in_features=768, out_features=3072, bias=False)
              (wo): Linear(in_features=3072, out_features=768, bias=False)
              (dropout): Dro

Consider now the number of parameters of the model

In [27]:
n_params = sum(param.numel() for param in t5.parameters())
n_params

222882048

In [28]:
for name, param in t5.named_parameters():
    print(f"Parameter name: {name}")
    print(f"Parameter shape: {param.size()}")
    print(f"Is trainable: {param.requires_grad}")
    print()

Parameter name: shared.weight
Parameter shape: torch.Size([32100, 768])
Is trainable: True

Parameter name: encoder.block.0.layer.0.SelfAttention.q.weight
Parameter shape: torch.Size([768, 768])
Is trainable: True

Parameter name: encoder.block.0.layer.0.SelfAttention.k.weight
Parameter shape: torch.Size([768, 768])
Is trainable: True

Parameter name: encoder.block.0.layer.0.SelfAttention.v.weight
Parameter shape: torch.Size([768, 768])
Is trainable: True

Parameter name: encoder.block.0.layer.0.SelfAttention.o.weight
Parameter shape: torch.Size([768, 768])
Is trainable: True

Parameter name: encoder.block.0.layer.0.SelfAttention.relative_attention_bias.weight
Parameter shape: torch.Size([32, 12])
Is trainable: True

Parameter name: encoder.block.0.layer.0.layer_norm.weight
Parameter shape: torch.Size([768])
Is trainable: True

Parameter name: encoder.block.0.layer.1.DenseReluDense.wi.weight
Parameter shape: torch.Size([3072, 768])
Is trainable: True

Parameter name: encoder.block.0.la

#### Training


Now that we have both Tokenizer and Model we can tokenize the dataset and train the model

We obtain a single string for each conversation of problem description and solution, where the user input and the system response are separated by eos tokens.

In [29]:
def apply_eos_token(idx, df, eos_token):
    # build the user input including the problem description, the time complexity and the space complexity
    chat_string = 'User:' + df.loc[idx, 'problem_description'] + ' Time complexity: ' + df.loc[idx, 'time_complexity_inferred'] + '; Space complexity: ' + df.loc[idx, 'space_complexity_inferred']
    # now add the eos token and the response from the assistant, the code solution for the problem
    chat_string = chat_string + eos_token + 'Assistant: ' + df.loc[idx, 'solution_code'] + eos_token
    return chat_string

In [30]:
train_str = [apply_eos_token(idx, train, tokenizer.eos_token) for idx in train.index]
test_str = [apply_eos_token(idx, test, tokenizer.eos_token) for idx in test.index]
val_str = [apply_eos_token(idx, val, tokenizer.eos_token) for idx in val.index]

In [31]:
train_str[0]

'User:It is known that the area of a regular dodecagon inscribed in a circle of radius a is 3a^2.\n\nGiven an integer r, find the area of a regular dodecagon inscribed in a circle of radius r.\n\nConstraints\n\n* 1 \\leq r \\leq 100\n* r is an integer.\n\nInput\n\nInput is given from Standard Input in the following format:\n\n\nr\n\n\nOutput\n\nPrint an integer representing the area of the regular dodecagon.\n\nExamples\n\nInput\n\n4\n\n\nOutput\n\n48\n\n\nInput\n\n15\n\n\nOutput\n\n675\n\n\nInput\n\n80\n\n\nOutput\n\n19200 Time complexity: O(1); Space complexity: O(1)</s>Assistant: N = int(input())\n\nprint(3*(N**2))</s>'

The strings where the eos token was applied, are now put in a dictionary structure

In [None]:
train_data = Dataset.from_dict({'chat': train_str})
test_data = Dataset.from_dict({'chat': test_str})
val_data = Dataset.from_dict({'chat': val_str})

In [None]:
data = DatasetDict()
data['train'] = train_data
data['val'] = val_data
data['test'] = test_data

The data is tokenized

In [None]:
def tokenize_function(examples):
    input_encodings = tokenizer(examples["chat"],
        truncation=True,
        padding="max_length",
        max_length=512)
    sample = {
        'input_ids': input_encodings.input_ids,
        'attention_mask': input_encodings.attention_mask,
        'labels': input_encodings.input_ids.copy()
    }
    return sample

tokenized_data = data.map(tokenize_function, batched=True)

Map:   0%|          | 0/5760 [00:00<?, ? examples/s]



Map:   0%|          | 0/1440 [00:00<?, ? examples/s]

Map:   0%|          | 0/800 [00:00<?, ? examples/s]

In [None]:
# get all sequences in same batch
from transformers import DataCollatorForLanguageModeling
data_collator = DataCollatorForLanguageModeling(tokenizer, mlm=False)

Now the training starts

In [None]:
from transformers import TrainingArguments, Trainer
import os

In [None]:
os.environ["WANDB_DISABLED"] = "true"

In [None]:
training_args = TrainingArguments(
    "t5_trainer",
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    gradient_accumulation_steps=8,
    num_train_epochs=3,
    save_steps=500,
    eval_steps=500,
    learning_rate=1e-4,       # t5 needs a higher lr than other models
    lr_scheduler_type="linear",
    bf16=True,
    report_to=None,
)

Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


In [None]:
trainer = Trainer(
    model= t5,
    args= training_args,
    train_dataset= tokenized_data['train'],
    eval_dataset= tokenized_data['val'],
    data_collator= data_collator
)

Measure the time taken by the model to finish training

In [None]:
import time
begin = time.time()

In [None]:
trainer.train()

Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.48.0. You should pass an instance of `EncoderDecoderCache` instead, e.g. `past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values)`.


Step,Training Loss
500,0.1532
1000,0.0042


TrainOutput(global_step=1080, training_loss=0.07308270597347506, metrics={'train_runtime': 7709.3416, 'train_samples_per_second': 2.241, 'train_steps_per_second': 0.14, 'total_flos': 1.05227923488768e+16, 'train_loss': 0.07308270597347506, 'epoch': 3.0})

In [None]:
end = time.time()
print("Training time: ", end - begin)

Training time:  7709.833685159683


In [None]:
# Get all of the model's parameters as a list of tuples.
params = list(t5.named_parameters())

print('The T5 model has {:} different named parameters.\n'.format(len(params)))

print('==== Embedding Layer ====\n')

for p in params[0:2]:
    print("{:<55} {:>12}".format(p[0], str(tuple(p[1].size()))))

print('\n==== First Transformer ====\n')

for p in params[2:14]:
    print("{:<55} {:>12}".format(p[0], str(tuple(p[1].size()))))

print('\n==== Output Layer ====\n')

for p in params[-2:]:
    print("{:<55} {:>12}".format(p[0], str(tuple(p[1].size()))))

The T5 model has 257 different named parameters.

==== Embedding Layer ====

shared.weight                                           (32100, 768)
encoder.block.0.layer.0.SelfAttention.q.weight            (768, 768)

==== First Transformer ====

encoder.block.0.layer.0.SelfAttention.k.weight            (768, 768)
encoder.block.0.layer.0.SelfAttention.v.weight            (768, 768)
encoder.block.0.layer.0.SelfAttention.o.weight            (768, 768)
encoder.block.0.layer.0.SelfAttention.relative_attention_bias.weight     (32, 12)
encoder.block.0.layer.0.layer_norm.weight                     (768,)
encoder.block.0.layer.1.DenseReluDense.wi.weight         (3072, 768)
encoder.block.0.layer.1.DenseReluDense.wo.weight         (768, 3072)
encoder.block.0.layer.1.layer_norm.weight                     (768,)
encoder.block.1.layer.0.SelfAttention.q.weight            (768, 768)
encoder.block.1.layer.0.SelfAttention.k.weight            (768, 768)
encoder.block.1.layer.0.SelfAttention.v.weight      

In [None]:
t5.eval()

T5ForConditionalGeneration(
  (shared): Embedding(32100, 768)
  (encoder): T5Stack(
    (embed_tokens): Embedding(32100, 768)
    (block): ModuleList(
      (0): T5Block(
        (layer): ModuleList(
          (0): T5LayerSelfAttention(
            (SelfAttention): T5Attention(
              (q): Linear(in_features=768, out_features=768, bias=False)
              (k): Linear(in_features=768, out_features=768, bias=False)
              (v): Linear(in_features=768, out_features=768, bias=False)
              (o): Linear(in_features=768, out_features=768, bias=False)
              (relative_attention_bias): Embedding(32, 12)
            )
            (layer_norm): T5LayerNorm()
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (1): T5LayerFF(
            (DenseReluDense): T5DenseActDense(
              (wi): Linear(in_features=768, out_features=3072, bias=False)
              (wo): Linear(in_features=3072, out_features=768, bias=False)
              (dropout): Dro

Save model's parameters

In [None]:
from datetime import datetime
t5_training_path = path_to_project + '/Transformer-trained-models/' + f"t5_train_{datetime.now().strftime('%Y_%m_%d_%H_%M_%S')}"
tokenizer.save_pretrained(t5_training_path)
t5.save_pretrained(t5_training_path)
print(f"Checkpoint saved at: \'{t5_training_path}\'")

Checkpoint saved at: '/content/drive/MyDrive/NLP_Project/Transformer-trained-models/t5_train_2025_05_18_12_35_46'


#### Testing

Retrieve the trained model

In [None]:
device = 'cuda'

#t5_training_path = '/content/drive/MyDrive/NLP_Project/Transformer-trained-models/t5_train_2025_05_18_12_35_46'
tokenizer = T5Tokenizer.from_pretrained(t5_training_path)
t5 = T5ForConditionalGeneration.from_pretrained(t5_training_path, device_map=device)

To test the model, we first extract one conversation from the test data randomly, give it as input to the model, and we see the response

In [None]:
import random

random.seed(43)

idx = random.choice(range(len(test_data))) # select a random conversation
print(idx)
dialogue = test_data['chat'][idx]
print(dialogue)

39
User: Vlad likes to eat in cafes very much. During his life, he has visited cafes n times. Unfortunately, Vlad started to feel that his last visits are not any different from each other. To fix that Vlad had a small research.

First of all, Vlad assigned individual indices to all cafes. Then, he wrote down indices of cafes he visited in a row, in order of visiting them. Now, Vlad wants to find such a cafe that his last visit to that cafe was before his last visits to every other cafe. In other words, he wants to find such a cafe that he hasn't been there for as long as possible. Help Vlad to find that cafe.

Input

In first line there is one integer n (1 ≤ n ≤ 2·105) — number of cafes indices written by Vlad.

In second line, n numbers a1, a2, ..., an (0 ≤ ai ≤ 2·105) are written — indices of cafes in order of being visited by Vlad. Vlad could visit some cafes more than once. Note that in numeration, some indices could be omitted.

Output

Print one integer — index of the cafe that 

In [None]:
# now take only the 'input part' and the 'output part'
# parse string
test_input, test_output, en = dialogue.split('</s>')

# add eos token at the end of test_input and test_output
test_input = test_input + tokenizer.eos_token + ' Assistant: '
test_output = test_output + tokenizer.eos_token

print('User input: \n', test_input)
print('User input lenght: ', len(test_input))
print('####')
print('Correct solution output: \n', test_output)

User input: 
 User: Vlad likes to eat in cafes very much. During his life, he has visited cafes n times. Unfortunately, Vlad started to feel that his last visits are not any different from each other. To fix that Vlad had a small research.

First of all, Vlad assigned individual indices to all cafes. Then, he wrote down indices of cafes he visited in a row, in order of visiting them. Now, Vlad wants to find such a cafe that his last visit to that cafe was before his last visits to every other cafe. In other words, he wants to find such a cafe that he hasn't been there for as long as possible. Help Vlad to find that cafe.

Input

In first line there is one integer n (1 ≤ n ≤ 2·105) — number of cafes indices written by Vlad.

In second line, n numbers a1, a2, ..., an (0 ≤ ai ≤ 2·105) are written — indices of cafes in order of being visited by Vlad. Vlad could visit some cafes more than once. Note that in numeration, some indices could be omitted.

Output

Print one integer — index of the

We generate the response from the model

In [None]:
input_ids = tokenizer(test_input, return_tensors="pt", max_length = 2000).to(device)
output = t5.generate(**input_ids, max_new_tokens=800)
gen_text = tokenizer.decode(output[0])

print(gen_text)
print(gen_text.split('Assistant:')[-1])

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


<pad> User: Vlad likes to eat in cafes very much. During his life, he has visited cafes n times. Unfortunately, Vlad started to feel that his last visits are not any different from each other. To fix that Vlad had a small research. First of all, Vlad assigned individual indices to all cafes. Then, he wrote down indices of cafes he visited in a row, in order of visiting them. Now, Vlad wants to find such a cafe that his last visit to that cafe was before his last visits to every other cafe. In other words, he wants to find such a cafe that he hasn't been there for as long as possible. Help Vlad to find that cafe. Input In first line there is one integer n (1 <unk> n <unk> 2<unk> 105) — number of cafes indices written by Vlad. In second line, n numbers a1, a2, ..., an (0 <unk> ai <unk> 2<unk> 105) are written — indices of cafes in order of being visited by Vlad. Vlad could visit some cafes more than once. Note that in numeration, some indices could be omitted. Output Print one integer — 

The model does not perform very well and struggles to produce code in output and to maintain the conversation format given to it. What it tries to do is to describe the solution process without explicitly generating code. 

We expected this type of reults, the T5 model has very good performance on translation and sumarization, while it wasn't trained as a chatbot to produce code initially. Moreover, due to our limited resources, 8000 samples are too little to train such a big model (223M parameters) in an efective way.

#### Evaluation Metrics

It was decided to consider the metrics Perplexity, BLEU and F1.

**Perplexity**

We select only the first three samples of the test set due to limitations in the RAM available to us

In [None]:
# get inputs from test_data
test_input = [dialogue.split('</s>')[0] + tokenizer.eos_token + 'Assistant: ' for dialogue in test_data['chat'][:3]]

# get outputs from test data
test_output = [dialogue.split('</s>')[1] + tokenizer.eos_token for dialogue in test_data['chat'][:3]]

In [None]:
!pip install lmppl

In [None]:
import lmppl

scorer = lmppl.EncoderDecoderLM(t5_training_path)

ppl = scorer.get_perplexity(input_texts= test_input, output_texts=test_output)
print(list( ppl))
print(f"average perplexity: {sum(ppl)/len(ppl)}")

You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
  0%|          | 0/1 [00:00<?, ?it/s]Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.48.0. You should pass an instance of `EncoderDecoderCache` instead, e.g. `past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values)`.
100%|██████████| 1/1 [00:00<00:00,  1.51it/s]

[23387.28944204467, 1260.3922689555861, 22569.483405842497]
average perplexity: 15739.055038947583





In [None]:
print(f"user input: \n{test_input[ppl.index(min(ppl))]}")
print('#################')
print(f"prediction: \n{test_output[ppl.index(min(ppl))]}")

user input: 
User: Gleb ordered pizza home. When the courier delivered the pizza, he was very upset, because several pieces of sausage lay on the crust, and he does not really like the crust.

The pizza is a circle of radius r and center at the origin. Pizza consists of the main part — circle of radius r - d with center at the origin, and crust around the main part of the width d. Pieces of sausage are also circles. The radius of the i -th piece of the sausage is ri, and the center is given as a pair (xi, yi).

Gleb asks you to help determine the number of pieces of sausage caught on the crust. A piece of sausage got on the crust, if it completely lies on the crust.

Input

First string contains two integer numbers r and d (0 ≤ d < r ≤ 500) — the radius of pizza and the width of crust.

Next line contains one integer number n — the number of pieces of sausage (1 ≤ n ≤ 105).

Each of next n lines contains three integer numbers xi, yi and ri ( - 500 ≤ xi, yi ≤ 500, 0 ≤ ri ≤ 500), where x

**BLEU**

BLEU focuses on precision by counting matching n-grams.

In [None]:
!pip -q install parlai

In [None]:
from parlai.core.metrics import BleuMetric

input_ids = tokenizer(test_input[0], return_tensors="pt", max_length = 2000).to(device)
output = t5.generate(**input_ids, max_new_tokens=800)
gen_text = tokenizer.decode(output[0])

bleu = BleuMetric.compute(gen_text, [test_output[0]])
print(f"BLEU: {bleu}")

BLEU: 3.804e-09


The Bleu metric has a very low value, indicating very little overlap between the generated text and the solution of the test sample 

**F1 Score**

In [None]:
from parlai.core.metrics import F1Metric

f1_score = F1Metric.compute(gen_text, [test_output[0]])
print(f"F1: {f1_score}")

F1: 0.03226
