### Fine-tuning gpt2-CodeParrot

This model is a base GPT-2 architecture with 124M parameters. It was trained on the huggingface-course/codeparrot-ds-valid dataset, which is a small subset of the original WebText dataset used to train GPT-2, and includes python code.

Connect to Goodle Drive and upload the dataset

In [None]:
try:
  from google.colab import drive
  drive.mount('/content/drive')
  import sys
  path_to_project = '/content/drive/MyDrive/NLP_Project'
  sys.path.append(path_to_project)
  IN_COLAB = True
except:
  IN_COLAB = False

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [9]:
import pandas as pd
import torch
import numpy as np

In [10]:
dataset_path = path_to_project + '/final_ds.csv' if IN_COLAB else '/final_ds.csv'
df = pd.read_csv(dataset_path)

Import libraries that are going to be used in the following subsections

In [None]:
!pip install -q transformers
!pip install datasets

In [None]:
from datasets import Dataset, DatasetDict

#### Dataset Pre-processing



Considering the small size of the model and the limited data used during its training, we decided to reduce the number of samples used for the fine-tuning to avoid overwriting parameters unecessarily and degrade the performance of the model

In [None]:
df = df.head(5000).copy()
df.shape

Subset the data into training dataset, validation dataset and test dataset. We decided to use 10% of the dataset for tesing, and the remaining for train and validation.

In [15]:
from sklearn.model_selection import train_test_split

train_val, test = train_test_split(df, test_size=0.1)
train, val = train_test_split(train_val, test_size=0.2)

In [16]:
print('# train instances: ', train.shape[0])
print('# test instances:  ', test.shape[0])
print('# val instances:   ', val.shape[0])

# train instances:  3600
# test instances:   500
# val instances:    900


Now we check the device

In [17]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(device)

cuda


#### Model and Tokenizer

Retreiev the model and the tokenizer

In [22]:
from transformers import AutoTokenizer, AutoModelForCausalLM

In [23]:
model_name = 'kailasps/GPT2-codeparrot'

Upload the Tokenizer

In [24]:
tokenizer_gpt2 = AutoTokenizer.from_pretrained(model_name)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/473 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/789k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/448k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/473 [00:00<?, ?B/s]

In [25]:
tokenizer_gpt2.pad_token = tokenizer_gpt2.eos_token

In [26]:
print("vocabulary size: ", tokenizer_gpt2.vocab_size)

vocabulary size:  50000


In [27]:
list(tokenizer_gpt2.get_vocab().items())[600:610]

[('ĠSources', 46176),
 ('Ġcac', 33927),
 ('Ġsegm', 31074),
 ('electric', 30949),
 ('Ġuntagged', 43874),
 ('ellipse', 22936),
 ('ĠSubscription', 26798),
 ('Ġlooping', 24598),
 ('`', 64),
 ('HierarchySession', 26069)]

In [28]:
text = "You have an array in input, order the elements in it in O(n) time complexity. Add a wrong wordd"
encoded_input = tokenizer_gpt2.tokenize(text)
print(encoded_input)

['You', 'Ġhave', 'Ġan', 'Ġarray', 'Ġin', 'Ġinput', ',', 'Ġorder', 'Ġthe', 'Ġelements', 'Ġin', 'Ġit', 'Ġin', 'ĠO', '(', 'n', ')', 'Ġtime', 'Ġcomplexity', '.', 'ĠAdd', 'Ġa', 'Ġwrong', 'Ġword', 'd']


In [29]:
encoded_ids = tokenizer_gpt2(text)['input_ids']
print(encoded_ids)

[6147, 1054, 309, 960, 253, 935, 12, 1444, 256, 2347, 253, 577, 253, 690, 8, 78, 9, 626, 16989, 14, 1822, 231, 6401, 1975, 68]


The model is uploaded and connected to the device


In [31]:
gpt2 = AutoModelForCausalLM.from_pretrained(model_name, device_map=device)
gpt2.resize_token_embeddings(len(tokenizer_gpt2))

config.json:   0%|          | 0.00/898 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/497M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

Embedding(50000, 768)

In [32]:
print(gpt2)

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50000, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D(nf=2304, nx=768)
          (c_proj): Conv1D(nf=768, nx=768)
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D(nf=3072, nx=768)
          (c_proj): Conv1D(nf=768, nx=3072)
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=50000, bias=False)
)


Consider now the number of parameters of the model

In [33]:
n_params = sum(param.numel() for param in gpt2.parameters())
n_params

124242432

In [34]:
for name, param in gpt2.named_parameters():
    print(f"Parameter name: {name}")
    print(f"Parameter shape: {param.size()}")
    print(f"Is trainable: {param.requires_grad}")
    print()

Parameter name: transformer.wte.weight
Parameter shape: torch.Size([50000, 768])
Is trainable: True

Parameter name: transformer.wpe.weight
Parameter shape: torch.Size([1024, 768])
Is trainable: True

Parameter name: transformer.h.0.ln_1.weight
Parameter shape: torch.Size([768])
Is trainable: True

Parameter name: transformer.h.0.ln_1.bias
Parameter shape: torch.Size([768])
Is trainable: True

Parameter name: transformer.h.0.attn.c_attn.weight
Parameter shape: torch.Size([768, 2304])
Is trainable: True

Parameter name: transformer.h.0.attn.c_attn.bias
Parameter shape: torch.Size([2304])
Is trainable: True

Parameter name: transformer.h.0.attn.c_proj.weight
Parameter shape: torch.Size([768, 768])
Is trainable: True

Parameter name: transformer.h.0.attn.c_proj.bias
Parameter shape: torch.Size([768])
Is trainable: True

Parameter name: transformer.h.0.ln_2.weight
Parameter shape: torch.Size([768])
Is trainable: True

Parameter name: transformer.h.0.ln_2.bias
Parameter shape: torch.Size([7

#### Training


Now that we have both Tokenizer and Model we can tokenize the dataset and train the model

In [None]:
def apply_eos_token(idx, df, eos_token):
    # build the user input including the problem description, the time complexity and the space complexity
    chat_string = 'User:' + df.loc[idx, 'problem_description'] + ' Time complexity: ' + df.loc[idx, 'time_complexity_inferred'] + '; Space complexity: ' + df.loc[idx, 'space_complexity_inferred']
    # now add the eos token and the response from the assistant, the code solution for the problem
    chat_string = chat_string + eos_token + 'Assistant: ' + df.loc[idx, 'solution_code'] + eos_token
    return chat_string

In [None]:
train_str = [apply_eos_token(idx, train, tokenizer_gpt2.eos_token) for idx in train.index]
test_str = [apply_eos_token(idx, test, tokenizer_gpt2.eos_token) for idx in test.index]
val_str = [apply_eos_token(idx, val, tokenizer_gpt2.eos_token) for idx in val.index]

In [37]:
train_str[0]

'User: Manao has invented a new mathematical term — a beautiful set of points. He calls a set of points on a plane beautiful if it meets the following conditions:\n\n  1. The coordinates of each point in the set are integers. \n  2. For any two points from the set, the distance between them is a non-integer. \n\n\n\nConsider all points (x, y) which satisfy the inequations: 0 ≤ x ≤ n; 0 ≤ y ≤ m; x + y > 0. Choose their subset of maximum size such that it is also a beautiful set of points.\n\nInput\n\nThe single line contains two space-separated integers n and m (1 ≤ n, m ≤ 100).\n\nOutput\n\nIn the first line print a single integer — the size k of the found beautiful set. In each of the next k lines print a pair of space-separated integers — the x- and y- coordinates, respectively, of a point from the set.\n\nIf there are several optimal solutions, you may print any of them.\n\nExamples\n\nInput\n\n2 2\n\n\nOutput\n\n3\n0 1\n1 2\n2 0\n\n\nInput\n\n4 3\n\n\nOutput\n\n4\n0 3\n2 1\n3 0\n4 

The strings where the eos token was applied, are now put in a dictionary structure

In [38]:
train_data = Dataset.from_dict({'chat': train_str})
test_data = Dataset.from_dict({'chat': test_str})
val_data = Dataset.from_dict({'chat': val_str})

In [39]:
data = DatasetDict()
data['train'] = train_data
data['val'] = val_data
data['test'] = test_data

Tokenize the data

In [40]:
def tokenize_function(examples):
    input_encodings = tokenizer_gpt2(examples["chat"],
        truncation=True,
        padding="max_length",
        max_length=512)
    sample = {
        'input_ids': input_encodings.input_ids,
        'attention_mask': input_encodings.attention_mask,
        'labels': input_encodings.input_ids.copy()
    }
    return sample

tokenized_data = data.map(tokenize_function, batched=True)

Map:   0%|          | 0/3600 [00:00<?, ? examples/s]

Map:   0%|          | 0/900 [00:00<?, ? examples/s]

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

In [41]:
# get all sequences in same batch
from transformers import DataCollatorForLanguageModeling
data_collator = DataCollatorForLanguageModeling(tokenizer_gpt2, mlm=False)

Now the training starts

##### Calculate Perplexity before fine-tuning.

Get a random conversation from the test set

In [42]:
import random

random.seed(43)

idx = random.choice(range(len(test_data))) # select a random conversation
print(idx)
dialogue = test_data['chat'][idx]
print(dialogue)

19
User: Bob is playing with 6-sided dice. A net of such standard cube is shown below.

<image>

He has an unlimited supply of these dice and wants to build a tower by stacking multiple dice on top of each other, while choosing the orientation of each dice. Then he counts the number of visible pips on the faces of the dice.

For example, the number of visible pips on the tower below is 29 — the number visible on the top is 1, from the south 5 and 3, from the west 4 and 2, from the north 2 and 4 and from the east 3 and 5.

<image>

The one at the bottom and the two sixes by which the dice are touching are not visible, so they are not counted towards total.

Bob also has t favourite integers x_i, and for every such integer his goal is to build such a tower that the number of visible pips is exactly x_i. For each of Bob's favourite integers determine whether it is possible to build a tower that has exactly that many visible pips.

Input

The first line contains a single integer t (1 ≤ t ≤

In [43]:
# now take only the 'input part' and the 'output part'
# parse string
test_input, test_output, end = dialogue.split('<|endoftext|>')

test_input = test_input + tokenizer_gpt2.eos_token + ' Assistant: '
test_output = test_output + tokenizer_gpt2.eos_token

print('User input: ', test_input)
print('####')
print('Correct solution output: ', test_output)

User input:  User: Bob is playing with 6-sided dice. A net of such standard cube is shown below.

<image>

He has an unlimited supply of these dice and wants to build a tower by stacking multiple dice on top of each other, while choosing the orientation of each dice. Then he counts the number of visible pips on the faces of the dice.

For example, the number of visible pips on the tower below is 29 — the number visible on the top is 1, from the south 5 and 3, from the west 4 and 2, from the north 2 and 4 and from the east 3 and 5.

<image>

The one at the bottom and the two sixes by which the dice are touching are not visible, so they are not counted towards total.

Bob also has t favourite integers x_i, and for every such integer his goal is to build such a tower that the number of visible pips is exactly x_i. For each of Bob's favourite integers determine whether it is possible to build a tower that has exactly that many visible pips.

Input

The first line contains a single integer 

In [44]:
!pip install pipeline transformers evaluate
from transformers import pipeline
import evaluate

Collecting pipeline
  Downloading pipeline-0.1.0-py3-none-any.whl.metadata (483 bytes)
Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Downloading pipeline-0.1.0-py3-none-any.whl (2.6 kB)
Downloading evaluate-0.4.3-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pipeline, evaluate
Successfully installed evaluate-0.4.3 pipeline-0.1.0


In [45]:
pipe = pipeline("text-generation", model= gpt2, tokenizer=tokenizer_gpt2)

Device set to use cuda


In [46]:
gen_output = pipe(test_input, max_new_tokens=256)
print("Generated text:\n", gen_output[0]["generated_text"])

Generated text:
 User: Bob is playing with 6-sided dice. A net of such standard cube is shown below.

<image>

He has an unlimited supply of these dice and wants to build a tower by stacking multiple dice on top of each other, while choosing the orientation of each dice. Then he counts the number of visible pips on the faces of the dice.

For example, the number of visible pips on the tower below is 29 — the number visible on the top is 1, from the south 5 and 3, from the west 4 and 2, from the north 2 and 4 and from the east 3 and 5.

<image>

The one at the bottom and the two sixes by which the dice are touching are not visible, so they are not counted towards total.

Bob also has t favourite integers x_i, and for every such integer his goal is to build such a tower that the number of visible pips is exactly x_i. For each of Bob's favourite integers determine whether it is possible to build a tower that has exactly that many visible pips.

Input

The first line contains a single inte

The generated text does not include the solution code

**Perplexity**

In [47]:
# get inputs from test_data
test_input = [dialogue.split('<|endoftext|>')[0] + tokenizer_gpt2.eos_token for dialogue in test_data['chat'][:3]]

# get outputs from test data
test_output = [dialogue.split('<|endoftext|>')[1] + tokenizer_gpt2.eos_token for dialogue in test_data['chat'][:3]]

In [48]:
perplexity_metric = evaluate.load("perplexity", module_type="metric")
perplexity_before_finetuning = perplexity_metric.compute(
    predictions= test_input,
    model_id= model_name
)
print("Perplexity before fine-tuning:", perplexity_before_finetuning)

Downloading builder script:   0%|          | 0.00/8.46k [00:00<?, ?B/s]

  0%|          | 0/1 [00:00<?, ?it/s]

Perplexity before fine-tuning: {'perplexities': [653.0138549804688, 570.7035522460938, 424.50201416015625], 'mean_perplexity': np.float64(549.4064737955729)}


#### Fine-Tuning
Now we can start the fine-tuning

In [None]:
from transformers import TrainingArguments, Trainer
import os

In [None]:
os.environ["WANDB_DISABLED"] = "true"

In [None]:
training_args = TrainingArguments(
    "gpt2-parrot_trainer",
    #evaluation_strategy="steps",
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    gradient_accumulation_steps=8, 
    num_train_epochs=3,
    save_steps=500,
    eval_steps=500,
    learning_rate=6.25e-5,
    lr_scheduler_type="linear",
    bf16=True,  
    report_to=None, 
)

Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


In [None]:
trainer = Trainer(
    model= gpt2,
    args= training_args,
    train_dataset= tokenized_data['train'],
    eval_dataset= tokenized_data['val'],
    data_collator= data_collator
)

Measure the time taken by the model to finish training

In [None]:
import time
begin = time.time()

In [None]:
trainer.train()

`loss_type=None` was set in the config but it is unrecognised.Using the default loss: `ForCausalLMLoss`.


Step,Training Loss
500,1.4587


TrainOutput(global_step=675, training_loss=1.1649107078269676, metrics={'train_runtime': 2385.8294, 'train_samples_per_second': 4.527, 'train_steps_per_second': 0.283, 'total_flos': 2821953945600000.0, 'train_loss': 1.1649107078269676, 'epoch': 3.0})

In [None]:
end = time.time()
print("Training time: ", end - begin)

Training time:  2386.2620153427124


In [None]:
# Get all of the model's parameters as a list of tuples.
params = list(gpt2.named_parameters())

print('The GPT-2 model has {:} different named parameters.\n'.format(len(params)))

print('==== Embedding Layer ====\n')

for p in params[0:2]:
    print("{:<55} {:>12}".format(p[0], str(tuple(p[1].size()))))

print('\n==== First Transformer ====\n')

for p in params[2:14]:
    print("{:<55} {:>12}".format(p[0], str(tuple(p[1].size()))))

print('\n==== Output Layer ====\n')

for p in params[-2:]:
    print("{:<55} {:>12}".format(p[0], str(tuple(p[1].size()))))

The GPT-2 model has 148 different named parameters.

==== Embedding Layer ====

transformer.wte.weight                                  (50000, 768)
transformer.wpe.weight                                   (1024, 768)

==== First Transformer ====

transformer.h.0.ln_1.weight                                   (768,)
transformer.h.0.ln_1.bias                                     (768,)
transformer.h.0.attn.c_attn.weight                       (768, 2304)
transformer.h.0.attn.c_attn.bias                             (2304,)
transformer.h.0.attn.c_proj.weight                        (768, 768)
transformer.h.0.attn.c_proj.bias                              (768,)
transformer.h.0.ln_2.weight                                   (768,)
transformer.h.0.ln_2.bias                                     (768,)
transformer.h.0.mlp.c_fc.weight                          (768, 3072)
transformer.h.0.mlp.c_fc.bias                                (3072,)
transformer.h.0.mlp.c_proj.weight                        (3072

In [None]:
gpt2.eval()

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50000, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D(nf=2304, nx=768)
          (c_proj): Conv1D(nf=768, nx=768)
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D(nf=3072, nx=768)
          (c_proj): Conv1D(nf=768, nx=3072)
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=50000, bias=False)
)

Save model's parameters

In [None]:
from datetime import datetime

gpt2_finetune_path = path_to_project + '/Transformer-trained-models/' + f"gpt2-parrot_fine_tuning_{datetime.now().strftime('%Y_%m_%d_%H_%M_%S')}"
tokenizer_gpt2.save_pretrained(gpt2_finetune_path)
gpt2.save_pretrained(gpt2_finetune_path)
print(f"Checkpoint saved at: \'{gpt2_finetune_path}\'")

Checkpoint saved at: '/content/drive/MyDrive/SFX/NLP_Project/Transformer-trained-models/gpt2-parrot_fine_tuning_2025_05_16_14_58_31'


#### Testing

Retrieve the trained model

In [None]:
!pip install evaluate
import evaluate

In [None]:
device = 'cuda'

#gpt2_finetune_path = '/content/drive/MyDrive/NLP_Project/Transformer-trained-models/gpt2-parrot_fine_tuning_2025_05_16_14_58_31'
tokenizer_gpt2 = AutoTokenizer.from_pretrained(gpt2_finetune_path)
gpt2 = AutoModelForCausalLM.from_pretrained(gpt2_finetune_path, device_map = device)

To test the model, we first extract one chat from the test data randomly, give it as input to the model, and we see the response

In [53]:
import random

random.seed(43)

idx = random.choice(range(len(test_data))) # select a random conversation
print(idx)
dialogue = test_data['chat'][idx]
print(dialogue)

19
User: Bob is playing with 6-sided dice. A net of such standard cube is shown below.

<image>

He has an unlimited supply of these dice and wants to build a tower by stacking multiple dice on top of each other, while choosing the orientation of each dice. Then he counts the number of visible pips on the faces of the dice.

For example, the number of visible pips on the tower below is 29 — the number visible on the top is 1, from the south 5 and 3, from the west 4 and 2, from the north 2 and 4 and from the east 3 and 5.

<image>

The one at the bottom and the two sixes by which the dice are touching are not visible, so they are not counted towards total.

Bob also has t favourite integers x_i, and for every such integer his goal is to build such a tower that the number of visible pips is exactly x_i. For each of Bob's favourite integers determine whether it is possible to build a tower that has exactly that many visible pips.

Input

The first line contains a single integer t (1 ≤ t ≤

In [54]:
# now take only the 'input part' and the 'output part'
# parse string
test_input, test_output, end = dialogue.split('<|endoftext|>')

# add eos token at the end of test_input and test_output
test_input = test_input + tokenizer_gpt2.eos_token
test_output = test_output + tokenizer_gpt2.eos_token

print('User input: \n', test_input)
print('####')
print('Correct solution output: \n', test_output)

User input: 
 User: Bob is playing with 6-sided dice. A net of such standard cube is shown below.

<image>

He has an unlimited supply of these dice and wants to build a tower by stacking multiple dice on top of each other, while choosing the orientation of each dice. Then he counts the number of visible pips on the faces of the dice.

For example, the number of visible pips on the tower below is 29 — the number visible on the top is 1, from the south 5 and 3, from the west 4 and 2, from the north 2 and 4 and from the east 3 and 5.

<image>

The one at the bottom and the two sixes by which the dice are touching are not visible, so they are not counted towards total.

Bob also has t favourite integers x_i, and for every such integer his goal is to build such a tower that the number of visible pips is exactly x_i. For each of Bob's favourite integers determine whether it is possible to build a tower that has exactly that many visible pips.

Input

The first line contains a single integer

In [None]:
!pip install pipeline transformers evaluate
from transformers import pipeline
import evaluate
pipe_finetuned = pipeline("text-generation", model= gpt2, tokenizer=tokenizer_gpt2)

In [56]:
gen_output = pipe_finetuned(test_input, max_new_tokens=256)
print("Generated text after the fine-tuning: \n", gen_output[0]["generated_text"])

Generated text after the fine-tuning: 
 User: Bob is playing with 6-sided dice. A net of such standard cube is shown below.

<image>

He has an unlimited supply of these dice and wants to build a tower by stacking multiple dice on top of each other, while choosing the orientation of each dice. Then he counts the number of visible pips on the faces of the dice.

For example, the number of visible pips on the tower below is 29 — the number visible on the top is 1, from the south 5 and 3, from the west 4 and 2, from the north 2 and 4 and from the east 3 and 5.

<image>

The one at the bottom and the two sixes by which the dice are touching are not visible, so they are not counted towards total.

Bob also has t favourite integers x_i, and for every such integer his goal is to build such a tower that the number of visible pips is exactly x_i. For each of Bob's favourite integers determine whether it is possible to build a tower that has exactly that many visible pips.

Input

The first line

The generated text after the fine-tuning includes now the python code. The code is still not correct, especially from a syntactical point of view, but the model is able to maintain the espected format of the response ('user:' and 'assistant:')

#### Evaluation Metrics

It was decided to consider the metrics Perplexity and BLEU.

**Perplexity**

In [57]:
# get inputs from test_data
test_input = [dialogue.split('<|endoftext|>')[0] + tokenizer_gpt2.eos_token for dialogue in test_data['chat'][:3]]

# get outputs from test data
test_output = [dialogue.split('<|endoftext|>')[1] + tokenizer_gpt2.eos_token for dialogue in test_data['chat'][:3]]

In [58]:
perplexity_metric = evaluate.load("perplexity", module_type="metric")
perplexity = perplexity_metric.compute(
    predictions= test_input,
    add_start_token=False,
    model_id= gpt2_finetune_path
)
print("Perplexity post fine-tuning:", perplexity)

  0%|          | 0/1 [00:00<?, ?it/s]

Perplexity post fine-tuning: {'perplexities': [1.080601692199707, 1.1230411529541016, 1.0684187412261963], 'mean_perplexity': np.float64(1.0906871954600017)}


The perplexity calculated before the fine-tuning was [653.0138549804688, 570.7035522460938, 424.50201416015625] with mean_perplexity: np.float64(549.4064737955729)

Now we can notice how the values for the perplexity are much lower, similar to those achieved by the gpt2-small model, indicating how the fine-tuning process managed to improve the performance.

**BLEU**

BLEU focuses on precision by counting matching n-grams.

In [None]:
# BLEU Score
bleu_metric = evaluate.load("bleu")
bleu_results = bleu_metric.compute(
    predictions= [pipe_finetuned(test_input[0], max_new_tokens=256)[0]['generated_text']], 
    references= [test_output[0]]  
)
print("BLEU Score:", bleu_results)

Downloading builder script:   0%|          | 0.00/5.94k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/1.55k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/3.34k [00:00<?, ?B/s]

BLEU Score: {'bleu': 0.05315658078526119, 'precisions': [0.3118081180811808, 0.10351201478743069, 0.03333333333333333, 0.0074211502782931356], 'brevity_penalty': 1.0, 'length_ratio': 1.7828947368421053, 'translation_length': 542, 'reference_length': 304}


The Bleu score, even if higher than the scores achieved by the T5 models, are still low, indicating that the generated text is still far from the reference text. 