# Supervised Fine-Tuning of GPT using Huggingface Tools

## Part One: Task specific Fine-Tuning

see https://medium.com/@rupaak/how-to-fine-tune-gpt-2-for-a-domain-specific-chatbot-46e9ca64bc86

## Part Two: Domain Adaption

see 

In [20]:
import math
import json

from datasets import load_dataset, Dataset

from transformers import AutoTokenizer, AutoModelForCausalLM, Trainer, TrainingArguments, DataCollatorForLanguageModeling
#from transformers import GPT2Tokenizer, GPT2LMHeadModel, Trainer, TrainingArguments, DataCollatorForLanguageModeling


## Loading the Squad dataset

In [21]:
dataset = load_dataset("squad")
dataset

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 87599
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 10570
    })
})

In [22]:
## Build custom Dataset from loaded squad dataset
path = "./data.jsonl"

def _build_dataset(dataset):
    custom_dataset = []
    for item in dataset['train']:
        item['train'] = True
        custom_dataset.append(json.dumps(item))
    
    for item in dataset['validation']:
        item['train'] = False
        custom_dataset.append(json.dumps(item))
    
    with open(path, "w") as f:
        f.write("\n".join(custom_dataset))

_build_dataset(dataset)

In [33]:
## Load custom Dataset
path = "./data.jsonl"

def _load_dataset(file_path):
    with open(file_path, 'r') as file:
        lines = file.readlines()
    print(len(lines))
    print(lines[0])
    data = [json.loads(line) for line in lines]
    
    formatted_data = []
    for item in data:
        question = item['question']
        context = item['context']
        answers = item['answers']
        
        input_text = f"Question: {question}\nContext: {context}\nAnswer:"
        formatted_data.append({
            'input_text': input_text,
            'target_text': str(context), #answers['text'][0]
        })
    
    return Dataset.from_dict({'text': [item['input_text'] for item in formatted_data],
                              'labels': [item['target_text'] for item in formatted_data]})


dataset = _load_dataset(path)  # Our data file name is 'data.jsonl'

98169
{"id": "5733be284776f41900661182", "title": "University_of_Notre_Dame", "context": "Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend \"Venite Ad Me Omnes\". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.", "question": "To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?", "answers": {"text": ["Saint Bernadette Soubirous"], "answer_start": [515]}, "train": true}



In [34]:
tokenizer = AutoTokenizer.from_pretrained('gpt2')
model = AutoModelForCausalLM.from_pretrained('gpt2')
#tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
#model = GPT2LMHeadModel.from_pretrained('gpt2')

In [39]:
## Tokenize the dataset
tokenizer.add_special_tokens({'pad_token': '[PAD]'})

def tokenize_function(examples):
    return tokenizer(examples['text'], truncation=True, padding='max_length', max_length=128)

tokenized_datasets = dataset.map(tokenize_function, batched=True)

SyntaxError: keyword argument repeated: padding (509376440.py, line 5)

In [36]:
## Use a Data Collator for preparing a Bach
## see https://huggingface.co/docs/transformers/main_classes/data_collator#transformers.DataCollatorForLanguageModeling
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
#data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

In [37]:
## Use Huggingface Trainer

training_args = TrainingArguments(
    output_dir='./results',
    eval_strategy="epoch",
    learning_rate=2e-5,
    #per_device_train_batch_size=4,
    #per_device_eval_batch_size=4,
    num_train_epochs=3,
    weight_decay=0.01,
    save_steps=10_000,
    save_total_limit=2,
    logging_steps=200,
)

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=tokenized_datasets,
)

In [38]:
trainer.train()

ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length. Perhaps your features (`labels` in this case) have excessive nesting (inputs type `list` where type `int` is expected).

## Part 2 - Domain Adaption

In [1]:
import json
import torch
from transformers import GPT2Tokenizer, GPT2LMHeadModel, Trainer, TrainingArguments, DataCollatorWithPadding
from torch.utils.data import Dataset

In [2]:
## Use the 'context' in data.jsonl for domain adaption
input_file = "./data.jsonl"
train_output_file = "./train_data.txt"

with open(input_file, 'r') as file:
    lines = file.readlines()

with open(train_output_file, "w") as outfile:
    for line in lines:
        data = json.loads(line)
        if data['train']:
            context = data['context']
            context = context.strip()
            outfile.write(context)

eval_output_file = "./eval_data.txt"
with open(eval_output_file, "w") as outfile:
    for line in lines:
        data = json.loads(line)
        if not data['train']:
            context = data['context']
            context = context.strip()
            outfile.write(context)

In [None]:
from transformers import AutoTokenizer, AutoLMHeadModel, TrainingArguments, Trainer, DataCollatorWithPadding

In [5]:
## Load the pre_trained GPT2 model
         # "gpt2"       #: 124 million parameters 
model_id = "gpt2-medium"#: 345 million parameters 
         # "gpt2-Large" #: 774 million parameters
         # "gpt2-xl"    #: 1.5 billion parameters 


tokenizer = GPT2Tokenizer.from_pretrained(model_id, clean_up_tokenization_spaces=True)
model = GPT2LMHeadModel.from_pretrained(model_id)

In [6]:
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token # tokenizer.add_special_tokens({'pad_token': '[PAD]'})
    tokenizer.pad_token_id = tokenizer.eos_token_id
    

In [7]:
###### Prepare Dataset
class CustomDataset(Dataset): # from torch.utils.data
    def __init__(self, tokenizer, file_path, block_size):
        self.tokenizer = tokenizer
        with open(file_path, "r") as f:
            self.text = f.read().splitlines()
        self.block_size = block_size

    def __len__(self):
        return len(self.text)
    
    def __getitem__(self, idx):
        tokenized_inputs = self.tokenizer(self.text[idx], 
                                          truncation=True, 
                                          padding="max_length", 
                                          max_length=self.block_size, 
                                          return_tensors="pt")
        
        tokenized_inputs['labels'] = tokenized_inputs['input_ids']
        return tokenized_inputs
        

train_dataset = CustomDataset(tokenizer, train_output_file, 128)

In [4]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device = torch.device("mps" if torch.backends.mps.is_available() else "cpu")
print("Usinig devivs:", device)

Usinig devivs: mps


In [8]:
#create a data collator that dynamically pad the sequences 
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

In [9]:
## Use Huggingface Trainer

training_args = TrainingArguments(
    #per_device_train_batch_size=2,
    output_dir='./results_2',
    logging_dir='./logs_2',
    num_train_epochs=4, #large dataset -> small epochs, small dataset --> large dataset
    learning_rate=1e-4,
    logging_steps=10,
    load_best_model_at_end=False,
    eval_strategy="no",    
    remove_unused_columns=False,
    push_to_hub=False,
    #per_device_eval_batch_size=4,
    #weight_decay=0.01,
    #save_steps=10_000,
    #save_total_limit=2,
)

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=train_dataset,
    eval_dataset=None,
)

In [10]:
trainer.train()

Step,Training Loss
10,2.4701
20,1.4205
30,0.9279
40,0.9885
50,0.7606
60,0.4954
70,0.369
80,0.5349
90,0.4294
100,0.2537


TrainOutput(global_step=200, training_loss=0.5171388733386993, metrics={'train_runtime': 220.6057, 'train_samples_per_second': 7.198, 'train_steps_per_second': 0.907, 'total_flos': 368694175727616.0, 'train_loss': 0.5171388733386993, 'epoch': 4.0})

In [11]:
#### Test / Eval the file
eval_dataset = CustomDataset(tokenizer, eval_output_file, 128)
len(eval_dataset)
eval_dataset[0]

{'input_ids': tensor([[12442,  8693,  2026,   373,   281,  1605,  4346,   983,   284,  5004,
           262,  8783,   286,   262,  2351,  9957,  4041,   357, 32078,     8,
           329,   262,  1853,  1622,    13,   383,  1605,  9957,  8785,   357,
            32,  4851,     8,  8783, 10656, 16582,  9772,   262,  2351,  9957,
          8785,   357,    45,  4851,     8,  8783,  5913, 17782,  1987,  1906,
           940,   284,  5160,   511,  2368,  3115,  8693,  3670,    13,   383,
           983,   373,  2826,   319,  3945,   767,    11,  1584,    11,   379,
         20196,   338, 10499,   287,   262,  2986,  6033,  4696,  9498,   379,
          8909, 27443,    11,  3442,    13,  1081,   428,   373,   262,  2026,
           400,  3115,  8693,    11,   262,  4652, 20047,   262,   366, 24267,
           268, 11162,     1,   351,  2972,  3869,    12, 26966, 15446,    11,
           355,   880,   355, 13413, 47499,   262,  6761,   286, 19264,  1123,
          3115,  8693,   983,   351,  

In [14]:
prompt = "What is Super Bowl 50?"
input_ids = tokenizer(prompt, return_tensors="pt").input_ids
attention_mask = tokenizer(prompt, return_tensors="pt").attention_mask

input_ids = input_ids.to(device)
attention_mask = attention_mask.to(device)


model.eval()
model.to("mps")
output = model.generate(input_ids=input_ids, 
                        attention_mask=attention_mask,
                        pad_token_id=tokenizer.pad_token_id,
                        max_length= 100,
                        num_beams=5,
                        temperature=1.5,
                        top_k=50,
                        do_sample=True)

In [17]:
generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
print(generated_text)

What is Super Bowl 50? It is the 50th Super Bowl between the Atlanta Falcons and the New England Patriots. It was played on January 3, 1994, at Mercedes-Benz Stadium in East Rutherford, New Jersey. The Falcons defeated the Patriots 27-14. The victory was the highest-scoring game in Super Bowl history with an average of 28 points per game. The victory also marked the beginning of the end for the Patriots, who had been without a victory in the postseason since 2001. The


In [19]:
## Compare to the pre trained model
## Load again the pre_trained GPT2 model
pre_trained_model = GPT2LMHeadModel.from_pretrained(model_id)

pre_trained_model.eval()
pre_trained_model.to("mps")

output = pre_trained_model.generate(input_ids=input_ids, 
                        attention_mask=attention_mask,
                        pad_token_id=tokenizer.pad_token_id,
                        max_length= 100,
                        num_beams=5,
                        temperature=1.5,
                        top_k=50,
                        do_sample=True)

generated_text = tokenizer.decode(output[0], skip_special_tokens=True)

print(generated_text)

What is Super Bowl 50?

The Super Bowl is an annual event that takes place in New Orleans, Louisiana, on February 5, 2016. It is the oldest sporting event in the United States, having been held every year since 1920.

The Super Bowl is an annual event that takes place in New Orleans, Louisiana, on February 5, 2016. It is the oldest sporting event in the United States, having been held every year since 1920.

What is Super Bowl XLIX
