## Fine-tune GPT-2 for Q-A reated tasks
We are using our own dataset to fine tune GPT. 

In [1]:
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
from transformers import GPT2Tokenizer, GPT2LMHeadModel
from transformers import TextDataset, DataCollatorForLanguageModeling, Trainer, TrainingArguments





We transform the dataset into a pandas dataframe, with a column for question and answer. The question contains the list of questions, and the answer is a answer of given question. 
Later we combined questions and answers into a single string with a seprator and stored in a list `texts`.

In [2]:
df = pd.read_excel("training_data_assignment.xlsx")
df.columns = ['question', 'answer']

# Combine questions and answers into a single string
df['input_text'] = df['question'] + " [SEP] " + df['answer']
texts = df['input_text'].tolist()

We imported and initialise `GPT2` tokenizer from Hugging Face using it's library transformer. It will be used to encode provided text into tokens.
We also save the texts to a text file with the correct encoding like `utf-8`. 

In [3]:
# Tokenize your dataset using the GPT-2 tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

with open("training_data_assignment.txt", "w", encoding="utf-8") as file:
    for text in texts:
        file.write(text + "\n")

We used `TextDataset` to organize our text data in a format suitable for training the GPT-2 model. It handles tokenization(using tokenizer we `initize` before) and batching of the text data.

We used `DataCollatorForLanguageModeling` for collating and preparing batches of input data for the language model during training.

`TextDataset` and `DataCollatorForLanguageModeling` are both important tools to fine-tune LLM on our custom data. Both are provided by Hugging Face.

In [4]:
# Create a TextDataset
text_dataset = TextDataset(
    tokenizer=tokenizer,
    file_path="training_data_assignment.txt",  
    block_size=128
)

# Create a DataCollatorForLanguageModeling
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False
)

We uses `TrainingArguments` as part of the Hugging Face transformers library and use to define the settings and hyperparameters for the training process.

We also import and initilize our GPT-2 model using `GPT2LMHeadModel`.

We uses `Trainer` for training process of our LLM. It takes in the model, training arguments, data collator, and training dataset and train it accordingly.

In `Trainer` we have sent out `GPT-2` model along with our dataset and training parameter. We have select in `training_args` like no.of epochs, batch size, save model on 10000 steps and limit on total save.

In [5]:
training_args = TrainingArguments(
    output_dir="./fine-tuned-gpt2",
    overwrite_output_dir=True,
    num_train_epochs=5, 
    per_device_train_batch_size=2,
    save_steps=10_000,
    save_total_limit=2,
)

model = GPT2LMHeadModel.from_pretrained("gpt2")

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=text_dataset
)

trainer.train()

 20%|█▉        | 500/2550 [38:01<2:35:44,  4.56s/it]

{'loss': 3.01, 'learning_rate': 4.0196078431372555e-05, 'epoch': 0.98}


 39%|███▉      | 1000/2550 [1:19:30<2:17:53,  5.34s/it]

{'loss': 2.2061, 'learning_rate': 3.0392156862745097e-05, 'epoch': 1.96}


 59%|█████▉    | 1500/2550 [1:55:42<1:09:50,  3.99s/it]

{'loss': 1.9055, 'learning_rate': 2.058823529411765e-05, 'epoch': 2.94}


 78%|███████▊  | 2000/2550 [2:28:57<31:34,  3.44s/it]  

{'loss': 1.7083, 'learning_rate': 1.0784313725490197e-05, 'epoch': 3.92}


 98%|█████████▊| 2500/2550 [3:01:52<03:01,  3.63s/it]  

{'loss': 1.6043, 'learning_rate': 9.80392156862745e-07, 'epoch': 4.9}


100%|██████████| 2550/2550 [3:04:57<00:00,  4.35s/it]

{'train_runtime': 11097.5723, 'train_samples_per_second': 0.459, 'train_steps_per_second': 0.23, 'train_loss': 2.078109023150276, 'epoch': 5.0}





TrainOutput(global_step=2550, training_loss=2.078109023150276, metrics={'train_runtime': 11097.5723, 'train_samples_per_second': 0.459, 'train_steps_per_second': 0.23, 'train_loss': 2.078109023150276, 'epoch': 5.0})

We save our fine-tune model and tokenizer locally using `save_pretrained`.
We can use our fine-tune model and tokenizer by importing it from their.

In [6]:
model.save_pretrained("fine-tuned-gpt2")
tokenizer.save_pretrained("fine-tuned-gpt2")

('fine-tuned-gpt2\\tokenizer_config.json',
 'fine-tuned-gpt2\\special_tokens_map.json',
 'fine-tuned-gpt2\\vocab.json',
 'fine-tuned-gpt2\\merges.txt',
 'fine-tuned-gpt2\\added_tokens.json')

Send question to get response and check is it working or not

In [7]:
prompt = "What is the stocks?"
input_ids = tokenizer.encode(prompt, return_tensors="pt")
output = model.generate(input_ids)

# Decode and print the generated text
generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
print(generated_text)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


What is the stocks? [SEP] The stock market is a place of investment where investors can


## END