## LLM Fine Tuning on SquAD format  JSON with data on Beyonce

REF https://stackabuse.com/guide-to-fine-tuning-open-source-llms-on-custom-data



In [None]:
# Install and import the modules
!pip install torch
!pip install transformers

import json
import torch
from transformers import BertTokenizer, BertForQuestionAnswering
from torch.utils.data import DataLoader, Dataset




In [None]:
#Input file path
file_path = 'drive/MyDrive/LLM_data/beyonce.json'

## Stand Alone code for JSON file read and data extract
:

In [None]:
# Read the Json file
with open(file_path, 'r') as f:
     data = json.load(f)

# get the data
paragraphs = data['data'][0]['paragraphs']

# loop through all the paragraphs
# for each para extract the following and create a dict
#  1) context  under that context under qas - loop thru to get 2) question 3) answer 4) start pos

extracted_data = [ ]
for para in paragraphs:
    context = para['context']
    for qa in para['qas']:
        question = qa['question']
        answer = qa['answers'][0]['text']
        start_pos = qa['answers'][0]['answer_start']
    extracted_data.append({
                    'context': context,
                    'question': question,
                    'answer': answer,
                    'start_pos': start_pos,
                })
# data creation ends

# check
extracted_data[0]

{'context': 'Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny\'s Child. Managed by her father, Mathew Knowles, the group became one of the world\'s best-selling girl groups of all time. Their hiatus saw the release of Beyoncé\'s debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five Grammy Awards and featured the Billboard Hot 100 number-one singles "Crazy in Love" and "Baby Boy".',
 'question': "What was the name of Beyoncé's first solo album?",
 'answer': 'Dangerously in Love',
 'start_pos': 505}

In [None]:
# Check number of data points in extracted data
len(extracted_data)

66

## Check tokenize separately



In [None]:
# Set Tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Set index
index = 5

# Display data at Set index pos.
Example = extracted_data[index]

# get the context
context = (Example['context'])
print(Example['context'])

# get the question
print(Example['question'])
question = Example['question']

# get the answer
print(Example['answer'])
answer = Example['answer']

# get the startpos
print(Example['start_pos'])
startpos = Example['start_pos']

At age eight, Beyoncé and childhood friend Kelly Rowland met LaTavia Roberson while in an audition for an all-girl entertainment group. They were placed into a group with three other girls as Girl's Tyme, and rapped and danced on the talent show circuit in Houston. After seeing the group, R&B producer Arne Frager brought them to his Northern California studio and placed them in Star Search, the largest talent show on national TV at the time. Girl's Tyme failed to win, and Beyoncé later said the song they performed was not good. In 1995 Beyoncé's father resigned from his job to manage the group. The move reduced Beyoncé's family's income by half, and her parents were forced to move into separated apartments. Mathew cut the original line-up to four and the group continued performing as an opening act for other established R&B girl groups. The girls auditioned before record labels and were finally signed to Elektra Records, moving to Atlanta Records briefly to work on their first recordin

## Explanation of Tokenizer

# Reference : https://www.analyticsvidhya.com/blog/2021/09/an-explanatory-guide-to-bert-tokenizer/


We will encode a sample question and comtext with the Tokenizer

**Question**
q1 = 'Who was Tony Stark?'

**Context**
c1 = 'Anthony Edward Stark known as Tony Stark is a fictional character in Avengers'


encoding = tokenizer.encode_plus( q1, c1)

## What does encoder_plus do ?

- It returns a token for the text - in this case the question and context are clubbed

- the token type ids: 0 for first sentence ; 1 for second sentence

- Attention mask ; if 1 focus attention on that token




In [None]:
## Example

# question
q1 = 'Who was Tony Stark?'

# context
c1 = 'Anthony Edward Stark known as Tony Stark is a fictional character in Avengers'


# encoding
encoding = tokenizer.encode_plus( q1, c1)


# print encoding
for key, value in encoding.items():
    print( '{} : {}'.format( key, value ) )

input_ids : [101, 2040, 2001, 4116, 9762, 1029, 102, 4938, 3487, 9762, 2124, 2004, 4116, 9762, 2003, 1037, 7214, 2839, 1999, 14936, 102]
token_type_ids : [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
attention_mask : [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]


## Now let us try the same on our data for any one index

In [None]:
## call tokenizer on one index pos of the data
inputs = tokenizer.encode_plus(question, context, add_special_tokens=True, padding='max_length', max_length=512, truncation=True, return_tensors='pt')


# print and display
for key, value in inputs.items():
    print( '{} : {}'.format( key, value ) )

input_ids : tensor([[  101,  2040,  2772,  1996,  2611,  2177,  2006,  2255,  1019,  1010,
          2786,  1029,   102,  2012,  2287,  2809,  1010, 20773,  1998,  5593,
          2767,  5163, 20539,  2777,  2474,  2696,  9035, 11111, 17753,  2096,
          1999,  2019, 14597,  2005,  2019,  2035,  1011,  2611,  4024,  2177,
          1012,  2027,  2020,  2872,  2046,  1037,  2177,  2007,  2093,  2060,
          3057,  2004,  2611,  1005,  1055,  5939,  4168,  1010,  1998,  9680,
          5669,  1998, 10948,  2006,  1996,  5848,  2265,  4984,  1999,  5395,
          1012,  2044,  3773,  1996,  2177,  1010,  1054,  1004,  1038,  3135,
         12098,  2638, 25312,  4590,  2716,  2068,  2000,  2010,  2642,  2662,
          2996,  1998,  2872,  2068,  1999,  2732,  3945,  1010,  1996,  2922,
          5848,  2265,  2006,  2120,  2694,  2012,  1996,  2051,  1012,  2611,
          1005,  1055,  5939,  4168,  3478,  2000,  2663,  1010,  1998, 20773,
          2101,  2056,  1996,  2299,  20

## Post process tokenizer output

from Tokenizer  we get output as tensor -

We use .squeeze() method to remove single dimension entries

**Example is shown below**

In [None]:
import numpy as geek

in_arr = geek.array([[[2, 2, 2], [2, 2, 2]]])

print ("Input array : ", in_arr)
print("Shape of input array : ", in_arr.shape)

out_arr = geek.squeeze(in_arr)

print ("output squeezed array : ", out_arr)
print("Shape of output array : ", out_arr.shape)

Input array :  [[[2 2 2]
  [2 2 2]]]
Shape of input array :  (1, 2, 3)
output squeezed array :  [[2 2 2]
 [2 2 2]]
Shape of output array :  (2, 3)


## Now Create a Class for the same

**Note** We are going to inherit - Dataset class from pytorch

https://pytorch.org/tutorials/beginner/basics/data_tutorial.html


To create a custom 'Dataset' class we would need the following methods
- __init__
- --_len--
- --get_item__


## Init

- In the init class - which will be called when and instance is initiated
We are passing the input file path and calling a custom json read method to
finally extract the relevant context query answers and start pos in a Dict format .

- The dict format is then our returned "data"


## len
Here we just code a return for the length of our 'data'


## get item
Here we pass the index value to the extracted data and get teh relevant components returned


In [None]:
class beyonce(Dataset):
    def __init__(self, file_path):
        self.data = self.load_data(file_path)
        self.tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

    def load_data(self, file_path):
        with open(file_path, 'r') as f:
            data = json.load(f)
        paragraphs = data['data'][0]['paragraphs']
        extracted_data = []
        for paragraph in paragraphs:
            context = paragraph['context']
            for qa in paragraph['qas']:
                question = qa['question']
                answer = qa['answers'][0]['text']
                start_pos = qa['answers'][0]['answer_start']
                extracted_data.append({
                    'context': context,
                    'question': question,
                    'answer': answer,
                    'start_pos': start_pos,
                })
        return extracted_data

    def __len__(self):
        return len(self.data)

    def __getitem__(self, index):
        example = self.data[index]
        question = example['question']
        context = example['context']
        answer = example['answer']
        inputs = self.tokenizer.encode_plus(question, context, add_special_tokens=True, padding='max_length', max_length=512, truncation=True, return_tensors='pt')
        input_ids = inputs['input_ids'].squeeze()
        attention_mask = inputs['attention_mask'].squeeze()
        start_pos = torch.tensor(example['start_pos'])
        return input_ids, attention_mask, start_pos

## Data Loader concepts

REF


https://pytorch.org/tutorials/beginner/basics/data_tutorial.html



The Dataset retrieves our dataset’s features and labels one sample at a time. While training a model, we typically want to pass samples in “minibatches”, reshuffle the data at every epoch to reduce model overfitting, and use Python’s multiprocessing to speed up data retrieval.

DataLoader is an iterable that abstracts this complexity for us in an easy API.


## Iterate through the DataLoader

We have loaded that dataset into the DataLoader and can iterate through the dataset as needed. Each iteration below returns a batch of train_features  (containing batch_size features ). Because we specified shuffle=True, after we iterate over all batches the data is shuffled (for finer-grained control over the data loading order, take a look at Samplers).


## Create an instance of the beyonce class

In [None]:
file_path = 'drive/MyDrive/LLM_data/beyonce.json'
dataset = beyonce(file_path)

## Set up the model for training


**We are using the following here**

- Adam optimizer and cross entropy loss function.

- Pytorch class DataLoader to load data in different batches and also shuffle them to avoid any bias.


**Notes**

We are using the BERT model specific for Question Answering

REF
https://huggingface.co/learn/nlp-course/chapter7/7?fw=pt




In [None]:
# Set device (CPU or GPU)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Initialize the BERT model for question answering
model = BertForQuestionAnswering.from_pretrained('bert-base-uncased')
model.to(device)

optimizer = torch.optim.AdamW(model.parameters(), lr=2e-5)
loss_fn = torch.nn.CrossEntropyLoss()
batch_size = 8
num_epochs = 50

# Create data loader
data_loader = DataLoader(dataset, batch_size=batch_size, shuffle=True)


model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['qa_outputs.weight', 'qa_outputs.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.




for epoch in range(num_epochs):
    model.train()
    total_loss = 0

    for batch in data_loader:
        # Move batch tensors to the device
        input_ids = batch[0].to(device)
        attention_mask = batch[1].to(device)
        start_positions = batch[2].to(device)

        # Zero the gradients
        optimizer.zero_grad()

        # Forward pass
        outputs = model(input_ids, attention_mask=attention_mask, start_positions=start_positions)
        loss = outputs.loss

        # Backward pass and optimization
        loss.backward()
        optimizer.step()

        total_loss += loss.item()

    avg_loss = total_loss / len(data_loader)
    print(f"Epoch {epoch+1}/{num_epochs} - Average Loss: {avg_loss:.4f}")