### Installing the Necessary Package

In [1]:
#!pip install huggingface_hub
#!pip install langchain
#!pip install python-dotenv
#! pip install sentencepiece
#! pip install transformers torch
#! pip install accelerate -U

| Package           | License         | Purpose                                                                                          |
|-------------------|-----------------|--------------------------------------------------------------------------------------------------|
| huggingface_hub   | Apache License 2.0 | A client library to download and publish models on the Hugging Face Hub.                           |
| langchain         | MIT License     | A framework for chaining language models together to build applications.                          |
| python-dotenv     | BSD 2-Clause "Simplified" License | Reads key-value pairs from a `.env` file and can set them as environment variables.               |
| sentencepiece     | Apache License 2.0 | A library for unsupervised text tokenization, primarily used for neural network-based text processing. |
| transformers      | Apache License 2.0 | A state-of-the-art natural language processing library for training, converting, and using models from Hugging Face. |
| torch             | BSD License     | An open-source machine learning library developed by the Facebook AI Research lab.                 |


### API Key For  Hugging Face

To access the model from Hugging Face, an API key is required

### Importing the Necessary Package

In [2]:
import os
import warnings
warnings.filterwarnings("ignore")
import pandas as pd
import numpy as np 
import torch
import logging
from dotenv import load_dotenv ,find_dotenv

### Loading API Token from .env File 

In [3]:
# Set up logging
logging.basicConfig(level=logging.INFO)

def load_api_token():
    try:
        # Load the .env file
        load_dotenv()
        # Retrieve the API token
        api_token = os.getenv('HUGGINGFACEHUB_API_TOKEN')
        if api_token is None:
            raise ValueError("API key not found in .env file")
        return api_token
    except Exception as e:
        logging.error(f"Error loading API token: {e}")
        return None


In [4]:
api_token = load_api_token()
if api_token:
    logging.info("API token loaded successfully")

INFO:root:API token loaded successfully


For the sake of testing, I am including an API key in the code, but it is generally advised against embedding API keys directly in the codebase.

In [5]:
os.environ["HUGGINGFACEHUB_API_TOKEN"] = "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"

### Solution Part-1

###### 1: Use a pre-trained google/flan-t5-small as the model
##### 2. Verify if the summariza'on task works.
##### 3. Verify if the Q&A task works.
##### 4. Verify if English to French transla'on task works

Google/FLAN-T5-small is a model developed by Google, combining two significant concepts in AI and machine learning: FLAN and T5.
T5:Text-to-Text Transfer Transformer
FLAN :Fine-tuned Language Net

### Multiple methods to load the Flan-T5 Model

Although there are multiple methods to load the Flan-T5 Model, for demonstration purposes, I am opting to use Hugging Face hub using langchain and transformers.

### Approch 1: Using Transformers package

In [6]:
import logging
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, pipeline

In [7]:
# Setup logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

### 1: Use a pre-trained google/flan-t5-small as the model

In [8]:
def load_model(model_name="google/flan-t5-small"):
    try:
        logger.info(f"Loading model: {model_name}")
        tokenizer = AutoTokenizer.from_pretrained(model_name)
        model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
        logger.info("Model loaded successfully")
        return tokenizer, model
    except Exception as e:
        logger.error(f"An error occurred while loading the model: {e}")
        return None, None

tokenizer, model = load_model()
if tokenizer is not None and model is not None:
    # Model usage here
    pass


INFO:__main__:Loading model: google/flan-t5-small
INFO:__main__:Model loaded successfully


### 2. Verify if the summariza'on task works.
### 3. Verify if the Q&A task works.
### 4. Verify if English to French transla'on task works

In [9]:
#model

In [10]:
class Verify_flan_model:
    def __init__(self, model_name="google/flan-t5-small"):
        self.logger = logger
        try:
            logger.info(f"Loading model: {model_name}")
            self.tokenizer = AutoTokenizer.from_pretrained(model_name)
            self.model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
            self.logger.info("Model and tokenizer loaded successfully.")
        except Exception as e:
            self.logger.error(f"An error occurred while loading the model and tokenizer: {e}")

    def summarize(self, text_to_summarize):
        try:
            summarizer = pipeline("summarization", model=self.model, tokenizer=self.tokenizer)
            summary = summarizer(text_to_summarize)
            self.logger.info(f"Summarization: {summary}")
            return summary
        except Exception as e:
            self.logger.error(f"An error occurred in summarization: {e}")
            return None

    def answer_question(self, question, context):
        try:
            qa_input = f"question: {question} context: {context}"
            answer = self.model.generate(**self.tokenizer(qa_input, return_tensors="pt"))
            decoded_answer = self.tokenizer.decode(answer[0])
            self.logger.info(f"Q&A: {decoded_answer}")
            return decoded_answer
        except Exception as e:
            self.logger.error(f"An error occurred in Q&A: {e}")
            return None

    def translate_to_french(self, text_to_translate):
        try:
            translator = pipeline("translation_en_to_fr", model=self.model, tokenizer=self.tokenizer)
            translation = translator(text_to_translate)
            self.logger.info(f"Translation: {translation}")
            return translation
        except Exception as e:
            self.logger.error(f"An error occurred in translation: {e}")
            return None


### Verification : Executing the Function 

In [11]:
flan_model = Verify_flan_model()
# Summarization example
text_to_summarize = "India, a country that conjures up a vivid kaleidoscope of images,is as diverse as it is vast. It is a land where ancient traditions blend seamlessly with modern advancements. "
print("Summarization result:", flan_model.summarize(text_to_summarize))
# Question Answering example
question = "where is head office of ServiceNow"
context = "ServiceNow, Inc. is an American software company based in Santa Clara, California that develops a cloud computing platform to help companies manage digital workflows for enterprise operations"
print("Q&A result:", flan_model.answer_question(question, context))
# Translation example
text_to_translate = "How are you today?"
print("Translation result:", flan_model.translate_to_french(text_to_translate))

INFO:__main__:Loading model: google/flan-t5-small
INFO:__main__:Model and tokenizer loaded successfully.
Your max_length is set to 200, but your input_length is only 48. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=24)
INFO:__main__:Summarization: [{'summary_text': "India is one of the world's most diverse and diverse regions, and it is a country where ancient traditions blend seamlessly with modern advancements."}]


Summarization result: [{'summary_text': "India is one of the world's most diverse and diverse regions, and it is a country where ancient traditions blend seamlessly with modern advancements."}]


INFO:__main__:Q&A: <pad> Santa Clara, California, is the head office of ServiceNow, Inc., an American software company that develops a cloud computing platform to help companies manage digital workflows for enterprise operations.</s>


Q&A result: <pad> Santa Clara, California, is the head office of ServiceNow, Inc., an American software company that develops a cloud computing platform to help companies manage digital workflows for enterprise operations.</s>


INFO:__main__:Translation: [{'translation_text': " c'est aujourd'hui d'aujourd'hui?  quoi est à l'heure actuelle? - et à la fois?"}]


Translation result: [{'translation_text': " c'est aujourd'hui d'aujourd'hui?  quoi est à l'heure actuelle? - et à la fois?"}]


### Approch 2:Using  langchain and huggingface hub

In [12]:
from langchain.llms import HuggingFaceHub
from langchain.prompts import PromptTemplate

# Setup logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

In [13]:
class LanguageModelHandler:
    def __init__(self, model_repo_id):
        # Initialize HuggingFace Language Model
        self.llm = HuggingFaceHub(repo_id=model_repo_id)
    
    def get_prompt(self, task, input_text, context=None):
        if task == "summarization":
            return f"Summarize the following text: {input_text}"
        elif task == "q&a":
            return f"Question: {input_text}\nContext: {context}\nAnswer:"
        elif task == "translation":
            return f"Translate the following English text to French: {input_text}"
        else:
            raise ValueError("Unsupported task")
    
    def execute_task(self, task, input_text, context=None):
        try:
            prompt = self.get_prompt(task, input_text, context)
            result = self.llm(prompt)
            return result
        except Exception as e:
            logger.error(f"An error occurred: {e}")
            return None




#### Verification : Executing the Function

In [14]:
# Create an instance of the class
lm_handler = LanguageModelHandler(model_repo_id="google/flan-t5-small")
# Summarization Example
summarization_input = "India, a country that conjures up a vivid kaleidoscope of images,is as diverse as it is vast. It is a land where ancient traditions blend seamlessly with modern advancements. "
summary = lm_handler.execute_task("summarization", summarization_input)
print(f"Summarization: {summary}")
# Q&A Example
qa_question = "where is head office of ServiceNow"
qa_context = "ServiceNow, Inc. is an American software company based in Santa Clara, California that develops a cloud computing platform to help companies manage digital workflows for enterprise operations"
answer = lm_handler.execute_task("q&a", qa_question, qa_context)
print(f"Q&A: {answer}")

# Translation Example
translation_input = "How are you today?"
translation = lm_handler.execute_task("translation", translation_input)
print(f"Translation: {translation}")

Summarization: India is a country that has a thriving culture and culture, and is a
Q&A: Santa Clara, California
Translation: Comment est aujourd'hui?


## Solution  Part :2

### 5.Print the names of all the model layers and their dimensions.
### 6.Print the total number of parameters/weights in this model.

In [15]:
def load_model_hf_tf(model_name="transformers"):
    try:
        if model_name == "huggingface":
            # Assuming HuggingFaceHub is a custom function or class you have defined
            # Replace 'HuggingFaceHub' with the actual function/class you're using to load the model
            model = HuggingFaceHub(repo_id="google/flan-t5-large")
        elif model_name == "transformers":
            tokenizer, model = load_model()
        else:
            raise ValueError("Unsupported model type specified")
        
        logger.info(f"Model loaded successfully: {model_name}")
        return model

    except Exception as e:
        logger.error(f"An error occurred while loading the model: {e}")
        return None

In [16]:
#model = load_model_hf_tf("huggingface")
#model = load_model_hf_tf("transformers")

In [17]:
def get_model_details(model):
    try:
        # Initialize a list to store model details
        model_details_list = []

        # Iterate through model parameters and add to the list
        for name, param in model.named_parameters():
            model_details_list.append({'Layer Name': name, 'Dimension': str(param.size())})

        # Convert list to DataFrame
        model_details_df = pd.DataFrame(model_details_list)

        # Calculate total parameters
        total_params = sum(p.numel() for p in model.parameters())
        logger.info(f"Total number of parameters: {total_params}")

        return model_details_df, total_params

    except Exception as e:
        logger.error(f"An error occurred: {e}")
        return pd.DataFrame(), 0


In [18]:
model = load_model_hf_tf("transformers")
# Get model details
model_info_df, total_parameters = get_model_details(model)

# Display DataFrame
model_info_df

INFO:__main__:Loading model: google/flan-t5-small
INFO:__main__:Model loaded successfully
INFO:__main__:Model loaded successfully: transformers
INFO:__main__:Total number of parameters: 76961152


Unnamed: 0,Layer Name,Dimension
0,shared.weight,"torch.Size([32128, 512])"
1,encoder.block.0.layer.0.SelfAttention.q.weight,"torch.Size([384, 512])"
2,encoder.block.0.layer.0.SelfAttention.k.weight,"torch.Size([384, 512])"
3,encoder.block.0.layer.0.SelfAttention.v.weight,"torch.Size([384, 512])"
4,encoder.block.0.layer.0.SelfAttention.o.weight,"torch.Size([512, 384])"
...,...,...
185,decoder.block.7.layer.2.DenseReluDense.wi_1.we...,"torch.Size([1024, 512])"
186,decoder.block.7.layer.2.DenseReluDense.wo.weight,"torch.Size([512, 1024])"
187,decoder.block.7.layer.2.layer_norm.weight,torch.Size([512])
188,decoder.final_layer_norm.weight,torch.Size([512])


## Solution Part :3 

### 7. Set the tensor in final layer (decoder.final_layer_norm.weight) to all zeros.
### 8.Verify if the Q&A task works after resetting the weights of the above layer
### 9.Replace the decoder.final_layer_norm.weight with a layer of smaller dimensions and adjust all the dependent layers to match the dimension

In [19]:
# Load the model
model = load_model_hf_tf("transformers")
#Task 7: Set the tensor in the final layer to all zeros
model.decoder.final_layer_norm.weight.data.zero_()

INFO:__main__:Loading model: google/flan-t5-small
INFO:__main__:Model loaded successfully
INFO:__main__:Model loaded successfully: transformers


tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 

In [20]:
# Task 8: Test the model's Q&A functionality

In [21]:
def test_qa_functionality(model, question, context):
    try:
        qa_input = f"question: {question} context: {context}"
        answer = model.generate(**tokenizer(qa_input, return_tensors="pt"))
        decoded_answer =tokenizer.decode(answer[0])
        logger.info(f"Q&A: {decoded_answer}")
        return decoded_answer
    except Exception as e:
        logger.error(f"An error occurred in Q&A: {e}")
        return None

In [22]:
qa_question = "where is head office of ServiceNow"
qa_context = "ServiceNow, Inc. is an American software company based in Santa Clara, California that develops a cloud computing platform to help companies manage digital workflows for enterprise operations"

In [23]:
test_qa_functionality(model, qa_question, qa_context)

INFO:__main__:Q&A: <pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad>


'<pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad>'

#### Impact of Zeroing Weights

By setting the weights of this normalization layer to zero, effectively nullify the normalization process. This drastic change disrupts the balance and scaling that the model relies on for generating coherent outputs.
Without proper normalization, the subsequent layers (like the output linear layer) receive unnormalized or improperly scaled inputs, leading to unpredictable and likely meaningless outputs.

### Pad 

The output  observing, a series of '<pad>' tokens, indicates that the model's ability to generate meaningful text is severely impaired. In T5 and similar models, the '<pad>' token is used to fill in sequences to a uniform length.
When the model's internals are significantly disrupted (as is the case here), it might default to outputting '<pad>' tokens, essentially indicating that it cannot generate any meaningful text based on the given input.

### Summary :model.decoder.final_layer_norm.weight.data.zero_()

In summary, setting the final layer normalization weights to zero disrupts the delicate balance and scaling within the model, leading to outputs that are essentially meaningless pad tokens. This demonstrates how critical each component of a transformer model is to its overall functionality and language generation capabilities.

Task 9:Replace the decoder.final_layer_norm.weight with a layer of smaller dimensions and 
adjust all the dependent layers to match the dimension

In [24]:
# Inspect the last block of the decoder
last_block = model.decoder.block[-1]

# Print the components of the last block
for name, module in last_block.named_children():
    print(f"{name}: {module}")

layer: ModuleList(
  (0): T5LayerSelfAttention(
    (SelfAttention): T5Attention(
      (q): Linear(in_features=512, out_features=384, bias=False)
      (k): Linear(in_features=512, out_features=384, bias=False)
      (v): Linear(in_features=512, out_features=384, bias=False)
      (o): Linear(in_features=384, out_features=512, bias=False)
    )
    (layer_norm): T5LayerNorm()
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (1): T5LayerCrossAttention(
    (EncDecAttention): T5Attention(
      (q): Linear(in_features=512, out_features=384, bias=False)
      (k): Linear(in_features=512, out_features=384, bias=False)
      (v): Linear(in_features=512, out_features=384, bias=False)
      (o): Linear(in_features=384, out_features=512, bias=False)
    )
    (layer_norm): T5LayerNorm()
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (2): T5LayerFF(
    (DenseReluDense): T5DenseGatedActDense(
      (wi_0): Linear(in_features=512, out_features=1024, bias=False)
      (wi_1): Linear(in_fea

In [25]:
layer = last_block.layer[2].layer_norm
layer

T5LayerNorm()

In [26]:
import torch.nn as nn

def ModifyLayerDimensions(layer, new_size):
    """
    Modify the dimensions of a given neural network layer.

    Parameters:
    layer (nn.Module): The layer to be modified.
    new_size (int): The new dimension size.

    Returns:
    nn.Module: The modified layer.
    """

    if isinstance(layer, nn.Linear):
        # For a linear layer, change the number of output features
        new_layer = nn.Linear(in_features=layer.in_features, out_features=new_size)
        
        # Optionally, copy the existing weights and biases, up to the new size
        with torch.no_grad():
            new_layer.weight[:layer.out_features, :new_size].copy_(layer.weight.data)
            if layer.bias is not None:
                new_layer.bias[:new_size].copy_(layer.bias.data)

        return new_layer

    else:
        raise TypeError("Layer modification for this layer type is not implemented")


In [27]:
# Define a function to modify the model
def modify_model(model, new_size):
    try:
        # Replace the final layer normalization in the decoder
        model.decoder.final_layer_norm = torch.nn.LayerNorm(new_size)
        # Adjust dependent layers
        decoder_block = model.decoder.block[-1]
        layer = decoder_block.layer[2].layer_norm
        #model.decoder.some_layer = ModifyLayerDimensions(layer, new_size)
        logger.info("Model modification successful.")

    except Exception as e:
        logger.error(f"An error occurred during model modification: {e}")


In [28]:
# Define the new size
new_size = 256 
# Call the function to modify the model
modify_model(model, new_size)

INFO:__main__:Model modification successful.


### Solution : Part 4

### 10. Reload the original google/flan-t5-small model.
#### 11. Train the model for a Q&A task that takes a context as addi'onal input along with the ques'on. You can use SQuAD dataset (h_ps://rajpurkar.github.io/SQuAD-explorer/ ) or the smaller Topioca dataset (h_ps://mcgill-nlp.github.io/topiocqa/) . Choose an appropriate task prefix/trigger word and jus'fy the choice.
#### 12. Evaluate the quality of the model

##### 10. Reload the original google/flan-t5-small model 

In [29]:
tokenizer, model = load_model()

INFO:__main__:Loading model: google/flan-t5-small
INFO:__main__:Model loaded successfully


In [30]:
#model

In [31]:
model.config

T5Config {
  "_name_or_path": "google/flan-t5-small",
  "architectures": [
    "T5ForConditionalGeneration"
  ],
  "classifier_dropout": 0.0,
  "d_ff": 1024,
  "d_kv": 64,
  "d_model": 512,
  "decoder_start_token_id": 0,
  "dense_act_fn": "gelu_new",
  "dropout_rate": 0.1,
  "eos_token_id": 1,
  "feed_forward_proj": "gated-gelu",
  "initializer_factor": 1.0,
  "is_encoder_decoder": true,
  "is_gated_act": true,
  "layer_norm_epsilon": 1e-06,
  "model_type": "t5",
  "n_positions": 512,
  "num_decoder_layers": 8,
  "num_heads": 6,
  "num_layers": 8,
  "output_past": true,
  "pad_token_id": 0,
  "relative_attention_max_distance": 128,
  "relative_attention_num_buckets": 32,
  "task_specific_params": {
    "summarization": {
      "early_stopping": true,
      "length_penalty": 2.0,
      "max_length": 200,
      "min_length": 30,
      "no_repeat_ngram_size": 3,
      "num_beams": 4,
      "prefix": "summarize: "
    },
    "translation_en_to_de": {
      "early_stopping": true,
      "ma

### Fine tuning step 

#### 1. Dataset Download
#### 2.Data Preparation
#### 3.Model:Choose the FLAN T5 :Load the Pre-trained Model:
#### 4.Fine-Tuning
#### 5.Evaluation


#### Dataset

https://mcgill-nlp.github.io/topiocqa/

#### About Topioca Dataset

Format: Each entry in the dataset consists of a question and its corresponding answer. The dataset may also include a context paragraph providing background or related information for the question, but the answer does not have to be a direct excerpt from this context.
Diversity: The dataset includes a variety of question types and subjects, making it a comprehensive resource for training generalized QA model.

Example

Context: "The Eiffel Tower is a wrought-iron lattice tower on the Champ de Mars in Paris, France."

Question: "Where is the Eiffel Tower located?"

Answer: "Paris, France"

In [32]:
import json
import os
from tqdm import tqdm
from torch.utils.data import Dataset, DataLoader, RandomSampler
# Configure basic logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')



In [33]:
def load_json_file_formatting(path, file_name):
    """
    Loads a JSON file from the given path and file name.

    Args:
    path (str): The path to the directory containing the JSON file.
    file_name (str): The name of the JSON file.

    Returns:
    dict: The content of the JSON file, or None if an error occurs.
    """
    full_path = os.path.join(path, file_name)

    try:
        with open(full_path, 'r') as f:
            data = json.load(f)
            logging.info(f"Successfully loaded file: {full_path}")
            return data
    except FileNotFoundError:
        logging.error(f"File not found: {full_path}")
    except json.JSONDecodeError as e:
        logging.error(f"Error decoding JSON from file {file_name}: {e}")
    except Exception as e:
        logging.error(f"An unexpected error occurred while reading {file_name}: {e}")

    return None

#### reading the training file  

In [34]:
# Get the current working directory
current_working_directory = os.getcwd()

In [35]:
# Example usage
folder_path = current_working_directory
qa_train = 'topiocqa_train.json'
qa_dev="topiocqa_dev.json"

In [36]:
data_train = load_json_file_formatting(folder_path, qa_train)

INFO:root:Successfully loaded file: C:\Users\verma\OneDrive\Desktop\LLM_TUTORIAL\topiocqa_train.json


In [37]:
#data_train["data"]

In [38]:
data_dev = load_json_file_formatting(folder_path, qa_dev)

INFO:root:Successfully loaded file: C:\Users\verma\OneDrive\Desktop\LLM_TUTORIAL\topiocqa_dev.json


#### Data Preprocessing and formating

In [39]:
def json_to_dataframe(json_file_path):
    # Read the JSON file
    with open(json_file_path, 'r') as file:
        data = json.load(file)
    
    # Extracting the relevant fields
    df_data = []
    for item in data:
        context = item.get('Context', '')
        question = item.get('Question', '')
        answer = item.get('Answer', '')

        # Add to the list as a tuple
        df_data.append((context, question, answer))

    # Create a DataFrame
    df = pd.DataFrame(df_data, columns=['Context', 'Question', 'Answer'])
    return df


In [40]:
topioca_data_frame= json_to_dataframe(qa_train)
topioca_data_frame_dev= json_to_dataframe(qa_dev)

### saving data in csv file format 

In [41]:
#topioca_data_frame.to_csv("Topioca_data.csv",index=False)
#topioca_data_frame_dev.to_csv("Topioca_data_dev.csv",index=False)

In [42]:
#reading the csv file

In [43]:
topica_csv=pd.read_csv("Topioca_data.csv")

In [44]:
dev_topica_csv=pd.read_csv("Topioca_data_dev.csv")

In [45]:
topica_csv.head()

Unnamed: 0,Context,Question,Answer
0,[],what was australia's contribution to the battl...,The army personnel and thousands of Australian...
1,"[""what was australia's contribution to the bat...",was the battle fought in australia?,UNANSWERABLE
2,"[""what was australia's contribution to the bat...",when was the battle fought?,1944
3,"[""what was australia's contribution to the bat...",who fought in this battle?,Australians and British
4,"[""what was australia's contribution to the bat...",was this battle part of a bigger war?,UNANSWERABLE


In [46]:
dev_topica_csv.head()

Unnamed: 0,Context,Question,Answer
0,[],when will the new dunkirk film be released on dvd,18 December 2017
1,['when will the new dunkirk film be released o...,what is this film about?,Dunkirk evacuation of World War II
2,['when will the new dunkirk film be released o...,can you mention a few members of the cast?,"Fionn Whitehead, Tom Glynn-Carney, Jack Lowden..."
3,['when will the new dunkirk film be released o...,where was it shot?,"Dunkirk and Urk, Netherlands"
4,['when will the new dunkirk film be released o...,can you tell me anything about the latter plac...,Until 1475 the High and Low Lordship of Urk an...


In [47]:
topica_csv.isna().sum()

Context     0
Question    0
Answer      0
dtype: int64

#### For training, a smaller subset of data is being selected due to limitations in CPU and GPU resources.

In [48]:
topica_csv_sample=topica_csv.iloc[0:4000,:]

### Data Prepration Train Val set 

In [49]:
def prepare_dataset(csv_data):
    """
    Prepares the dataset by creating input and target text columns, and splitting the data into training and validation sets.

    Parameters:
    csv_data (DataFrame): A pandas DataFrame containing the dataset with 'Context', 'Question', and 'Answer' columns.

    Returns:
    tuple: A tuple containing two DataFrames (train_df, val_df) for training and validation sets.
    """
    try:
        # Add input_text and target_text columns
        csv_data['input_text'] = 'context: ' + csv_data['Context'] + ' question: ' + csv_data['Question']
        csv_data['target_text'] = csv_data['Answer']

        # Split the dataset into training and validation sets
        train_df = csv_data.sample(frac=0.8, random_state=200)
        val_df = csv_data.drop(train_df.index)

        logging.info("Dataset preparation complete.")
        return train_df, val_df

    except Exception as e:
        logging.error(f"Error occurred during dataset preparation: {e}")
        raise


In [50]:
from transformers import T5Tokenizer, T5ForConditionalGeneration, Trainer, TrainingArguments

INFO:datasets:PyTorch version 2.1.2 available.


In [51]:
# Initialize the tokenizer
tokenizer = T5Tokenizer.from_pretrained('google/flan-t5-small')

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [52]:
train_df,val_df=prepare_dataset(topica_csv_sample)

INFO:root:Dataset preparation complete.


In [53]:
tokenizer = T5Tokenizer.from_pretrained('google/flan-t5-small')

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [54]:
train_df.head()

Unnamed: 0,Context,Question,Answer,input_text,target_text
501,"['what is durian?', 'It is the fruit of severa...",what kind of places does the first one live in?,Lowland and hill forests,"context: ['what is durian?', 'It is the fruit ...",Lowland and hill forests
3118,"['where did the field of fire take place', 'UN...",has he been referred to in popular culture or ...,"John Betjeman's poem ""A Shropshire Lad"" (1940)...",context: ['where did the field of fire take pl...,"John Betjeman's poem ""A Shropshire Lad"" (1940)..."
228,"[""who's directing the new blade runner movie"",...",are there concerns over the environment?,A federal environmental programme was establis...,"context: [""who's directing the new blade runne...",A federal environmental programme was establis...
2879,[],who is calvin coolidge?,American politician and lawyer who served as t...,context: [] question: who is calvin coolidge?,American politician and lawyer who served as t...
1220,['when does the punisher make an appearance in...,who is the writer?,UNANSWERABLE,context: ['when does the punisher make an appe...,UNANSWERABLE


In [55]:
# Custom Dataset class
class QADataset(Dataset):
    def __init__(self, tokenizer, data):
        self.tokenizer = tokenizer
        self.data = data

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        row = self.data.iloc[idx]
        inputs = self.tokenizer.encode_plus(
            row['input_text'], 
            None, 
            add_special_tokens=True,
            max_length=512,
            padding='max_length',
            return_token_type_ids=False,
            return_attention_mask=True,
            truncation=True
        )
        outputs = self.tokenizer.encode(
            row['target_text'],
            add_special_tokens=True,
            max_length=512,
            padding='max_length',
            truncation=True
        )

        return {
            'input_ids': torch.tensor(inputs['input_ids'], dtype=torch.long),
            'attention_mask': torch.tensor(inputs['attention_mask'], dtype=torch.long),
            'labels': torch.tensor(outputs, dtype=torch.long)
        }

In [56]:
# Prepare datasets
train_dataset = QADataset(tokenizer, train_df)
val_dataset = QADataset(tokenizer, val_df)

In [57]:
train_dataset.data

Unnamed: 0,Context,Question,Answer,input_text,target_text
501,"['what is durian?', 'It is the fruit of severa...",what kind of places does the first one live in?,Lowland and hill forests,"context: ['what is durian?', 'It is the fruit ...",Lowland and hill forests
3118,"['where did the field of fire take place', 'UN...",has he been referred to in popular culture or ...,"John Betjeman's poem ""A Shropshire Lad"" (1940)...",context: ['where did the field of fire take pl...,"John Betjeman's poem ""A Shropshire Lad"" (1940)..."
228,"[""who's directing the new blade runner movie"",...",are there concerns over the environment?,A federal environmental programme was establis...,"context: [""who's directing the new blade runne...",A federal environmental programme was establis...
2879,[],who is calvin coolidge?,American politician and lawyer who served as t...,context: [] question: who is calvin coolidge?,American politician and lawyer who served as t...
1220,['when does the punisher make an appearance in...,who is the writer?,UNANSWERABLE,context: ['when does the punisher make an appe...,UNANSWERABLE
...,...,...,...,...,...
3877,"['is the simpsons a type of animation?', 'Yes,...",what else did the first voice artist do after ...,"He made guest appearances in TV shows, includi...",context: ['is the simpsons a type of animation...,"He made guest appearances in TV shows, includi..."
2169,[],when did immigrants stop coming to ellis island,1924,context: [] question: when did immigrants stop...,1924
2196,"[""what is 'the wrong goodbye (gossip girl)'?"",...",is there any controversy related to that chara...,"Yes, Chuck's public humiliation of Blair, his ...","context: [""what is 'the wrong goodbye (gossip ...","Yes, Chuck's public humiliation of Blair, his ..."
2537,[],when do they leave the farm walking dead,in the 13th and final episode of the second se...,context: [] question: when do they leave the f...,in the 13th and final episode of the second se...


In [58]:
# Model
model = T5ForConditionalGeneration.from_pretrained('google/flan-t5-small')

In [59]:
# Training arguments
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=1,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    warmup_steps=50,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=10,
)

# Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset
)

In [60]:
# Train the model
trainer.train()

Step,Training Loss
10,43.5069
20,41.5737
30,38.7892
40,34.8326
50,30.5375
60,24.9973
70,18.1373
80,11.313
90,7.4454
100,5.7637


TrainOutput(global_step=200, training_loss=14.906954383850097, metrics={'train_runtime': 15924.7566, 'train_samples_per_second': 0.201, 'train_steps_per_second': 0.013, 'total_flos': 594849615052800.0, 'train_loss': 14.906954383850097, 'epoch': 1.0})

In [61]:
model.save_pretrained('./flan-T5-QA-Finetune')

### Evaluating a trained model

Evaluating a trained model for a task like question-answering can be done using various metrics such as ROUGE, BLEU, and perplexity

In [64]:
import nltk
import rouge_score


In [69]:
#!pip install sacrebleu

In [70]:
#nltk.download('all')

In [72]:
from datasets import load_metric
# Load the trained model and tokenizer
model = T5ForConditionalGeneration.from_pretrained('./flan-T5-QA-Finetune')
tokenizer = T5Tokenizer.from_pretrained('google/flan-t5-small')

# Load your validation dataset
val_df = val_df

# Metrics
rouge = load_metric('rouge')
bleu = load_metric('sacrebleu')

# Function to generate answers from the model
def generate_answer(input_text):
    input_ids = tokenizer(input_text, return_tensors="pt").input_ids
    output_ids = model.generate(input_ids)[0]
    return tokenizer.decode(output_ids, skip_special_tokens=True)

# Evaluate ROUGE and BLEU
for index, row in val_df.iterrows():
    context = row['Context']
    question = row['Question']
    answer = row['Answer']
    input_text = f"context: {context} question: {question}"
    
    predicted_answer = generate_answer(input_text)
    rouge.add(prediction=predicted_answer, reference=answer)
    bleu.add(prediction=[predicted_answer.split()], reference=[[answer.split()]])

# Calculate final scores
rouge_score = rouge.compute()
bleu_score = bleu.compute()

print("ROUGE Score:", rouge_score)
print("BLEU Score:", bleu_score)

# Evaluate Perplexity
def calculate_perplexity(text):
    inputs = tokenizer(text, return_tensors="pt")
    outputs = model(**inputs, labels=inputs["input_ids"])
    return torch.exp(outputs[0])

# Example: Calculate perplexity for a sample text
sample_text = "Your sample text here."
perplexity = calculate_perplexity(sample_text)
print("Perplexity:", perplexity.item())


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Token indices sequence length is longer than the specified maximum sequence length for this model (608 > 512). Running this sequence through the model will result in indexing errors
INFO:absl:Using default tokenizer.


ROUGE Score: {'rouge1': AggregateScore(low=Score(precision=0.0, recall=0.0, fmeasure=0.0), mid=Score(precision=0.00375, recall=0.0006547619047619048, fmeasure=0.0008914728682170542), high=Score(precision=0.00875, recall=0.0021138392857142853, fmeasure=0.0029543298653610752)), 'rouge2': AggregateScore(low=Score(precision=0.0, recall=0.0, fmeasure=0.0), mid=Score(precision=0.0, recall=0.0, fmeasure=0.0), high=Score(precision=0.0, recall=0.0, fmeasure=0.0)), 'rougeL': AggregateScore(low=Score(precision=0.0, recall=0.0, fmeasure=0.0), mid=Score(precision=0.00375, recall=0.0006944444444444445, fmeasure=0.0009649122807017543), high=Score(precision=0.007531249999999971, recall=0.0021436011904761897, fmeasure=0.0030124694002447963)), 'rougeLsum': AggregateScore(low=Score(precision=0.0, recall=0.0, fmeasure=0.0), mid=Score(precision=0.00375, recall=0.0006845238095238096, fmeasure=0.000949612403100775), high=Score(precision=0.00875, recall=0.002113095238095238, fmeasure=0.0029528763769889835))}


### Next Step For Fine Tuning : Retrieval-Augmented Generation (RAG) model  