<a href="https://colab.research.google.com/github/anne6808/NLP-Project/blob/main/T5_Small.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Before continuing make sure the contents contain the following files found in the drive linked in github: \
modified_squad_data.json \
train-v2.0.json \

These files need to be in the same directory as the sample_data folder and should result in a directory that looks like the following: \

| .. \
| > sample_data \
|  &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; sample_data_contents \
| train-v2.0 \
| modified_squad_data.json

\

This is important if running on colab. If running on local, it does not matter where the datasets are but you will need to change the routes in the code.

Run the following cell to install and update all dependencies

In [2]:
!pip install transformers -U
!pip install datasets evaluate
!pip install transformers[torch]
!pip install -U accelerate
!pip install -U transformers
!pip install Cython
!pip install rouge

Installing collected packages: rouge
Successfully installed rouge-1.0.1


Implementation heavily based on the following article: https://medium.com/@ajazturki10/simplifying-language-understanding-a-beginners-guide-to-question-answering-with-t5-and-pytorch-253e0d6aac54

Import all required libraries

In [3]:
import torch
import json
import random
import torch.nn as nn
import nltk
import spacy
import string
import evaluate  # Bleu
import pandas as pd
import numpy as np
import transformers
import matplotlib.pyplot as plt


from tqdm import tqdm
from torch.optim import Adam
from torch.utils.data import Dataset, DataLoader, RandomSampler
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score
from transformers import T5Tokenizer, T5Model, T5ForConditionalGeneration, T5TokenizerFast, pipeline
from nltk.corpus import wordnet as wn
from evaluate import load
from datasets import load_dataset

# Download WordNet if not already downloaded
nltk.download('wordnet')
nltk.download('punkt')



[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [4]:
  #Here we are defining the tokenizer, model, optimizer, and the Question and Target Length
  TOKENIZER = T5TokenizerFast.from_pretrained("t5-small")
  MODEL = T5ForConditionalGeneration.from_pretrained("t5-small", return_dict=True)
  OPTIMIZER = Adam(MODEL.parameters(), lr=0.00001)
  Q_LEN = 256   # Question Length
  T_LEN = 32    # Target Length
  BATCH_SIZE = 4
  DEVICE = "cpu"

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

In [5]:
# Loading the data
# If running on local change the route to where the dataset is located
with open('/content/train-v2.0.json') as f:
    data_as_json = json.load(f)

Example of some of the data that is found in the json file. It is not easily accessable right now

In [6]:
print('context: ' + data_as_json['data'][0]['paragraphs'][0]['context'])
print('question: '+ data_as_json['data'][0]['paragraphs'][0]['qas'][0]['question'])
print('answer: ' + data_as_json['data'][0]['paragraphs'][0]['qas'][0]['answers'][0]['text'])

context: Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny's Child. Managed by her father, Mathew Knowles, the group became one of the world's best-selling girl groups of all time. Their hiatus saw the release of Beyoncé's debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five Grammy Awards and featured the Billboard Hot 100 number-one singles "Crazy in Love" and "Baby Boy".
question: When did Beyonce start becoming popular?
answer: in the late 1990s


In [7]:
# Extracting context, question, and answers from the dataset
def prepare_data(data):
    articles = []

    for article in data["data"]:
        for paragraph in article["paragraphs"]:
            for qa in paragraph["qas"]:
                question = qa["question"]

                if not qa["is_impossible"]:
                  answer = qa["answers"][0]["text"]

                inputs = {"context": paragraph["context"], "question": question, "answer": answer}

                articles.append(inputs)


    return articles

In [8]:
data = prepare_data(data_as_json)

# Create a Dataframe
data_as_pd = pd.DataFrame(data)

#Limit data to 125. This will create 100 training data and 25 validation data
shuffled_data = data_as_pd.sample(frac=1).reset_index(drop=True)[:125]

# Print some of the data. Much easier to access and manipulate now
shuffled_data.head()

Unnamed: 0,context,question,answer
0,In additive color devices such as computer dis...,At what wavelength is green on computer displays?,~550 nm
1,If a more consistent report with the genetic g...,How much of the population of Brazil reported ...,42.4%
2,"It is estimated that in 480 BC, 50 million peo...",How many people lived in the Archaemenid Empir...,50 million people
3,Lie groups are of fundamental importance in mo...,What are the translations in a group of Minkow...,Poincaré group
4,It is generally considered that the Pacific Wa...,Has Japan ever attacked Thailand?,Japan invaded Thailand


Heuristic 1: Changing where the answer is \
In extractive QA models there is the possibility that the model learns that the answer is earlier in the sentence and can skew its effectiveness. It is speculated that generative models should suffer less from this as mentioned in [this](https://arxiv.org/pdf/2004.14602.pdf) article. We can test here by adjusting our contexts. By using only contexts and questions that have the answer in the first half of the sentence we can see if the model is affected by this bias

In [9]:
#This is the function for limiting the data to questions that have the answer in the first half of the context
def in_first_half(data):
  answer_in_first_half=data[:2000].copy()
  indices=[]
  for i in data[:2000]['context'].unique():
    indices.append(np.where(data_as_pd == i)[0][0])
  answer_in_first_half=answer_in_first_half.iloc[indices][:125].reset_index(drop=True)
  return answer_in_first_half

Heuristic 2: Using synonyms to vary context wording

In QA systems, the choice of words within the context can significantly impact the model's performance. By integrating synonyms from lexical resources like WordNet, we aim to diversify the wording of context sentences. This approach enables the model to better understand variations in language and identify relevant information, ultimately improving coverage and robustness. Through experimentation with synonym substitution, we can assess how different word choices influence the model's ability to accurately answer questions, leading to enhanced performance and adaptability across a wide range of queries.

In [10]:
# Define a function to replace words with synonyms using WordNet
def replace_with_synonyms(text):
    tokens = nltk.word_tokenize(text)
    synonyms = []
    for token in tokens:
        synsets = wn.synsets(token)
        if synsets:
            synonym = synsets[0].lemmas()[0].name()  # Taking the first synonym from the first synset
            synonyms.append(synonym)
            # print(f"Token: {token}, Synonym: {synonym}")
        # else:
            # print(f"No synonym found for token: {token}")
    replaced_text = ' '.join(synonyms)
    # print("\nReplaced Text:", replaced_text)
    return replaced_text



In [11]:
sample_text = "This is a test sentence with some words."

# Apply the function to the sample text
replaced_text = replace_with_synonyms(sample_text)

# Print out the original and replaced text for comparison
print("Original Text:")
print(sample_text)
print("\nReplaced Text:")
print(replaced_text)


Original Text:
This is a test sentence with some words.

Replaced Text:
be angstrom trial sentence some words


In [12]:
# This function is used to generate the synonyms dataset
def get_synonyms(data):
  replaced_with_synonyms=data[:1000].copy()
  # Print out the 'context' column of the specified number of rows of the DataFrame
  new_context=np.array([replace_with_synonyms(i) for i in data['context'][:1000]])
  replaced_with_synonyms['context']=new_context

  return replaced_with_synonyms.sample(frac=1).reset_index(drop=True)[:125]

In [13]:
# This function adds certain words to the question asked in order to get better results
def paraphrase_question(question):
    transformations = [
        {
            'condition': lambda q: q.startswith('What'),
            'options': [
                lambda q: q.replace('What is', 'What exactly is'),
                lambda q: q.replace('What', 'Could you tell me what'),
                lambda q: q.replace('What are', 'What exactly are')
            ]
        },
        {
            'condition': lambda q: q.startswith('Where'),
            'options': [
                lambda q: q.replace('Where', 'Where exactly'),
                lambda q: q.replace('Where', 'Could you specify where'),
                lambda q: q.replace('Where can', 'Where is it possible to')
            ]
        },
        {
            'condition': lambda q: q.startswith('When'),
            'options': [
                lambda q: q.replace('When', 'When exactly'),
                lambda q: q.replace('When did', 'On what date did'),
                lambda q: q.replace('When', 'Can you specify when')
            ]
        },
        {
            'condition': lambda q: q.startswith('Who'),
            'options': [
                lambda q: q.replace('Who', 'Who exactly'),
                lambda q: q.replace('Who was', 'Who was the person who was'),
                lambda q: q.replace('Who', 'Can you tell me who')
            ]
        },
        {
            'condition': lambda q: q.startswith('Why'),
            'options': [
                lambda q: q.replace('Why', 'Can you explain why'),
                lambda q: q.replace('Why', 'What are the reasons why'),
                lambda q: q.replace('Why did', 'What prompted')
            ]
        },
        {
            'condition': lambda q: q.startswith('How'),
            'options': [
                lambda q: q.replace('How', 'How exactly'),
                lambda q: q.replace('How can', 'In what way can'),
                lambda q: q.replace('How do', 'What methods do')
            ]
        },
        {
            'condition': lambda q: 'do you' in q.lower(),
            'options': [
                lambda q: q.replace('do you', 'does one'),
                lambda q: q.replace('Do you', 'Does someone'),
            ]
        }
    ]

    for rule in transformations:
        if rule['condition'](question):
            question = random.choice(rule['options'])(question)
            break

    return question

In [14]:
# This takes the json dataset and changes all of the questions
def modify_questions(squad_data):
    for article in tqdm(squad_data['data']):
        for paragraph in article['paragraphs']:
            for qa in paragraph['qas']:
                original_question = qa['question']
                qa['question'] = paraphrase_question(original_question)
    return squad_data

In [15]:
# This allows us to convert any dataset into a tokenized dataset that can be trained
class QA_Dataset(Dataset):
    def __init__(self, tokenizer, dataframe, q_len, t_len):
        self.tokenizer = tokenizer
        self.q_len = q_len
        self.t_len = t_len
        self.data = dataframe
        self.questions = self.data["question"]
        self.context = self.data["context"]
        self.answer = self.data['answer']

    def __len__(self):
        return len(self.questions)

    def __getitem__(self, idx):
        question = self.questions[idx]
        context = self.context[idx]
        answer = self.answer[idx]

        question_tokenized = self.tokenizer(question, context, max_length=self.q_len, padding="max_length",
                                                    truncation=True, pad_to_max_length=True, add_special_tokens=True)
        answer_tokenized = self.tokenizer(answer, max_length=self.t_len, padding="max_length",
                                          truncation=True, pad_to_max_length=True, add_special_tokens=True)

        labels = torch.tensor(answer_tokenized["input_ids"], dtype=torch.long)
        labels[labels == 0] = -100

        return {
            "input_ids": torch.tensor(question_tokenized["input_ids"], dtype=torch.long),
            "attention_mask": torch.tensor(question_tokenized["attention_mask"], dtype=torch.long),
            "labels": labels,
            "decoder_attention_mask": torch.tensor(answer_tokenized["attention_mask"], dtype=torch.long)
        }

In [16]:
# Here we are creating the training and validation splits of the data
def create_train(data):
  train_data, val_data = train_test_split(data, test_size=0.2, random_state=42)

  train_sampler = RandomSampler(train_data.index)
  val_sampler = RandomSampler(val_data.index)

  qa_dataset = QA_Dataset(TOKENIZER, data, Q_LEN, T_LEN)

  train_loader = DataLoader(qa_dataset, batch_size=BATCH_SIZE, sampler=train_sampler)
  val_loader = DataLoader(qa_dataset, batch_size=BATCH_SIZE, sampler=val_sampler)

  return train_loader, val_loader

In [17]:
# Here we will train the model
def train_model(train_loader, val_loader):
  # We refresh the tokenizer, model, and optimizer
  # This is so we can test new datasets without the previous information interfering and causing a better F1 Score
  TOKENIZER = T5TokenizerFast.from_pretrained("t5-small")
  MODEL = T5ForConditionalGeneration.from_pretrained("t5-small", return_dict=True)
  OPTIMIZER = Adam(MODEL.parameters(), lr=0.00001)

  train_loss = 0
  val_loss = 0
  train_batch_count = 0
  val_batch_count = 0
  epochs=10


  for epoch in range(epochs):
  #Training
      MODEL.train()
      for batch in tqdm(train_loader, desc="Training batches"):
          input_ids = batch["input_ids"].to(DEVICE)
          attention_mask = batch["attention_mask"].to(DEVICE)
          labels = batch["labels"].to(DEVICE)
          decoder_attention_mask = batch["decoder_attention_mask"].to(DEVICE)

          outputs = MODEL(
                            input_ids=input_ids,
                            attention_mask=attention_mask,
                            labels=labels,
                            decoder_attention_mask=decoder_attention_mask
                          )

          OPTIMIZER.zero_grad()
          outputs.loss.backward()
          OPTIMIZER.step()
          train_loss += outputs.loss.item()
          train_batch_count += 1

      #Evaluation
      MODEL.eval()
      for batch in tqdm(val_loader, desc="Validation batches"):
          input_ids = batch["input_ids"].to(DEVICE)
          attention_mask = batch["attention_mask"].to(DEVICE)
          labels = batch["labels"].to(DEVICE)
          decoder_attention_mask = batch["decoder_attention_mask"].to(DEVICE)

          outputs = MODEL(
                            input_ids=input_ids,
                            attention_mask=attention_mask,
                            labels=labels,
                            decoder_attention_mask=decoder_attention_mask
                          )

          OPTIMIZER.zero_grad()
          outputs.loss.backward()
          OPTIMIZER.step()
          val_loss += outputs.loss.item()
          val_batch_count += 1

          # We also print the loss since it could indicate to us how the model is doing
      print(f"{epoch+1}/{epochs} -> Train loss: {train_loss / train_batch_count}\tValidation loss: {val_loss/val_batch_count}")

Here is the evaluation process: \
Use the predict answer in order to generate answers. We do this because the model in its current state does not work well with the processer as shown in homework 3. \
This predict answer function can be used by itself to get a BLEU score as shown in the article that the rest of the model was used from. \
With the predict function we can alter the code given in homework 3 to generate answers with our model and test it against SQuAD validation set. This also prints out the f1 model.


In [18]:
def predict_answer(context, question, ref_answer=None):
    #Tokenizer converts the context and question into tokenized values
    inputs = TOKENIZER(question, context, max_length=256, padding="max_length", truncation=True, add_special_tokens=True)

    input_ids = torch.tensor(inputs["input_ids"], dtype=torch.long).to('cpu').unsqueeze(0)
    attention_mask = torch.tensor(inputs["attention_mask"], dtype=torch.long).to('cpu').unsqueeze(0)


    #The model now predicts the value given the context and the Tokenizer decodes it
    outputs = MODEL.generate(input_ids=input_ids, attention_mask=attention_mask)
    predicted_answer = TOKENIZER.decode(outputs.flatten(), skip_special_tokens=True)


    #If a refrence answer is given, we can compare the reference to the predicted. If not, it will return the predicted answer in text
    if ref_answer:
        # Load the Bleu metric
        bleu = evaluate.load("google_bleu")
        score = bleu.compute(predictions=[predicted_answer],
                            references=[ref_answer])

        print("Context: \n", context)
        print("\n")
        print("Question: \n", question)
        return {
            "Reference Answer: ": ref_answer,
            "Predicted Answer: ": predicted_answer,
            "BLEU Score: ": score
        }
    else:
      return predicted_answer

In [19]:
squad_dataset = load_dataset('squad', split='validation') # Makes the process of loading datasets much easier than before
squad_dataset = squad_dataset.select(random.choices([i for i in range(len(squad_dataset))], k=1000))
squad_evaluate = load('squad')

def evaluate_hf_model(MODEL, model_name):
    model = MODEL       # Initialize the model
    tokenizer = TOKENIZER                   # Initialize the tokenizer

    def dataset_generator(dataset):
        for ex in dataset:
            yield (ex,
                {'question' : ex['question'], 'context': ex['context']})

    predictions = []
    references = []

    # For prediction text we no longer use the pipeline but instead use our predict function
    for ex in tqdm(dataset_generator(squad_dataset), total=len(squad_dataset)):
        predictions.append({

                'id' : ex[0]['id'],
                'prediction_text' : predict_answer(ex[0]['context'], ex[0]['question'])
        }
        )

        # In each example, there are multiple possible answers which we compare to. Here we are converting from them from the datasets format to the one expected by the evaluation metric.
        references.append({
            'id' : ex[0]['id'],
            'answers' : [{'text' : z[0], 'answer_start' : z[1]} for z in zip(ex[0]['answers']['text'], ex[0]['answers']['answer_start'])]
        })

    # Compute metrics
    squad_evaluate.compute(predictions=predictions, references=references)["f1"]
    print('Performance of {} : {}'.format(model_name, squad_evaluate.compute(predictions=predictions, references=references)))


Downloading readme:   0%|          | 0.00/7.62k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/14.5M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.82M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/87599 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/10570 [00:00<?, ? examples/s]

Downloading builder script:   0%|          | 0.00/4.53k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/3.32k [00:00<?, ?B/s]

In [None]:
# This generates a model trained on the baseline data. This will also generate a F1 Score
train, val = create_train(shuffled_data)
train_model(train, val)
evaluate_hf_model(MODEL, 'Baseline')

In [None]:
# This generates a model trained on the synonyms dataset. We also evaluate it with a F1 Score
synonyms=get_synonyms(data_as_pd)
train, val = create_train(synonyms)
train_model(train, val)
evaluate_hf_model(MODEL, 'Synonyms')

In [22]:
# Here we are altering the SQuAD dataset with the prompting features
question_changed_data=modify_questions(data_as_json)
question_changed_pd=prepare_data(question_changed_data)
question_changed_pd=pd.DataFrame(question_changed_pd)
question_changed_pd = question_changed_pd.sample(frac=1).reset_index(drop=True)[:125]

100%|██████████| 442/442 [00:00<00:00, 722.35it/s]


In [None]:
# We train and evaluate the model on the new prompting dataset
train, val = create_train(question_changed_pd)
train_model(train, val)
evaluate_hf_model(MODEL, 'Prompting')

In [24]:
# We get the dataset with the altered entites dataset and prepare it similarly to the baseline
# If running on local change the route to where the dataset is located
with open('/content/modified_squad_data.json') as f:
    altered_entities = json.load(f)

In [25]:
altered = prepare_data(altered_entities)

# Create a Dataframe
altered_pd = pd.DataFrame(altered)

#Limit data to 125. This will create 100 training data and 25 validation data
altered_pd = altered_pd.sample(frac=1).reset_index(drop=True)[:125]


In [None]:
# Train and evaluate the model on the altered entities dataset
train, val = create_train(altered_pd)
train_model(train, val)
evaluate_hf_model(MODEL, 'Altered Entities')