<a href="https://colab.research.google.com/github/arinakosovskaia/SQuAD2.0/blob/main/Kosovskaia_Nethercott_Preda.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 🥇 *The* *Squad*: Nate Nethercott (10815538), Arina Kosovskaia, Daria-Maria Preda (10855501)


# **SQuAD2.0: The Stanford Question Answering Dataset**

Website with data: https://rajpurkar.github.io/SQuAD-explorer/
- Dataset: The dataset consists of a set of questions and Wikipedia articles containing the
answers to the questions.
- Task: Find answers to the question or to respond that the question is unanswerable given the information available.

- ✨ Our approach ✨: choose one pretrained model from HuggingFace and fine tune it on the SQuAD2.0 dataset (+ try/more fail to compare multiple models and a naive 
approach)

[1] https://huggingface.co/

[2] https://huggingface.co/docs/transformers/tasks/question_answering 


# 💪 Bert 🧠
Transformer-based approach leveraging pretraining of MLMs

Goals: 
- compare zero-shot performance of bert-base models with fine tuning on squad data
- see how the fine tuned model compares to available alternatives on hf 

<img src="https://pytorch.org/tutorials/_images/bert.png" height=700>


In [None]:
# link google drive to access model weights 
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [None]:
#!pip install -q transformers datasets

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.1/7.1 MB[0m [31m111.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m474.6/474.6 kB[0m [31m47.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m224.5/224.5 kB[0m [31m26.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m99.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m110.5/110.5 kB[0m [31m13.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m212.5/212.5 kB[0m [31m25.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.3/134.3 kB[0m [31m17.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m61.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━

In [None]:
import torch 
import torch.nn.functional as F
from torch import nn 
import numpy as np
from datasets import load_dataset
from pprint import pprint


# 🛠 Preliminary analysis:
SQuAD2.0 combines questions in SQuAD1.1 with unanswerable questions written adversarially by crowdworkers to look similar to answerable ones. To do well on SQuAD2.0, systems must not only answer questions when possible, but also determine when no answer is supported by the paragraph and abstain from answering.

NB: followed the ideas from here https://towardsdatascience.com/use-the-datasets-library-of-hugging-face-in-your-next-nlp-project-94e300cca850

[delete later or move up in notebook] We load the squadv2 data using the `datasets` library of Hugging Face which provides a clean interface for loading train and validation splits

In [None]:
squad_dataset = load_dataset('squad')

Downloading builder script:   0%|          | 0.00/5.27k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/2.36k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/7.67k [00:00<?, ?B/s]

Downloading and preparing dataset squad/plain_text to /root/.cache/huggingface/datasets/squad/plain_text/1.0.0/d6ec3ceb99ca480ce37cdd35555d6cb2511d223b9150cce08a837ef62ffea453...


Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/8.12M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.05M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/87599 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/10570 [00:00<?, ? examples/s]

Dataset squad downloaded and prepared to /root/.cache/huggingface/datasets/squad/plain_text/1.0.0/d6ec3ceb99ca480ce37cdd35555d6cb2511d223b9150cce08a837ef62ffea453. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

In [None]:
print(squad_dataset)

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 87599
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 10570
    })
})


In [None]:
test = load_dataset("squad_v2", split = 'validation') 
train = load_dataset("squad_v2", split = f'train') 
print(train.shape, test.shape)



(130319, 5) (11873, 5)


Let's take a look at an entry in the dataset.
It is organised as `title`,`id`, `question`, `context`, `answers`:

In [None]:
print("Features: ")
pprint(squad_train.features)
print("Column names: ", train.column_names)

Features: 
{'answers': Sequence(feature={'answer_start': Value(dtype='int32', id=None),
                              'text': Value(dtype='string', id=None)},
                     length=-1,
                     id=None),
 'context': Value(dtype='string', id=None),
 'id': Value(dtype='string', id=None),
 'question': Value(dtype='string', id=None),
 'title': Value(dtype='string', id=None)}
Column names:  ['id', 'title', 'context', 'question', 'answers']


In [None]:
train[0]

In [None]:
print("Another two examples from the dataset using slice operation: \n")
pprint(train[14:16])

Now some metrics...

In [None]:
print("Length of the validation set: ", len(test))
print("Length of the training set: ", len(train))
print("Number of rows: ", train.num_rows)
print("Number of columns: ", train.num_columns)
print("Shape: ", train.shape)

For a given context we can have multiple questions associated with it:

In [None]:
print("A column slice from the dataset: \n")
pprint(train['question'][:5])

A column slice from the dataset: 

['When did Beyonce start becoming popular?',
 'What areas did Beyonce compete in when she was growing up?',
 "When did Beyonce leave Destiny's Child and become a solo singer?",
 'In what city and state did Beyonce  grow up? ',
 'In which decade did Beyonce become famous?']


SQuAD2.0 is know for putting both ***people and models*** 👨 🤝 💻  in difficulty by popping ***unanswerable*** questions, so let's find the monsters:

In [None]:
import random
import pandas as pd
from IPython.display import display, HTML

def display_random_examples(train, num_examples=15):
    assert num_examples < len(squad_train)
    
    random_picks = []
    count = 0
    for i in range(len(train)):
        example = train[i]
        if len(example['answers']['text']) == 0:
            random_picks.append(i)
            count += 1
            if count == num_examples:
                break
    
    if len(random_picks) == 0:
        print("No examples found with empty answers.")
        return
    
    df = pd.DataFrame(train[random_picks])
    display(HTML(df.to_html()))

display_random_examples(train, 3)

Unnamed: 0,id,title,context,question,answers
0,5a8d7bf7df8bba001a0f9ab1,The_Legend_of_Zelda:_Twilight_Princess,"The Legend of Zelda: Twilight Princess (Japanese: ゼルダの伝説 トワイライトプリンセス, Hepburn: Zeruda no Densetsu: Towairaito Purinsesu?) is an action-adventure game developed and published by Nintendo for the GameCube and Wii home video game consoles. It is the thirteenth installment in the The Legend of Zelda series. Originally planned for release on the GameCube in November 2005, Twilight Princess was delayed by Nintendo to allow its developers to refine the game, add more content, and port it to the Wii. The Wii version was released alongside the console in North America in November 2006, and in Japan, Europe, and Australia the following month. The GameCube version was released worldwide in December 2006.[b]",What category of game is Legend of Zelda: Australia Twilight?,"{'text': [], 'answer_start': []}"
1,5a8d7bf7df8bba001a0f9ab2,The_Legend_of_Zelda:_Twilight_Princess,"The Legend of Zelda: Twilight Princess (Japanese: ゼルダの伝説 トワイライトプリンセス, Hepburn: Zeruda no Densetsu: Towairaito Purinsesu?) is an action-adventure game developed and published by Nintendo for the GameCube and Wii home video game consoles. It is the thirteenth installment in the The Legend of Zelda series. Originally planned for release on the GameCube in November 2005, Twilight Princess was delayed by Nintendo to allow its developers to refine the game, add more content, and port it to the Wii. The Wii version was released alongside the console in North America in November 2006, and in Japan, Europe, and Australia the following month. The GameCube version was released worldwide in December 2006.[b]",What consoles can be used to play Australia Twilight?,"{'text': [], 'answer_start': []}"
2,5a8d7bf7df8bba001a0f9ab3,The_Legend_of_Zelda:_Twilight_Princess,"The Legend of Zelda: Twilight Princess (Japanese: ゼルダの伝説 トワイライトプリンセス, Hepburn: Zeruda no Densetsu: Towairaito Purinsesu?) is an action-adventure game developed and published by Nintendo for the GameCube and Wii home video game consoles. It is the thirteenth installment in the The Legend of Zelda series. Originally planned for release on the GameCube in November 2005, Twilight Princess was delayed by Nintendo to allow its developers to refine the game, add more content, and port it to the Wii. The Wii version was released alongside the console in North America in November 2006, and in Japan, Europe, and Australia the following month. The GameCube version was released worldwide in December 2006.[b]",When was Australia Twilight launched in North America?,"{'text': [], 'answer_start': []}"


In [None]:
count = 0

# Iterate through the dataset
for example in train:
    question = example["question"]
    answer = example["answers"]["text"]

    # Check if the answer is empty
    if len(answer) == 0:
        print(f"Empty answer detected for question: {question}")
        count += 1

    # Check if we have found 5 questions without an answer
    if count == 5:
        break

Empty answer detected for question: What category of game is Legend of Zelda: Australia Twilight?
Empty answer detected for question: What consoles can be used to play Australia Twilight?
Empty answer detected for question: When was Australia Twilight launched in North America?
Empty answer detected for question: When could GameCube owners purchase Australian Princess?
Empty answer detected for question: What year was the Legend of Zelda: Australian Princess originally planned for release?


### 💭 Question exploration


In [None]:
squad_train = pd.DataFrame(train)
squad_test = pd.DataFrame(test)

🍳 Let's do preprocessing of the text. First we will delete stopwords and do lemmatization and tokenization.

In [None]:
import nltk
import collections
from collections import Counter

import re
import string
import spacy

import nltk

nltk.download('stopwords')
from nltk.corpus import stopwords

nltk.download('punkt')
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
stopWords = set(stopwords.words('english'))
nltk.download('wordnet')
wnl = nltk.WordNetLemmatizer()

In [None]:
squad_train['question_prep'] = squad_train['question'].apply(lambda x: preproc_nltk(x))
squad_test['question_prep'] = squad_test['question'].apply(lambda x: preproc_nltk(x))

In [None]:
print("Average number of words in question, train:", (squad_train['question_prep'].apply(lambda x: x.split()).apply(len)-1).mean())
print("Average number of words in question, validation:", (squad_valid['question_prep'].apply(lambda x: x.split()).apply(len)-1).mean())

Average number of words in question, train: 5.65711937350883
Average number of words in question, validation: 5.695553453169347


So the average number of words in question is 6.

In [None]:
questions_train = ' '.join(squad_train['question_split'])
questions_test = ' '.join(squad_test['question_split'])
all_questions = questions_train + ' ' + questions_test

In [None]:
print("The size of vocablurary:", len(set(all_questions.split())))

The size of vocablurary: 38008


🧘 Let's try to group questions by 10 clusters.

In order to cluster the documents, we need to first convert them into a vector format. We will use the TfidfVectorizer from Scikit-Learn to do this.

We won't take into account words occuring in more than half of the documents and less than 5 questions.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(max_df=0.8, min_df=5)

In [None]:
vectorizer.fit(squad_train['question_prep'])

In [None]:
vocab = vectorizer.get_feature_names_out()

print(f"New length of vocabulary: {len(vocab)}")

New length of vocabulary: 9446


In [None]:
vector_questions_train = vectorizer.transform(squad_train['question_prep'])
vector_questions_test = vectorizer.transform(squad_test['question_prep'])

📚 Clustering with k-Means

In [None]:
from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=10, max_iter=100, n_init=2, verbose=True, random_state=2307)
kmeans.fit(vector_questions_train)

Initialization complete
Iteration 0, inertia 168574.79633215984.
Iteration 1, inertia 85873.84358247.
Iteration 2, inertia 85779.94989331017.
Iteration 3, inertia 85758.45481846304.
Iteration 4, inertia 85743.0719158977.
Iteration 5, inertia 85740.86941577483.
Iteration 6, inertia 85740.86672151393.
Converged at iteration 6: strict convergence.
Initialization complete
Iteration 0, inertia 168826.4874926858.
Iteration 1, inertia 85812.31778904032.
Iteration 2, inertia 85686.42385761911.
Iteration 3, inertia 85627.67079252863.
Iteration 4, inertia 85605.46079935489.
Iteration 5, inertia 85587.73097726377.
Iteration 6, inertia 85577.06745063604.
Iteration 7, inertia 85573.71985483702.
Iteration 8, inertia 85573.54945780986.
Iteration 9, inertia 85573.5479995172.
Iteration 10, inertia 85573.54681193116.
Converged at iteration 10: strict convergence.


In [None]:
print("Top terms per cluster:")
vocab = vectorizer.get_feature_names_out()

for i in range(kmeans.n_clusters):
    centroid = kmeans.cluster_centers_[i]    
    sorted_terms = centroid.argsort()[::-1]
    print(f"Cluster {i}:\t{[vocab[j] for j in sorted_terms[:10]]}")

Top terms per cluster:
Cluster 0:	['state', 'united', 'many', 'city', 'year', 'treaty', 'law', 'first', 'government', 'name']
Cluster 1:	['part', 'considered', 'important', 'world', 'many', 'language', 'body', 'city', 'become', 'located']
Cluster 2:	['called', 'group', 'people', 'name', 'also', 'two', 'ethnic', 'first', 'one', 'form']
Cluster 3:	['name', 'first', 'used', 'city', 'country', 'located', 'time', 'term', 'one', 'two']
Cluster 4:	['year', 'first', 'many', 'take', 'die', 'begin', 'war', 'become', 'founded', 'place']
Cluster 5:	['type', 'used', 'music', 'system', 'two', 'one', 'found', 'antenna', 'art', 'climate']
Cluster 6:	['make', 'used', 'much', 'many', 'use', 'wood', 'difficult', 'type', 'population', 'year']
Cluster 7:	['use', 'term', 'system', 'type', 'first', 'method', 'time', 'language', 'people', 'describe']
Cluster 8:	['many', 'people', 'member', 'time', 'live', 'school', 'season', 'city', 'died', 'day']
Cluster 9:	['new', 'york', 'city', 'delhi', 'many', 'name', 'y

In [None]:
print('Number of questions in: ')

for i in range(kmeans.n_clusters):
    print(f"Cluster {i}: {np.sum(kmeans.labels_ == i)}")

Number of questions in: 
Cluster 0: 1619
Cluster 1: 1466
Cluster 2: 2374
Cluster 3: 67573
Cluster 4: 3720
Cluster 5: 2243
Cluster 6: 747
Cluster 7: 1198
Cluster 8: 4841
Cluster 9: 1818




---



Now let's have fun with this ❓***questionable***❓ dataset of ours by using the mighty 💪 ***Bert***: 

This class of transformer model can be used for classification tasks by training an additional set of weights mapping the output context vectors to logits predicting whether or not the answer span starts/ends at particular locations.

The data given provides info concerning `question`, `context`, `id`, `answers`, but in order to fine tune our predictive model we also need to determine where our answer starts within the context in the tokenized representation. 

In [None]:
# we slightly modify the preprocessing function found at https://huggingface.co/docs/transformers/tasks/question_answering

def preprocess_function(examples, tokenizer):
    questions = [q.strip() for q in examples["question"]]
    inputs = tokenizer(
        questions,
        examples["context"],
        max_length=384,
        truncation="only_second",
        return_offsets_mapping=True,
        padding="max_length",
    )

    offset_mapping = inputs.pop("offset_mapping")
    answers = examples["answers"]
    start_positions = []
    end_positions = []

    for i, offset in enumerate(offset_mapping):
        answer = answers[i]

        if len(answer["answer_start"])>0:
          start_char = answer["answer_start"][0]
          end_char = answer["answer_start"][0] + len(answer["text"][0])
          sequence_ids = inputs.sequence_ids(i)

          # Find the start and end of the context
          idx = 0
          while sequence_ids[idx] != 1:
              idx += 1
          context_start = idx
          while sequence_ids[idx] == 1:
              idx += 1
          context_end = idx - 1

          # If the answer is not fully inside the context, label it (0, 0)
          if offset[context_start][0] > end_char or offset[context_end][1] < start_char:
              start_positions.append(0)
              end_positions.append(0)
          else:
              # Otherwise it's the start and end token positions
              idx = context_start
              while idx <= context_end and offset[idx][0] <= start_char:
                  idx += 1
              start_positions.append(idx - 1)

              idx = context_end
              while idx >= context_start and offset[idx][1] >= end_char:
                  idx -= 1
              end_positions.append(idx + 1)
        else:
          start_positions.append(0)
          end_positions.append(0)

    inputs["start_positions"] = start_positions
    inputs["end_positions"] = end_positions
    
    return inputs

🧚 One final consideratio; we use AutoTokenizer and AutoModelForQuestionAnswering in order to a) quickly extract a tokenizer from a model path, and b) nicely wrap the forward method of the underlying Bert model.

We use `bert-base-cased` going off the logic that having capital letters helps distinguish named entities which often show up as answers to questions.

In [None]:
from transformers import AutoTokenizer, AutoModelForQuestionAnswering

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
model = AutoModelForQuestionAnswering.from_pretrained("bert-base-cased")

Downloading (…)okenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/436M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForQuestionAnswering: ['cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-cased and a

In [None]:
text = "let me check to see the tokenization is predictable"   
tokenizer.decode(tokenizer(text, text).input_ids)

#[CLS] token is our context aware vector; relays information concerning the sequence as a whole. Used for stuff like NSP 

'[CLS] let me check to see the tokenization is predictable [SEP] let me check to see the tokenization is predictable [SEP]'

## 💅 Data prep

- Map preprocessing function over splits 
- Dataloaders for base PyTorch training & resource management 

In [None]:
from functools import partial 

train = train.map(partial(preprocess_function, tokenizer=tokenizer), batched=True)
test = test.map(partial(preprocess_function, tokenizer=tokenizer), batched=True)

train

Map:   0%|          | 0/130319 [00:00<?, ? examples/s]

Map:   0%|          | 0/11873 [00:00<?, ? examples/s]

Dataset({
    features: ['id', 'title', 'context', 'question', 'answers', 'input_ids', 'token_type_ids', 'attention_mask', 'start_positions', 'end_positions'],
    num_rows: 130319
})

👑 We provide an interface to the SQuAD dataset compatible with the torch dataloader (just specify `__len__` and `__getitem__` dunders).  Additionally we provide a custom collating function to let us access dict fields instead of individual training instances 

In [None]:
from torch.utils.data import DataLoader

class TorchDataset:
  def __init__(self, data):
    self.data = data

  def __len__(self):
    return len(self.data)

  def __getitem__(self, i):
      return self.data[i]

# turn list of dicts into dict of lists 
def custom_collate(data): 
    return {k:[d[k] for d in data] for k in data[0].keys()}

train_data = TorchDataset(train)
train_loader = DataLoader(train_data, batch_size = 8, shuffle = True, collate_fn=custom_collate)

test_data = TorchDataset(test)
test_loader = DataLoader(test_data, batch_size = 8, collate_fn=custom_collate)

print(next(iter(train_loader)).keys())

dict_keys(['id', 'title', 'context', 'question', 'answers'])


## 🧗 Training

- instantiate an optimizer and learning rate scheduler
- define the looped gradient descent procedure 

- used hyperparams in the [distilbert model](https://huggingface.co/distilbert-base-uncased-distilled-squad) for qa as a reference for informing the below training (lr = 3e-05, linear increase for lr_scheduler, 2 epochs; coulda tried cosine annealing but was braindead when training)

   - roughly 1.5 hours on undisclosed HPC

- model weights after our training can be found [here](https://drive.google.com/file/d/1O1HDtRccTsbkzNiEyyENz9zqfqIVRTM7/view?usp=sharing)

In [None]:
from torch.optim import AdamW
from transformers import get_scheduler
from tqdm import tqdm 
import torch 

device = "cuda" if torch.cuda.is_available() else "cpu"

optimizer = AdamW(model.parameters(), lr=3e-5)

num_epochs = 2
num_training_steps = num_epochs * len(train_loader)
lr_scheduler = get_scheduler(
    name="linear", optimizer=optimizer, num_warmup_steps=0, num_training_steps=num_training_steps
)

model.to(device)
model.train()

BertForQuestionAnswering(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(28996, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elem

In [None]:
from tqdm.auto import tqdm

progress_bar = tqdm(range(num_training_steps))

for epoch in range(num_epochs):
    for batch in train_loader:
        try:
            input_ids = torch.tensor(batch['input_ids'], device = device)
            attention_mask = torch.tensor(batch['attention_mask'], device=device)
            start_positions = torch.tensor(batch['start_positions'], device = device)
            end_positions = torch.tensor(batch['end_positions'], device = device)

            outputs = model(input_ids=input_ids, 
                            attention_mask = attention_mask,
                            start_positions = start_positions,
                            end_positions = end_positions,
                            )

            loss = outputs.loss
            loss.backward()

            optimizer.step()
            lr_scheduler.step()
            optimizer.zero_grad()
            progress_bar.update(1)

        except:
            pass


In [None]:
#Note: in previous model trainings I already saved some predictions at differnt points of training
!ls gdrive/MyDrive/nlp\ project

bert-again.ipynb  gpt-squad.ipynb      preds_dummy.json
bert-squad.ipynb  preds_baseline.json  preds_large.json


⌛ load the scripts used in official submission evaluations



In [None]:
!wget https://worksheets.codalab.org/rest/bundles/0x6b567e1cf2e041ec80d7098f031c5c9e/contents/blob/ -O evaluation.py
!wget https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json -O dev-v2.0.json

--2023-05-28 08:23:12--  https://worksheets.codalab.org/rest/bundles/0x6b567e1cf2e041ec80d7098f031c5c9e/contents/blob/
Resolving worksheets.codalab.org (worksheets.codalab.org)... 13.68.212.115
Connecting to worksheets.codalab.org (worksheets.codalab.org)|13.68.212.115|:443... connected.
HTTP request sent, awaiting response... 200 OK
Syntax error in Set-Cookie: codalab_session=""; expires=Thu, 01 Jan 1970 00:00:00 GMT; Max-Age=-1; Path=/ at position 70.
Length: unspecified [text/x-python]
Saving to: ‘evaluation.py’

evaluation.py           [ <=>                ]  10.30K  --.-KB/s    in 0s      

2023-05-28 08:23:13 (167 MB/s) - ‘evaluation.py’ saved [10547]

--2023-05-28 08:23:13--  https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json
Resolving rajpurkar.github.io (rajpurkar.github.io)... 185.199.108.153, 185.199.109.153, 185.199.110.153, ...
Connecting to rajpurkar.github.io (rajpurkar.github.io)|185.199.108.153|:443... connected.
HTTP request sent, awaiting response... 20

In [None]:
#baseline prediction accuracy
!python3 evaluation.py ./dev-v2.0.json gdrive/MyDrive/nlp\ project/preds_baseline.json

{
  "exact": 25.191611218731577,
  "f1": 27.015494626847186,
  "total": 11873,
  "HasAns_exact": 0.0,
  "HasAns_f1": 3.652997251106355,
  "HasAns_total": 5928,
  "NoAns_exact": 50.31118587047939,
  "NoAns_f1": 50.31118587047939,
  "NoAns_total": 5945
}


In [None]:
#prediction after training 
!python3 evaluation.py ./dev-v2.0.json gdrive/MyDrive/nlp\ project/preds_large.json

{
  "exact": 67.59033100311632,
  "f1": 71.36014963533503,
  "total": 11873,
  "HasAns_exact": 63.46153846153846,
  "HasAns_f1": 71.01198660936785,
  "HasAns_total": 5928,
  "NoAns_exact": 71.70731707317073,
  "NoAns_f1": 71.70731707317073,
  "NoAns_total": 5945
}


🔍 Let's see the types of responses we get; load trained weights and do some predictions

In [None]:
weights = 'gdrive/MyDrive/nlp project/bert-again-large'
model.load_state_dict(torch.load(weights))
model.eval()

In [None]:
end.item()

36

In [None]:
num_samples = 5
samples = test[np.random.choice(range(10000), num_samples)]

for i in range(num_samples): 
  #pass through model 
  input_ids = torch.tensor(samples['input_ids'][i]).unsqueeze(0)
  attention_mask = torch.tensor(samples['attention_mask'][i]).unsqueeze(0)

  out = model(input_ids=input_ids, attention_mask = attention_mask)

  #recover prediction span 
  start = out.start_logits.argmax(dim=1).cpu().detach()
  end = out.end_logits.argmax(dim=1).cpu().detach()

  #print info 
  print(f'question: {samples["question"][i]}')
  if len(samples["answers"][i]["text"])>0:
    print(f'real answer: {samples["answers"][i]["text"][0]}')
  else:
    print(f'real answer: {samples["answers"][i]["text"]}')
  print(f'predicted span: {tokenizer.decode(input_ids[0, start:end+1])}')
  print('\n')

question: What is one function that prime numbers have that 1 does not?
real answer: the sum of divisors function
predicted span: sum of divisors function


question: Who funds the IPCC's Secretary?
real answer: World Meteorological Organization
predicted span: [CLS]


question: What are auricles?
real answer: gelatinous projections edged with cilia that produce water currents
predicted span: gelatinous projections edged with cilia


question: What is the Dutch word for the Amazon rainforest?
real answer: Amazoneregenwoud
predicted span: Amazoneregenwoud


question: What is issued once construction is complete and a final inspection has been passed?
real answer: an occupancy permit
predicted span: an occupancy permit




## pre-trained from hf 🤗

Chose [this model](https://huggingface.co/mrm8488/bert-medium-finetuned-squadv2) from the available ones for the SQuADv2 task 

In [None]:
hf_tokenizer = AutoTokenizer.from_pretrained("mrm8488/bert-medium-finetuned-squadv2")
hf_model = AutoModelForQuestionAnswering.from_pretrained("mrm8488/bert-medium-finetuned-squadv2")

Downloading (…)okenizer_config.json:   0%|          | 0.00/23.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/463 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/166M [00:00<?, ?B/s]

In [None]:
#this is why we added the tokenizer arg to the preprocessing function :p
val_hf = test.map(partial(preprocess_function, tokenizer=hf_tokenizer), batched = True)
val_data = TorchDataset(val_hf)
val_loader = DataLoader(val_data, batch_size = 8, collate_fn=custom_collate)

Map:   0%|          | 0/11873 [00:00<?, ? examples/s]

In [None]:
from tqdm import tqdm 
import torch 

device = "cuda" if torch.cuda.is_available() else "cpu"
hf_model.to(device)

hf_model.eval()

preds = {}
for batch in tqdm(val_loader):
    input_ids = torch.tensor(batch['input_ids'], device = device)
    attention_mask = torch.tensor(batch['attention_mask'], device=device)
    start_positions = torch.tensor(batch['start_positions'], device = device)
    end_positions = torch.tensor(batch['end_positions'], device=device)

    out = hf_model(input_ids=input_ids, 
                attention_mask = attention_mask,
                start_positions = start_positions,
                end_positions = end_positions,
               )

    #recover prediction span 
    start = out.start_logits.argmax(dim=1).cpu().detach()
    end = out.end_logits.argmax(dim=1).cpu().detach()

    for i, p in enumerate(list(zip(start, end))):
        decoded = hf_tokenizer.decode(input_ids[i,p[0]:p[1]+1])
        preds[batch['id'][i]] = decoded if decoded != '[CLS]' else ''

100%|██████████| 1485/1485 [01:46<00:00, 13.89it/s]


In [None]:
import json 
with open("preds_hf.json", "w") as f:
    json.dump(preds, f)

#prediction for hugging face model  -- should be close to 70 in F1 
!python3 evaluation.py ./dev-v2.0.json preds_hf.json

{
  "exact": 53.00261096605744,
  "f1": 57.030657390000464,
  "total": 11873,
  "HasAns_exact": 46.18758434547908,
  "HasAns_f1": 54.25522860854819,
  "HasAns_total": 5928,
  "NoAns_exact": 59.79814970563499,
  "NoAns_f1": 59.79814970563499,
  "NoAns_total": 5945
}


# 🥲 Ideas that didn't workout [do last] aka it is what is it (or it is not)

## Concept
- use tf-idf representation to isolate relevant sentences from the context provided with the question (larger margin for error)
- train a transformer-like architecture **without** trainable embeddings; instead see if we can benefit from pre-training through the use of GloVe embeddings for instance 
  - couple multi-head self attention layers & qa head for start-end logits


## Results
- best loss obtained was around 5.8 which indicates we were predicting start and end words locations with probability 1/18 (not great). 
- contexts were 196 words long so the totally random model would incur a loss of $-\log(1/196) \approx 10.6$
- tf-idf approch for sentence retrieval was 76% accurate over *answerable* questions. 

🗑 Not very honorable mention: trying to fine tune or at least use gpt2 in the same manner as bert, but failed to prepare the data for the transformer and work with the api of hf; maybe another time

In [None]:
import gensim.downloader as api
model_wiki = api.load("glove-wiki-gigaword-50")



In [None]:
PAD = model_wiki.get_vector('omit')
UNK = model_wiki.get_vector('unkown')

PAD = np.expand_dims(PAD, axis=0)

In [None]:
def bert_like_input(question, sentence, emb = model_wiki, max_length = 192):
    """    
    returns:
        - vec for: [PAD] sentece [PAD] question [PAD]
    """
    
    vec = np.repeat(PAD, max_length, axis=0)

    #sentence 
    for i,w in enumerate(sentence):
        if emb.has_index_for(w):
            vec[i] = emb.get_vector(w)
        else:
            vec[i] = UNK
    
    #sep
    idx = i+1
    vec[idx] = PAD
    idx+=1
    
    #question 
    for i,w in enumerate(question):
        if emb.has_index_for(w):
            vec[i+idx] = emb.get_vector(w)
        else:
            vec[i+idx] = UNK
    
    return vec

In [None]:
vec = bert_like_input('this is a question'.split(' '), 'this is context'.split(' '), max_length=32)
print(f'embedded question-context pair shape: {vec.shape}')

#decode 
decoded = [model_wiki.most_similar(w)[0][0] for w in vec]
decoded[0:10]

embedded question-context pair shape: (32, 50)


['this',
 'is',
 'context',
 'omit',
 'this',
 'is',
 'a',
 'question',
 'omit',
 'omit']

##  🤠 fastest transformer implementation in the west 


In [None]:
class SA(nn.Module):
    def __init__(self, dim_word, dim_attn):
        super().__init__()
        #let's just keep all hidden states same dimension for simplicity 
        self.key = nn.Linear(dim_word, dim_attn)
        self.query = nn.Linear(dim_word, dim_attn)
        self.value = nn.Linear(dim_word, dim_attn)
    
    def forward(self, x):
        #projections 
        k = self.key(x)
        q = self.query(x)
        v = self.value(x)
        
        #self attention
        attn = F.softmax(k@q.transpose(1,-1), dim=1) #had to account for batches here 
        
        return attn@v

class MultiHeadSA(nn.Module):
    def __init__(self, num_heads, dim_word, dim_attn):
        # choose dim attn and num heads to factorize the word dimension 
        super().__init__()
        
        self.heads = nn.ModuleList([SA(dim_word, dim_attn) for i in range(num_heads)])
        self.linear = nn.Linear(num_heads*dim_attn, num_heads*dim_attn)

    def forward(self, x):
        x = x + torch.cat([h(x) for h in self.heads], dim=-1)
        x = x + self.linear(x)        
        return x


class Net(nn.Module):
    def __init__(self, num_heads, dw, dh):
        super().__init__()
        
        #block of 2 multi-head attentions 
        self.multi_attn_block = nn.Sequential(
            MultiHeadSA(num_heads, dw, dh),
            nn.ReLU(),
            MultiHeadSA(num_heads, dw, dh),
            nn.ReLU(),
            MultiHeadSA(num_heads, dw, dh),
        )
        
        #provide logits for start and end positions 
        self.qa_head = nn.Linear(dw, 2)
        
    def forward(self, x):
        #can we use the same concept as masked self attention to omit question from logit pred?
        x = self.multi_attn_block(x)
        return self.qa_head(x)

## data prep 
- we need to do some pseudo tokenization at the word level 
- retrieve the vector representation before passing inputs to network during training 
- need to manually implement the whole offsets field of the hugging face tokenizers

In [None]:
import re 

def find_ans_in_sentence(sentence, ans):
    #O(n^2) but it dont matter 
    ans_split = re.findall(r'\b\w+\b', ans)
    sentence_split = re.findall(r'\b\w+\b', sentence)
    for i,w in enumerate(sentence_split):
        if w == ans_split[0]:
            return i
    return -1

def locate_context(context, ans_obj):
    ans_start = ans_obj['answer_start'][0]
    ans_text = ans_obj['text'][0].lower()

    # by splitting by a period we need to add their length back when searching 
    c = context.split('.')
    
    s_lengths = [0]+[len(s)+1 for s in c]
    cumsum = np.cumsum(s_lengths)
    
    #reduced_context = ''
    for i in range(1,len(cumsum)):
        if ans_start>=cumsum[i-1] and ans_start<cumsum[i]:
            reduced_context =  c[i-1].strip()
        
    #now find word level position of the answer start in this
    i = find_ans_in_sentence(reduced_context, ans_text)

    return reduced_context, (i, i+len(ans_text.split(' ')))

In [None]:
# the new new preprocessing function 
def preprocess_function(examples):
    #basic 
    questions = [q.lower() for q in examples["question"]]
    contexts = [c.lower() for c in examples["context"]]
    answers = examples["answers"]
    
    #store position for logits to inform model training later 
    start_positions = []
    end_positions = []
    conjoined = []
    reduced_contexts = []
    inputs = {}
    
    #util 
    re_split = lambda x: re.findall(r'\b\w+\b', x)
    
    #iterate over data
    for i in range(len(questions)):
        answer = answers[i]

        if len(answer["answer_start"])>0:
            #1. get relevant sentence from full context 
            context = contexts[i]
            
            rc, start_end = locate_context(context, answer)
            
            if start_end[0] == -1:
                start_end = (0,0)
                
            reduced_contexts.append(rc)
            
            start_positions.append(start_end[0])
            end_positions.append(start_end[1])
            
            #2. prep the context-answer pair for training later 
            conjoined.append(bert_like_input(re_split(questions[i]), re_split(rc)))
            
        else:
            # ad hoc 
            rc = contexts[i].split('.')[0] 
            reduced_contexts.append(rc)
            
            start_positions.append(0)
            end_positions.append(0)
            
            conjoined.append(bert_like_input(re_split(questions[i]), re_split(rc)))

    inputs["start_positions"] = start_positions
    inputs["end_positions"] = end_positions
    inputs["reduced_context"] = reduced_contexts
    inputs["pairs"] = conjoined 
    
    return inputs

In [None]:
inputs = test.map(preprocess_function, batched=True)

Map:   0%|          | 0/11873 [00:00<?, ? examples/s]

In [None]:
train_data = TorchDataset(inputs)
train_loader = DataLoader(train_data, batch_size = 8, shuffle = True, collate_fn=custom_collate)

In [None]:
#make sure im not crazy 
i = 1

sample = inputs[i]
s = sample['start_positions']
e = sample['end_positions']

#decode the pairs 
pairs = sample['pairs']

decoded = [model_wiki.most_similar(np.array(v))[0][0] for v in pairs]

#check the answer is where it should be 
print(sample['question'])
print(sample['answers']['text'][0])
print(decoded[s:e])

When were the Normans in Normandy?
10th and 11th centuries
['10th', 'and', '11th', 'centuries']


In [None]:
model = Net(5, 50, 10)

In [None]:
from torch.optim import AdamW
from transformers import get_scheduler
from tqdm import tqdm 
import torch 

device = "cuda" if torch.cuda.is_available() else "cpu"

optimizer = AdamW(model.parameters(), lr=3e-5)

num_epochs = 2
num_training_steps = num_epochs * len(train_loader)
lr_scheduler = get_scheduler(
    name="linear", optimizer=optimizer, num_warmup_steps=0, num_training_steps=num_training_steps
)

model.to(device)
model.train()

Net(
  (multi_attn_block): Sequential(
    (0): MultiHeadSA(
      (heads): ModuleList(
        (0-4): 5 x SA(
          (key): Linear(in_features=50, out_features=10, bias=True)
          (query): Linear(in_features=50, out_features=10, bias=True)
          (value): Linear(in_features=50, out_features=10, bias=True)
        )
      )
      (linear): Linear(in_features=50, out_features=50, bias=True)
    )
    (1): ReLU()
    (2): MultiHeadSA(
      (heads): ModuleList(
        (0-4): 5 x SA(
          (key): Linear(in_features=50, out_features=10, bias=True)
          (query): Linear(in_features=50, out_features=10, bias=True)
          (value): Linear(in_features=50, out_features=10, bias=True)
        )
      )
      (linear): Linear(in_features=50, out_features=50, bias=True)
    )
    (3): ReLU()
    (4): MultiHeadSA(
      (heads): ModuleList(
        (0-4): 5 x SA(
          (key): Linear(in_features=50, out_features=10, bias=True)
          (query): Linear(in_features=50, out_f

In [None]:
from tqdm.auto import tqdm

progress_bar = tqdm(range(num_training_steps))

for epoch in range(num_epochs):
    for i,batch in enumerate(train_loader):
        input_embeddings = torch.tensor(batch['pairs'], device = device)
        start_positions = torch.tensor(batch['start_positions'], device = device)
        end_positions = torch.tensor(batch['end_positions'], device = device)

        #get the logits for answer start and end spans 
        out = model(input_embeddings)
        start_logits = out[:,:,0]
        end_logits = out[:,:,1]

        #compute loss (seems like it works??)
        loss_start = F.cross_entropy(start_logits, start_positions)
        loss_end = F.cross_entropy(end_logits, end_positions)

        loss = loss_start + loss_end
        loss.backward()

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)
        
        if i%50 == 0:
            print(loss.item()/2)


  0%|          | 0/32580 [00:00<?, ?it/s]

4.866694450378418
4.79029655456543
4.122103214263916
4.429520606994629
4.16649866104126
3.8428258895874023
3.825674533843994
4.068503379821777
3.477278232574463
3.812932014465332
3.6742162704467773
3.451906442642212
4.063449382781982
3.4353177547454834
3.7646450996398926
3.953007221221924
3.5493903160095215
3.292992115020752
3.622032642364502
3.7955923080444336
3.728724956512451
3.854045867919922
3.7056798934936523
3.718621015548706
3.7689075469970703
3.4656453132629395


KeyboardInterrupt: ignored