##**BERT Fine-tuning and Prediction on SQUAD 2.0**

---



In [1]:
!pip install transformers==4.25.1

Collecting transformers==4.25.1
  Downloading transformers-4.25.1-py3-none-any.whl (5.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.8/5.8 MB[0m [31m19.2 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers==4.25.1)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m47.2 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tokenizers, transformers
  Attempting uninstall: tokenizers
    Found existing installation: tokenizers 0.19.1
    Uninstalling tokenizers-0.19.1:
      Successfully uninstalled tokenizers-0.19.1
  Attempting uninstall: transformers
    Found existing installation: transformers 4.41.2
    Uninstalling transformers-4.41.2:
      Successfully uninstalled transformers-4.41.2
Successfully installed tokenizers-0.13.3 transformers-4.25.1


In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import torch
from transformers import BertForQuestionAnswering
from transformers import BertTokenizer

**Download the SQUAD 2.0 Dataset**

In [3]:

SQuAD = pd.read_json('https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json')
SQuAD.head()

Unnamed: 0,version,data
0,v2.0,"{'title': 'Beyoncé', 'paragraphs': [{'qas': [{..."
1,v2.0,"{'title': 'Frédéric_Chopin', 'paragraphs': [{'..."
2,v2.0,{'title': 'Sino-Tibetan_relations_during_the_M...
3,v2.0,"{'title': 'IPod', 'paragraphs': [{'qas': [{'qu..."
4,v2.0,{'title': 'The_Legend_of_Zelda:_Twilight_Princ...


### Data cleaning

In [4]:
del SQuAD["version"]

In [5]:
cols = ["text","question","answer"]

comp_list = []
for index, row in SQuAD.iterrows():
    for i in range(len(row['data']['paragraphs'])):
        for j in (row['data']['paragraphs'][i]['qas']):
            temp_list = []
            temp_list.append((row["data"]["paragraphs"][i]["context"]))
            temp_list.append(j['question'])
            if j["answers"]:
                temp_list.append(j["answers"][0]["text"])
            else:
                temp_list.append("")
        comp_list.append(temp_list)
new_df = pd.DataFrame(comp_list, columns=cols)

In [6]:
new_df.to_csv("SQuAD_data.csv", index=False)

In [8]:
data = pd.read_csv("SQuAD_data.csv")
data.head()

Unnamed: 0,text,question,answer
0,Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ b...,What was the name of Beyoncé's first solo album?,Dangerously in Love
1,Following the disbandment of Destiny's Child i...,What is the name of Beyoncé's alter-ego?,Sasha Fierce
2,"A self-described ""modern-day feminist"", Beyonc...",What magazine named Beyoncé as the most powerf...,Forbes
3,"Beyoncé Giselle Knowles was born in Houston, T...",Beyoncé was raised in what religion?,Methodist
4,Beyoncé attended St. Mary's Elementary School ...,What choir did Beyoncé sing in for two years?,St. John's United Methodist Church


In [9]:
print("Number of question and answers: ", len(data))

Number of question and answers:  19035


# **LOADING THE MODEL AND TOKENIZER**


In [10]:
model = BertForQuestionAnswering.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')
tokenizer = BertTokenizer.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/443 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

In [11]:
random_num = np.random.randint(0,len(data))

question = data["question"][random_num]
text = data["text"][random_num]

In [12]:
print(question, "\n", text)

What notable accidental fossil fuel discharge occurred at Newtown Creek? 
 Newtown Creek, a 3.5-mile (6-kilometer) a long estuary that forms part of the border between the boroughs of Brooklyn and Queens, has been designated a Superfund site for environmental clean-up and remediation of the waterway's recreational and economic resources for many communities. One of the most heavily used bodies of water in the Port of New York and New Jersey, it had been one of the most contaminated industrial sites in the country, containing years of discarded toxins, an estimated 30 million US gallons (110,000 m3) of spilled oil, including the Greenpoint oil spill, raw sewage from New York City's sewer system, and other accumulation.


In [13]:
input_ids = tokenizer.encode(question, text)
print("The input has a total of {} tokens.".format(len(input_ids)))

The input has a total of 153 tokens.


In [14]:
tokens = tokenizer.convert_ids_to_tokens(input_ids)

for token, id in zip(tokens, input_ids):
    print('{:8}{:8,}'.format(token,id))

[CLS]        101
what       2,054
notable    3,862
accidental  17,128
fossil    10,725
fuel       4,762
discharge  11,889
occurred   4,158
at         2,012
newtown   26,382
creek      3,636
?          1,029
[SEP]        102
newtown   26,382
creek      3,636
,          1,010
a          1,037
3          1,017
.          1,012
5          1,019
-          1,011
mile       3,542
(          1,006
6          1,020
-          1,011
kilometer  20,595
)          1,007
a          1,037
long       2,146
estuary   18,056
that       2,008
forms      3,596
part       2,112
of         1,997
the        1,996
border     3,675
between    2,090
the        1,996
boroughs  21,413
of         1,997
brooklyn   6,613
and        1,998
queens     8,603
,          1,010
has        2,038
been       2,042
designated   4,351
a          1,037
super      3,565
##fu      11,263
##nd       4,859
site       2,609
for        2,005
environmental   4,483
clean      4,550
-          1,011
up         2,039
and        1,998
re 

In [15]:
#first occurence of [SEP] token
sep_idx = input_ids.index(tokenizer.sep_token_id)
print(sep_idx)
#number of tokens in segment A - question
num_seg_a = sep_idx+1
print(num_seg_a)
#number of tokens in segment B - text
num_seg_b = len(input_ids) - num_seg_a
print(num_seg_b)
segment_ids = [0]*num_seg_a + [1]*num_seg_b
print(segment_ids)
assert len(segment_ids) == len(input_ids)

12
13
140
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]


In [16]:
#token input_ids to represent the input
#token segment_ids to differentiate our segments - text and question
output = model(torch.tensor([input_ids]), token_type_ids=torch.tensor([segment_ids]))
#print(output.start_logits, output.end_logits)

In [17]:
#tokens with highest start and end scores
answer_start = torch.argmax(output.start_logits)
answer_end = torch.argmax(output.end_logits)
#print(answer_start, answer_end)

In [18]:
if answer_end >= answer_start:
    answer = " ".join(tokens[answer_start:answer_end+1])
else:
    print("I am unable to find the answer to this question. Can you please ask another question?")

print("Text:\n{}".format(text.capitalize()))
print("\nQuestion:\n{}".format(question.capitalize()))
print("\nAnswer:\n{}.".format(answer.capitalize()))

Text:
Newtown creek, a 3.5-mile (6-kilometer) a long estuary that forms part of the border between the boroughs of brooklyn and queens, has been designated a superfund site for environmental clean-up and remediation of the waterway's recreational and economic resources for many communities. one of the most heavily used bodies of water in the port of new york and new jersey, it had been one of the most contaminated industrial sites in the country, containing years of discarded toxins, an estimated 30 million us gallons (110,000 m3) of spilled oil, including the greenpoint oil spill, raw sewage from new york city's sewer system, and other accumulation.

Question:
What notable accidental fossil fuel discharge occurred at newtown creek?

Answer:
Green ##point oil spill.


### Code to join the broken words

In [None]:
answer = tokens[answer_start]

for i in range(answer_start+1, answer_end+1):
    if tokens[i][0:2] == "##":
        answer += tokens[i][2:]
    else:
        answer += " " + tokens[i]

### Function

In [None]:
def question_answer(question, text):

    #tokenize question and text in ids as a pair
    input_ids = tokenizer.encode(question, text)

    #string version of tokenized ids
    tokens = tokenizer.convert_ids_to_tokens(input_ids)

    #segment IDs
    #first occurence of [SEP] token
    sep_idx = input_ids.index(tokenizer.sep_token_id)

    #number of tokens in segment A - question
    num_seg_a = sep_idx+1

    #number of tokens in segment B - text
    num_seg_b = len(input_ids) - num_seg_a

    #list of 0s and 1s
    segment_ids = [0]*num_seg_a + [1]*num_seg_b

    assert len(segment_ids) == len(input_ids)

    #model output using input_ids and segment_ids
    output = model(torch.tensor([input_ids]), token_type_ids=torch.tensor([segment_ids]))

    #reconstructing the answer
    answer_start = torch.argmax(output.start_logits)
    answer_end = torch.argmax(output.end_logits)

    if answer_end >= answer_start:
        answer = tokens[answer_start]
        for i in range(answer_start+1, answer_end+1):
            if tokens[i][0:2] == "##":
                answer = ""
            else:
                answer += " " + tokens[i]

    if answer.startswith("[CLS]"):
        answer = "Unable to find the answer to your question."

#     print("Text:\n{}".format(text.capitalize()))
#     print("\nQuestion:\n{}".format(question.capitalize()))
    print("\nAnswer:\n{}".format(answer.capitalize()))

In [None]:
text = """New York (CNN) -- More than 80 Michael Jackson collectibles -- including the late pop star's famous rhinestone-studded glove from a 1983 performance -- were auctioned off Saturday, reaping a total $2 million. Profits from the auction at the Hard Rock Cafe in New York's Times Square crushed pre-sale expectations of only $120,000 in sales. The highly prized memorabilia, which included items spanning the many stages of Jackson's career, came from more than 30 fans, associates and family members, who contacted Julien's Auctions to sell their gifts and mementos of the singer. Jackson's flashy glove was the big-ticket item of the night, fetching $420,000 from a buyer in Hong Kong, China. Jackson wore the glove at a 1983 performance during \"Motown 25,\" an NBC special where he debuted his revolutionary moonwalk. Fellow Motown star Walter \"Clyde\" Orange of the Commodores, who also performed in the special 26 years ago, said he asked for Jackson's autograph at the time, but Jackson gave him the glove instead. "The legacy that [Jackson] left behind is bigger than life for me,\" Orange said. \"I hope that through that glove people can see what he was trying to say in his music and what he said in his music.\" Orange said he plans to give a portion of the proceeds to charity. Hoffman Ma, who bought the glove on behalf of Ponte 16 Resort in Macau, paid a 25 percent buyer's premium, which was tacked onto all final sales over $50,000. Winners of items less than $50,000 paid a 20 percent premium."""
question = "Where was the Auction held?"

question_answer(question, text)


Answer:
Hard rock cafe in new york ' s times square
