In [1]:
import pandas as pd
import numpy as np
import torch
from transformers import BertForQuestionAnswering
from transformers import BertTokenizer

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
coqa = pd.read_json('G:/ajaykumaretw/QuestionAnswerUsingNPLBert/coqa-train-v1.0.json')
coqa.head()

Unnamed: 0,version,data
0,1,"{'source': 'wikipedia', 'id': '3zotghdk5ibi9ce..."
1,1,"{'source': 'cnn', 'id': '3wj1oxy92agboo5nlq4r7..."
2,1,"{'source': 'gutenberg', 'id': '3bdcf01ogxu7zdn..."
3,1,"{'source': 'cnn', 'id': '3ewijtffvo7wwchw6rtya..."
4,1,"{'source': 'gutenberg', 'id': '3urfvvm165iantk..."


Data Cleaning

In [3]:
del coqa["version"]

For every question-answer pair, we will be attaching the linked story to it.

In [4]:
#required columns in our dataframe
cols = ["text","question","answer"]
#list of lists to create our dataframe
comp_list = []
for index, row in coqa.iterrows():
    for i in range(len(row["data"]["questions"])):
        temp_list = []
        temp_list.append(row["data"]["story"])
        temp_list.append(row["data"]["questions"][i]["input_text"])
        temp_list.append(row["data"]["answers"][i]["input_text"])
        comp_list.append(temp_list)
new_df = pd.DataFrame(comp_list, columns=cols) 
#saving the dataframe to csv file for further loading
new_df.to_csv("CoQA_data.csv", index=False)

Data Loading from Local CSV File

In [5]:
data = pd.read_csv("CoQA_data.csv")
data.head()

Unnamed: 0,text,question,answer
0,"The Vatican Apostolic Library (), more commonl...",When was the Vat formally opened?,It was formally established in 1475
1,"The Vatican Apostolic Library (), more commonl...",what is the library for?,research
2,"The Vatican Apostolic Library (), more commonl...",for what subjects?,"history, and law"
3,"The Vatican Apostolic Library (), more commonl...",and?,"philosophy, science and theology"
4,"The Vatican Apostolic Library (), more commonl...",what was started in 2014?,a project


In [6]:
print("Number of question and answers: ", len(data))

Number of question and answers:  108647


Building the Chatbot

The best part about using these pre-trained models is that you can load the model and its tokenizer in just two simple lines of code. 😲 Isn’t it simply wow? For tasks like text classification, we need to fine-tune BERT on our dataset. But for question answering tasks, we can even use the already trained model and get decent results even when our text is from a completely different domain. To get decent results, we are using a BERT model which is fine-tuned on the SQuAD benchmark.

In [7]:
model = BertForQuestionAnswering.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')
tokenizer = BertTokenizer.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')

Some weights of the model checkpoint at bert-large-uncased-whole-word-masking-finetuned-squad were not used when initializing BertForQuestionAnswering: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Asking a Question

In [8]:
random_num = np.random.randint(0,len(data))
question = data["question"][random_num]
text = data["text"][random_num]

In [9]:
question

'Where did she pass?'

In [10]:
text

'NEW YORK (CNN) -- After spending nearly 28 years in an irreversible coma, heiress and socialite Martha "Sunny" von Bulow died Saturday in a New York nursing home, according to a family statement. She was 76. \n\nSunny von Bulow is pictured during her 1957 wedding to Prince Alfred von Auersperg. \n\nVon Bulow was subject of one of the nation\'s most sensational criminal cases during the 1980s. \n\nHer husband, Claus, was accused of trying to kill her with an overdose of insulin, which prosecutors alleged sent her into the coma. \n\nHe was convicted of making two attempts on her life, but the conviction was overturned on appeal. He was acquitted in a second trial. \n\nHis retrial in 1985 received national attention. \n\n"We were blessed to have an extraordinarily loving and caring mother," said the statement from Von Bulow\'s three children -- Annie Laurie "Ala" Isham, Alexander von Auersperg and Cosima Pavoncelli -- released by a spokeswoman. "She was especially devoted to her many fri

Let’s tokenize the question and text as a pair.

In [11]:
input_ids = tokenizer.encode(question, text)
print("The input has a total of {} tokens.".format(len(input_ids)))

The input has a total of 354 tokens.


To look at what our tokenizer is doing, let’s just print out the tokens and their IDs.

In [12]:
tokens = tokenizer.convert_ids_to_tokens(input_ids)
for token, id in zip(tokens, input_ids):
    print('{:8}{:8,}'.format(token,id))

[CLS]        101
where      2,073
did        2,106
she        2,016
pass       3,413
?          1,029
[SEP]        102
new        2,047
york       2,259
(          1,006
cnn       13,229
)          1,007
-          1,011
-          1,011
after      2,044
spending   5,938
nearly     3,053
28         2,654
years      2,086
in         1,999
an         2,019
ir        20,868
##re       2,890
##vers    14,028
##ible     7,028
coma      16,571
,          1,010
heiress   20,020
and        1,998
social     2,591
##ite      4,221
martha     9,246
"          1,000
sunny     11,559
"          1,000
von        3,854
bu        20,934
##low      8,261
died       2,351
saturday   5,095
in         1,999
a          1,037
new        2,047
york       2,259
nursing    8,329
home       2,188
,          1,010
according   2,429
to         2,000
a          1,037
family     2,155
statement   4,861
.          1,012
she        2,016
was        2,001
76         6,146
.          1,012
sunny     11,559
von        3

-BERT has a unique way of processing the tokenized inputs.
****We can see two special tokens [CLS] and [SEP].
 -[CLS] token stands for classification -To represent sentence-level classification and is used when we are classifying.
 -[SEP] to separate the two pieces of text

 Here Two [SEP] tokens in the above, one after the question and another after the text.

Apart from the "Token Embeddings", BERT internally also uses "Segment Embeddings" and "Position Embeddings". Segment embeddings help BERT in differentiating a question from the text. 

In practice-

we use a vector of 0's if embeddings are from sentence 1 else a vector of 1’s if embeddings are from sentence 2. 

Position embeddings help in specifying the position of words in the sequence. All these embeddings are fed to the input layer.

Transformers library can create segment embeddings on its own using PretrainedTokenizer.encode_plus(). But, we can even create our own. For that, we just need to specify a 0 or 1 for each token.

In [13]:
#first occurence of [SEP] token
sep_idx = input_ids.index(tokenizer.sep_token_id)
print("SEP token index: ", sep_idx)

SEP token index:  6


In [14]:
#number of tokens in segment A (question) - this will be one more than the sep_idx as the index in Python starts from 0
num_seg_a = sep_idx+1
print("Number of tokens in segment A: ", num_seg_a)
#number of tokens in segment B (text)
num_seg_b = len(input_ids) - num_seg_a
print("Number of tokens in segment B: ", num_seg_b)
#creating the segment ids
segment_ids = [0]*num_seg_a + [1]*num_seg_b
#making sure that every input token has a segment id
assert len(segment_ids) == len(input_ids)

Number of tokens in segment A:  7
Number of tokens in segment B:  347


Let’s now feed this to our model.

In [15]:
#token input_ids to represent the input and token segment_ids to differentiate our segments - question and text
output = model(torch.tensor([input_ids]),  token_type_ids=torch.tensor([segment_ids]))

In [16]:
#tokens with highest start and end scores
answer_start = torch.argmax(output.start_logits)
answer_end = torch.argmax(output.end_logits)
if answer_end >= answer_start:
    answer = " ".join(tokens[answer_start:answer_end+1])
else:
    print("I am unable to find the answer to this question. Can you please ask another question?")
    
print("\nQuestion:\n{}".format(question.capitalize()))
print("\nAnswer:\n{}.".format(answer.capitalize()))


Question:
Where did she pass?

Answer:
New york nursing home.


In [17]:
answer = tokens[answer_start]
for i in range(answer_start+1, answer_end+1):
    if tokens[i][0:2] == "##":
        answer += tokens[i][2:]
    else:
        answer += " " + tokens[i]

In [18]:
def question_answer(question, text):
    
    #tokenize question and text as a pair
    input_ids = tokenizer.encode(question, text)    
    #string version of tokenized ids
    tokens = tokenizer.convert_ids_to_tokens(input_ids)
    #print(tokens)
    
    #segment IDs
    #first occurence of [SEP] token
    sep_idx = input_ids.index(tokenizer.sep_token_id)
    #number of tokens in segment A (question)
    num_seg_a = sep_idx+1
    #number of tokens in segment B (text)
    num_seg_b = len(input_ids) - num_seg_a
    
    #list of 0s and 1s for segment embeddings
    segment_ids = [0]*num_seg_a + [1]*num_seg_b
    assert len(segment_ids) == len(input_ids)
    
    #model output using input_ids and segment_ids
    output = model(torch.tensor([input_ids]), token_type_ids=torch.tensor([segment_ids]))
    
    #reconstructing the answer
    answer_start = torch.argmax(output.start_logits)
    answer_end = torch.argmax(output.end_logits)
    if answer_end >= answer_start:
        answer = tokens[answer_start]
        for i in range(answer_start+1, answer_end+1):
            if tokens[i][0:2] == "##":
                answer += tokens[i][2:]
            else:
                answer += " " + tokens[i]
                
    if answer.startswith("[CLS]"):
        answer = "Unable to find the answer to your question."
    
    print("\nPredicted answer:\n{}".format(answer.capitalize()))

In [19]:
text = """New York (CNN) -- More than 80 Michael Jackson collectibles -- including the late pop star's famous rhinestone-studded glove from a 1983 performance -- were auctioned off Saturday, reaping a total $2 million. Profits from the auction at the Hard Rock Cafe in New York's Times Square crushed pre-sale expectations of only $120,000 in sales. The highly prized memorabilia, which included items spanning the many stages of Jackson's career, came from more than 30 fans, associates and family members, who contacted Julien's Auctions to sell their gifts and mementos of the singer. Jackson's flashy glove was the big-ticket item of the night, fetching $420,000 from a buyer in Hong Kong, China. Jackson wore the glove at a 1983 performance during \"Motown 25,\" an NBC special where he debuted his revolutionary moonwalk. Fellow Motown star Walter \"Clyde\" Orange of the Commodores, who also performed in the special 26 years ago, said he asked for Jackson's autograph at the time, but Jackson gave him the glove instead. "The legacy that [Jackson] left behind is bigger than life for me,\" Orange said. \"I hope that through that glove people can see what he was trying to say in his music and what he said in his music.\" Orange said he plans to give a portion of the proceeds to charity. Hoffman Ma, who bought the glove on behalf of Ponte 16 Resort in Macau, paid a 25 percent buyer's premium, which was tacked onto all final sales over $50,000. Winners of items less than $50,000 paid a 20 percent premium."""

In [20]:
text

'New York (CNN) -- More than 80 Michael Jackson collectibles -- including the late pop star\'s famous rhinestone-studded glove from a 1983 performance -- were auctioned off Saturday, reaping a total $2 million. Profits from the auction at the Hard Rock Cafe in New York\'s Times Square crushed pre-sale expectations of only $120,000 in sales. The highly prized memorabilia, which included items spanning the many stages of Jackson\'s career, came from more than 30 fans, associates and family members, who contacted Julien\'s Auctions to sell their gifts and mementos of the singer. Jackson\'s flashy glove was the big-ticket item of the night, fetching $420,000 from a buyer in Hong Kong, China. Jackson wore the glove at a 1983 performance during "Motown 25," an NBC special where he debuted his revolutionary moonwalk. Fellow Motown star Walter "Clyde" Orange of the Commodores, who also performed in the special 26 years ago, said he asked for Jackson\'s autograph at the time, but Jackson gave h

In [21]:
question = "Where was the Auction held?"

In [22]:
question_answer(question, text)


Predicted answer:
Hard rock cafe in new york ' s times square


Not bad at all. In fact, our BERT model gave a more detailed response.

In [23]:
text='''My house
My family lives in a house.
Our house is very nice.
It has two bedrooms.
We cook and eat in the kitchen.
We watch TV in the living room.
My favorite room is my bedroom.
I like to read books and do my homework in my bedroom.
'''

In [24]:
question='My favorite room is?'

In [25]:
question_answer(question, text)


Predicted answer:
My bedroom


In [26]:
question='My family lives in..'

In [27]:
question_answer(question, text)


Predicted answer:
A house


In [28]:
question=' Our house has two'

In [29]:
question_answer(question, text)


Predicted answer:
It has two bedrooms


In [30]:
question='We cook and eat in the'

In [31]:
question_answer(question, text)


Predicted answer:
The kitchen


In [37]:
question='I like to ____________ and __________ in my bedroom'

In [38]:
question_answer(question, text)


Predicted answer:
Read books and do my homework in my bedroom


In [34]:
question='We watch TV in the'

In [35]:
question_answer(question, text)


Predicted answer:
The living room
