###  Name - Vijay Iyer (vsiyer)
###  Course - DSCI 590 (INTRO TO NLP FOR DATA SCIENCE)
### Title  - Question Answering Chatbot


## Importing required libraries

In [1]:
import requests
import json
import pandas as pd
import numpy as np
from transformers import BertTokenizer, BertForQuestionAnswering
import random as rnd
import colorama
from colorama import Fore, Style, Back
from termcolor import colored, cprint
import os
import warnings
warnings.filterwarnings('ignore')
import torch

To use data.metrics please install scikit-learn. See https://scikit-learn.org/stable/index.html


# PART 1 - Preprocessing SQUAD Dataset
 - We will be working on SQUAD v1.1 dataset - SQuAD is a collection of 100,00 crowdsourced question-answer pairs. It's also used as a benchmark for evaluating model performance on question answering NLP task
 - the dataset comes in 2 files a ***train .json*** and a ***dev .json file***
 - train dataset has **87,599 question-answer pairs while dev dataset has 10,570**
 - there are in total **442 different articles** (from which 87,599 context-question-answer combinations have been extracted) the format of the data is in the form  - title, [list of **'context'** paragraphs, list of question-answer(multiple possible answers per question) pairs for each 'context'] 
 - the 'source' is the pair of context and question with the answer being the 'target' for that particular pair of context and question. A machine learning model will have to train on context and question data with answer as target to then be able to answer questions from unseen articles/ paragraphs
 - for training for chatbot, we will have to get each context-question-answer combination as a seperate record. 
 - for this assignment, however, we will be doing the preprocessing and transforms on the **'context'** column, since that's the one having unstructured text data
 - we will download the json content using requests module
 - After downloading the json files, we will convert the content to a suitable dataframe format from where training and evaluation can be done. ---> **url = https://rajpurkar.github.io/SQuAD-explorer/dataset/**
 - we can see these steps (for part 1) below
 

In [2]:
url = "https://rajpurkar.github.io/SQuAD-explorer/dataset/"
train_url = os.path.join(url, "train-v1.1.json")
dev_url = os.path.join(url, "dev-v1.1.json")

In [3]:
r = requests.get(train_url)
train_json = json.loads(r.content)

In [4]:
r = requests.get(dev_url)
dev_json = json.loads(r.content)

### Exploratory data analysis on train and dev json files

In [5]:
print(len(train_json['data'])) #  -- number of titles in train dataset

442


In [6]:
print(len(dev_json['data'])) # -- number of titles in dev dataset

48


#### print sample record to see the structure of a data

In [7]:
#print(train_json['data'][0])

There are in total 442 different records.
Each record has a title and a list of 
1. paragraphs - which each consist of context, questions and answers
2. Context is an article or the tet on which questions will be asked
3. Questions
4. Multiple Answers for each question - these are the ground truth or correct answers for the question

## Defining Functions to convert the text to dataframe with format
 - id, title, Question, Context, answer_start (since answer can be found within context)

In [8]:
def squad_json_to_dataframe_train(train_json, record_path = ['data','paragraphs','qas','answers'],
                           verbose = 1):
    """
    input_file_path: path to the squad json file.
    record_path: path to deepest level in json file default value is
    ['data','paragraphs','qas','answers']
    verbose: 0 to suppress it default is 1
    """
      
    file = train_json
    if verbose:
        print("processing...")
    # parsing different level's in the json file
    js = pd.json_normalize(file , record_path )
    m = pd.json_normalize(file, record_path[:-1] )
    r = pd.json_normalize(file,record_path[:-2])
    
    #combining it into single dataframe
    idx = np.repeat(r['context'].values, r.qas.str.len())
    ndx  = np.repeat(m['id'].values,m['answers'].str.len())
    m['context'] = idx
    js['q_idx'] = ndx
    main = pd.concat([ m[['id','question','context']].set_index('id'),js.set_index('q_idx')],1,sort=False).reset_index()
    main['c_id'] = main['context'].factorize()[0]
    if verbose:
        print("shape of the dataframe is {}".format(main.shape))
        print("Done")
    return main

In [9]:
def squad_json_to_dataframe_dev(dev_json, record_path = ['data','paragraphs','qas','answers'],
                           verbose = 1):
    """
    input_file_path: path to the squad json file.
    record_path: path to deepest level in json file default value is
    ['data','paragraphs','qas','answers']
    verbose: 0 to suppress it default is 1
    """
       
    file = dev_json
    if verbose:
        print("processing...")
    # parsing different level's in the json file
    js = pd.json_normalize(file , record_path )
    m = pd.json_normalize(file, record_path[:-1] )
    r = pd.json_normalize(file,record_path[:-2])
    
    #combining it into single dataframe
    idx = np.repeat(r['context'].values, r.qas.str.len())
#     ndx  = np.repeat(m['id'].values,m['answers'].str.len())
    m['context'] = idx
#     js['q_idx'] = ndx
    main = m[['id','question','context','answers']].set_index('id').reset_index()
    main['c_id'] = main['context'].factorize()[0]
    if verbose:
        print("shape of the dataframe is {}".format(main.shape))
        print("Done")
    return main

### Convert json to dataframe

In [10]:
record_path = ['data','paragraphs','qas','answers']
train_df = squad_json_to_dataframe_train(train_json,record_path=record_path)
dev_df = squad_json_to_dataframe_dev(dev_json,record_path=record_path)

processing...
shape of the dataframe is (87599, 6)
Done
processing...
shape of the dataframe is (10570, 5)
Done


In [11]:
train_df.head(3)

Unnamed: 0,index,question,context,answer_start,text,c_id
0,5733be284776f41900661182,To whom did the Virgin Mary allegedly appear i...,"Architecturally, the school has a Catholic cha...",515,Saint Bernadette Soubirous,0
1,5733be284776f4190066117f,What is in front of the Notre Dame Main Building?,"Architecturally, the school has a Catholic cha...",188,a copper statue of Christ,0
2,5733be284776f41900661180,The Basilica of the Sacred heart at Notre Dame...,"Architecturally, the school has a Catholic cha...",279,the Main Building,0


In [12]:
dev_df.head(3)

Unnamed: 0,id,question,context,answers,c_id
0,56be4db0acb8001400a502ec,Which NFL team represented the AFC at Super Bo...,Super Bowl 50 was an American football game to...,"[{'answer_start': 177, 'text': 'Denver Broncos...",0
1,56be4db0acb8001400a502ed,Which NFL team represented the NFC at Super Bo...,Super Bowl 50 was an American football game to...,"[{'answer_start': 249, 'text': 'Carolina Panth...",0
2,56be4db0acb8001400a502ee,Where did Super Bowl 50 take place?,Super Bowl 50 was an American football game to...,"[{'answer_start': 403, 'text': 'Santa Clara, C...",0


**getting BertTokenizer which is pre-trained which is fine-tuned on SQuAD dataset**

In [13]:
tokenizer = BertTokenizer.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')

#### We can see how the input text is tokenized

In [14]:
random_num = np.random.randint(0,len(dev_df))
question = dev_df["question"][random_num]
context = dev_df["context"][random_num]

We can use the **encode_plus** function to tokenize the question-context pair. Bert needs input in the form of token embeddings as well as segment embeddings which can help it differentiate between either a pair of sentences or question and text. 

In [15]:
input_ids = tokenizer.encode_plus(question, context)
print(input_ids.keys())
print("The input has a total of {} tokens.".format(len(input_ids['input_ids'])))

dict_keys(['input_ids', 'token_type_ids'])
The input has a total of 258 tokens.


Tokens with their ids. **Some of the tokens are printed out with *##* at the beginning**, this is because BERT does wordpiece tokenization on the words, in order to reduce the size of the vocabulary and so unseen words can be represented. Words are broken into sub-wods

In [16]:
tokens = tokenizer.convert_ids_to_tokens(input_ids['input_ids'])
print(' ,'.join(tokens))
## uncomment below line to see all tokens and their ids
# for token, id in zip(tokens, input_ids['input_ids']):
#     print('{:8}{:8,}'.format(token,id))

what ,did ,luther ,try ,to ,do ,for ,the ,jews ,? ,luther ,wrote ,about ,the ,jews ,throughout ,his ,career ,, ,though ,only ,a ,few ,of ,his ,works ,dealt ,with ,them ,directly ,. ,luther ,rarely ,encountered ,jews ,during ,his ,life ,, ,but ,his ,attitudes ,reflected ,a ,theological ,and ,cultural ,tradition ,which ,saw ,jews ,as ,a ,rejected ,people ,guilty ,of ,the ,murder ,of ,christ ,, ,and ,he ,lived ,within ,a ,local ,community ,that ,had ,expelled ,jews ,some ,ninety ,years ,earlier ,. ,he ,considered ,the ,jews ,b ,##las ,##ph ,##eme ,##rs ,and ,liar ,##s ,because ,they ,rejected ,the ,divinity ,of ,jesus ,, ,whereas ,christians ,believed ,jesus ,was ,the ,messiah ,. ,but ,luther ,believed ,that ,all ,human ,beings ,who ,set ,themselves ,against ,god ,were ,equally ,guilty ,. ,as ,early ,as ,151 ,##6 ,, ,he ,wrote ,that ,many ,people ," ,are ,proud ,with ,marvelous ,stupidity ,when ,they ,call ,the ,jews ,dogs ,, ,evil ,##do ,##ers ,, ,or ,whatever ,they ,like ,, ,while ,they

## Model
 - We will be using HuggingFace Transformers library
 - It's a pytorch implementation of BERT
 - It provides pre-trained models for NLP tasks, like Sentence Classification, Question Answering, Text Summarization, etc

### getting pre-trained models fine-tuned on SQuAD dataset

In [17]:
model = BertForQuestionAnswering.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')

Since model is **pre-trained**, we will be directly using it for predicting answer, which is to predict start and end span of the answer from the context given. 

In [18]:
output = model(torch.tensor([input_ids['input_ids']]),  token_type_ids=torch.tensor([input_ids['token_type_ids']]))

In [19]:
#tokens with highest start and end scores
answer_start = torch.argmax(output[0]) # the token with highest probability of being start of the answer
answer_end = torch.argmax(output[1]) # the token with highest probability of being end of the answer
if answer_end >= answer_start:
    answer = " ".join(tokens[answer_start:answer_end+1])
else:
    print("I am unable to find the answer to this question. Can you please ask another question?")
    
print("\nQuestion:\n{}".format(question.capitalize()))
print("\nAnswer:\n{}.".format(answer.capitalize()))


Question:
What did luther try to do for the jews?

Answer:
Convert them to christianity.


Minor modification is required to above code for mode readability. If the tokens are returned where one of the tokens is ***##{word}***, then we can prepend the remaining text in token to the token before, since in wordpiece tokenization performed by BERT, the start of a word can never be ***##**. 

In [20]:
answer = tokens[answer_start]
for i in range(answer_start+1, answer_end+1):
    if tokens[i][0:2] == "##":
        answer += tokens[i][2:]
    else:
        answer += " " + tokens[i]

In [21]:
answer

'convert them to christianity'

# Evaluation
 - We will convert the above function into a function which picks a passage and corresponding question about the passage. 
 - We will also see how the model does for questions which cannot be answered from the text

In [22]:
def question_answer(question, passage):
    
    #tokenize question and text as a pair
    inputs = tokenizer.encode_plus(question, passage)
    
    input_ids, token_type_ids = inputs['input_ids'], inputs['token_type_ids']
    #string version of tokenized ids
    tokens = tokenizer.convert_ids_to_tokens(input_ids)
    
    #model output using input_ids and segment_ids
    output = model(torch.tensor([input_ids]), token_type_ids=torch.tensor([token_type_ids]))
    
    #reconstructing the answer
    answer_start = torch.argmax(output[0])
    answer_end = torch.argmax(output[1])
    if answer_end >= answer_start:
        answer = tokens[answer_start]
        for i in range(answer_start+1, answer_end+1):
            if tokens[i][0:2] == "##":
                answer += tokens[i][2:]
            else:
                answer += " " + tokens[i]
                
    elif answer.startswith("[CLS]"):
        answer = "Unable to find the answer to your question."
    else:
        answer = "Unable to find the answer to your question."
    
    return answer

### run below cell repeatedly to see different context, question, answer pairs


In [23]:
ind = rnd.randrange(len(train_df))
passage = train_df.iloc[ind]['context']
question = train_df.iloc[ind]['question']
ground_truth_ans = train_df.iloc[ind]['text']

answer = question_answer(question = question, passage = passage)
cprint("\n"+question, attrs = ['bold'])
cprint("\n"+passage)
print(colored("\nPredicted answer:", attrs=['bold'])+colored(answer, 'green',attrs=['bold']))
#original answer from the dataset
print(colored("\nOriginal answer:", attrs = ['bold'])+colored(ground_truth_ans, 'green',attrs=['bold']))

[1m
How many things did Microsoft alter after Release Preview?[0m

Relatively few changes were made from the Release Preview to the final version; these included updated versions of its pre-loaded apps, the renaming of Windows Explorer to File Explorer, the replacement of the Aero Glass theme from Windows Vista and 7 with a new flat and solid-colored theme, and the addition of new background options for the Start screen, lock screen, and desktop. Prior to its general availability on October 26, 2012, updates were released for some of Windows 8's bundled apps, and a "General Availability Cumulative Update" (which included fixes to improve performance, compatibility, and battery life) was released on Tuesday, October 9, 2012. Microsoft indicated that due to improvements to its testing infrastructure, general improvements of this nature are to be released more frequently through Windows Update instead of being relegated to OEMs and service packs only.[0m
[1m
Predicted answer:[0m[1m

### or see the cell below it to give your own passage and questions

In [24]:
context = input("Please enter your text: \n")
question = input("\nPlease enter your question: \n")
while True:
    answer = question_answer(question, context)
    cprint("\n"+answer, attrs=['bold'])
    flag = True
    flag_N = False
    
    while flag:
        response = input("\nDo you want to ask another question based on this text (Y/N)? ")
        if response[0] == "Y":
            question = input("\nPlease enter your question: \n")
            flag = False
        elif response[0] == "N":
            print("\nBye!")
            flag = False
            flag_N = True
            
    if flag_N == True:
        break

Please enter your text: 
Relatively few changes were made from the Release Preview to the final version; these included updated versions of its pre-loaded apps, the renaming of Windows Explorer to File Explorer, the replacement of the Aero Glass theme from Windows Vista and 7 with a new flat and solid-colored theme, and the addition of new background options for the Start screen, lock screen, and desktop. Prior to its general availability on October 26, 2012, updates were released for some of Windows 8's bundled apps, and a "General Availability Cumulative Update" (which included fixes to improve performance, compatibility, and battery life) was released on Tuesday, October 9, 2012. Microsoft indicated that due to improvements to its testing infrastructure, general improvements of this nature are to be released more frequently through Windows Update instead of being relegated to OEMs and service packs only.

Please enter your question: 
What was renamed?
[1m
windows explorer to file e