Code
----

In [1]:
# install package
%pip install evaluate
%pip install bert_score
%pip install gradio

Collecting evaluate
  Downloading evaluate-0.4.0-py3-none-any.whl (81 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.4/81.4 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: evaluate
Successfully installed evaluate-0.4.0
[0mNote: you may need to restart the kernel to use updated packages.
Collecting bert_score
  Downloading bert_score-0.3.13-py3-none-any.whl (61 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.1/61.1 kB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: bert_score
Successfully installed bert_score-0.3.13
[0mNote: you may need to restart the kernel to use updated packages.
Collecting gradio
  Downloading gradio-3.27.0-py3-none-any.whl (17.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m17.3/17.3 MB[0m [31m51.6 MB/s[0m eta [36m0:00:00[0m
Collecting gradio-client>=0.1.3
  Downloading gradio_client-0.1.3-py3-none-any.whl (286 kB)
[2K 

In [2]:
# import necessary packages 
import torch, requests, json
from transformers import AutoTokenizer, AutoModelForQuestionAnswering
import numpy as np   
import pandas as pd
from tqdm.notebook import tqdm  
tqdm.pandas()
from evaluate import load
import gradio as gr

In [3]:
# prepare tokenizer and model
tokenizer = AutoTokenizer.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')
model = AutoModelForQuestionAnswering.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/443 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

In [4]:
# method to convert SQuAD-V2 train dataset from json to dataframe
def squad_json_to_dataframe(input_file_path, record_path = ['data','paragraphs','qas','answers'], verbose = 1):
    """
    input_file_path: path to the squad json file.
    record_path: path to deepest level in json file default value is
    ['data','paragraphs','qas','answers']
    verbose: 0 to suppress it default is 1
    """
    if verbose:
        print("Reading the json file")    
    file = json.loads(open(input_file_path).read())
    if verbose:
        print("processing...")
    # parsing different level's in the json file
    js = pd.json_normalize(file , record_path )
    m = pd.json_normalize(file, record_path[:-1] )
    r = pd.json_normalize(file,record_path[:-2])
    
    #combining it into single dataframe
    idx = np.repeat(r['context'].values, r.qas.str.len())
    ndx  = np.repeat(m['id'].values,m['answers'].str.len())
    m['context'] = idx
    js['q_idx'] = ndx
    main = pd.concat([ m[['id','question','context']].set_index('id'),js.set_index('q_idx')],axis=1,sort=False).reset_index()
    main['c_id'] = main['context'].factorize()[0]
    if verbose:
        print("shape of the dataframe is {}".format(main.shape))
        print("Done")
    return main

In [5]:
# convert SQuAD-V2 train dataset from json to dataframe
input_file_path = '/kaggle/input/squad-2-dataset/train-v2.0.json'
record_path = ['data','paragraphs','qas','answers']
train = squad_json_to_dataframe(input_file_path=input_file_path,record_path=record_path)

Reading the json file
processing...
shape of the dataframe is (130319, 6)
Done


In [6]:
train.head()

Unnamed: 0,index,question,context,text,answer_start,c_id
0,56be85543aeaaa14008c9063,When did Beyonce start becoming popular?,Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ b...,in the late 1990s,269.0,0
1,56be85543aeaaa14008c9065,What areas did Beyonce compete in when she was...,Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ b...,singing and dancing,207.0,0
2,56be85543aeaaa14008c9066,When did Beyonce leave Destiny's Child and bec...,Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ b...,2003,526.0,0
3,56bf6b0f3aeaaa14008c9601,In what city and state did Beyonce grow up?,Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ b...,"Houston, Texas",166.0,0
4,56bf6b0f3aeaaa14008c9602,In which decade did Beyonce become famous?,Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ b...,late 1990s,276.0,0


In [7]:
# data cleanup
train = train.drop(['index','c_id','answer_start'],axis=1)
train.head()

Unnamed: 0,question,context,text
0,When did Beyonce start becoming popular?,Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ b...,in the late 1990s
1,What areas did Beyonce compete in when she was...,Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ b...,singing and dancing
2,When did Beyonce leave Destiny's Child and bec...,Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ b...,2003
3,In what city and state did Beyonce grow up?,Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ b...,"Houston, Texas"
4,In which decade did Beyonce become famous?,Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ b...,late 1990s


In [8]:
# method to generate answer for a given question from a given context
def predict_answer(question,context):
  encoded_text = tokenizer.encode_plus(text=question,text_pair=context)
  inputs = encoded_text['input_ids'] 
  sentence_embeddings = encoded_text['token_type_ids']
  tokens = tokenizer.convert_ids_to_tokens(inputs)
  outputs = model(input_ids=torch.tensor([inputs]),token_type_ids=torch.tensor([sentence_embeddings]))
  start_index = torch.argmax(outputs.start_logits).item()
  end_index = torch.argmax(outputs.end_logits).item()
  answer = ' '.join(tokens[start_index:end_index+1])
  corrected_ans = ''

  for word in answer.split():
    if word[:2] == '##':
      corrected_ans += word[2:]
    else:
      corrected_ans += ' ' + word
  return corrected_ans.strip()

In [9]:
question = "Cristiano Ronaldo currently playing at which club?"
context = "Cristiano Ronaldo dos Santos Aveiro (born 5 February 1985), better known as Cristiano Ronaldo, is a Portuguese professional footballer who plays as a forward. He is the captain of the Portuguese national team and he is currently playing at Saudi Arabian football club Al Nassr. He is considered to be one of the greatest footballers of all time, and, by some, as the greatest ever. Ronaldo began his professional career with Sporting CP at age 17 in 2002, and signed for Manchester United a year later. He won three back-to-back Premier League titles: in 2006-07, 2007-08, and 2008-09. In 2007-08, Ronaldo, helped United win the UEFA Champions League. In 2008-09, he won his first FIFA Club World Cup in December 2008, and he also won his first Ballon d'Or. At one point Ronaldo was the most expensive professional footballer of all time, after moving from Manchester United to Real Madrid for approximately £80 m in July 2009."
predict_answer(question,context)

'al nassr'

In [10]:
# generate predicted answers from training sample
train_sample = train.sample(n=100).dropna()
predicted_answers = []

for i in tqdm(range(len(train_sample))):
  ques = train_sample.question.iloc[i]
  ans = train_sample.text.iloc[i]
  predicted_answers.append(predict_answer(ques,ans))

  0%|          | 0/57 [00:00<?, ?it/s]

In [11]:
# calculate bert scores (precision, recall, f1, hashcode)
bert_score = load("bertscore")
results = bert_score.compute(predictions=predicted_answers,references=train_sample.text.values,lang='en')

Downloading builder script:   0%|          | 0.00/7.95k [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/482 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.43G [00:00<?, ?B/s]



In [12]:
precision, recall, f1_scores, hashcode = results['precision'], results['recall'], results['f1'], results['hashcode']

In [13]:
train_sample['answer'] = predicted_answers
train_sample['f1_score'] = f1_scores
train_sample.head(10)

Unnamed: 0,question,context,text,answer,f1_score
67573,How do the ingredients of traditional Swiss cu...,The cuisine of Switzerland is multifaceted. Wh...,similar,similar,1.0
87965,What are giant axons?,"As in arthropods, each muscle fiber (cell) is ...",the output signal lines of nerve cells,nerve cells,0.885365
55952,What might happen during interviews?,Professional wrestling in the U.S. tends to ha...,heels may even attack the faces,heels may even attack the faces,1.0
111828,In what year did Rome claim victory against th...,The first Roman republican wars were wars of b...,477 BC,477 bc,0.970673
68648,Where was the Treaty of Paris signed?,Peace negotiations at the Congress of Paris re...,the Congress of Paris,the congress of paris,0.938957
40231,What brought better multi display support to Mac?,"In 2001, Apple introduced Mac OS X, based on D...",Mavericks,,0.0
27231,What type of wood is it possible the bacteria ...,Little is known about the bacteria that degrad...,sunken,sunken,1.0
104548,What year was the Polaris Missile Facility Atl...,"During this period, the Weapons Station was th...",1996,1996,1.0
16468,What was achieved by launching 9 additional sa...,The first satellite of the second-generation s...,functional regional coverage,functional regional coverage,1.0
94034,What is Palermo's climate classification?,Palermo experiences a hot-summer Mediterranean...,hot-summer Mediterranean climate (Köppen clima...,hot - summer mediterranean climate ( koppen cl...,0.922139


In [14]:
# render an user interface
app = gr.Interface(fn=predict_answer, inputs = ['textbox', 'text'], outputs = 'textbox', title = 'Question Answering bot', description = 'Input context and question, then get answers!')
app.launch(inline=True)

Kaggle notebooks require sharing enabled. Setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Running on local URL:  http://127.0.0.1:7860
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Running on public URL: https://abfbf026cf73308086.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades (NEW!), check out Spaces: https://huggingface.co/spaces




Conclusion
----------
Here I have used AutoTokenizer and AutoModelForQuestionAnswering along with uncased version of Bert Large pretrained model from Huggingface. In the final result, the model is able to predict the answers for even those questions which are not directly mentioned in the context, but some related statement is present in the context. This way it is different from any normal Text search algorithm.