 <div style="background-color: #99CD4E; text-align:center; vertical-align: middle; padding:40px 0;"> 
    <h1 style="color: white;"> <b>Intelligent Search Demo</b> </h1>.
 </div>

<img src="alice_in_wonderland.jpg"
     alt="Alice in wonderland"
     style="float: left; margin-right: 10px; max-width:80%;" />

> Alice's Adventures in Wonderland (commonly shortened to Alice in Wonderland) is an 1865 novel written by English author Charles Lutwidge Dodgson under the pseudonym Lewis Carroll. It tells of a young girl named Alice falling through a rabbit hole into a fantasy world populated by peculiar, anthropomorphic creatures. The tale plays with logic, giving the story lasting popularity with adults as well as with children. It is considered to be one of the best examples of the literary nonsense genre.


This demo demonstrates the use of Intelligent Search capability on the text of "Alice in Wonderland". The demo notebook is structures as follows:
- First we get the text of the book from a nltk corpus and process it for consumption by a ranker
-

 <div style="background-color: #99CD4E; text-align:center; vertical-align: middle; padding:10px 0;"> 
    <h2 style="color: white;"> <b>Question</b> </h2>.
 </div>
<br/>



In [1]:
# question = 'What is the best way to explain?'
# question = 'Why should we not be going back to yesterday?'
question = 'What did the rabbit took out of its waistcoat-pocket?'

 <div style="background-color: #99CD4E; text-align:center; vertical-align: middle; padding:10px 0;"> 
    <h2 style="color: white;"> <b>Get the book text</b> </h2>.
 </div>
<br/>
NLTK includes a small selection of texts from the Project Gutenberg electronic text archive which we are going to use in this exercse.


In [3]:
from nltk.corpus import gutenberg
from nltk.tokenize.treebank import TreebankWordDetokenizer
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from IPython.core.display import display, HTML
from deeppavlov import build_model, configs
import torch

In [4]:
book_name = 'carroll-alice.txt'

In [5]:
display(HTML('<h2> Number of paragraphs: ' + str(len(gutenberg.paras(book_name))) + '</h2>'))

In [6]:
display(HTML('<h2> Number of sentences: ' + str(len(gutenberg.sents(book_name))) + '</h2>'))

In [7]:
display(HTML('<h2> Number of words: ' + str(len(gutenberg.words(book_name))) + '</h2>'))

In [8]:
gutenberg.paras(book_name)[4]

[['There',
  'was',
  'nothing',
  'so',
  'VERY',
  'remarkable',
  'in',
  'that',
  ';',
  'nor',
  'did',
  'Alice',
  'think',
  'it',
  'so',
  'VERY',
  'much',
  'out',
  'of',
  'the',
  'way',
  'to',
  'hear',
  'the',
  'Rabbit',
  'say',
  'to',
  'itself',
  ',',
  "'",
  'Oh',
  'dear',
  '!'],
 ['Oh', 'dear', '!'],
 ['I', 'shall', 'be', 'late', "!'"],
 ['(',
  'when',
  'she',
  'thought',
  'it',
  'over',
  'afterwards',
  ',',
  'it',
  'occurred',
  'to',
  'her',
  'that',
  'she',
  'ought',
  'to',
  'have',
  'wondered',
  'at',
  'this',
  ',',
  'but',
  'at',
  'the',
  'time',
  'it',
  'all',
  'seemed',
  'quite',
  'natural',
  ');',
  'but',
  'when',
  'the',
  'Rabbit',
  'actually',
  'TOOK',
  'A',
  'WATCH',
  'OUT',
  'OF',
  'ITS',
  'WAISTCOAT',
  '-',
  'POCKET',
  ',',
  'and',
  'looked',
  'at',
  'it',
  ',',
  'and',
  'then',
  'hurried',
  'on',
  ',',
  'Alice',
  'started',
  'to',
  'her',
  'feet',
  ',',
  'for',
  'it',
  'flashed',

 <div style="background-color: #99CD4E; text-align:center; vertical-align: middle; padding:10px 0;"> 
    <h2 style="color: white;"> <b>Pre-process the text</b> </h2>.
 </div>
<br/>

In this nltk corpus each sentence is represented as a list of words in the sentence, and each paragraph is represented as a list of sentences. In the preporcessing step we detokenize the words in the sentence to reconstruct the sentence and then merge the sentences to reconstruct the paragraph.

In [9]:
para_list = []
for para in gutenberg.paras(book_name):
    sentence_list = []
    for sentence in para:
        sentence_list.append(TreebankWordDetokenizer().detokenize(sentence))
    para_list.append(" ".join([sent for sent in sentence_list]))
    

In [10]:
para_list[4]

"There was nothing so VERY remarkable in that; nor did Alice think it so VERY much out of the way to hear the Rabbit say to itself ,' Oh dear! Oh dear! I shall be late!' ( when she thought it over afterwards, it occurred to her that she ought to have wondered at this, but at the time it all seemed quite natural ); but when the Rabbit actually TOOK A WATCH OUT OF ITS WAISTCOAT - POCKET, and looked at it, and then hurried on, Alice started to her feet, for it flashed across her mind that she had never before seen a rabbit with either a waistcoat - pocket, or a watch to take out of it, and burning with curiosity, she ran across the field after it, and fortunately was just in time to see it pop down a large rabbit - hole under the hedge."

 <div style="background-color: #99CD4E; text-align:center; vertical-align: middle; padding:10px 0;"> 
    <h2 style="color: white;"> <b>Part 1: Implement the ranker</b> </h2>.
 </div>
<br/>
Ranker goes through the corpus of documents to identify the relevant passages that may potentially contain the answer to the query. In this demo we have used a simple tf-idf based ranker


## Convert the book text and question into count vectors

In [11]:
#tfidf_vectorizer = TfidfVectorizer(stop_words='english')
count_vectorizer = CountVectorizer(stop_words='english', ngram_range=(1, 2))

In [12]:
count_vec_total_text = count_vectorizer.fit_transform(para_list)

In [13]:
count_vec_question = count_vectorizer.transform([question])

In [14]:
print(np.shape(count_vec_total_text))

(817, 10658)


In [15]:
print(np.shape(count_vec_question))

(1, 10658)


## Compute the cosine similarity of the question vector from the paragraph vectors

In [16]:
distance_array = cosine_similarity(count_vec_question, count_vec_total_text)

In [17]:
distance_array = distance_array[0]

In [18]:
len(distance_array)

817

In [19]:
distance_array

array([0.        , 0.18257419, 0.        , 0.06085806, 0.42802583,
       0.        , 0.07106691, 0.0804518 , 0.        , 0.        ,
       0.        , 0.0255655 , 0.0789337 , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.0758098 , 0.        , 0.        ,
       0.09774528, 0.04188539, 0.02879561, 0.        , 0.        ,
       0.        , 0.04778185, 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.03178209, 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.05227084, 0.        , 0.        ,
       0.10540926, 0.        , 0.07332356, 0.        , 0.        ,
       0.08908708, 0.05832118, 0.        , 0.        , 0.     

## Identify the top n similar passages

In [20]:
top_n = 5

In [21]:
top_passages = []

In [22]:
for index in distance_array.argsort()[::-1][:top_n]:
    print(index, ' ', para_list[index])
    print('-' * 20)
    top_passages.append(para_list[index])

4   There was nothing so VERY remarkable in that; nor did Alice think it so VERY much out of the way to hear the Rabbit say to itself ,' Oh dear! Oh dear! I shall be late!' ( when she thought it over afterwards, it occurred to her that she ought to have wondered at this, but at the time it all seemed quite natural ); but when the Rabbit actually TOOK A WATCH OUT OF ITS WAISTCOAT - POCKET, and looked at it, and then hurried on, Alice started to her feet, for it flashed across her mind that she had never before seen a rabbit with either a waistcoat - pocket, or a watch to take out of it, and burning with curiosity, she ran across the field after it, and fortunately was just in time to see it pop down a large rabbit - hole under the hedge.
--------------------
456   ' Did you say " What a pity!"?' the Rabbit asked.
--------------------
383   ' What did they live on?' said Alice, who always took a great interest in questions of eating and drinking.
--------------------
712   ' You did!' sa

 <div style="background-color: #99CD4E; text-align:center; vertical-align: middle; padding:10px 0;"> 
    <h2 style="color: white;"> <b>Part 2 : Use question answer model to extract the answer </b> </h2>.
 </div>
<br/>

In this part we demonstrate the use of two off the shelf question answer models trained on open source SQUAD dataset. These question answer models use the power of transfer learning via word embeddings to enhance the ability of the question answer models across different domains.

### Part 2a : Use DeepPavlov API to extract the answer 

In [24]:
# Doing the heavyweight initialization of deeppavlov library here
from deeppavlov import build_model, configs
model = build_model(configs.squad.squad_bert, download=True)

2021-01-07 14:26:19.11 INFO in 'deeppavlov.download'['download'] at line 138: Skipped http://files.deeppavlov.ai/deeppavlov_data/bert/cased_L-12_H-768_A-12.zip download because of matching hashes
2021-01-07 14:26:20.340 INFO in 'deeppavlov.download'['download'] at line 138: Skipped http://files.deeppavlov.ai/deeppavlov_data/squad_bert.tar.gz download because of matching hashes












The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.

Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.
Instructions for updating:
Use keras.layers.Dense instead.
Instructions for updating:
Please use `layer.__call__` method instead.


Instructions for updating:

Future major versions of TensorFlow will allow gradients to flow
into the labels input on backprop by default.

See `tf.nn.softmax_cross_entropy_with_logits_v2`.





Instructions for updating:
Use standard file APIs to check for files with this prefix.


2021-01-07 14:27:19.861 INFO in 'deeppavlov.core.models.tf_model'['tf_model'] at line 51: [loading model from /Users/asingh/.deeppavlov/models/squad_bert/model]



INFO:tensorflow:Restoring parameters from /Users/asingh/.deeppavlov/models/squad_bert/model


In [25]:
predicted_answer_list = []
for passage in top_passages:
    predicted_answer = model([passage], [question])
    predicted_answer_list.append(predicted_answer)
    print(predicted_answer)

[['a watch'], [562], [1523999.75]]
[['you say " What a pity!"?\' the Rabbit asked.'], [6], [0.03372638300061226]]
[[''], [-1], [2.575139045715332]]
[['You did'], [2], [6.767302513122559]]
[[''], [-1], [0.09891779720783234]]


In [26]:
display(HTML("<h2> Question: '" + question + "'</h2>"))

In [27]:
display(HTML('<h2> Best Answer: ' + str("'" + predicted_answer_list[0][0][0] + "'") + '</h2>'))

### Part 2b : Use HuggingFace API to extract the answer 

In [28]:
from transformers import BertForQuestionAnswering, BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForQuestionAnswering.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')

Input Representation and Tokenization In BERT

In [29]:
input_text = "[CLS] " + question + " [SEP] " + top_passages[0] + " [SEP]"

In [30]:
tokenized_text = tokenizer.tokenize(input_text)
input_ids = tokenizer.convert_tokens_to_ids(tokenized_text)

In [31]:
print(top_passages[0])

There was nothing so VERY remarkable in that; nor did Alice think it so VERY much out of the way to hear the Rabbit say to itself ,' Oh dear! Oh dear! I shall be late!' ( when she thought it over afterwards, it occurred to her that she ought to have wondered at this, but at the time it all seemed quite natural ); but when the Rabbit actually TOOK A WATCH OUT OF ITS WAISTCOAT - POCKET, and looked at it, and then hurried on, Alice started to her feet, for it flashed across her mind that she had never before seen a rabbit with either a waistcoat - pocket, or a watch to take out of it, and burning with curiosity, she ran across the field after it, and fortunately was just in time to see it pop down a large rabbit - hole under the hedge.


In [32]:
print(tokenized_text)

['[CLS]', 'what', 'did', 'the', 'rabbit', 'took', 'out', 'of', 'its', 'waist', '##coat', '-', 'pocket', '?', '[SEP]', 'there', 'was', 'nothing', 'so', 'very', 'remarkable', 'in', 'that', ';', 'nor', 'did', 'alice', 'think', 'it', 'so', 'very', 'much', 'out', 'of', 'the', 'way', 'to', 'hear', 'the', 'rabbit', 'say', 'to', 'itself', ',', "'", 'oh', 'dear', '!', 'oh', 'dear', '!', 'i', 'shall', 'be', 'late', '!', "'", '(', 'when', 'she', 'thought', 'it', 'over', 'afterwards', ',', 'it', 'occurred', 'to', 'her', 'that', 'she', 'ought', 'to', 'have', 'wondered', 'at', 'this', ',', 'but', 'at', 'the', 'time', 'it', 'all', 'seemed', 'quite', 'natural', ')', ';', 'but', 'when', 'the', 'rabbit', 'actually', 'took', 'a', 'watch', 'out', 'of', 'its', 'waist', '##coat', '-', 'pocket', ',', 'and', 'looked', 'at', 'it', ',', 'and', 'then', 'hurried', 'on', ',', 'alice', 'started', 'to', 'her', 'feet', ',', 'for', 'it', 'flashed', 'across', 'her', 'mind', 'that', 'she', 'had', 'never', 'before', 'see

In [33]:
#input_ids = tokenizer.encode(input_text)

In [34]:
for tup in zip(tokenized_text, input_ids):
  print(tup)

('[CLS]', 101)
('what', 2054)
('did', 2106)
('the', 1996)
('rabbit', 10442)
('took', 2165)
('out', 2041)
('of', 1997)
('its', 2049)
('waist', 5808)
('##coat', 16531)
('-', 1011)
('pocket', 4979)
('?', 1029)
('[SEP]', 102)
('there', 2045)
('was', 2001)
('nothing', 2498)
('so', 2061)
('very', 2200)
('remarkable', 9487)
('in', 1999)
('that', 2008)
(';', 1025)
('nor', 4496)
('did', 2106)
('alice', 5650)
('think', 2228)
('it', 2009)
('so', 2061)
('very', 2200)
('much', 2172)
('out', 2041)
('of', 1997)
('the', 1996)
('way', 2126)
('to', 2000)
('hear', 2963)
('the', 1996)
('rabbit', 10442)
('say', 2360)
('to', 2000)
('itself', 2993)
(',', 1010)
("'", 1005)
('oh', 2821)
('dear', 6203)
('!', 999)
('oh', 2821)
('dear', 6203)
('!', 999)
('i', 1045)
('shall', 4618)
('be', 2022)
('late', 2397)
('!', 999)
("'", 1005)
('(', 1006)
('when', 2043)
('she', 2016)
('thought', 2245)
('it', 2009)
('over', 2058)
('afterwards', 5728)
(',', 1010)
('it', 2009)
('occurred', 4158)
('to', 2000)
('her', 2014)
('that

In [35]:
token_type_ids = [0 if i <= input_ids.index(102) else 1 for i in range(len(input_ids))]

In [36]:
#token_type_ids

In [37]:
#start_scores, end_scores = model(torch.tensor([input_ids]), token_type_ids=torch.tensor([token_type_ids]))

In [50]:
response = model(torch.tensor([input_ids]), token_type_ids=torch.tensor([token_type_ids]))

In [53]:
start_scores = response['start_logits']
start_scores

tensor([[-5.6253, -6.5432, -8.7005, -8.0701, -7.9872, -7.1824, -8.0202, -7.7935,
         -8.3308, -8.4675, -9.2599, -9.4378, -8.8390, -9.9604, -5.6253, -7.8081,
         -8.1429, -6.5101, -7.7626, -8.4542, -7.7951, -7.0669, -6.8516, -8.6917,
         -7.7679, -7.9444, -5.3592, -7.5644, -7.2754, -7.7703, -8.5000, -8.1266,
         -7.1917, -8.1409, -8.4497, -8.4238, -7.9103, -8.4523, -7.7646, -5.8052,
         -8.8358, -8.5498, -7.9195, -8.7839, -6.5267, -5.7339, -8.1382, -8.4639,
         -6.5532, -7.8100, -8.3539, -5.7651, -7.6919, -8.0590, -6.0737, -8.7583,
         -8.2796, -8.1480, -7.4105, -7.7558, -7.7056, -7.7925, -8.5667, -7.6174,
         -8.4638, -7.4137, -7.4711, -8.6020, -7.8537, -8.5254, -7.4074, -8.2212,
         -8.6080, -8.7363, -7.5475, -8.9741, -7.7709, -8.7558, -7.8884, -6.8745,
         -7.1873, -7.9396, -6.4894, -7.6782, -6.4856, -6.7629, -5.9999, -8.1147,
         -7.3338, -5.7741, -2.6644,  0.3923, -1.2396, -2.3462,  0.3755,  5.8996,
          4.7149, -4.0624, -

In [54]:
start_scores

tensor([[-5.6253, -6.5432, -8.7005, -8.0701, -7.9872, -7.1824, -8.0202, -7.7935,
         -8.3308, -8.4675, -9.2599, -9.4378, -8.8390, -9.9604, -5.6253, -7.8081,
         -8.1429, -6.5101, -7.7626, -8.4542, -7.7951, -7.0669, -6.8516, -8.6917,
         -7.7679, -7.9444, -5.3592, -7.5644, -7.2754, -7.7703, -8.5000, -8.1266,
         -7.1917, -8.1409, -8.4497, -8.4238, -7.9103, -8.4523, -7.7646, -5.8052,
         -8.8358, -8.5498, -7.9195, -8.7839, -6.5267, -5.7339, -8.1382, -8.4639,
         -6.5532, -7.8100, -8.3539, -5.7651, -7.6919, -8.0590, -6.0737, -8.7583,
         -8.2796, -8.1480, -7.4105, -7.7558, -7.7056, -7.7925, -8.5667, -7.6174,
         -8.4638, -7.4137, -7.4711, -8.6020, -7.8537, -8.5254, -7.4074, -8.2212,
         -8.6080, -8.7363, -7.5475, -8.9741, -7.7709, -8.7558, -7.8884, -6.8745,
         -7.1873, -7.9396, -6.4894, -7.6782, -6.4856, -6.7629, -5.9999, -8.1147,
         -7.3338, -5.7741, -2.6644,  0.3923, -1.2396, -2.3462,  0.3755,  5.8996,
          4.7149, -4.0624, -

In [56]:
end_scores = response['end_logits']

In [40]:
all_tokens = tokenizer.convert_ids_to_tokens(input_ids)

In [41]:
print(all_tokens)

['[CLS]', 'what', 'did', 'the', 'rabbit', 'took', 'out', 'of', 'its', 'waist', '##coat', '-', 'pocket', '?', '[SEP]', 'there', 'was', 'nothing', 'so', 'very', 'remarkable', 'in', 'that', ';', 'nor', 'did', 'alice', 'think', 'it', 'so', 'very', 'much', 'out', 'of', 'the', 'way', 'to', 'hear', 'the', 'rabbit', 'say', 'to', 'itself', ',', "'", 'oh', 'dear', '!', 'oh', 'dear', '!', 'i', 'shall', 'be', 'late', '!', "'", '(', 'when', 'she', 'thought', 'it', 'over', 'afterwards', ',', 'it', 'occurred', 'to', 'her', 'that', 'she', 'ought', 'to', 'have', 'wondered', 'at', 'this', ',', 'but', 'at', 'the', 'time', 'it', 'all', 'seemed', 'quite', 'natural', ')', ';', 'but', 'when', 'the', 'rabbit', 'actually', 'took', 'a', 'watch', 'out', 'of', 'its', 'waist', '##coat', '-', 'pocket', ',', 'and', 'looked', 'at', 'it', ',', 'and', 'then', 'hurried', 'on', ',', 'alice', 'started', 'to', 'her', 'feet', ',', 'for', 'it', 'flashed', 'across', 'her', 'mind', 'that', 'she', 'had', 'never', 'before', 'see

In [57]:
predicted_answer = ' '.join(all_tokens[torch.argmax(start_scores): torch.argmax(end_scores) + 1])

In [58]:
predicted_answer

'a watch'

In [59]:
torch.max(end_scores).item()

6.369904518127441

## Getting answers for from all the top passages

In [60]:
predicted_answer_list = []
for passage in top_passages:
    input_text = "[CLS] " + question + " [SEP] " + passage + " [SEP]"
    tokenized_text = tokenizer.tokenize(input_text)
    input_ids = tokenizer.convert_tokens_to_ids(tokenized_text)
    token_type_ids = [0 if i <= input_ids.index(102) else 1 for i in range(len(input_ids))]
    response = model(torch.tensor([input_ids]), token_type_ids=torch.tensor([token_type_ids]))
    start_scores = response['start_logits']
    end_scores = response['end_logits']
    all_tokens = tokenizer.convert_ids_to_tokens(input_ids)
    predicted_answer = ' '.join(all_tokens[torch.argmax(start_scores): torch.argmax(end_scores) + 1])
    predicted_answer_list.append(predicted_answer)
    print(predicted_answer)

a watch
what a pity ! " ? ' the rabbit asked .
drinking
what did the rabbit took out of its waist ##coat - pocket ? [SEP] ' you did ! ' said the hat ##ter
what did the rabbit took out of its waist ##coat - pocket


 <div style="background-color: #99CD4E; text-align:center; vertical-align: middle; padding:40px 0;"> 
  <h1 style="color: white;"> The End </h1>.
 </div>