Deep Learning for NLP
--

Recipe 7-1:  Retrieving Information
--

Information retrieval is one of the highly used applications of NLP and it is
quite tricky. The meaning of the words or sentences not only depends on the exact words used but also on the context and meaning. Two sentences may be of completely different words but can convey the same meaning. We should be able to capture that as well.

An information retrieval (IR) system allows users to efficiently search documents and retrieve meaningful information based on a search text/query.

![Information_retrieval_block_dg](images/Information_retrieval_block_dg.png  'Information_retrieval_block_dg')

Problem
--
Information retrieval using word embeddings.

Solution
--
There are multiple ways to do Information retrieval. But we will see how to
do it using word embeddings, which is very effective since it takes context
also into consideration. 

> We discussed how word embeddings are built in Chapter 4. 

We will just use the pretrained word2vec in this case. 

Let’s take a simple example and see how to build a document retrieval using query input. Let’s say we have 4 documents in our database as below. (Just showcasing how it works. We will have too many documents in a real-world application.)

In [9]:
Doc1 = ["With the Union cabinet approving the amendments to the Motor Vehicles Act, 2016, those caught for drunken driving will have to have really deep pockets, as the fine payable in court has been enhanced to Rs 10,000 for first-time offenders." ]
        
Doc2 = ["Natural language processing (NLP) is an area of computer science and artificial intelligence concerned with the interactions between computers and human (natural) languages, in particular how to program computers to process and analyze large amounts of natural language data."]

Doc3 = ["He points out that public transport is very good in Mumbai and New Delhi, where there is a good network of suburban and metro rail systems."] 

Doc4 = ["But the man behind the wickets at the other end was watching just as keenly. With an affirmative nod from Dhoni, India captain Rohit Sharma promptly asked for a review. Sure enough, the ball would have clipped the top of middle and leg."]

Assume we have numerous documents like this. And you want to retrieve
the most relevant one, for the query “cricket.” Let’s see how to build it.

query = "cricket"

In [10]:
# How It Works

# Step 7.1.1 : Import the libraries
import gensim
from gensim.models import Word2Vec
import numpy as np
import nltk
import itertools
from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize, word_tokenize
import scipy
from scipy import spatial
from nltk.tokenize.toktok import ToktokTokenizer
import re

tokenizer = ToktokTokenizer()
stopword_list = nltk.corpus.stopwords.words('english')

In [11]:
# Step 7.1.2 - Create/import documents

# Doc1 , Doc2 , Doc3 and Doc4 as defined above in the code.

# Put all the documents in one list
fin = Doc1+Doc2+Doc3+Doc4

Step 7.1.3 Download word2vec

As mentioned earlier, we are going to use the word embeddings to solve
this problem. We had downloaded word2Vec in Jupyter Notebook "Converting Text to Features". Also recall that we may encounter ValueError as below
> The Google Db is soo large that we would get ValueError, like this : ValueError: array is too big; arr.size * arr.dtype.itemsize is larger than the maximum possible size.

Download link for word2vec is:
https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit

In [12]:
# step 7.1.3
# load the model
model = gensim.models.KeyedVectors.load_word2vec_format('C:\Program Files\Python36\suven\Adv ML\datasets\datasets/GoogleNews-vectors-negative300.bin', binary=True)

Notes for 7.1 :     // case 1: 
// assume word = "cricket"

if "cricket" in model.wv.vocab: -> true

  return model[word]  -> word found hence vector 
  
  returned
  
  
 else:
 
 
  return np.zeros(300)
  
model 
-----
  apple : [0.3,0.445,.......    ],size = 300 
  bat : [0.34,0.245,.......    ],size = 300
  cricket : [0.2,0.665,.......    ], size = 300
  ...
  ...
--------------------
// case 2: 
// assume word = "hitesh"

if "hitesh" in model.wv.vocab: -> false
  
  
  return model[word]  
 
 
 else:
  
  
  return np.zeros(300) -> make a vector of 300 0's 
 
 and return it.

In [30]:
# Step 7.1.4 : Create IR system

# Now we build the information retrieval system:

# Preprocessing
def remove_stopwords(text, is_lower_case=False):
 pattern = r'[^a-zA-z0-9\s]'
 text = re.sub(pattern, ", ".join(text))
 tokens = tokenizer.tokenize(text)
 tokens = [token.strip() for token in tokens]
 if is_lower_case:
  filtered_tokens = [token for token in tokens if token not in stopword_list]
 else:
  filtered_tokens = [token for token in tokens if token.lower() not in stopword_list]
  filtered_text = ' '.join(filtered_tokens)
 return filtered_text


# Function to get the embedding vector for n dimension, we have used "300"
def get_embedding(word):
 if word in model.wv.vocab:
  return model[x]
 else:
  return np.zeros(300)

In [32]:
# Getting average vector for each document
out_dict = {}
for sen in fin:  # this loop will pick one sentence at a time from fin (final document)
 average_vector = (np.mean(np.array([get_embedding(x) for x in nltk.word_tokenize(remove_stopwords(sen))]), axis=0))
 dict = { sen : (average_vector) }
 out_dict.update(dict)

# Function to calculate the similarity between the query vector and document vector
def get_sim(query_embedding, average_vector_doc):
 sim = [(1 - scipy.spatial.distance.cosine(query_embedding, average_vector_doc))]
 return sim



# Remember : scipy.spatial.distance.cosine gives us dissimilarity instead of similarity
# therefore 1 - (....)
# Note : scipy...cosine is different from the cosine distance formulae, we used in ML
# Refer : https://datascience.stackexchange.com/questions/8426/cosine-distance-1-in-scipy

TypeError: sub() missing 1 required positional argument: 'string'

In [14]:
# Rank all the documents based on the similarity to get relevant docs
def Ranked_documents(query):
 query_words = (np.mean(np.array([get_embedding(x) for x in nltk.word_tokenize(query.lower())],dtype=float), axis=0))
 rank = []
 
 for k,v in out_dict.items():
  rank.append((k, get_sim(query_words, v)))
 
 rank = sorted(rank,key=lambda t: t[1], reverse=True)
 print('Ranked Documents :')
 return rank

In [15]:
# Call the IR function with a query
Ranked_documents("cricket")



NameError: name 'x' is not defined

On running this code , over a PC (with 16GB RAM and the Microsoft Visual Studio - C++ build tools correctly installed), our team member got the following o/p  

In [None]:
# If you see, doc4 (on top in result), this will be most relevant for the
# query “cricket” even though the word “cricket” is not even mentioned once
# with the similarity of 0.449.

# Let’s take one more example as may be driving.
Ranked_documents("driving")

In [None]:
[('With the Union cabinet approving the amendments to the Motor
Vehicles Act, 2016, those caught for drunken driving will have
to have really deep pockets, as the fine payable in court has
been enhanced to Rs 10,000 for first-time offenders.',
[0.35947287723800669]),
('But the man behind the wickets at the other end was watching
just as keenly. With an affirmative nod from Dhoni, India
captain Rohit Sharma promptly asked for a review. Sure enough,
the ball would have clipped the top of middle and leg.',
[0.19042556935316801]),
('He points out that public transport is very good in Mumbai
and New Delhi, where there is a good network of suburban and
metro rail systems.',
[0.17066536985237601]),
('Natural language processing (NLP) is an area of computer
science and artificial intelligence concerned with the
interactions between computers and human (natural) languages,
in particular how to program computers to process and analyze
large amounts of natural language data.',
[0.088723080005327359])]

In [None]:
# Again, since driving is connected to transport and the Motor Vehicles
# Act, it pulls out the most relevant documents on top. The first 2 documents
# are relevant to the query.

We can use the same approach and scale it up for as many documents as possible. 

For more accuracy, we can build our own embeddings, as we learned in Chapter 4, for specific industries since the one we are using is generalized.
(Although it is time consuming task)

This is the fundamental approach that can be used for many applications like the following:
1. Search engines (Internet Search Engines)
2. Document retrieval
3. Passage retrieval
4. Question and answer

![result_length_vs_query_len](images/result_length_vs_query_len.png 'result_length_vs_query_len' )

Strongly Recommended :  I have learned a lot from the Videos of Mr. Brandon Rohrer. I would advice each of my student to go through this free course. You may skip the CNN part.  https://end-to-end-machine-learning.teachable.com/p/how-deep-neural-networks-work