<a href="https://colab.research.google.com/github/WasudeoGurjalwar/AL_ML_Training/blob/main/Retrieving_Information_NLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Deep Learning for NLP
--

Recipe 7-1:  Retrieving Information
--

Information retrieval is one of the highly used applications of NLP and it is
quite tricky. The meaning of the words or sentences not only depends on the exact words used but also on the context and meaning. Two sentences may be of completely different words but can convey the same meaning. We should be able to capture that as well.

An information retrieval (IR) system allows users to efficiently search documents and retrieve meaningful information based on a search text/query.

<img src="https://drive.google.com/uc?id=1QOtLrYLecQSPH3D_SgaFg12urVyeOu3b" />

Problem
--
Information retrieval using word embeddings.

Solution
--
There are multiple ways to do Information retrieval. But we will see how to do it using word embeddings, which is very effective since it takes context also into consideration.

We will just use the pretrained word2vec in this case.
( Src : https://radimrehurek.com/gensim/models/word2vec.html )

Let’s take a simple example and see how to build a document retrieval using query input. Let’s say we have 4 documents in our database as below. (Just showcasing how it works. We will have too many documents in a real-world application.)

In [None]:
Doc1 = ["With the Union cabinet approving the amendments to the Motor Vehicles Act, 2016, those caught for drunken driving will have to have really deep pockets, as the fine payable in court has been enhanced to Rs 10,000 for first-time offenders." ]

Doc2 = ["Natural language processing (NLP) is an area of computer science and artificial intelligence concerned with the interactions between computers and human (natural) languages, in particular how to program computers to process and analyze large amounts of natural language data."]

Doc3 = ["He points out that public transport is very good in Mumbai and New Delhi, where there is a good network of suburban and metro rail systems."]

Doc4 = ["But the man behind the wickets at the other end was watching just as keenly. With an affirmative nod from Dhoni, India captain Rohit Sharma promptly asked for a review. Sure enough, the ball would have clipped the top of middle and leg."]

Assume we have numerous documents like this. And you want to retrieve the most relevant one, for the query “cricket.” Let’s see how to build it.

**query = "cricket"**

In [None]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [None]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [None]:
# Step 7.1.1 : Import the libraries
import gensim
from gensim.models import Word2Vec
import numpy as np
import nltk
import itertools
from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize, word_tokenize
import scipy
from scipy import spatial
from nltk.tokenize.toktok import ToktokTokenizer
import re

tokenizer = ToktokTokenizer()
stopword_list = nltk.corpus.stopwords.words('english')

In [None]:
# Step 7.1.2 - Create/import documents

# Doc1 , Doc2 , Doc3 and Doc4 as defined above in the code.

# Put all the documents in one list
fin = Doc1+Doc2+Doc3+Doc4
print(fin)

['With the Union cabinet approving the amendments to the Motor Vehicles Act, 2016, those caught for drunken driving will have to have really deep pockets, as the fine payable in court has been enhanced to Rs 10,000 for first-time offenders.', 'Natural language processing (NLP) is an area of computer science and artificial intelligence concerned with the interactions between computers and human (natural) languages, in particular how to program computers to process and analyze large amounts of natural language data.', 'He points out that public transport is very good in Mumbai and New Delhi, where there is a good network of suburban and metro rail systems.', 'But the man behind the wickets at the other end was watching just as keenly. With an affirmative nod from Dhoni, India captain Rohit Sharma promptly asked for a review. Sure enough, the ball would have clipped the top of middle and leg.']


In [None]:
# import gensim package
#import gensim

# load the saved model
#model = gensim.models.KeyedVectors.load_word2vec_format('http://d1ufe8q8sjuo99.cloudfront.net/public/GoogleNews-vectors-negative300.bin.gz', binary=True)
## http://d1ufe8q8sjuo99.cloudfront.net/public/GoogleNews-vectors-negative300.bin.gz

import gensim.downloader as api
wv = api.load('word2vec-google-news-300')



In [None]:
#print(model)
#print(gensim.models.Word2Vec())

In [None]:
# Step 7.1.4 : Create IR system

# Now we build the information retrieval system:

# Preprocessing
def remove_stopwords(text, is_lower_case=False):
 pattern = r'[^a-zA-Z0-9\s]'
 text = re.sub(pattern, "" , text)


 tokens = tokenizer.tokenize(text)
 tokens = [token.strip() for token in tokens]
 if is_lower_case:
  filtered_tokens = [token for token in tokens if token not in stopword_list]

 else:
  filtered_tokens = [token for token in tokens if token.lower() not in stopword_list]
  filtered_text = ' '.join(filtered_tokens)

 return filtered_text


# Function to get the embedding vector for n dimension, we have used "300"
def get_embedding(word):
 if word in wv.key_to_index:
  return wv[word]
 else:
  return np.zeros(300)

In [None]:
# just to see the word vector for some word.
print(wv['cricket'])
print("-----------------------------------")
print(len(wv['cricket']))
print(np.mean(np.array(wv['cricket']), axis=0))

# just to see no. of words in Doc1
print(len(tokenizer.tokenize(Doc1)))


# so, we would get 47 word vectors, each of size 300.
# One vector for one word.
# so, we could get the mean of each vector
# so that we reduce of the values we have to handle

In [None]:
import nltk
nltk.download('punkt')

# Getting average vector for each document
out_dict = {}
for sen in fin:  # this loop will pick one sentence at a time from fin (final document)
 average_vector = (np.mean(np.array([get_embedding(x) for x in nltk.word_tokenize(remove_stopwords(sen))]), axis=0))
 dict = { sen : (average_vector) }
 out_dict.update(dict)

# Function to calculate the similarity between the query vector and document vector
def get_sim(query_embedding, average_vector_doc):
 sim = [(1 - scipy.spatial.distance.cosine(query_embedding, average_vector_doc))]
 return sim

In [None]:
print(out_dict)

In [None]:
# Rank all the documents based on the similarity to get relevant docs
def Ranked_documents(query):
 query_words = (np.mean(np.array([get_embedding(x) for x in nltk.word_tokenize(query.lower())],dtype=float), axis=0))
 rank = []

 for k,v in out_dict.items():
  rank.append((k, get_sim(query_words, v)))

 rank = sorted(rank,key=lambda t: t[1], reverse=True)
 print('Ranked Documents :')
 return rank

In [None]:
# Call the IR function with a query
Ranked_documents("cricket")

In [None]:
# Let’s take one more example as may be driving.
Ranked_documents("driving is cool on National Highways")

We can use the same approach and scale it up for as many documents as possible.

For more accuracy, we can build our own embeddings, for specific industries since the one we are using is generalized.
(Although it is time consuming task)

This is the fundamental approach that can be used for many applications like the following:
1. Search engines (Internet Search Engines)
2. Document retrieval
3. Passage retrieval
4. Question and answer

<img src="https://drive.google.com/uc?id=1kGARF7WUBYrDKbRkQh5l7V01pDTYCwiU">



It’s been proven that results will be good when queries are longer and
the result length is shorter. That’s the reason we don’t get great results in search engines when the search query has lesser number of words.