![MSE Logo](https://moodle.msengineering.ch/pluginfile.php/1/core_admin/logo/0x150/1643104191/logo-mse.png)

# AnTeDe Lab 4: Search Engine with the Vector Space Model

## Summary
The aim of this lab is to build a simple document search engine based on TF-IDF document vectors. 

The lab is inspired by a notebook designed by [Kavita Ganesan](https://github.com/kavgan/nlp-in-practice/blob/master/TF-IDF/Keyword%20Extraction%20with%20TF-IDF%20and%20SKlearn.ipynb).

<font color='green'>Please answer the questions in green within this notebook, and submit the completed notebook under the corresponding homework on Moodle.</font>

In [31]:
import os    
import nltk  # on Colab, you mind find it helpful to run nltk.download('popular') to install packages
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger') 
nltk.download('omw-1.4')

import gensim
import pandas as pd
from nltk.corpus import stopwords, wordnet

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.metrics.pairwise import linear_kernel
from gensim import models, corpora, similarities

In [32]:
!pip install contractions
import contractions

In [None]:
# Import TextProcessor.py from Google drive
# from google.colab import drive
# drive.mount('/content/gdrive')

# # Modify path according to your configuration
# # !ls "/content/gdrive/MyDrive/ColabNotebooks/MSE_AnTeDe_Spring2022"
# import sys
# sys.path.insert(0,'/content/gdrive/MyDrive/ColabNotebooks/MSE_AnTeDe_Spring2022')

# from TextPreprocessor import * 

' from google.colab import drive\ndrive.mount(\'/content/gdrive\')\n\n# Modify path according to your configuration\n# !ls "/content/gdrive/MyDrive/ColabNotebooks/MSE_AnTeDe_Spring2022"\nimport sys\nsys.path.insert(0,\'/content/gdrive/MyDrive/ColabNotebooks/MSE_AnTeDe_Spring2022\')\n\nfrom TextPreprocessor import * '

In [3]:
# Import TextProcessor.py from local directory structure
import sys
module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path)

from Lab1.TextPreprocessor import *


The data used in this lab is a set of 300 documents selected from the Australian Broadcasting Corporation's news mail service. It consists of texts of headline stories from around the years 2000-2001.  This is a shortened version of the Lee Background Corpus [described here](http://www.socsci.uci.edu/~mdlee/lee_pincombe_welsh_document.PDF).  It is available as test data in the **gensim** package, so you do not need to download it separately.

The following code will load the documents into a Pandas dataframe.

In [4]:
# Code inspired from:
# https://github.com/bhargavvader/personal/blob/master/notebooks/text_analysis_tutorial/topic_modelling.ipynb

test_data_dir = '{}'.format(os.sep).join([gensim.__path__[0], 'test', 'test_data'])
lee_train_file = test_data_dir + os.sep + 'lee_background.cor'
text = open(lee_train_file).read().splitlines()
data_df = pd.DataFrame({'text': text})

The following code will run our in-house Text Preprocessor provided in the `TextPreprocessor.py` file, and documented in the `MSE_AnTeDe_TextPreprocessingDemo.ipynb` notebook provided in Lab 1 (see Lab 1 archive on Moodle for both files).

<font color='green'> **Question**: Please enhance the code by adding special characters such as e.g., " ' ' " as stopwords and uses adjective and noun POS tag sets in the TextPreprocessor function.</font>

In [5]:
language = 'english'
stop_words = set(stopwords.words(language))
# Extend the list here:
# BEGIN_REMOVE
# END_REMOVE

processor = TextPreprocessor(
# Add options here:
# BEGIN_REMOVE    
# END_REMOVE 
)

In [33]:
data_df['processed'] = processor.transform(data_df['text'])

We can now look at a few examples of processed texts.

In [14]:
print(data_df['processed'].iloc[136])

In [15]:
data_df.head()

## Generation of document vectors with [Scikit-learn](https://scikit-learn.org/stable)

We will use the [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) class from scikit-learn to create a vocabulary and generate word counts or *Term Frequencies* (TF).
    
The result is a  matrix representation of the counts: each column represents a _word_ in the vocabulary and each row represents a document in our dataset: the cell values are the word counts of the word in the document. 

The matrix is very sparse, because all words not appearing in a document have 0 counts.

Recommended reading for usage and differences of scikit-learn’s Tfidftransformer and Tfidfvectorizer: 

https://kavita-ganesan.com/tfidftransformer-tfidfvectorizer-usage-differences/

In order to start using TfidfTransformer you will first have to create a CountVectorizer to count the number of words (term frequency), limit your vocabulary size, apply stop words and etc.

In [9]:
cv = CountVectorizer(max_features=3000) # keep only the 3000 most frequent words in the corpus
word_count_vector = cv.fit_transform(data_df['processed'])

Let's look at some words from our vocabulary:

In [10]:
feature_names = cv.get_feature_names_out()

In [16]:
print(len(feature_names)) # has the max_features value been reached?
print(feature_names[2500:2505]) # try various slices
print(np.where(feature_names == 'hundred')[0]) # find a word's index
print(feature_names[1315]) # find a word corresponding to an index
print(cv.vocabulary_.items())

# print(cv.vocabulary_.keys())
# print(cv.vocabulary_.values())

Now, let’s check the shape of the term-document matrix, which should contain 300 documents from the Lee Corpus and 3000 terms. 

In [17]:
word_count_vector.shape

In [18]:
# Output word counts for a particular word  
count_list = np.asarray(word_count_vector.sum(axis=0))[0]
word_count_dict = dict(zip(feature_names, count_list))

res = {k: v for k, v in word_count_dict.items() if k.startswith('hundred')}
# res = {k: v for k, v in word_count_dict.items() if k.startswith('people')}
res

**TfidfTransformer to Compute Inverse Document Frequency (IDF)**

We now use the (sparse) matrix generated by `CountVectorizer` to compute the IDF values of each word.  Note that the IDF should in reality be based on a large and representative corpus.

In [None]:
tfidf_transformer = TfidfTransformer(smooth_idf=True, use_idf=True)
tfidf_transformer.fit(word_count_vector)

The IDF values are stored in the `idf_` field of the `TfidfTransformer`.  It has the same size as the array of feature names (words).

In [19]:
print(len(tfidf_transformer.idf_)) # check length

To get a glimpse of how the IDF values look, we are going to print it by placing the IDF values in a python DataFrame. The values will be sorted in ascending order.

In [21]:
# print a single idf value 
print(tfidf_transformer.idf_[np.where(cv.get_feature_names_out() == 'hundred')]) # check IDF value of a word

# print idf values in a data frame 
df_idf = pd.DataFrame(tfidf_transformer.idf_, index=cv.get_feature_names_out(), columns=["idf_weights"]) 
# sort ascending 
df_idf.sort_values(by=['idf_weights'])

**We define here two helper functions:**
 * the first one is a sorting function for the columns of a sparse matrix in COOrdinate format (a.k.a "ijv" or "triplet" format [explained here](https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.coo_matrix.html));
 * the second one extracts the feature names (*words*) and their TF-IDF values from the sorted list.

In [None]:
def sort_coo(coo_matrix):
    tuples = zip(coo_matrix.col, coo_matrix.data)
    return sorted(tuples, key=lambda x: (x[1], x[0]), reverse=True)

def extract_topn_from_vector(feature_names, sorted_items, topn=10):
    """get the feature names and TF-IDF score of top n items from sorted list"""
    
    #use only topn items from vector
    sorted_items = sorted_items[:topn]

    score_vals = []
    feature_vals = []

    for idx, score in sorted_items:
        fname = feature_names[idx]
        
        #keep track of feature name and its corresponding score
        score_vals.append(round(score, 3))
        feature_vals.append(feature_names[idx])

    results= {}
    for idx in range(len(feature_vals)):
        results[feature_vals[idx]]=score_vals[idx]
    
    return results

We now select a document for which we will generate TF-IDF values.  <font color="green">Please select a random document of your choice between 0 and 300.</font>

In [22]:
doc_orig = data_df['text'].iloc[136]
doc_processed = data_df['processed'].iloc[136]
print(doc_orig)
print(doc_processed)

The next instruction generates the vector of TF-IDF values for the document using the `tfidf_transformer`.

In [None]:
tf_idf_vector = tfidf_transformer.transform(cv.transform([doc_processed]))

Next, we sort the words in the `tf_idf_vector` by decreasing TF-IDF values, first transforming the vector into a coordinate format ('coo'), and then applying our sorting function from above.  We then extract the words with the top 10 scores (and the scores) for the selected document using our second helper function from above and display them.

In [23]:
sorted_items=sort_coo(tf_idf_vector.tocoo())

topn_words = extract_topn_from_vector(feature_names, sorted_items, 10)

print(doc_orig, '\n', topn_words)

Alternatively the TF-IDF values of the first document can be inspected by placing the TF-IDF scores from the first document into a pandas data frame.

In [25]:
feature_names = cv.get_feature_names_out() 
#get tfidf vector for first document 
first_document_vector=tf_idf_vector[0] 
#densify and print the scores 
df_firstdoc = pd.DataFrame(first_document_vector.T.todense(), index=feature_names, columns=["tfidf"]) 
print("the size of the data frame is: ", df_firstdoc.shape)

In [26]:
df_firstdoc.sort_values(by=["tfidf"],ascending=False).head(20)

In [27]:
df_firstdoc.sort_values(by=["tfidf"],ascending=False).tail(20)

Notice that only certain words have scores. This is because our first document does not contain all of the top 3000 tokens which then show up as zeroes. Notice that the word “a” is missing from this list. This is possibly due to internal pre-processing of CountVectorizer where it removes single characters.

The more common the word across documents, the lower its score and the more unique a word is to our first document the higher the score. So it’s working as expected except for the mysterious a that was chopped off.

<font color="green"> **Question**: Please comment briefly on the relevance of these words with respect to the document content.</font>

## Document-document similarity using scikit-learn

In this section, you will write the commands to compute a document-document similarity matrix over the above documents, in scikit-learn.

Please use a processing [Pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html#sklearn.pipeline.Pipeline) and a [TfidfVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) and compute the *cosine similarities* between all documents.  

Note: With Tfidftransformer you will systematically compute word counts using CountVectorizer and then compute the Inverse Document Frequency (IDF) values and only then compute the TF-IDF scores.

With Tfidfvectorizer on the contrary, you will do all three steps at once. Under the hood, it computes the word counts, IDF values, and TF-IDF scores all using the same dataset.

General guideline re. how to use Tfidftransformer over Tfidfvectorizer :

- If you need the term frequency (term count) vectors for different tasks, use Tfidftransformer.
- If you need to compute TF-IDF scores on documents within your “training” dataset, use Tfidfvectorizer
- If you need to compute TF-IDF scores on documents outside your “training” dataset, use either one, both will work.

<font color="green">**Question**: At the end, you will be asked to display the five most similar documents to the one you selected above, and compare the 1st and the 5th best results.</font>

You can use inspiration from: 
 * the above code
 * https://kavita-ganesan.com/tfidftransformer-tfidfvectorizer-usage-differences/#.XkK2ceFCe-Y
 * https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html
 * https://stackoverflow.com/questions/12118720/python-TF-IDF-cosine-to-find-document-similarity
 * https://markhneedham.com/blog/2016/07/27/scitkit-learn-tfidf-and-cosine-similarity-for-computer-science-papers

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer 
from sklearn.pipeline import Pipeline
from sklearn.metrics.pairwise import cosine_similarity

In [None]:
tfidf = TfidfVectorizer(use_idf=True)
pipe = Pipeline(steps=[('pre', processor), ('tfidf', tfidf)]) # the 'processor' was defined above

<font color='green'>**Question**: Please write a function called `find_similar` which receives a `tfidf_matrix` with all similarity scores between documents, and the `index` of a document in the collection, and returns the `top_n` most similar documents to it using cosine similarity.</font>

In [None]:
def find_similar(tfidf_matrix, index, top_n = 5):
# BEGIN_REMOVE
# END_REMOVE

<font color="green">**Question**: Using the data from the Pandas form created above, please use "fit" and "transform" to generate the matrix of all document similarites called "tfidf_matrix". -- How long do these two operations take on your computer?  -- Please explain briefly in your own words what is the difference between "fit" and "transform".</font>

In [29]:
import time
# BEGIN_REMOVE
# END_REMOVE

<font color="green">**Question**: Using `find_similar` and the `tfidf_matrix` please display the five most similar documents to the one you selected above, with their scores, comment them, and compare the 1st and the 5th best results.</font>

In [28]:
# BEGIN_REMOVE
# END_REMOVE

<font color='green'>**Question**: Could you also use the dot product instead of the cosine similarity in the `find_similar` function?  Please answer in the following box.</font>

In [None]:
# BEGIN_REMOVE
# END_REMOVE

## Building a search engine using Gensim

<font color='green'>**Question**: Using the [tutorial on Topics and Transformations from Gensim](https://radimrehurek.com/gensim/auto_examples/core/run_topics_and_transformations.html#sphx-glr-auto-examples-core-run-topics-and-transformations-py), please implement a method that returns the documents most similar to a given query.
    
Use [Gensim's TF-IDF Model](https://radimrehurek.com/gensim/models/tfidfmodel.html) to build the model and the [MatrixSimilarity function](https://radimrehurek.com/gensim/similarities/docsim.html#gensim.similarities.docsim.MatrixSimilarity) to measure cosine similarity between documents.</font>

<font color='green'>Please write a query of your own (5-10 words), retrieve the 5 most similar documents, and comment the result.</font>

In [30]:
# BEGIN_REMOVE
# END_REMOVE

## End of Lab 4
Please make sure all cells have been executed, save this completed notebook, compress it to a *zip* file, and upload it to [Moodle](https://moodle.msengineering.ch/course/view.php?id=1869).