# SCS 3546: Deep Learning
> **Assignment 3: Contextualized Word Embeddings**

### Your name & student number:

<pre> Student Name: Fernando Espinosa </pre>

<pre> Student Number: X566420 </pre>

## **Assignment Description**
***

Search Engines are a standard tool for finding relevant content. The calculation of similarity between textual information is an important factor for better search results.

### **Objectives**

**Your goal in this assignment is to calculate the textual similarity between queries and the provided sample documents, using a variety of NLP approaches.**

In achieving the above goal, you will also:
- Demonstrate how to preprocess text and embed textual data.
- Compare the results of textual similarity scoring between traditional and deep-learning based NLP methods.

### **Data and Queries**

You will use the document repository provided by `sample_repository.json`, which you can download from the following link, or from the assignment description in Quercus: https://q.utoronto.ca/courses/286389/files/21993451/download?download_frd=1

The queries you will run against these sample documents are the following:

- Query 1: “fruits”
- Query 2: “vegetables”
- Query 3: “healthy foods in Canada”

### **Techniques to Demonstrate**

The techniques you will use to compute the similarity scores are:
- 1. TF-IDF.
- 2. Semantic similarity using GloVe word vectors.
- 3. Semantic similarity using a BERT-based model.


### **Feel Free to Choose Your Own Approach**

How you go about demonstrating each of the above techniques is up to you. You are not expected to use any particular library. The code below is just meant to provide you with some guidance to get started. You **do**, however, need to demonstrate obtaining similarity scores **with all 3 techniques above**, but how you go about doing this is totally up to you. The evaluation will be based on your ability obtain results using all three techniques, plus your discussion/comparison of any differences you observe.



## **Grade Allocation**
***
15 points total

- Experiment 1 (TD-IDF), implementation: 2 marks
- Experiment 2 (GloVe), implementation: 3 marks
- Experiment 3 (BERT), implementation: 3 marks
- Comparison and Discussion: 3 marks
  - Compare all three techniques and interpret your findings. Do your best to explain the differences you observe in terms of concepts learned in class (not just the _what_, but also the _how_ and _why_ one technique produces different results from another).
- Text Pre-Processing: 2 marks
 - Cleaning and standardization (e.g. lemmatization, stemming) in Experiment 1
 - Basic text cleaning (e.g. removal of special characters or tags) in Experiments 2 and 3.
- Clarity: 2 marks
 - The marks for clarity are awarded for code documentation, clean code (e.g. avoiding repetition by building re-usable functions)  and how well you explained/supported your answers, including the use of visualizations.


# Setup and Data Import
***
You can use the code snippets below to help you load and extract the document repository.


In [1]:
# you can either drop the file manually into your Colab drive, or otherwise
# use this widget to upload it

from google.colab import files
# uploaded = files.upload()

In [2]:
# this will unpack the json file contents into a list of titles and documents
import json

with open('sample_repository.json') as in_file:
    repo_data = json.load(in_file)

titles = [item[0] for item in repo_data['data']]
documents = [item[1] for item in repo_data['data']]


In [4]:
# let's take a look at some of these documents and titles;
# here we print the five last entries
for id in range(-5, 0, 1):
  print(f"Document title: {titles[id]}")
  print(f"Document contents: {documents[id]}")
  print("\n") # adds newline

Document title: botany
Document contents: Botany, also called plant science(s), plant biology or phytology, is the science of plant life and a branch of biology. A botanist, plant scientist or phytologist is a scientist who specialises in this field. 


Document title: Ford Bronco 
Document contents: The Ford Bronco is a model line of sport utility vehicles manufactured and marketed by Ford. ... The first SUV model developed by the company, five generations of the Bronco were sold from the 1966 to 1996 model years. A sixth generation of the model line is sold from the 2021 model year. the Ford Bronco will be available in Canada, with first deliveries beginning in spring of 2021. The Bronco will come in six versions in Canada: Base, Big Bend, Black Diamond, Outer Banks, Wildtrak and Badlands. 


Document title: List of fruit dishes
Document contents: Fruit dishes are those that use fruit as a primary ingredient. Condiments prepared with fruit as a primary ingredient are also included in

# Experiment 1: TF-IDF
***

**T**erm **F**requency - **I**nverse **D**ocument **F**requency (TF-IDF) is a traditional NLP technique to look at words that appear in both pieces of text, and score them based on how often they appear. For this experiment, you are free to use the TF-IDF implementation provided by scikit-learn.


In [5]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel

import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

nltk.download('punkt')
# need to keep the list to avoid error in TfidfVectorizer() contructor:
#     InvalidParameterError: The 'stop_words' parameter of TfidfVectorizer must be a str among {'english'}, an instance of 'list' or None..
stop_words_list = stopwords.words('english')
stop_words = set(stop_words_list)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [21]:

vectorizer = TfidfVectorizer(stop_words=stop_words_list)

# >>> fit the TF-IDF vectorizer to the document corpus and transform the documents
#     into TF-IDF vectors
document_vectors = vectorizer.fit_transform(documents)


In [None]:

# Calculate the word frequency, and a measure of similarity of the search terms with each document.
# Do not apply any text pre-processing (i.e. cleanup) yet.

# >>> calculate word frequencies (TF-IDF values)
word_frequencies = document_vectors.toarray()
terms = vectorizer.get_feature_names_out()

# >>> print term frequencies for each document
print("Word Frequencies (TF-IDF Values):")
for i, doc in enumerate(documents):
    print(f"Document {i+1}: {doc}")
    for j, term in enumerate(terms):
        if word_frequencies[i, j] > 0:  # Only show terms with non-zero TF-IDF
            print(f"  {term}: {word_frequencies[i, j]:.4f}")
    print()


In [None]:

# for each query, output the similarity scores for the top 5 documents with
# th highest score, and interpret your results

queries = ['fruits', 'vegetables', 'healthy foods in Canada']

for query in queries:
  # transforms each query into a TF-IDF vector based on the same vocabulary and weights learned from the documents
  query_vector = vectorizer.transform([query])

  # use cosine similarity between the query vector and all document vectors
  cosine_similarities = linear_kernel(query_vector, document_vectors).flatten()

  # sort the indices of the documents by their similarity scores in descending order
  top_indices = cosine_similarities.argsort()[::-1]

  print(f"query: {query}")
  for index in top_indices[:5]:
      print(f">> score: {cosine_similarities[index]:.4f} for document: {documents[index]}")
  print()


## Repeat the same task after some preprocessing

Use a minimum of 2 different text cleaning/standardization techniques (e.g. lemmatization, removing punctuation, etc).

In [None]:
# e.g. you can use a lemmatizer to reduce words down to their
# simplest 'lemma' (helpful when dealing with plurals)

from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer


In [None]:
#!pip install tfidf

What impact did the text cleaning / preprocessing have on your results?

In [None]:
# your response here

# Experiment 2: Semantic matching using GloVe embeddings
***

In [None]:
# if you decide to use the gensim library and the sample codes below,
# you would need gensim version >=4.0.1 to be installed
!pip install  gensim==4.0.1
import gensim
print(gensim.__version__)

In [None]:
import logging
import json
import logging
from re import sub
from multiprocessing import cpu_count

import numpy as np

import gensim.downloader as api
from gensim.utils import simple_preprocess
from gensim.corpora import Dictionary
from gensim.models import TfidfModel
from gensim.similarities import WordEmbeddingSimilarityIndex
from gensim.similarities import SparseTermSimilarityMatrix
from gensim.similarities import SoftCosineSimilarity

In [None]:
# optional, but it helps
import logging

# Initialize logging.
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.WARNING)

In [None]:
import nltk

# Import and download stopwords from NLTK.
nltk.download('stopwords')  # Download stopwords list.
stopwords = set(nltk.corpus.stopwords.words("english"))

In [None]:
def preprocess(doc):
    # Tokenize, clean up input document string
    doc = sub(r'<img[^<>]+(>|$)', " image_token ", doc)
    # you may decide to add additional steps here
    return [token for token in simple_preprocess(doc, min_len=0, max_len=float("inf")) if token not in stopwords]

In [None]:
# Load test data
with open('sample_repository.json') as in_file:
    repo_data = json.load(in_file)

titles = [item[0] for item in repo_data['data']]
documents = [item[1] for item in repo_data['data']]

In [None]:
query_s = 'Your queries here'

# Preprocess the documents, including the query string
corpus = [preprocess(document) for document in documents]
query = preprocess(query_s)

In [None]:
# Download and load the GloVe word vector embeddings
if 'glove' not in locals():  # only load if not already in memory
    glove = api.load("glove-wiki-gigaword-50")

similarity_index = WordEmbeddingSimilarityIndex(glove)

In [None]:
# Build the term dictionary, TF-idf model
# Keep in mind that the search query must be in the dictionary as well, in case the terms do not overlap with the documents
dictionary = Dictionary(corpus+[query])
tfidf = TfidfModel(dictionary=dictionary)

# Create the term similarity matrix.
# The nonzero_limit enforces sparsity by limiting the number of non-zero terms in each column.
# In my case, I got best results by removing the default value of 100
similarity_matrix = SparseTermSimilarityMatrix(similarity_index, dictionary, tfidf)  # , nonzero_limit=None)

In [None]:
# Compute similarity measure between the query and the documents.
query_tf = tfidf[dictionary.doc2bow(query)]

index = SoftCosineSimilarity(
            tfidf[[dictionary.doc2bow(document) for document in corpus]],
            similarity_matrix)

doc_similarity_scores = index[query_tf]

In [None]:
# for each query, output the similarity scores for the top 5 documents with
# th highest score, and interpret your results


# Experiment 3: BERT Model
***
Use a BERT model obtain sentence embeddings and calculate the similarity between queries and documents.

> Hint: see the Module 07 jupyter notebook for examples of how to work with BERT.

In [None]:
# for each query, output the similarity scores for the top 5 documents with
# th highest score, and interpret your results

 # Technique Comparison
 ***

Compare all three techniques and interpret your findings. Do your best to explain the differences you observe in terms of concepts learned in class (not just the what, but also the how and why one technique produces different results from another).
