# SCS 3546: Deep Learning
> **Assignment 3: Contextualized Word Embeddings**

### Your name & student number:

<pre> Gordon Chan </pre>

<pre> qq525548 </pre>

## **Assignment Description**
***

Search Engines are a standard tool for finding relevant content. The calculation of similarity between textual information is an important factor for better search results.

### **Objectives**

**Your goal in this assignment is to calculate the textual similarity between queries and the provided sample documents, using a variety of NLP approaches.**

In achieving the above goal, you will also:
- Demonstrate how to preprocess text and embed textual data.
- Compare the results of textual similarity scoring between traditional and deep-learning based NLP methods.

### **Data and Queries**

You will use the document repository provided by `sample_repository.json`, which you can download from the following link, or from the assignment description in Quercus: https://q.utoronto.ca/courses/286389/files/21993451/download?download_frd=1

The queries you will run against these sample documents are the following:

- Query 1: “fruits”
- Query 2: “vegetables”
- Query 3: “healthy foods in Canada”

### **Techniques to Demonstrate**

The techniques you will use to compute the similarity scores are:
- 1. TF-IDF.
- 2. Semantic similarity using GloVe word vectors.
- 3. Semantic similarity using a BERT-based model.


### **Feel Free to Choose Your Own Approach**

How you go about demonstrating each of the above techniques is up to you. You are not expected to use any particular library. The code below is just meant to provide you with some guidance to get started. You **do**, however, need to demonstrate obtaining similarity scores **with all 3 techniques above**, but how you go about doing this is totally up to you. The evaluation will be based on your ability obtain results using all three techniques, plus your discussion/comparison of any differences you observe.



## **Grade Allocation**
***
15 points total

- Experiment 1 (TD-IDF), implementation: 2 marks
- Experiment 2 (GloVe), implementation: 3 marks
- Experiment 3 (BERT), implementation: 3 marks
- Comparison and Discussion: 3 marks
  - Compare all three techniques and interpret your findings. Do your best to explain the differences you observe in terms of concepts learned in class (not just the _what_, but also the _how_ and _why_ one technique produces different results from another).
- Text Pre-Processing: 2 marks
 - Cleaning and standardization (e.g. lemmatization, stemming) in Experiment 1
 - Basic text cleaning (e.g. removal of special characters or tags) in Experiments 2 and 3.
- Clarity: 2 marks
 - The marks for clarity are awarded for code documentation, clean code (e.g. avoiding repetition by building re-usable functions)  and how well you explained/supported your answers, including the use of visualizations.


# Setup and Data Import
***
You can use the code snippets below to help you load and extract the document repository.


In [1]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [2]:
filePath ="/content/gdrive/MyDrive/neural_data/sample_repository.json"

In [3]:
# this will unpack the json file contents into a list of titles and documents
import json

with open(filePath) as in_file:
    repo_data = json.load(in_file)

titles = [item[0] for item in repo_data['data']]
documents = [item[1] for item in repo_data['data']]


In [4]:
# let's take a look at some of these documents and titles;
# here we print the five last entries
for id in range(-5, 0, 1):
  print(f"Document title: {titles[id]}")
  print(f"Document contents: {documents[id]}")
  print("\n") # adds newline

Document title: botany
Document contents: Botany, also called plant science(s), plant biology or phytology, is the science of plant life and a branch of biology. A botanist, plant scientist or phytologist is a scientist who specialises in this field. 


Document title: Ford Bronco 
Document contents: The Ford Bronco is a model line of sport utility vehicles manufactured and marketed by Ford. ... The first SUV model developed by the company, five generations of the Bronco were sold from the 1966 to 1996 model years. A sixth generation of the model line is sold from the 2021 model year. the Ford Bronco will be available in Canada, with first deliveries beginning in spring of 2021. The Bronco will come in six versions in Canada: Base, Big Bend, Black Diamond, Outer Banks, Wildtrak and Badlands. 


Document title: List of fruit dishes
Document contents: Fruit dishes are those that use fruit as a primary ingredient. Condiments prepared with fruit as a primary ingredient are also included in

In [5]:
titles[0], len(titles)

('Pomegranate Bhagwa', 32)

In [6]:
documents[0][:80], len(documents)

('Fresh Pomegranate from Anushka Avni International Bhagwa is a premium Pomegranat',
 32)

# Experiment 1: TF-IDF
***

**T**erm **F**requency - **I**nverse **D**ocument **F**requency (TF-IDF) is a traditional NLP technique to look at words that appear in both pieces of text, and score them based on how often they appear. For this experiment, you are free to use the TF-IDF implementation provided by scikit-learn.


In [7]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel

import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

nltk.download('punkt')
stop_words = list(set(stopwords.words('english')))

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [8]:
def find_top_similarity(query, top_n=5):
  vectorizer = TfidfVectorizer(stop_words=stop_words)
  vectors = vectorizer.fit_transform([query] + documents)

  # Calculate similarity scores
  # Compare first vector(vectors[0:1]) with all other vectors in vectors
  # linear_kernel: compute dot product of vectors, this is the cosine similarity when the vectors are normalized
  cosine_similarities = linear_kernel(vectors[0:1], vectors).flatten()

  # Output the similarity scores for the top 5 documents
  top_indices = cosine_similarities.argsort()[-2:-7:-1]
  print("Query:", query)
  print("Top 5 similar documents:")
  for idx in top_indices:
      print(f"Document {idx-1}: \t{documents[idx-1][:80]}, \tsimilarity score {round(cosine_similarities[idx], 2)}")


In [9]:
queries = ['fruits', 'vegetables', 'healthy foods in Canada']
for query in queries:
  find_top_similarity(query)
  print()

Query: fruits
Top 5 similar documents:
Document 6: 	To a botanist, a fruit is an entity that develops from the fertilized ovary of a, 	similarity score 0.16
Document 10: 	Canada's Food Guide is a nutrition guide produced by Health Canada to promote He, 	similarity score 0.07
Document 14: 	Berry Size: 18mm and above Packaging Packing Size: 4.5 kg loose in carry bags 8., 	similarity score 0.0
Document 0: 	Fresh Pomegranate from Anushka Avni International Bhagwa is a premium Pomegranat, 	similarity score 0.0
Document 1: 	Fresh Pomegranate Arakta from Anushka Avni International This Pomegranate are bi, 	similarity score 0.0

Query: vegetables
Top 5 similar documents:
Document 10: 	Canada's Food Guide is a nutrition guide produced by Health Canada to promote He, 	similarity score 0.08
Document 14: 	Berry Size: 18mm and above Packaging Packing Size: 4.5 kg loose in carry bags 8., 	similarity score 0.0
Document 0: 	Fresh Pomegranate from Anushka Avni International Bhagwa is a premium Pomegran

In [10]:
documents[10]

"Canada's Food Guide is a nutrition guide produced by Health Canada to promote Healthy behaviours and habits, and lifestyles in Canada - this is to increase the number of healthy people in Canada. In 2007, it was reported to be the second most requested Canadian government publication, behind the Income Tax Forms. The Health Canada website states: Food guides are basic education tools that are designed to help people follow a healthy diet. The Guide recommends eating a variety of healthy foods each day including plenty of vegetables and fruits, protein foods, and whole grain foods. It recommends choosing protein foods that come from plants more often. It also recommends limiting highly processed foods."

Documents[10] came up in top 5 of all three queries.

## Repeat the same task after some preprocessing

Use a minimum of 2 different text cleaning/standardization techniques (e.g. lemmatization, removing punctuation, etc).

In [11]:
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

In [12]:
# e.g. you can use a lemmatizer to reduce words down to their
# simplest 'lemma' (helpful when dealing with plurals)

from nltk.corpus import wordnet
from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer
import string

In [13]:
def get_wordnet_pos(treebank_tag):
    """
    Convert Treebank POS tags to WordNet POS tags
    The Penn Treebank tagset uses different tags (eg, NN for noun)
    than WordNet (e.g., N for noun)
    """
    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN  # default to noun if POS tag is not found


In [14]:
def lemmatize_document(document):
    lemmatizer = WordNetLemmatizer()
    words = word_tokenize(document)

    # Remove punctuation and convert to lowercase
    words = [word.lower() for word in words if word.isalnum()]

    pos_tags = nltk.pos_tag(words)
    lemmatized_words = [lemmatizer.lemmatize(word, get_wordnet_pos(tag)) for word, tag in pos_tags]
    return ' '.join(lemmatized_words)


In [15]:
def find_top_similarity_2(query, top_n=5, stop_words=None):

    # preprocessing
    lemmatized_documents = [lemmatize_document(doc) for doc in documents]
    lemmatized_query = lemmatize_document(query)

    vectorizer = TfidfVectorizer(stop_words=stop_words)
    vectors = vectorizer.fit_transform([lemmatized_query] + lemmatized_documents)

    # Calculate similarity scores
    cosine_similarities = linear_kernel(vectors[0:1], vectors).flatten()

    # Output the similarity scores for the top N documents
    top_indices = cosine_similarities.argsort()[-2:-top_n-2:-1]
    print("Query:", query)
    print(f"Top {top_n} similar documents:")
    for idx in top_indices:
        print(f"Document {idx-1}: \t{documents[idx-1][:80]}, \tsimilarity score {round(cosine_similarities[idx], 2)}")


In [16]:
for query in queries:
  find_top_similarity_2(query)
  print()

Query: fruits
Top 5 similar documents:
Document 29: 	Fruit dishes are those that use fruit as a primary ingredient. Condiments prepar, 	similarity score 0.44
Document 6: 	To a botanist, a fruit is an entity that develops from the fertilized ovary of a, 	similarity score 0.23
Document 0: 	Fresh Pomegranate from Anushka Avni International Bhagwa is a premium Pomegranat, 	similarity score 0.15
Document 31: 	A fruit serving bowl is a round dish or container typically used to prepare and , 	similarity score 0.08
Document 1: 	Fresh Pomegranate Arakta from Anushka Avni International This Pomegranate are bi, 	similarity score 0.05

Query: vegetables
Top 5 similar documents:
Document 10: 	Canada's Food Guide is a nutrition guide produced by Health Canada to promote He, 	similarity score 0.07
Document 14: 	Berry Size: 18mm and above Packaging Packing Size: 4.5 kg loose in carry bags 8., 	similarity score 0.0
Document 0: 	Fresh Pomegranate from Anushka Avni International Bhagwa is a premium Pomeg

What impact did the text cleaning / preprocessing have on your results?

After preprocessing, the scores are a bit different for some queries. For example for query 'fruits', the top documents was Document 6 (score 0.16) without any preprocessing, but after preprocessing, Document 29 is the top document (score, 0.44). For query 'vegetables', the ranking of the documents are not different, neither are the scores. For query 'health foods in Canada', the top 3 documents are the same for both approaches. With preprocessing, it is able to pick up Document 31 and Document 28. Hence, for some queries, some important documents are missed if preprocessing steps did not take place.

# Experiment 2: Semantic matching using GloVe embeddings
***

In [17]:
# if you decide to use the gensim library and the sample codes below,
# you would need gensim version >=4.0.1 to be installed

import gensim
print(gensim.__version__)

4.3.2


In [18]:
import logging
import json
import logging
from re import sub
from multiprocessing import cpu_count

import numpy as np

import gensim.downloader as api
from gensim.utils import simple_preprocess
from gensim.corpora import Dictionary
from gensim.models import TfidfModel
from gensim.similarities import WordEmbeddingSimilarityIndex
from gensim.similarities import SparseTermSimilarityMatrix
from gensim.similarities import SoftCosineSimilarity

In [19]:
# optional, but it helps
import logging

# Initialize logging.
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.WARNING)

In [20]:
import nltk

# Import and download stopwords from NLTK.
nltk.download('stopwords')  # Download stopwords list.
stopwords = set(nltk.corpus.stopwords.words("english"))

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [22]:
def preprocess(doc):
    # Tokenize, clean up input document string
    doc = sub(r'<img[^<>]+(>|$)', " image_token ", doc)
    # you may decide to add additional steps here
    return [token for token in simple_preprocess(doc, min_len=0, max_len=float("inf")) if token not in stopwords]

In [23]:
# Load test data
with open(filePath) as in_file:
    repo_data = json.load(in_file)

titles = [item[0] for item in repo_data['data']]
documents = [item[1] for item in repo_data['data']]

In [24]:
queries

['fruits', 'vegetables', 'healthy foods in Canada']

In [25]:
# Preprocess the documents, including the query string
corpus = [preprocess(document) for document in documents]
query = preprocess(queries[0])

In [26]:
# Download and load the GloVe word vector embeddings
if 'glove' not in locals():  # only load if not already in memory
    glove = api.load("glove-wiki-gigaword-50")

similarity_index = WordEmbeddingSimilarityIndex(glove)



In [27]:
# Build the term dictionary, TF-idf model
# Keep in mind that the search query must be in the dictionary as well, in case the terms do not overlap with the documents
dictionary = Dictionary(corpus+[query])
tfidf = TfidfModel(dictionary=dictionary)

# Create the term similarity matrix.
# The nonzero_limit enforces sparsity by limiting the number of non-zero terms in each column.
# In my case, I got best results by removing the default value of 100
similarity_matrix = SparseTermSimilarityMatrix(similarity_index, dictionary, tfidf)  # , nonzero_limit=None)

100%|██████████| 568/568 [00:15<00:00, 37.66it/s]


In [28]:
# Compute similarity measure between the query and the documents.
query_tf = tfidf[dictionary.doc2bow(query)]

index = SoftCosineSimilarity(
            tfidf[[dictionary.doc2bow(document) for document in corpus]],
            similarity_matrix)

doc_similarity_scores = index[query_tf]

  normalized_corpus = np.multiply(corpus, 1.0 / corpus_norm)
  normalized_corpus = np.multiply(corpus, 1.0 / corpus_norm)


In [30]:
doc_similarity_scores, len(doc_similarity_scores)

(array([0.8091958 , 0.72479224, 0.        , 0.        , 0.41293627,
        0.        , 0.88390946, 0.        , 0.5453234 , 0.5064042 ,
        0.7613044 , 0.        , 0.        , 0.5223634 , 0.        ,
        0.5223635 , 0.56572443, 0.41293627, 0.45817053, 0.        ,
        0.        , 0.50264883, 0.50177664, 0.63024676, 0.        ,
        0.69269216, 0.        , 0.        , 0.        , 0.8436877 ,
        0.        , 0.8436877 ], dtype=float32),
 32)

In [55]:
sorted_indices = np.argsort(doc_similarity_scores)[::-1]  # Sort indices in descending order
top_5_docs = [(documents[idx], idx, doc_similarity_scores[idx]) for idx in sorted_indices[:5]]

for doc, idx, score in top_5_docs:
  score = round(score, 2)
  print(f"Document {idx}: \t{documents[idx][:80]}, \tsimilarity score {score}")

Document 6: 	To a botanist, a fruit is an entity that develops from the fertilized ovary of a, 	similarity score 0.8799999952316284
Document 31: 	A fruit serving bowl is a round dish or container typically used to prepare and , 	similarity score 0.8399999737739563
Document 29: 	Fruit dishes are those that use fruit as a primary ingredient. Condiments prepar, 	similarity score 0.8399999737739563
Document 0: 	Fresh Pomegranate from Anushka Avni International Bhagwa is a premium Pomegranat, 	similarity score 0.8100000023841858
Document 10: 	Canada's Food Guide is a nutrition guide produced by Health Canada to promote He, 	similarity score 0.7599999904632568


In [60]:
def glove_find_top_similarity(query, top_n=5):

SyntaxError: incomplete input (<ipython-input-60-4c75df8f0335>, line 1)

# Experiment 3: BERT Model
***
Use a BERT model obtain sentence embeddings and calculate the similarity between queries and documents.

> Hint: see the Module 07 jupyter notebook for examples of how to work with BERT.

In [None]:
# for each query, output the similarity scores for the top 5 documents with
# th highest score, and interpret your results

 # Technique Comparison
 ***

Compare all three techniques and interpret your findings. Do your best to explain the differences you observe in terms of concepts learned in class (not just the what, but also the how and why one technique produces different results from another).
