# SCS 3546: Deep Learning
> **Assignment 3: Contextualized Word Embeddings**

### Your name & student number:

<pre> Gordon Chan </pre>

<pre> qq525548 </pre>

## **Assignment Description**
***

Search Engines are a standard tool for finding relevant content. The calculation of similarity between textual information is an important factor for better search results.

### **Objectives**

**Your goal in this assignment is to calculate the textual similarity between queries and the provided sample documents, using a variety of NLP approaches.**

In achieving the above goal, you will also:
- Demonstrate how to preprocess text and embed textual data.
- Compare the results of textual similarity scoring between traditional and deep-learning based NLP methods.

### **Data and Queries**

You will use the document repository provided by `sample_repository.json`, which you can download from the following link, or from the assignment description in Quercus: https://q.utoronto.ca/courses/286389/files/21993451/download?download_frd=1

The queries you will run against these sample documents are the following:

- Query 1: “fruits”
- Query 2: “vegetables”
- Query 3: “healthy foods in Canada”

### **Techniques to Demonstrate**

The techniques you will use to compute the similarity scores are:
- 1. TF-IDF.
- 2. Semantic similarity using GloVe word vectors.
- 3. Semantic similarity using a BERT-based model.


### **Feel Free to Choose Your Own Approach**

How you go about demonstrating each of the above techniques is up to you. You are not expected to use any particular library. The code below is just meant to provide you with some guidance to get started. You **do**, however, need to demonstrate obtaining similarity scores **with all 3 techniques above**, but how you go about doing this is totally up to you. The evaluation will be based on your ability obtain results using all three techniques, plus your discussion/comparison of any differences you observe.



## **Grade Allocation**
***
15 points total

- Experiment 1 (TD-IDF), implementation: 2 marks
- Experiment 2 (GloVe), implementation: 3 marks
- Experiment 3 (BERT), implementation: 3 marks
- Comparison and Discussion: 3 marks
  - Compare all three techniques and interpret your findings. Do your best to explain the differences you observe in terms of concepts learned in class (not just the _what_, but also the _how_ and _why_ one technique produces different results from another).
- Text Pre-Processing: 2 marks
 - Cleaning and standardization (e.g. lemmatization, stemming) in Experiment 1
 - Basic text cleaning (e.g. removal of special characters or tags) in Experiments 2 and 3.
- Clarity: 2 marks
 - The marks for clarity are awarded for code documentation, clean code (e.g. avoiding repetition by building re-usable functions)  and how well you explained/supported your answers, including the use of visualizations.


# Setup and Data Import
***
You can use the code snippets below to help you load and extract the document repository.


In [1]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [2]:
filePath ="/content/gdrive/MyDrive/neural_data/sample_repository.json"

In [3]:
# this will unpack the json file contents into a list of titles and documents
import json

with open(filePath) as in_file:
    repo_data = json.load(in_file)

titles = [item[0] for item in repo_data['data']]
documents = [item[1] for item in repo_data['data']]


In [4]:
# let's take a look at some of these documents and titles;
# here we print the five last entries
for id in range(-5, 0, 1):
  print(f"Document title: {titles[id]}")
  print(f"Document contents: {documents[id]}")
  print("\n") # adds newline

Document title: botany
Document contents: Botany, also called plant science(s), plant biology or phytology, is the science of plant life and a branch of biology. A botanist, plant scientist or phytologist is a scientist who specialises in this field. 


Document title: Ford Bronco 
Document contents: The Ford Bronco is a model line of sport utility vehicles manufactured and marketed by Ford. ... The first SUV model developed by the company, five generations of the Bronco were sold from the 1966 to 1996 model years. A sixth generation of the model line is sold from the 2021 model year. the Ford Bronco will be available in Canada, with first deliveries beginning in spring of 2021. The Bronco will come in six versions in Canada: Base, Big Bend, Black Diamond, Outer Banks, Wildtrak and Badlands. 


Document title: List of fruit dishes
Document contents: Fruit dishes are those that use fruit as a primary ingredient. Condiments prepared with fruit as a primary ingredient are also included in

In [5]:
titles[0], len(titles)

('Pomegranate Bhagwa', 32)

In [6]:
documents[0][:80], len(documents)

('Fresh Pomegranate from Anushka Avni International Bhagwa is a premium Pomegranat',
 32)

# Experiment 1: TF-IDF
***

**T**erm **F**requency - **I**nverse **D**ocument **F**requency (TF-IDF) is a traditional NLP technique to look at words that appear in both pieces of text, and score them based on how often they appear. For this experiment, you are free to use the TF-IDF implementation provided by scikit-learn.


In [7]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel

import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

nltk.download('punkt')
stop_words = list(set(stopwords.words('english')))

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


### Word Frequency in documents

In [8]:
from sklearn.feature_extraction.text import CountVectorizer

In [9]:
# use binary=True for sentences, not words
cv = CountVectorizer(binary=True)

X = cv.fit_transform(documents)

In [10]:
cv.vocabulary_

{'fresh': 273,
 'pomegranate': 483,
 'from': 274,
 'anushka': 70,
 'avni': 88,
 'international': 331,
 'bhagwa': 112,
 'is': 333,
 'premium': 488,
 'variety': 638,
 'india': 320,
 'the': 604,
 'deep': 189,
 'red': 521,
 'arils': 79,
 'pleasing': 478,
 'but': 133,
 'rugged': 535,
 'skin': 563,
 'enhances': 234,
 'appearance': 71,
 'whilst': 656,
 'promoting': 502,
 'shelf': 554,
 'life': 357,
 'of': 428,
 'fruit': 275,
 'widely': 661,
 'known': 348,
 'for': 264,
 'its': 336,
 'soft': 565,
 'seed': 542,
 'dark': 184,
 'color': 160,
 'and': 68,
 'extremely': 246,
 'delicious': 191,
 'packaging': 453,
 'net': 412,
 'weight': 649,
 'box': 128,
 '5kg': 47,
 '00kg': 1,
 'details': 202,
 'minimum': 400,
 '180gm': 11,
 'maximum': 386,
 '400gm': 38,
 'cherry': 151,
 'taste': 599,
 'sweet': 594,
 'count': 174,
 'carton': 145,
 '50': 43,
 'kg': 345,
 'wt': 666,
 'numbers': 420,
 'packed': 454,
 'per': 467,
 '350': 36,
 '400gms': 39,
 'arakta': 77,
 'this': 609,
 'are': 78,
 'bigger': 114,
 'in': 3

In [11]:
# show occurance of vocabulary words each document, with each row corresponding the each documnent
print(X.toarray())

[[0 1 0 ... 1 0 0]
 [0 1 1 ... 1 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]


In [12]:
# Word frequency in each document
import numpy as np
word_freq = np.sum(X.toarray(), axis=0)

word_freq_dict = dict(zip(cv.get_feature_names_out(), word_freq))
print("\nWord Frequency:\n", word_freq_dict)


Word Frequency:
 {'00': 1, '00kg': 2, '10': 1, '10kg': 3, '12': 4, '15': 1, '1540s': 1, '15kg': 3, '16mm': 1, '1753': 1, '17kg': 3, '180gm': 2, '18kg': 3, '18mm': 2, '1966': 1, '1970s': 1, '1996': 1, '20': 1, '2007': 1, '2021': 1, '2040': 2, '20ft': 3, '20kg': 3, '220': 1, '225': 1, '2400': 2, '25kg': 3, '275': 1, '275gms': 1, '28': 3, '290': 1, '30': 3, '30kg': 3, '320gms': 1, '325gms': 1, '3400': 2, '350': 2, '3kg': 3, '400gm': 2, '400gms': 2, '40ft': 3, '4400': 1, '45': 3, '50': 5, '500': 2, '55': 3, '5500': 1, '5kg': 5, '60': 3, '70': 3, '9kg': 3, 'aai': 6, 'ability': 6, 'about': 2, 'above': 6, 'absorption': 1, 'academic': 1, 'acclaimed': 1, 'according': 1, 'account': 1, 'across': 1, 'actively': 1, 'agro': 6, 'all': 6, 'along': 1, 'also': 5, 'an': 5, 'analyses': 1, 'and': 24, 'animals': 1, 'anushka': 12, 'appearance': 1, 'apples': 1, 'approach': 1, 'apricots': 1, 'april': 3, 'arabia': 1, 'arakta': 1, 'are': 14, 'arils': 2, 'as': 13, 'assimilation': 1, 'assortment': 6, 'attached': 

In [13]:
print(len(word_freq_dict))

669


In [14]:
def find_top_similarity(query, top_n=5):
  vectorizer = TfidfVectorizer(stop_words=stop_words)
  vectors = vectorizer.fit_transform([query] + documents)

  # Calculate similarity scores
  # Compare first vector(vectors[0:1]) with all other vectors in vectors
  # linear_kernel: compute dot product of vectors, this is the cosine similarity when the vectors are normalized
  cosine_similarities = linear_kernel(vectors[0:1], vectors).flatten()

  # Output the similarity scores for the top 5 documents
  top_indices = cosine_similarities.argsort()[-2:-7:-1]
  print("Query:", query)
  print("Top 5 similar documents:")
  for idx in top_indices:
      print(f"Document {idx-1}: \t{documents[idx-1][:80]}, \tsimilarity score {round(cosine_similarities[idx], 2)}")


In [15]:
queries = ['fruits', 'vegetables', 'healthy foods in Canada']
for query in queries:
  find_top_similarity(query)
  print()

Query: fruits
Top 5 similar documents:
Document 6: 	To a botanist, a fruit is an entity that develops from the fertilized ovary of a, 	similarity score 0.16
Document 10: 	Canada's Food Guide is a nutrition guide produced by Health Canada to promote He, 	similarity score 0.07
Document 14: 	Berry Size: 18mm and above Packaging Packing Size: 4.5 kg loose in carry bags 8., 	similarity score 0.0
Document 0: 	Fresh Pomegranate from Anushka Avni International Bhagwa is a premium Pomegranat, 	similarity score 0.0
Document 1: 	Fresh Pomegranate Arakta from Anushka Avni International This Pomegranate are bi, 	similarity score 0.0

Query: vegetables
Top 5 similar documents:
Document 10: 	Canada's Food Guide is a nutrition guide produced by Health Canada to promote He, 	similarity score 0.08
Document 14: 	Berry Size: 18mm and above Packaging Packing Size: 4.5 kg loose in carry bags 8., 	similarity score 0.0
Document 0: 	Fresh Pomegranate from Anushka Avni International Bhagwa is a premium Pomegran

In [16]:
documents[10]

"Canada's Food Guide is a nutrition guide produced by Health Canada to promote Healthy behaviours and habits, and lifestyles in Canada - this is to increase the number of healthy people in Canada. In 2007, it was reported to be the second most requested Canadian government publication, behind the Income Tax Forms. The Health Canada website states: Food guides are basic education tools that are designed to help people follow a healthy diet. The Guide recommends eating a variety of healthy foods each day including plenty of vegetables and fruits, protein foods, and whole grain foods. It recommends choosing protein foods that come from plants more often. It also recommends limiting highly processed foods."

Documents[10] came up in top 5 of all three queries.

## Repeat the same task after some preprocessing

Use a minimum of 2 different text cleaning/standardization techniques (e.g. lemmatization, removing punctuation, etc).

In [17]:
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

In [18]:
# e.g. you can use a lemmatizer to reduce words down to their
# simplest 'lemma' (helpful when dealing with plurals)

from nltk.corpus import wordnet
from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer
import string

In [19]:
def get_wordnet_pos(treebank_tag):
    """
    Convert Treebank POS tags to WordNet POS tags
    The Penn Treebank tagset uses different tags (eg, NN for noun)
    than WordNet (e.g., N for noun)
    """
    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN  # default to noun if POS tag is not found


In [20]:
def lemmatize_document(document):
    lemmatizer = WordNetLemmatizer()
    words = word_tokenize(document)

    # Remove punctuation and convert to lowercase
    words = [word.lower() for word in words if word.isalnum()]

    pos_tags = nltk.pos_tag(words)
    lemmatized_words = [lemmatizer.lemmatize(word, get_wordnet_pos(tag)) for word, tag in pos_tags]
    return ' '.join(lemmatized_words)

In [21]:
# preprocessing
lemmatized_documents = [lemmatize_document(doc) for doc in documents]

In [22]:
lemmatized_documents[0][:80]

'fresh pomegranate from anushka avni international bhagwa be a premium pomegranat'

### Word Frequency

In [23]:
# use binary=True for sentences, not words
cv = CountVectorizer(binary=True)
X = cv.fit_transform(lemmatized_documents)

In [24]:
cv.vocabulary_

{'fresh': 238,
 'pomegranate': 425,
 'from': 239,
 'anushka': 56,
 'avni': 73,
 'international': 287,
 'bhagwa': 95,
 'be': 84,
 'premium': 430,
 'variety': 563,
 'india': 276,
 'the': 534,
 'deep': 165,
 'red': 459,
 'arils': 65,
 'pleasing': 420,
 'but': 113,
 'rugged': 472,
 'skin': 497,
 'enhance': 202,
 'appearance': 57,
 'whilst': 579,
 'promote': 440,
 'shelf': 490,
 'life': 310,
 'of': 375,
 'fruit': 240,
 'widely': 584,
 'know': 302,
 'for': 230,
 'it': 289,
 'soft': 499,
 'seed': 479,
 'dark': 160,
 'color': 137,
 'and': 54,
 'extremely': 213,
 'delicious': 167,
 'package': 398,
 'net': 360,
 'weight': 573,
 'box': 109,
 'detail': 175,
 'minimum': 348,
 '180gm': 9,
 'maximum': 335,
 '400gm': 28,
 'aril': 64,
 'cherry': 130,
 'taste': 529,
 'sweet': 525,
 'count': 150,
 'carton': 125,
 'kg': 299,
 'wt': 589,
 'number': 367,
 'pack': 397,
 'per': 410,
 'arakta': 63,
 'this': 539,
 'big': 96,
 'in': 272,
 'size': 496,
 'with': 587,
 'bold': 103,
 'also': 51,
 'possess': 427,
 'g

In [25]:
# show occurance of vocabulary words each document, with each row corresponding the each lemmatized_documents
print(X.toarray())

[[0 0 0 ... 0 1 0]
 [1 0 1 ... 0 1 0]
 [0 0 0 ... 1 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]


In [26]:
word_freq = np.sum(X.toarray(), axis=0)
word_freq_dict = dict(zip(cv.get_feature_names_out(), word_freq))
print("\nWord Frequency:\n", word_freq_dict)


Word Frequency:
 {'10': 1, '10kg': 3, '12': 1, '15': 1, '1540s': 1, '15kg': 3, '16mm': 1, '1753': 1, '17kg': 3, '180gm': 2, '18kg': 3, '18mm': 2, '1966': 1, '1970s': 1, '1996': 1, '20': 1, '2007': 1, '2021': 1, '2040': 2, '20ft': 3, '20kg': 3, '220': 1, '2400': 2, '25kg': 3, '28': 3, '30kg': 3, '3400': 2, '3kg': 3, '400gm': 2, '40ft': 3, '4400': 1, '50': 3, '500': 2, '5500': 1, '5kg': 3, '60': 3, '9kg': 3, 'aai': 6, 'ability': 6, 'about': 2, 'above': 6, 'absorption': 1, 'academic': 1, 'acclaim': 1, 'accord': 1, 'account': 1, 'across': 1, 'actively': 1, 'agro': 6, 'all': 6, 'along': 1, 'also': 5, 'an': 5, 'analysis': 1, 'and': 24, 'animal': 1, 'anushka': 12, 'appearance': 1, 'apple': 1, 'approach': 1, 'apricots': 1, 'april': 3, 'arabia': 1, 'arakta': 1, 'aril': 2, 'arils': 1, 'assimilation': 1, 'assortment': 6, 'attach': 1, 'attractive': 1, 'august': 1, 'availability': 4, 'available': 7, 'avni': 12, 'aware': 1, 'background': 1, 'badlands': 1, 'bag': 5, 'bahrain': 1, 'bandler': 1, 'bang

In [27]:
print(len(word_freq_dict))

591


With no preprocessing, size of word_freq_dict is 669. After preprocessing, this dictionary is smaller.

In [28]:
def find_top_similarity_preprocess(query, top_n=5, stop_words=None):

    # preprocessing query
    lemmatized_query = lemmatize_document(query)

    vectorizer = TfidfVectorizer(stop_words=stop_words)
    vectors = vectorizer.fit_transform([lemmatized_query] + lemmatized_documents)

    # Calculate similarity scores
    cosine_similarities = linear_kernel(vectors[0:1], vectors).flatten()

    # Output the similarity scores for the top N documents
    top_indices = cosine_similarities.argsort()[-2:-top_n-2:-1]
    print("Query:", query)
    print(f"Top {top_n} similar documents:")
    for idx in top_indices:
        print(f"Document {idx-1}: \t{documents[idx-1][:80]}, \tsimilarity score {round(cosine_similarities[idx], 2)}")


In [29]:
for query in queries:
  find_top_similarity_preprocess(query)
  print()

Query: fruits
Top 5 similar documents:
Document 29: 	Fruit dishes are those that use fruit as a primary ingredient. Condiments prepar, 	similarity score 0.44
Document 6: 	To a botanist, a fruit is an entity that develops from the fertilized ovary of a, 	similarity score 0.23
Document 0: 	Fresh Pomegranate from Anushka Avni International Bhagwa is a premium Pomegranat, 	similarity score 0.15
Document 31: 	A fruit serving bowl is a round dish or container typically used to prepare and , 	similarity score 0.08
Document 1: 	Fresh Pomegranate Arakta from Anushka Avni International This Pomegranate are bi, 	similarity score 0.05

Query: vegetables
Top 5 similar documents:
Document 10: 	Canada's Food Guide is a nutrition guide produced by Health Canada to promote He, 	similarity score 0.07
Document 14: 	Berry Size: 18mm and above Packaging Packing Size: 4.5 kg loose in carry bags 8., 	similarity score 0.0
Document 0: 	Fresh Pomegranate from Anushka Avni International Bhagwa is a premium Pomeg

What impact did the text cleaning / preprocessing have on your results?

After preprocessing, the scores are a bit different for some queries. For example for query 'fruits', the top documents was Document 6 (score 0.16) without any preprocessing, but after preprocessing, Document 29 is the top document (score, 0.44). For query 'vegetables', the ranking of the documents are not different, neither are the scores. For query 'health foods in Canada', the top 3 documents are the same for both approaches. With preprocessing, it is able to pick up Document 31 and Document 28. Hence, for some queries, some important documents are missed if preprocessing steps did not take place.

# Experiment 2: Semantic matching using GloVe embeddings
***

GloVe (Global Vectors) is a model for distributed word representation developed at Stanford and launched in 2014. It is an unsupervised learning algorithm that creates vector representations for words by mapping them into a space where distances reflect semantic similarity. GloVe is trained on global word-word co-occurrence statistics from a corpus, resulting in word vectors that exhibit interesting linear substructures. It combines features of global matrix factorization and local context window methods.

In [30]:
# if you decide to use the gensim library and the sample codes below,
# you would need gensim version >=4.0.1 to be installed

import gensim
print(gensim.__version__)

4.3.2


In [31]:
import logging
import json
import logging
from re import sub
from multiprocessing import cpu_count

import numpy as np

import gensim.downloader as api
from gensim.utils import simple_preprocess
from gensim.corpora import Dictionary
from gensim.models import TfidfModel
from gensim.similarities import WordEmbeddingSimilarityIndex
from gensim.similarities import SparseTermSimilarityMatrix
from gensim.similarities import SoftCosineSimilarity

In [32]:
# optional, but it helps
import logging

# Initialize logging.
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.WARNING)

In [33]:
import nltk

# Import and download stopwords from NLTK.
nltk.download('stopwords')  # Download stopwords list.
stopwords = set(nltk.corpus.stopwords.words("english"))

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [34]:
import re

def preprocess(doc):
    # Convert to lower case
    doc = doc.lower()
    # Tokenize, clean up input document string
    doc = sub(r'<img[^<>]+(>|$)', " image_token ", doc)
    # Remove special characters
    doc = re.sub(r'[^a-zA-Z0-9\s]', '', doc)

    # Convert to lower case
    doc = doc.lower()

    # Tokenize and remove stopwords
    return [token for token in simple_preprocess(doc, min_len=0, max_len=float("inf")) if token not in stopwords]

In [35]:
# Load test data
with open(filePath) as in_file:
    repo_data = json.load(in_file)

titles = [item[0] for item in repo_data['data']]
documents = [item[1] for item in repo_data['data']]

In [36]:
queries

['fruits', 'vegetables', 'healthy foods in Canada']

In [37]:
# Preprocess the documents, including the query string
corpus = [preprocess(document) for document in documents]

In [38]:
# Download and load the GloVe word vector embeddings
if 'glove' not in locals():  # only load if not already in memory
    glove = api.load("glove-wiki-gigaword-50")

similarity_index = WordEmbeddingSimilarityIndex(glove)



In [39]:
def glove_find_top_similarity(query, top_n=5):
  query = preprocess(query)

  # Build the term dictionary, TF-idf model
  dictionary = Dictionary(corpus+[query])
  tfidf = TfidfModel(dictionary=dictionary)

  # Create the term similarity matrix.
  similarity_matrix = SparseTermSimilarityMatrix(similarity_index, dictionary, tfidf)

  # Compute similarity measure between the query and the documents.
  query_tf = tfidf[dictionary.doc2bow(query)]

  index = SoftCosineSimilarity(
              tfidf[[dictionary.doc2bow(document) for document in corpus]],
              similarity_matrix)

  doc_similarity_scores = index[query_tf]
  print()

  # sort documents by similarity scores
  sorted_indices = np.argsort(doc_similarity_scores)[::-1]  # Sort indices in descending order
  top_n_docs = [(documents[idx], idx, doc_similarity_scores[idx]) for idx in sorted_indices[:5]]

  print()
  for doc, idx, score in top_n_docs:
    print(f"Document {idx}: \t{documents[idx][:80]}, \tsimilarity score {score:.2f}")

In [40]:
for query in queries:
  print(f"Query: {query}")
  glove_find_top_similarity(query)
  print()

Query: fruits


100%|██████████| 569/569 [00:18<00:00, 31.24it/s]
  normalized_corpus = np.multiply(corpus, 1.0 / corpus_norm)
  normalized_corpus = np.multiply(corpus, 1.0 / corpus_norm)




Document 6: 	To a botanist, a fruit is an entity that develops from the fertilized ovary of a, 	similarity score 0.88
Document 31: 	A fruit serving bowl is a round dish or container typically used to prepare and , 	similarity score 0.84
Document 29: 	Fruit dishes are those that use fruit as a primary ingredient. Condiments prepar, 	similarity score 0.84
Document 0: 	Fresh Pomegranate from Anushka Avni International Bhagwa is a premium Pomegranat, 	similarity score 0.79
Document 10: 	Canada's Food Guide is a nutrition guide produced by Health Canada to promote He, 	similarity score 0.76

Query: vegetables


100%|██████████| 569/569 [00:16<00:00, 34.29it/s]




Document 6: 	To a botanist, a fruit is an entity that develops from the fertilized ovary of a, 	similarity score 0.90
Document 10: 	Canada's Food Guide is a nutrition guide produced by Health Canada to promote He, 	similarity score 0.81
Document 29: 	Fruit dishes are those that use fruit as a primary ingredient. Condiments prepar, 	similarity score 0.76
Document 17: 	We are one of the leading organizations engaged in delivering our customers with, 	similarity score 0.75
Document 4: 	White Onions from Anushka Avni International Fresh White Onion, which is widely , 	similarity score 0.71

Query: healthy foods in Canada


100%|██████████| 569/569 [00:14<00:00, 38.89it/s]




Document 10: 	Canada's Food Guide is a nutrition guide produced by Health Canada to promote He, 	similarity score 0.93
Document 9: 	In nutrition, the diet of an organism is the sum of foods it eats, which is larg, 	similarity score 0.68
Document 31: 	A fruit serving bowl is a round dish or container typically used to prepare and , 	similarity score 0.59
Document 16: 	Anushka Avni International (AAI) takes pleasure in presenting itself as one of t, 	similarity score 0.59
Document 0: 	Fresh Pomegranate from Anushka Avni International Bhagwa is a premium Pomegranat, 	similarity score 0.53



With GloVe enbeddings, there are consistently higher similarity score for all of the queries, indicating that this approach is able to discern the the graded similarity between documents.

For query 'fruits', the top documents are similar to those find at the end of Experiment 2. The order of similarity is a different, all show high degree of similarity (between 0.76 - 0.88). This is different from the result from end of experiment 2. There, the similarity scores ranges from 0.05 - 0.44.

This is similar to what is observed for queries 'vegetables' and 'healthy food in Canada'. In conclusion, GloVe embedding may be is better in detecting more subtle difference in the similar of different documents.

# Experiment 3: BERT Model
***
Use a BERT model obtain sentence embeddings and calculate the similarity between queries and documents.

> Hint: see the Module 07 jupyter notebook for examples of how to work with BERT.

In [49]:
!pip uninstall tensorflow

Found existing installation: tensorflow 2.16.2
Uninstalling tensorflow-2.16.2:
  Would remove:
    /usr/local/bin/import_pb_to_tensorboard
    /usr/local/bin/saved_model_cli
    /usr/local/bin/tensorboard
    /usr/local/bin/tf_upgrade_v2
    /usr/local/bin/tflite_convert
    /usr/local/bin/toco
    /usr/local/bin/toco_from_protos
    /usr/local/lib/python3.10/dist-packages/tensorflow-2.16.2.dist-info/*
    /usr/local/lib/python3.10/dist-packages/tensorflow/*
Proceed (Y/n)? y
  Successfully uninstalled tensorflow-2.16.2


In [50]:
!pip install tensorflow==2.15.1

Collecting tensorflow==2.15.1
  Downloading tensorflow-2.15.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (475.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m475.2/475.2 MB[0m [31m1.5 MB/s[0m eta [36m0:00:00[0m
Collecting tensorboard<2.16,>=2.15 (from tensorflow==2.15.1)
  Downloading tensorboard-2.15.2-py3-none-any.whl (5.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.5/5.5 MB[0m [31m52.5 MB/s[0m eta [36m0:00:00[0m
Collecting keras<2.16,>=2.15.0 (from tensorflow==2.15.1)
  Downloading keras-2.15.0-py3-none-any.whl (1.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m62.4 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: keras, tensorboard, tensorflow
  Attempting uninstall: keras
    Found existing installation: keras 3.4.1
    Uninstalling keras-3.4.1:
      Successfully uninstalled keras-3.4.1
  Attempting uninstall: tensorboard
    Found existing installation:

In [4]:
import tensorflow as tf
print("TensorFlow version:", tf.__version__)

TensorFlow version: 2.15.1


In [5]:
import tensorflow_hub as hub

In [6]:
# Load the  a BERT model to fine-tune
# here we use BERT-Base with fewer parameters (Uncased) which was released by the original BERT authors
bert_model_name = 'small_bert/bert_en_uncased_L-4_H-512_A-8'

map_name_to_handle = {
    'bert_en_uncased_L-12_H-768_A-12':
        'https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/3',
    'bert_en_cased_L-12_H-768_A-12':
        'https://tfhub.dev/tensorflow/bert_en_cased_L-12_H-768_A-12/3',
    'bert_multi_cased_L-12_H-768_A-12':
        'https://tfhub.dev/tensorflow/bert_multi_cased_L-12_H-768_A-12/3',
    'small_bert/bert_en_uncased_L-2_H-128_A-2':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-2_H-128_A-2/1',
    'small_bert/bert_en_uncased_L-2_H-256_A-4':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-2_H-256_A-4/1',
    'small_bert/bert_en_uncased_L-2_H-512_A-8':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-2_H-512_A-8/1',
    'small_bert/bert_en_uncased_L-2_H-768_A-12':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-2_H-768_A-12/1',
    'small_bert/bert_en_uncased_L-4_H-128_A-2':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-4_H-128_A-2/1',
    'small_bert/bert_en_uncased_L-4_H-256_A-4':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-4_H-256_A-4/1',
    'small_bert/bert_en_uncased_L-4_H-512_A-8':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-4_H-512_A-8/1',
    'small_bert/bert_en_uncased_L-4_H-768_A-12':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-4_H-768_A-12/1',
    'small_bert/bert_en_uncased_L-6_H-128_A-2':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-6_H-128_A-2/1',
    'small_bert/bert_en_uncased_L-6_H-256_A-4':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-6_H-256_A-4/1',
    'small_bert/bert_en_uncased_L-6_H-512_A-8':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-6_H-512_A-8/1',
    'small_bert/bert_en_uncased_L-6_H-768_A-12':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-6_H-768_A-12/1',
    'small_bert/bert_en_uncased_L-8_H-128_A-2':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-8_H-128_A-2/1',
    'small_bert/bert_en_uncased_L-8_H-256_A-4':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-8_H-256_A-4/1',
    'small_bert/bert_en_uncased_L-8_H-512_A-8':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-8_H-512_A-8/1',
    'small_bert/bert_en_uncased_L-8_H-768_A-12':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-8_H-768_A-12/1',
    'small_bert/bert_en_uncased_L-10_H-128_A-2':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-10_H-128_A-2/1',
    'small_bert/bert_en_uncased_L-10_H-256_A-4':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-10_H-256_A-4/1',
    'small_bert/bert_en_uncased_L-10_H-512_A-8':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-10_H-512_A-8/1',
    'small_bert/bert_en_uncased_L-10_H-768_A-12':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-10_H-768_A-12/1',
    'small_bert/bert_en_uncased_L-12_H-128_A-2':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-12_H-128_A-2/1',
    'small_bert/bert_en_uncased_L-12_H-256_A-4':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-12_H-256_A-4/1',
    'small_bert/bert_en_uncased_L-12_H-512_A-8':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-12_H-512_A-8/1',
    'small_bert/bert_en_uncased_L-12_H-768_A-12':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-12_H-768_A-12/1',
    'albert_en_base':
        'https://tfhub.dev/tensorflow/albert_en_base/2',
    'electra_small':
        'https://tfhub.dev/google/electra_small/2',
    'electra_base':
        'https://tfhub.dev/google/electra_base/2',
    'experts_pubmed':
        'https://tfhub.dev/google/experts/bert/pubmed/2',
    'experts_wiki_books':
        'https://tfhub.dev/google/experts/bert/wiki_books/2',
    'talking-heads_base':
        'https://tfhub.dev/tensorflow/talkheads_ggelu_bert_en_base/1',
}

map_model_to_preprocess = {
    'bert_en_uncased_L-12_H-768_A-12':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'bert_en_cased_L-12_H-768_A-12':
        'https://tfhub.dev/tensorflow/bert_en_cased_preprocess/3',
    'small_bert/bert_en_uncased_L-2_H-128_A-2':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-2_H-256_A-4':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-2_H-512_A-8':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-2_H-768_A-12':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-4_H-128_A-2':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-4_H-256_A-4':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-4_H-512_A-8':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-4_H-768_A-12':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-6_H-128_A-2':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-6_H-256_A-4':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-6_H-512_A-8':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-6_H-768_A-12':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-8_H-128_A-2':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-8_H-256_A-4':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-8_H-512_A-8':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-8_H-768_A-12':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-10_H-128_A-2':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-10_H-256_A-4':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-10_H-512_A-8':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-10_H-768_A-12':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-12_H-128_A-2':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-12_H-256_A-4':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-12_H-512_A-8':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-12_H-768_A-12':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'bert_multi_cased_L-12_H-768_A-12':
        'https://tfhub.dev/tensorflow/bert_multi_cased_preprocess/3',
    'albert_en_base':
        'https://tfhub.dev/tensorflow/albert_en_preprocess/3',
    'electra_small':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'electra_base':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'experts_pubmed':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'experts_wiki_books':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'talking-heads_base':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
}

tfhub_handle_encoder = map_name_to_handle[bert_model_name]
tfhub_handle_preprocess = map_model_to_preprocess[bert_model_name]

print(f'BERT model selected           : {tfhub_handle_encoder}')
print(f'Preprocess model auto-selected: {tfhub_handle_preprocess}')

BERT model selected           : https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-4_H-512_A-8/1
Preprocess model auto-selected: https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3


In [10]:
!pip install -q tensorflow_text==2.11.0

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.8/5.8 MB[0m [31m19.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m588.3/588.3 MB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m54.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m39.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.0/6.0 MB[0m [31m51.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m439.2/439.2 kB[0m [31m29.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.9/4.9 MB[0m [31m58.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m781.3/781.3 kB[0m [31m40.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━

In [11]:
import tensorflow_text

ImportError: /usr/local/lib/python3.10/dist-packages/tensorflow_text/core/pybinds/tflite_registrar.so: undefined symbol: _ZN4absl12lts_2022062320raw_logging_internal21internal_log_functionB5cxx11E

In [45]:
# load the preprocessing model into a hub.KerasLayer to compose the fine-tuned model
# with the smallBert The input is truncated to 128 tokens (The number of tokens can be customized)
bert_preprocess_model = hub.KerasLayer(tfhub_handle_preprocess)

RuntimeError: Op type not registered 'CaseFoldUTF8' in binary running on ab4d514c0e0a. Make sure the Op and Kernel are registered in the binary running in this process. Note that if you are loading a saved graph which used ops from tf.contrib (e.g. `tf.contrib.resampler`), accessing should be done before importing the graph, as contrib ops are lazily registered when the module is first accessed.

In [None]:
# Function to compute BERT embeddings
def compute_bert_embeddings(texts):
    preprocessed_text = bert_preprocess_model(texts)
    bert_model = hub.KerasLayer(tfhub_handle_encoder)
    return bert_model(preprocessed_text)["pooled_output"]

In [None]:
def bert_find_top_similarity(query, top_n=5):
  # Compute embeddings for query and documents
  query_embedding = compute_bert_embeddings([query])
  document_embeddings = [compute_bert_embeddings([doc]) for doc in documents]

  # Calculate cosine similarity between query and documents
  similarity_scores = []
  for doc_emb in document_embeddings:
      dot_product = tf.reduce_sum(query_embedding * doc_emb)
      query_norm = tf.linalg.norm(query_embedding)
      doc_norm = tf.linalg.norm(doc_emb)
      cosine_sim = dot_product / (query_norm * doc_norm)
      similarity_scores.append(cosine_sim.numpy())

  # Create a list of tuples (index, score) for sorting
  scored_documents = [(idx, score) for idx, score in enumerate(similarity_scores)]

  # Sort documents by similarity score in descending order
  scored_documents.sort(key=lambda x: x[1], reverse=True)

  # Print top_n documents with indices and similarity scores
  print(f"Query: {query}\n")
  for rank, (idx, score) in enumerate(scored_documents[:top_n], 1):
    print(f"Document {idx}: \t{documents[idx][:80]}, \tsimilarity score {score:.2f}")


In [None]:
for query in queries:
  bert_find_top_similarity(query)
  print()

 # Technique Comparison
 ***

Compare all three techniques and interpret your findings. Do your best to explain the differences you observe in terms of concepts learned in class (not just the what, but also the how and why one technique produces different results from another).


Approach 1: TF-IDF

In this approach, I used TF-IDF, which converts each document into a TF-IDF vector. It takes into account the term frequency (TF) within the document and the inverse document frequency (IDF), which measures how common or rare a word is across all documents. The TF-IDF vectors for each document are used to assess the similarity of the documents based on the cosine similarity of these vectors. This approach does not consider syntax or semantic similarity.

Approach 2: GloVe Embedding

In this approach, I used GloVe embeddings, which are based on co-occurrence statistics in a large corpus. This method captures semantic relationships between words, but it does not consider contextual information. This approach was able to uncover some documents that were not detected using approach 1.

Approach 3: BERT Embedding

In this approach, I used BERT embeddings. BERT takes into account the meaning of words based on their context in a sentence or document. This is a better approach for capturing nuanced meanings and context-specific semantics.

Using BERT embeddings, I was able to uncover documents not identified in approaches 1 and 2. This approach seems to be better for document classification. I noticed that computation was fastest for approach 1, followed by approach 2, with BERT embeddings being the slowest, which is expected due to its deep architecture.
