<a href="https://colab.research.google.com/github/fernandoespinosa/3546-034/blob/master/Assignment%203%20-%20Contextualized%20Word%20Embeddings/SCS_3546_Assignment_03_Fall2022_Fernando_Espinosa.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# SCS 3546: Deep Learning
> **Assignment 3: Contextualized Word Embeddings**

### Your name & student number:

<pre> Student Name: Fernando Espinosa </pre>

<pre> Student Number: X566420 </pre>

## **Assignment Description**
***

Search Engines are a standard tool for finding relevant content. The calculation of similarity between textual information is an important factor for better search results.

### **Objectives**

**Your goal in this assignment is to calculate the textual similarity between queries and the provided sample documents, using a variety of NLP approaches.**

In achieving the above goal, you will also:
- Demonstrate how to preprocess text and embed textual data.
- Compare the results of textual similarity scoring between traditional and deep-learning based NLP methods.

### **Data and Queries**

You will use the document repository provided by `sample_repository.json`, which you can download from the following link, or from the assignment description in Quercus: https://q.utoronto.ca/courses/286389/files/21993451/download?download_frd=1

The queries you will run against these sample documents are the following:

- Query 1: “fruits”
- Query 2: “vegetables”
- Query 3: “healthy foods in Canada”

### **Techniques to Demonstrate**

The techniques you will use to compute the similarity scores are:
- 1. TF-IDF.
- 2. Semantic similarity using GloVe word vectors.
- 3. Semantic similarity using a BERT-based model.


### **Feel Free to Choose Your Own Approach**

How you go about demonstrating each of the above techniques is up to you. You are not expected to use any particular library. The code below is just meant to provide you with some guidance to get started. You **do**, however, need to demonstrate obtaining similarity scores **with all 3 techniques above**, but how you go about doing this is totally up to you. The evaluation will be based on your ability obtain results using all three techniques, plus your discussion/comparison of any differences you observe.



## **Grade Allocation**
***
15 points total

- Experiment 1 (TD-IDF), implementation: 2 marks
- Experiment 2 (GloVe), implementation: 3 marks
- Experiment 3 (BERT), implementation: 3 marks
- Comparison and Discussion: 3 marks
  - Compare all three techniques and interpret your findings. Do your best to explain the differences you observe in terms of concepts learned in class (not just the _what_, but also the _how_ and _why_ one technique produces different results from another).
- Text Pre-Processing: 2 marks
 - Cleaning and standardization (e.g. lemmatization, stemming) in Experiment 1
 - Basic text cleaning (e.g. removal of special characters or tags) in Experiments 2 and 3.
- Clarity: 2 marks
 - The marks for clarity are awarded for code documentation, clean code (e.g. avoiding repetition by building re-usable functions)  and how well you explained/supported your answers, including the use of visualizations.


# Setup and Data Import
***
You can use the code snippets below to help you load and extract the document repository.


In [1]:
# you can either drop the file manually into your Colab drive, or otherwise
# use this widget to upload it

from google.colab import files
# uploaded = files.upload()

In [2]:
# this will unpack the json file contents into a list of titles and documents
import json

with open('sample_repository.json') as in_file:
    repo_data = json.load(in_file)

titles = [item[0] for item in repo_data['data']]
documents = [item[1] for item in repo_data['data']]


In [3]:
# let's take a look at some of these documents and titles;
# here we print the five last entries
for id in range(-5, 0, 1):
  print(f"Document title: {titles[id]}")
  print(f"Document contents: {documents[id]}")
  print("\n") # adds newline

Document title: botany
Document contents: Botany, also called plant science(s), plant biology or phytology, is the science of plant life and a branch of biology. A botanist, plant scientist or phytologist is a scientist who specialises in this field. 


Document title: Ford Bronco 
Document contents: The Ford Bronco is a model line of sport utility vehicles manufactured and marketed by Ford. ... The first SUV model developed by the company, five generations of the Bronco were sold from the 1966 to 1996 model years. A sixth generation of the model line is sold from the 2021 model year. the Ford Bronco will be available in Canada, with first deliveries beginning in spring of 2021. The Bronco will come in six versions in Canada: Base, Big Bend, Black Diamond, Outer Banks, Wildtrak and Badlands. 


Document title: List of fruit dishes
Document contents: Fruit dishes are those that use fruit as a primary ingredient. Condiments prepared with fruit as a primary ingredient are also included in

# Experiment 1: TF-IDF
***

**T**erm **F**requency - **I**nverse **D**ocument **F**requency (TF-IDF) is a traditional NLP technique to look at words that appear in both pieces of text, and score them based on how often they appear. For this experiment, you are free to use the TF-IDF implementation provided by scikit-learn.


In [4]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel

import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

nltk.download('punkt')
stop_words = set(stopwords.words('english'))


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [5]:

print(f'stop_words: {type(stop_words)}')

# need to keep have list to avoid error in TfidfVectorizer() contructor:
# InvalidParameterError: The 'stop_words' parameter of TfidfVectorizer must be
# a str among {'english'}, an instance of 'list' or None ...

stop_words_list = list(stop_words)
print(f'stop_words_list: {type(stop_words_list)}')


stop_words: <class 'set'>
stop_words_list: <class 'list'>


In [6]:

# Calculate the word frequency, and a measure of similarity of the search terms with each document.
# Do not apply any text pre-processing (i.e. cleanup) yet.


In [7]:

def word_frequency_measure_similarity(vectorizer: TfidfVectorizer):
  """
  Common function that will take different instantiations of `TfidfVectorizer`
  """

  # this will do 2 things:
  #   1. fit the TF-IDF vectorizer to the document corpus, and
  #   2. transform the documents into TF-IDF vectors
  document_vectors = vectorizer.fit_transform(documents)

  # calculate word frequencies (TF-IDF values)
  frequencies = document_vectors.toarray()

  # get feature names to see which words are retained
  terms = vectorizer.get_feature_names_out()

  # print term frequencies for each document
  print("Word Frequencies (TF-IDF values):")
  for i, doc in enumerate(documents):
      print(f"document #{i}: {doc[:30]}...")
      for j, term in enumerate(terms):
          if frequencies[i, j] > 0:  # only show terms with non-zero TF-IDF
              print(f"  {term}: {frequencies[i, j]:.4f}")
      print()

  # return `document_vectors` and vocabulary
  return document_vectors, terms


In [8]:

# now invoke word_frequency_measure_similarity() with a simple `TfidfVectorizer`
# without that any text pre-processing

vectorizer = TfidfVectorizer(stop_words=stop_words_list)

document_vectors, vocabulary = word_frequency_measure_similarity(vectorizer);

Word Frequencies (TF-IDF values):
document #0: Fresh Pomegranate from Anushka...
  00kg: 0.1041
  180gm: 0.1041
  350: 0.1041
  400gm: 0.1041
  400gms: 0.1041
  50: 0.0829
  5kg: 0.1658
  anushka: 0.0592
  appearance: 0.1166
  arils: 0.2083
  avni: 0.0592
  bhagwa: 0.2331
  box: 0.1041
  carton: 0.1770
  cherry: 0.1166
  color: 0.2331
  count: 0.1041
  dark: 0.2083
  deep: 0.1166
  delicious: 0.1166
  details: 0.0885
  enhances: 0.1166
  extremely: 0.1166
  fresh: 0.0741
  fruit: 0.1658
  india: 0.1041
  international: 0.0592
  kg: 0.0829
  known: 0.1166
  life: 0.0953
  maximum: 0.1041
  minimum: 0.1041
  net: 0.2083
  numbers: 0.1041
  packaging: 0.0741
  packed: 0.1041
  per: 0.0741
  pleasing: 0.1166
  pomegranate: 0.2083
  premium: 0.1166
  promoting: 0.1166
  red: 0.2963
  rugged: 0.1166
  seed: 0.1166
  shelf: 0.1166
  skin: 0.0885
  soft: 0.1041
  sweet: 0.0953
  taste: 0.0885
  variety: 0.1041
  weight: 0.3124
  whilst: 0.1166
  widely: 0.1041
  wt: 0.1041

document #1: Fresh 

In [9]:

# for each query, output the similarity scores for the top 5 documents with
# th highest score, and interpret your results


In [10]:

def top_N_documents_similarity_scores(fit_vectorizer: TfidfVectorizer, document_vectors, queries: list[str], n: int):
  """
  Common function that will take different initializzations of `TfidfVectorizer`

  :param fit_vectorizer: a TfidfVectorizer, already fit to a document corpus
  :param document_vectors: the transformed documents vectors
  """

  for query in queries:
    # the vectorizer is already fit, just transform each query into a TF-IDF vector
    # based on the same vocabulary and weights learned from the documents
    query_vector = fit_vectorizer.transform([query])

    # use cosine similarity between the query vector and all document vectors
    cosine_similarities = linear_kernel(query_vector, document_vectors).flatten()

    # sort the indices of the documents by their similarity scores in descending order
    top_indices = cosine_similarities.argsort()[::-1]

    print(f"query: {query}")
    for index in top_indices[:5]:
        print(f">> score: {cosine_similarities[index]:.4f} for document #{index}: {documents[index][:30]}...")
    print()


In [11]:
queries = ['fruits', 'vegetables', 'healthy foods in Canada']

# now pass simple `vectorizer` and `document_vectors` already computed above
# along with the query terms

top_N_documents_similarity_scores(vectorizer, document_vectors, queries, 5);

query: fruits
>> score: 0.1790 for document #6: To a botanist, a fruit is an e...
>> score: 0.0800 for document #10: Canada's Food Guide is a nutri...
>> score: 0.0000 for document #31: A fruit serving bowl is a roun...
>> score: 0.0000 for document #30: Neuro linguistic programming (...
>> score: 0.0000 for document #1: Fresh Pomegranate Arakta from ...

query: vegetables
>> score: 0.0895 for document #10: Canada's Food Guide is a nutri...
>> score: 0.0000 for document #31: A fruit serving bowl is a roun...
>> score: 0.0000 for document #30: Neuro linguistic programming (...
>> score: 0.0000 for document #1: Fresh Pomegranate Arakta from ...
>> score: 0.0000 for document #2: About Us Anushka Avni Internat...

query: healthy foods in Canada
>> score: 0.6315 for document #10: Canada's Food Guide is a nutri...
>> score: 0.3137 for document #9: In nutrition, the diet of an o...
>> score: 0.1072 for document #12: Canadian Industry Statistics (...
>> score: 0.0877 for document #11: UK, Neth

## Repeat the same task after some preprocessing

Use a minimum of 2 different text cleaning/standardization techniques (e.g. lemmatization, removing punctuation, etc).

In [12]:
# e.g. you can use a lemmatizer to reduce words down to their
# simplest 'lemma' (helpful when dealing with plurals)

from nltk import word_tokenize
from nltk.corpus import stopwords, wordnet
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer

nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')
nltk.download('omw-1.4')
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [13]:

def lemmatized_preprocessor(text):
  """
  Custom tokenizer preprocessor function for lemmatization
  """

  # initialize the lemmatizer
  lemmatizer = WordNetLemmatizer()
  # tokenize text
  tokens = nltk.word_tokenize(text.lower())
  # lemmatize each token
  lemmatized_tokens = [
      lemmatizer.lemmatize(token) for token in tokens if token.isalnum() and token not in stop_words
  ]
  return " ".join(lemmatized_tokens)


In [14]:

# initialize the vectorizer with the custom `lemmatized_preprocessor`
lemmatized_vectorizer = TfidfVectorizer(
    stop_words=stop_words_list,
    preprocessor=lemmatized_preprocessor)

# invoke word_frequency_measure_similarity() with `lemmatized_vectorizer`
lemmatized_document_vectors, lemmatized_vocabulary = word_frequency_measure_similarity(lemmatized_vectorizer);




Word Frequencies (TF-IDF values):
document #0: Fresh Pomegranate from Anushka...
  180gm: 0.1085
  400gm: 0.1085
  anushka: 0.0617
  appearance: 0.1214
  aril: 0.2169
  avni: 0.0617
  bhagwa: 0.2428
  box: 0.1085
  carton: 0.1727
  cherry: 0.1214
  color: 0.2428
  count: 0.1085
  dark: 0.2169
  deep: 0.1214
  delicious: 0.1214
  detail: 0.0922
  enhances: 0.1214
  extremely: 0.1214
  fresh: 0.0772
  fruit: 0.1628
  india: 0.1085
  international: 0.0617
  kg: 0.0863
  known: 0.1214
  life: 0.0993
  maximum: 0.1085
  minimum: 0.1085
  net: 0.2169
  number: 0.0993
  packaging: 0.0772
  packed: 0.1085
  per: 0.0772
  pleasing: 0.1214
  pomegranate: 0.2169
  premium: 0.1214
  promoting: 0.1214
  red: 0.3086
  rugged: 0.1214
  seed: 0.1085
  shelf: 0.1214
  skin: 0.0922
  soft: 0.1085
  sweet: 0.1085
  taste: 0.0922
  variety: 0.0993
  weight: 0.3254
  whilst: 0.1214
  widely: 0.1085
  wt: 0.1085

document #1: Fresh Pomegranate Arakta from ...
  10: 0.0827
  12: 0.0827
  15: 0.0827
  180gm: 

In [15]:

# pass `lemmatizer_vectorizer` and `lemmatizer_document_vectors` already computed above
top_N_documents_similarity_scores(
    lemmatized_vectorizer,
    lemmatized_document_vectors, queries, 5);


query: fruits
>> score: 0.4722 for document #29: Fruit dishes are those that us...
>> score: 0.2662 for document #6: To a botanist, a fruit is an e...
>> score: 0.1628 for document #0: Fresh Pomegranate from Anushka...
>> score: 0.0970 for document #31: A fruit serving bowl is a roun...
>> score: 0.0569 for document #10: Canada's Food Guide is a nutri...

query: vegetables
>> score: 0.0849 for document #10: Canada's Food Guide is a nutri...
>> score: 0.0000 for document #31: A fruit serving bowl is a roun...
>> score: 0.0000 for document #30: Neuro linguistic programming (...
>> score: 0.0000 for document #1: Fresh Pomegranate Arakta from ...
>> score: 0.0000 for document #2: About Us Anushka Avni Internat...

query: healthy foods in Canada
>> score: 0.6552 for document #10: Canada's Food Guide is a nutri...
>> score: 0.2770 for document #9: In nutrition, the diet of an o...
>> score: 0.1291 for document #31: A fruit serving bowl is a roun...
>> score: 0.1102 for document #12: Canadian

## What impact did the text cleaning / preprocessing have on your results?


### Response:

1. Removal of stop words ("the", "is", "and", etc.) eliminates non-informative terms that would otherwise add noise to the TF-IDF matrix
2. Gives higher TF-IDF weights for meaningful terms, making them more influential in similarity calculations
3. Lemmatization helps group different word forms for better generalization
4. Vocabularty size:
  - Before preprocessing: ~616 terms
  - After preprocessing: ~565 terms
5. Similarity Scores (for query "fruits")
  - Without preprocessing: Lower scores due to mismatches ("fruits" vs. "fruit")
  - With preprocessing: Higher scores, better alignment

In [16]:
print("len(vocabulary): ", len(vocabulary))

print("len(lemmatized_vocabulary): ", len(lemmatized_vocabulary))

len(vocabulary):  616
len(lemmatized_vocabulary):  565


# Experiment 2: Semantic matching using GloVe embeddings
***

In [17]:
# if you decide to use the gensim library and the sample codes below,
# you would need gensim version >=4.0.1 to be installed
# !pip install  gensim==4.0.1
import gensim
print(gensim.__version__)

4.3.3


In [18]:
import logging
import json
import logging
from re import sub
from multiprocessing import cpu_count

import numpy as np

import gensim.downloader as api
from gensim.utils import simple_preprocess
from gensim.corpora import Dictionary
from gensim.models import TfidfModel
from gensim.similarities import WordEmbeddingSimilarityIndex
from gensim.similarities import SparseTermSimilarityMatrix
from gensim.similarities import SoftCosineSimilarity

In [19]:
# optional, but it helps
import logging

# Initialize logging.
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.WARNING)

In [20]:
import nltk

# Import and download stopwords from NLTK.
nltk.download('stopwords')  # Download stopwords list.
stopwords = set(nltk.corpus.stopwords.words("english"))

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [21]:
def preprocess(doc):
    # Tokenize, clean up input document string
    doc = sub(r'<img[^<>]+(>|$)', " image_token ", doc)
    # you may decide to add additional steps here
    return [token for token in simple_preprocess(doc, min_len=0, max_len=float("inf")) if token not in stopwords]

In [22]:
# Load test data
with open('sample_repository.json') as in_file:
    repo_data = json.load(in_file)

titles = [item[0] for item in repo_data['data']]
documents = [item[1] for item in repo_data['data']]

In [23]:
# query_s = 'Your queries here'

# Preprocess the documents, including the query string
corpus = [preprocess(document) for document in documents]

# to be iterated in new function below...
# query = preprocess(query_s)

In [24]:
# Download and load the GloVe word vector embeddings
if 'glove' not in locals():  # only load if not already in memory
    glove = api.load("glove-wiki-gigaword-50")

similarity_index = WordEmbeddingSimilarityIndex(glove)

In [25]:
def dictionary_tfidf_similarity_matrix(query):
  # Build the term dictionary, TF-idf model
  # Keep in mind that the search query must be in the dictionary as well, in case the terms do not overlap with the documents
  dictionary = Dictionary(corpus+[query])
  tfidf = TfidfModel(dictionary=dictionary)

  # Create the term similarity matrix.
  # The nonzero_limit enforces sparsity by limiting the number of non-zero terms in each column.
  # In my case, I got best results by removing the default value of 100
  similarity_matrix = SparseTermSimilarityMatrix(similarity_index, dictionary, tfidf)  # , nonzero_limit=None)

  return dictionary, tfidf, similarity_matrix

In [26]:
# Compute similarity measure between the query and the documents.

def compute_similarity_query_documents(
    dictionary: Dictionary,
    tfidf : TfidfModel,
    similarity_matrix : SparseTermSimilarityMatrix,
    query):

  query_tf = tfidf[dictionary.doc2bow(query)]

  index = SoftCosineSimilarity(
              tfidf[[dictionary.doc2bow(document) for document in corpus]],
              similarity_matrix)

  doc_similarity_scores = index[query_tf]
  return doc_similarity_scores

In [27]:

# for each query, output the similarity scores for the top 5 documents with
# th highest score, and interpret your results


In [28]:
query_s = ['fruits', 'vegetables', 'healthy foods in Canada']

for query in [preprocess(q) for q in query_s]:
  dictionary, tfidf, similarity_matrix = dictionary_tfidf_similarity_matrix(query)
  doc_similarity_scores = compute_similarity_query_documents(
      dictionary,
      tfidf,
      similarity_matrix,
      query)

  top_indices = doc_similarity_scores.argsort()[::-1]

  print()
  print(f"query: {query}")
  for index in top_indices[:5]:
    print(f">> score: {doc_similarity_scores[index]:.4f} for document #{index}: {documents[index][:30]}...")
  print()


100%|██████████| 568/568 [00:15<00:00, 35.88it/s]
  normalized_corpus = np.multiply(corpus, 1.0 / corpus_norm)
  normalized_corpus = np.multiply(corpus, 1.0 / corpus_norm)



query: ['fruits']
>> score: 0.8839 for document #6: To a botanist, a fruit is an e...
>> score: 0.8437 for document #31: A fruit serving bowl is a roun...
>> score: 0.8437 for document #29: Fruit dishes are those that us...
>> score: 0.8092 for document #0: Fresh Pomegranate from Anushka...
>> score: 0.7613 for document #10: Canada's Food Guide is a nutri...



100%|██████████| 568/568 [00:14<00:00, 39.03it/s]



query: ['vegetables']
>> score: 0.8961 for document #6: To a botanist, a fruit is an e...
>> score: 0.8104 for document #10: Canada's Food Guide is a nutri...
>> score: 0.7603 for document #29: Fruit dishes are those that us...
>> score: 0.7505 for document #17: We are one of the leading orga...
>> score: 0.7265 for document #0: Fresh Pomegranate from Anushka...



100%|██████████| 568/568 [00:16<00:00, 34.86it/s]


query: ['healthy', 'foods', 'canada']
>> score: 0.9388 for document #10: Canada's Food Guide is a nutri...
>> score: 0.6835 for document #9: In nutrition, the diet of an o...
>> score: 0.5887 for document #31: A fruit serving bowl is a roun...
>> score: 0.5879 for document #16: Anushka Avni International (AA...
>> score: 0.5269 for document #0: Fresh Pomegranate from Anushka...






## Interpretation of GloVE results:

1. GloVe matches all documents with varying similarity scores based on semantic closeness.

2. TF-IDF more strongly matches one of the documents compared to the rest, but may not recognize semantic relationships with "fruit" and "apples".

3. TF-IDF is more efficient to compute, and would be good with domain-specific term matching, or traditional search engines with exact keyword-based queries.

4. In this way GloVe feels better when working with small datasets where capturing semantic relationships is important such as semantic search applications.

# Experiment 3: BERT Model
***
Use a BERT model obtain sentence embeddings and calculate the similarity between queries and documents.

> Hint: see the Module 07 jupyter notebook for examples of how to work with BERT.

In [29]:
!pip install sentence-transformers



In [30]:
from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer("all-MiniLM-L6-v2")

  from tqdm.autonotebook import tqdm, trange
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [31]:
# for each query, output the similarity scores for the top 5 documents with
# th highest score, and interpret your results

In [32]:
# compute embeddings for queries and documents
query_embeddings = model.encode(queries, convert_to_tensor=True)
document_embeddings = model.encode(documents, convert_to_tensor=True)

# compute cosine similarity
cosine_scores = util.cos_sim(query_embeddings, document_embeddings)


In [33]:
import torch

queries = ['fruits', 'vegetables', 'healthy foods in Canada']

# find top 5 documents for each query by similarity score
for i, query in enumerate(queries):

    # get similarity scores for the current query
    scores = cosine_scores[i]
    # get top 5 scores
    top_results = torch.topk(scores, k=5)

    print(f"query: {query}")
    for index, score in zip(top_results.indices, top_results.values):
        print(f">> score: {score.item():.4f} for document #{index}: {documents[index][:30]}")
    print()

query: fruits
>> score: 0.6041 for document #6: To a botanist, a fruit is an e
>> score: 0.5436 for document #29: Fruit dishes are those that us
>> score: 0.4571 for document #18: Fresh Tomatoes from Anushka Av
>> score: 0.4364 for document #31: A fruit serving bowl is a roun
>> score: 0.4042 for document #23: Flame / Red seedless grapesFla

query: vegetables
>> score: 0.4787 for document #6: To a botanist, a fruit is an e
>> score: 0.4280 for document #29: Fruit dishes are those that us
>> score: 0.3915 for document #18: Fresh Tomatoes from Anushka Av
>> score: 0.3826 for document #17: We are one of the leading orga
>> score: 0.3549 for document #8: Nutrients are substances used 

query: healthy foods in Canada
>> score: 0.6388 for document #10: Canada's Food Guide is a nutri
>> score: 0.3855 for document #12: Canadian Industry Statistics (
>> score: 0.3480 for document #18: Fresh Tomatoes from Anushka Av
>> score: 0.3017 for document #29: Fruit dishes are those that us
>> score: 0.24

 # Technique Comparison
 ***

Compare all three techniques and interpret your findings. Do your best to explain the differences you observe in terms of concepts learned in class (not just the what, but also the how and why one technique produces different results from another).



## TF-IDF

1. Overall, TF-IDF prioritized matching exact terms between the query and the document
 - It did not capture semantic similarity well; synonyms or related terms are treated as unrelated
 - This would require us to do some preprocessing such as lemmatization to take it into account
 - Shorter, keyword-rich documents tend to perform better because their TF-IDF scores emphasize those terms
 - TF-IDF is more efficient to compute, and would be good with domain-specific term matching, or traditional search engines with exact keyword-based queries.

## GloVe

2. GloVe did capture semantic relationships between words, for example: "apple" and "fruit" in similar embeddings
 - This led to better performance for queries with synonyms or related terms.
 - For example: queries like "fruits" match documents with semantically related words like "apple" or "appricots"
 - GloVe feels better when working with small datasets where capturing semantic relationships is important such as semantic search applications.

## BERT

3. BERT captured semantic and contextual relationships, making surprisingly good for ambiguous queries.
- Strong performance for long and complex queries such as "healthy foods in Canada" because it can understand the relationships between all words in a query and document
- Queries like "vegetables" match not only semantically similar words but also related contexts, for example: it gave document #6: "To a botanist, a fruit is..." the highest score: 0.4787, despite not appearing in the document — whereas TF-IDF and GloVe only VERY STRICTLY matched document #10: "Canada's Food Guide..."
