<a href="https://colab.research.google.com/github/ua-deti-information-retrieval/Neural-IR-hands-on/blob/main/RI_practical_tutorial_2_Embeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# RI practical tutorial #2

## Embeddings

An important component of natural language processing (NLP) is the ability to translate words, phrases, or larger bodies of text into continuous numerical vectors.



## Dependencies

In [None]:
!pip install torch matplotlib
!git clone https://github.com/ua-deti-information-retrieval/Neural-IR-hands-on.git

In [None]:
import torch
from tqdm import tqdm

## Recap

Embeddings convert words, sentences, or even entire documents into vectors of real numbers. Unlike traditional methods like one-hot encoding, which represent words as isolated and high-dimensional points.

In [None]:
toy_vocab = ['the','supreme','art','of','war','is','to','subdue','the','enemy','without','fighting']
torch.eye(len(toy_vocab))

In [None]:
embedding_layer = torch.nn.Embedding(len(toy_vocab), 4)
print(embedding_layer.weight)
print("embeddings norm", torch.linalg.norm(embedding_layer.weight, ord=2, dim=-1))

##Hands on

To get started with practical exercises in embeddings, it's beneficial to use pre-trained models. This allows us to explore and understand the power of embeddings without the need for extensive computational resources and time to train our models.

For our exercise, we will use the DESM (Dual Embedding Space Model) from Microsoft (the same introduced in class). DESM is a unique model that leverages two types of embeddings.

In [None]:
# run to download the desm embeddings
!wget https://download.microsoft.com/download/A/7/C/A7C7F0A6-B925-4C07-A14B-04ACF8A8E030/desm.zip
!unzip desm.zip

In [None]:
!wget https://raw.githubusercontent.com/dwyl/english-words/master/words_alpha.txt

In [None]:
# get a simple vocab, because load the in and out matrices will exaust the resources
with open("words_alpha.txt") as f:
#with open("simple_vocab_example.txt") as f:
  vocab_set = {token.rstrip() for token in f}

In [None]:

def load_embeddings_from_txt(path, vocab):
  emb = {}

  with open(path) as f:
    for line in tqdm(f):
      token, *values = line.split("\t")
      if token in vocab:
        emb[token] = list(map(float, values))

  # separating the vocab from the embeddings
  vocab, embedding = list(zip(*emb.items()))
  token_to_id = {token:i for i,token in enumerate(vocab)}
  id_to_token = {v:k for k,v in token_to_id.items()}

  return token_to_id, id_to_token, torch.tensor(embedding)

in_token_to_id, in_id_to_token, in_embeddings = load_embeddings_from_txt("in.txt", vocab_set)

Let's explore the loaded embeddings.



In [None]:
print("shape", in_embeddings.shape)
nurse_id = in_token_to_id["nurse"]
print("Token: nurse | id:", nurse_id)
print("embeddings norm", torch.linalg.norm(in_embeddings[nurse_id], ord=2))
print("nurse embedding:",in_embeddings[nurse_id])

## How to find similar tokens with embeddings?

The same way you find similar vectors with tfidf, using cosine similarity!

More precisely, given the two vectors:

$cos(a,b) = \frac{\vec{a}\cdot\vec{b}^T}{\|\vec{a}\|\times\|\vec{b}\|}$

Then, we just need to compute the cosine similaraty between $\vec{a}$ and all of the vectors in our matrix $C$ (collection).

As an example, complete the following function. It should calculate the cosine similarity between a given vector and all the collection vectors and return the most similar tokens and scores.

In [None]:
def find_topk_similar_to(token, embeddings, token_to_id, id_to_token, topk=10):
  """
  Given the token return topk similar tokens according to the cos sim between the
  token vector and all of the embeddings vectors
  """

  token_embedding = embeddings[token_to_id[token]]
  return find_topk_similar_to_vec(token_embedding, embeddings, token_to_id, id_to_token, topk)

def find_topk_similar_to_vec(token_embedding, embeddings, token_to_id, id_to_token, topk=10):
  """
  Given the token embedding return topk similar tokens according to the cos sim between the
  token vector and all of the embeddings vectors


  [('mercedes', 0.9999992251396179),
  ('cabriolet', 0.6590193510055542),
  ('sprinter', 0.6370120048522949),
  ('volkswagen', 0.6347604393959045),
  ('fiat', 0.6245887875556946),
  ('jaguar', 0.6102705001831055),
  ('toyota', 0.5901010632514954),
  ('honda', 0.5850051641464233),
  ('rover', 0.5818690061569214),
  ('freightliner', 0.5783664584159851)]

  """
  ## complete
  pass




In [None]:
find_topk_similar_to("yale", in_embeddings, in_token_to_id, in_id_to_token)


In [None]:
find_topk_similar_to("apple", in_embeddings, in_token_to_id, in_id_to_token)

In [None]:
find_topk_similar_to("oak", in_embeddings, in_token_to_id, in_id_to_token)

In [None]:
# Why it works bad for covid? any guess?
find_topk_similar_to("covid", in_embeddings, in_token_to_id, in_id_to_token)

## Word analogies

Another interesting property of word embeddings is their ability to capture word analogies through geometric relationships in the vector space. This phenomenon is often illustrated by the famous example: "king" - "man" + "woman" ≈ "queen". In this case, the embeddings capture the relationship between gender roles and royal titles.

With the help of the previous function, create a the vector queen by using appling the relation ("king"-"man") to "woman".



In [None]:


def word_analogy(token_a, token_b, token_c):
  """
  Performs token_a - token_b + token_c

  and returns a list with the closest tokens

  Note: token_a, token_b and token_c should be removed of the list

  Example:
  word_analogy("king", "man", "woman")
  [('queen', 0.6244865655899048),
 ('kings', 0.4600622057914734),
 ('prince', 0.42849528789520264),
 ('princess', 0.42579346895217896),
 ('royal', 0.41185224056243896),
 ('crown', 0.4051671624183655),
 ('princes', 0.40045303106307983),
 ('lamb', 0.3960754871368408),
 ('hamilton', 0.39465370774269104)]
  """
  ## Complete
  pass

word_analogy("king", "man", "woman") # expected queen

In [None]:
word_analogy("paris", "france", "portugal") # expected lisbon

In [None]:
word_analogy("france", "paris", "lisbon") # expected portugal

In [None]:
word_analogy("teacher", "school", "hospital") # expected ? (maybe doctor?)

## Okey, but if I want to use sentance or documents?

In such scenarios, a straightforward approach is to average the embeddings of all tokens within a sentence. This method offers a means to condense the rich information of a sentence into a single vector.

By averaging the embeddings of each word in a sentence, we create a composite representation that captures the essence of the sentence as a whole. This can then be used to compare and measure the similarity between different sentences or documents. It's a practical method, especially when dealing with small texts. Let's proceed to implement this and see how well it performs in identifying sentence similarities.

In [None]:
sentences_corpus = [
    "A nimble red fox leaped over a sleeping canine.",
    "New York is known for its bustling city life.",
    "The city of Tokyo is lively and vibrant at night.",
    "The development of AI has significant implications for society.",
    "Fresh vegetables and fruits are essential for a healthy diet.",
    "Eating a variety of greens and fruits contributes to good health.",
    "The book on the shelf is old and worn.",
    "An ancient, tattered tome sits in the library."
]

sentence_to_id = {s:i for i,s in enumerate(sentences_corpus)}
id_to_sentence = sentences_corpus
#id_to_sentence = {v:k for k,v in sentence_to_id.items()}

In [None]:


def text_to_vec(text, embeddings, in_token_to_id):
  tokens = text.lower().split()
  return [embeddings[in_token_to_id[token]] for token in tokens if token in in_token_to_id]

def sentence_embedding(text, embeddings):
  """
  Give a sequence of text compute the embeddings of the sentece by averaging its token embeddings

  use the function text_to_vec to convert text to vectors: text_to_vec(text, embeddings, in_token_to_id)

  Out: sentence embeddings
  """
  ## Complete

  pass

sentences_corpus_embeddings = torch.stack([sentence_embedding(sent, in_embeddings) for sent in sentences_corpus])


In [None]:
sent_embedding = sentence_embedding("Artificial Intelligence will shape the future of humanity.", in_embeddings)
find_topk_similar_to_vec(sent_embedding, sentences_corpus_embeddings, sentence_to_id, id_to_sentence, topk=5)



In [None]:
sent_embedding = sentence_embedding("The quick brown fox jumps over the lazy dog.", in_embeddings)
find_topk_similar_to_vec(sent_embedding, sentences_corpus_embeddings, sentence_to_id, id_to_sentence, topk=5)

## Well if it works for sentence similarity, maybe it works for retrieval?

Let's apply the same example to this toy collection of documents

In [None]:
documents = [
    "Apples are rich in antioxidants, which help in fighting free radicals.",
    "The water cycle consists of evaporation, condensation, and precipitation.",
    "Recent trends in AI include advancements in deep learning and neural networks.",
    "Good mental health can be maintained by regular exercise and proper sleep.",
    "The Olympic Games originated in ancient Greece and have evolved over centuries.",
    "Eating fruits and vegetables is essential for physical well-being.",
    "Cloud formation is a key aspect of the earth's hydrological process.",
    "Machine learning and AI are becoming integral in various industries.",
    "Mindfulness and meditation are effective for stress management.",
    "The modern Olympics include a variety of sports from track to swimming."
]

doc_to_id = {s:i for i,s in enumerate(documents)}
id_to_doc = documents

doc_embeddings = torch.stack([sentence_embedding(sent, in_embeddings) for sent in documents])


In [None]:
sent_embedding = sentence_embedding("How does the water cycle work?", in_embeddings)
find_topk_similar_to_vec(sent_embedding, doc_embeddings, doc_to_id, id_to_doc, topk=3)

In [None]:
sent_embedding = sentence_embedding("What is the history of the Olympic Games?", in_embeddings)
find_topk_similar_to_vec(sent_embedding, doc_embeddings, doc_to_id, id_to_doc, topk=3)

## DESM model

Up to this point, we have primarily utilized the 'IN' embeddings of the DESM (Dual Embedding Space Model) model. Let's delve deeper into understanding and exploring this model:

The DESM model is unique in its dual-embedding approach. It leverages both 'IN' and 'OUT' embeddings to enhance the representation of words and phrases.

First lets load the OUT embeddings

In [None]:
# note that out_token_to_id and out_id_to_token should be exactly the same as in_token_id and in_id_to_token
out_token_to_id, out_id_to_token, out_embeddings = load_embeddings_from_txt("out.txt", vocab_set)


In continuation of what we've learned in class, we'll now calculate similarities using different combinations of embeddings from the DESM model. Namely, IN-IN, IN-OUT and OUT-OUT.

In [None]:
def in_out_comparison_for_token(token, topk=10):

  in_in_results = find_topk_similar_to(token, in_embeddings, in_token_to_id, in_id_to_token, topk=topk)
  out_out_results = find_topk_similar_to(token, out_embeddings, out_token_to_id, out_id_to_token, topk=topk)
  in_out_results = find_topk_similar_to_vec(in_embeddings[in_token_to_id[token]], out_embeddings, in_token_to_id, in_id_to_token, topk=topk)
  print(f'|{"IN-IN":^25}|{"OUT-OUT":^25}|{"IN-OUT":^25}|')
  for i in range(topk):
    in_in_str = f'{in_in_results[i][0]} ({in_in_results[i][1]:.3f})'
    out_out_str = f"{out_out_results[i][0]} ({out_out_results[i][1]:.3f})"
    in_out_str = f"{in_out_results[i][0]} ({in_out_results[i][1]:.3f})"
    print(f'|{in_in_str:^25}|{out_out_str:^25}|{in_out_str:^25}|')



In [None]:
in_out_comparison_for_token("yale")


In [None]:
in_out_comparison_for_token("apple")

## DESM Retrieval

Following the slides lets implement the DESM retrieval model

$DESM(Q, D) = \frac{1}{|Q|}\sum_{q_i \in Q}cos(q_i,D)$

In [None]:
documents = [
    "Apples are rich in antioxidants, which help in fighting free radicals.",
    "The water cycle consists of evaporation, condensation, and precipitation.",
    "Recent trends in AI include advancements in deep learning and neural networks.",
    "Good mental health can be maintained by regular exercise and proper sleep.",
    "The Olympic Games originated in ancient Greece and have evolved over centuries.",
    "Eating fruits and vegetables is essential for physical well-being.",
    "Cloud formation is a key aspect of the earth's hydrological process.",
    "Machine learning and AI are becoming integral in various industries.",
    "Mindfulness and meditation are effective for stress management.",
    "The modern Olympics include a variety of sports from track to swimming."
]



In [None]:
def desm(query, documents, topk=3):
  """
  Implement the desm algorithm
  query: text of a question
  documents: list of documents text that make the collection
  topk: maximum number of documents that we want to return

  desm("How does the water cycle work?", documents)
  [('The water cycle consists of evaporation, condensation, and precipitation.',
  -0.0023205685429275036),
 ("Cloud formation is a key aspect of the earth's hydrological process.",
  -0.028624113649129868),
 ('Good mental health can be maintained by regular exercise and proper sleep.',
  -0.031198585405945778)]
  """
  ## COMPLETE


  # average embeddings for the doc
  pass



In [None]:
desm("How does the water cycle work?", documents) # it help?


In [None]:
desm("What is the history of the Olympic Games?", documents)