<div style="background-color: #f6a800; color: #ffffff; padding: 10px;">
<h2>Part 3.2 - Alternative static embeddings: GloVe
</div>

We are going to review what we did with Word2Vec models but using GloVe Embeddings this time, and we will visualize in 2 dimensions the embeddings.

We start with some imports as usual.

In [None]:
# imports
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

from nb_config import EXTERNAL_DATA_PATH

from src.data import data_loader
from src.plotting import visualize_embeddings

import warnings
warnings.filterwarnings('ignore')

<div style="background-color: #dd6108; color: #ffffff; padding: 10px;">
<h3>1. Overview
</div>

**GloVe (Global Vectors for Word Representation)** is an algorithm developed by researchers at Stanford University to generate word embeddings. GloVe operates on the idea that the meaning of words can be inferred from their contexts of use. The algorithm begins by constructing a co-occurrence matrix that represents the frequency of word pairs appearing together within a specified context window. This matrix reflects the statistical information about how often words occur together in the given corpus.

The optimization objective of GloVe is to learn vector representations for words in a way that the dot product of two vectors corresponds to the logarithm of the probability of the words' co-occurrence. In other words, GloVe aims to capture the ratios of co-occurrence probabilities between words. The training process involves adjusting the word vectors iteratively to minimize the difference between the dot products of vectors and the logarithm of the observed co-occurrence probabilities.

Compared to Word2Vec, which also aims to capture word relationships but does so through a predictive model (either skip-gram or continuous bag of words), GloVe stands out for its emphasis on global context and direct optimization for co-occurrence probabilities. Word2Vec, on the other hand, uses a neural network-based approach to predict words in context, learning embeddings that capture syntactic and semantic relationships.

We will make use of **GloVe** pretrained word vectors, specifically the ones that use the Wikipedia 2014 + Gigaword 5 dataset (6B tokens, 400K vocab, uncased, 50d, 100d, 200d, & 300d vectors, 822 MB download).

> **NOTE**: If you followed the instructions for preparing this workshop, you should have in the <kbd>data/external</kbd> folder some files called <kbd>glove.6B.XXXd.txt</kbd>. We will work with the 50-dimensional embeddings file. 

In [None]:
# Load GloVe embeddings
glove_embeds = {}
with open(EXTERNAL_DATA_PATH+'glove.6B.50d.txt') as f:
    for line in f:
        values = line.split(' ')
        word = values[0] ## The first entry is the token
        coefs = np.asarray(values[1:], dtype='float32') ## The rest is the embedding for that token
        glove_embeds[word] = coefs

# show how do they look like
glove_embeds

We prepare the embeddings in two pandas DataFrames, one with the original embeddings as we have read them from our txt file and a second one, with 2-dimensional embeddings created with UMAP for visualization purposes. 

In [None]:
# save the embeddings in a DataFrame
embeds_df = pd.DataFrame(glove_embeds).T

# loads the 2D embeddings of those embeddings
umap_df = pd.read_parquet(EXTERNAL_DATA_PATH+'glove_umap_emb.parquet')

<div style="background-color: #dd6108; color: #ffffff; padding: 10px;">
<h3>2. Related words
</div>

We will start looking for the closest tokens to a given token. We need first to check if the token is in the vocabulary. We can then calculate the cosine similarity of that token with all the rest of tokens.

<div style="background-color: #b1063a; color: #ffffff; padding: 10px;">
<strong>Exercise</strong>

Choose a word and try to think which other words would you expect to be closer when you plot them. Then take a look to the code to understand what it does. Finally, run the code and see which GloVe embeddings are closer to the word you chose.

What can you say about the results?

Try with very different words/tokens and see if your observations are consistent with the results.
</div>

In [None]:
# the token we want to check
word_lookup = "queen"

# create the emebddings matrix
embed_matrix = embeds_df.to_numpy()

# get the vector for the lookup token
try:
    lookup_vector = glove_embeds[word_lookup]
    lookup_vector = lookup_vector.reshape((1, -1))
except:
    print(f"The token {word_lookup} is not present in the vocabulary.")

# calculate the cosine similarity
cos_similarities = cosine_similarity(lookup_vector, embed_matrix).flatten()

# put the results in a DataFrame
results = pd.DataFrame({'cos_sim':cos_similarities}, index=embeds_df.index)
results.loc[:, 'cos_sim'] = cos_similarities

# choose number of results
n = 5

# sort, list and print
close_tokens = results.sort_values(by='cos_sim', ascending=False).head(n+1).index.to_list()
print(f"The closest {n} tokens to {close_tokens[0]} are {close_tokens[1:]}")

We can now visualize the results with the following code:

In [None]:
visualize_embeddings(close_tokens, umap_df)

<div style="background-color: #dd6108; color: #ffffff; padding: 10px;">
<h3>3. Word analogies with GloVe
</div>

We can also try word analogies as we did for Word2Vec embeddings.

<div style="background-color: #b1063a; color: #ffffff; padding: 10px;">
<strong>Exercise</strong>

Take a look to the code to understand what it does. Choose some different word analogies like:
- *Which word is to woman, what king is to man?*
- *Which word is to France, what Berlin is to Germany?*

Run the code and see which tokens are closer to the vector of your word analogy.
</div>


In [None]:
# defines the tokens for word analogy
base_token = "king"
negative_token = "man"
positive_token = "woman"

# number of closest tokens we want to show
n = 5

# create the lookup vector
try:
    lookup_vector = glove_embeds[base_token] - glove_embeds[negative_token] + glove_embeds[positive_token]
    lookup_vector = lookup_vector.reshape([1, -1])
except:
    print(f"One of the tokens is not present in the vocabulary.")

# calculate cosine similarity
cos_similarities = cosine_similarity(lookup_vector, embed_matrix).flatten()

# put results in a DataFrame
results = pd.DataFrame({'cos_sim':cos_similarities}, index=embeds_df.index)
results.loc[:, 'cos_sim'] = cos_similarities

# list the closest n tokens
close_tokens = results.sort_values(by='cos_sim', ascending=False).head(n).index.to_list()
print(f"The closest {n} tokens are {close_tokens}")

# select query vectors in 2D from list of closest tokens
lookup_umap = (umap_df.loc[base_token].values - umap_df.loc[negative_token].values + umap_df.loc[positive_token].values).reshape([1, -1])

# visualize the query and the closest tokens
visualize_embeddings(close_tokens, umap_df, lookup_umap)

There is something strange going on with our word analogies...

Did you get what you were expecting? Did the results surprise you?

<div style="background-color: #b1063a; color: #ffffff; padding: 10px;">
<strong>Exercise</strong>

What happened with the countries and their capitals? Why didn't it work?

Formulate a hypothesis that justifies this anomaly. Discuss your hypothesis with other participants.
</div>
<br>

Let's focus now on the first analogy. What happened?

<div style="background-color: #b1063a; color: #ffffff; padding: 10px;">
<strong>Exercise</strong>

Try other word analogies like:
- *king - son + daughter = ?*
- *father - uncle + aunt = ?*

Which analogies worked well?
</div>

To try to shed some light on the problem, let us visualize some related tokens that we would expect to behave in the same way.

In [None]:
analogy_words = ['man', 'woman', 'queen', 'king', 'father', 'mother', 'son', 'daughter']

visualize_embeddings(analogy_words, umap_df)


We can see now where is the problem. All pairs (*father-mother*, *son-daughter*, *king-queen*) show in a very general way the same direction, downwards to the right from the male to the female counterpart. However, the *man-woman* pair goes in the opposite direction, upwards to the left. What do you think could be the reason for this?

<div style="background-color: #b1063a; color: #ffffff; padding: 10px;">
<strong>Exercise</strong>

Formulate a hypothesis that justifies this anomaly. Discuss your hypothesis with other participants.
</div>



<div style="background-color: #dd6108; color: #ffffff; padding: 10px;">
<h3>4. Queries with GloVe
</div>

But now we want to check again our use case. How does this model performs?

We need to retrieve again our tokenized descriptors. But we know already how does it work. And as with Word2Vec we are going to create descriptor embeddings by averaging the vectors for all the tokens in the descriptor.

> **NOTE**: We have seen that with this model some ways of tokenizing could be better than others. So this could be perhaps a good moment for going back and creating new tokens with a different configuration, if you think that that would improve the semantic search engine.

We load and prepare the data.

In [None]:
df, params = data_loader("my_tokenized_data.parquet")

# extract the arguments for the normalize function
tokenizer = params['tokenizer']
arguments = params['args']

# print the tokenizer used getting the tokens
print(f"Tokenizer used: {tokenizer}")
print(f"Arguments passed to the normalize function:\n{arguments}")

Now we create the corpus matrix.

In [None]:
# creates a Pandas series with the averaged vectors
corpus_matrix = df.loc[:, 'tokens'].map(lambda x: np.sum([glove_embeds[token] for token in x if token in glove_embeds.keys()], axis=0))
# creates the matrix from the vectors
corpus_matrix = np.vstack(corpus_matrix)

It's turn to tokenize the query and create its embedding.

In [None]:

query = "An innocent man goes to prison accused of killing his wife and her lover, but never loses the hope"

from src.normalizing import normalize, NLTKTokenizer

# instantiate the tokenizer
tkn = NLTKTokenizer()

# tokenize the query with the same tokenizer you used for the corpus texts
query_tokens = normalize(query, tkn, punct_signs=True)[1]

query_vector = np.sum([glove_embeds[token] for token in query_tokens if token in glove_embeds.keys()], axis=0)
query_vector = query_vector.reshape((1, -1))

And as usual, we compare the query with the descriptors in our dataset.

In [None]:
# calculating the distances
euclid_distances = euclid_dist_AB = np.linalg.norm(corpus_matrix - query_vector, axis=1)
dotprod_similarities = np.dot(corpus_matrix, query_vector.T)
cos_similarities = cosine_similarity(query_vector, corpus_matrix).flatten()

# creating the new dataframe and adding the extra columns
results = df.loc[:, ['title', 'descriptor']].copy()
results.loc[:, 'euclid_dist'] = euclid_distances
results.loc[:, 'dot_prod_sim'] = dotprod_similarities
results.loc[:, 'cos_sim'] = cos_similarities
results.loc[:, 'common_tokens'] = df.loc[:, 'tokens'].map(lambda x: list(set(x).intersection(query_tokens)))

In [None]:
# choose the metric! options:
# 'euclid_dist', 'dot_prod_sim', 'cos_sim'
metric = 'cos_sim'

# choose number of results
n = 10

# show the results
# ascending=False for 'dot_prod_sim' and 'cos_sim'
# ascending=FTrue for 'euclid_dist'
results.sort_values(by=metric, ascending=False).head(n)

<div style="background-color: #b1063a; color: #ffffff; padding: 10px;">
<strong>Exercise</strong>

How was the result? Try several queries with different complexity levels and try to understand when works better and when worse.
</div>

<div style="background-color: #dd6108; color: #ffffff; padding: 10px;">
<h3>5. Advantages and disadvantages of GloVe
</div>

#### Advantages:

> - **Semantic Precision**: GloVe excels in capturing intricate semantic relationships based on global co-occurrence patterns. This precision provides a nuanced understanding of word meanings, surpassing some other embedding methods.
>  - **Reduced Sensitivity to Noise**: GloVe demonstrates resilience to noise and outliers in training data. This robustness contributes to stable embeddings, making them reliable in the presence of varied and noisy textual data.
> - **Pre-trained Embeddings Availability**: Pre-trained GloVe embeddings are widely accessible for multiple languages and domains. This availability facilitates transfer learning, allowing users to leverage pre-trained embeddings in downstream tasks with limited data.

#### Disadvantages:

> - **Fixed Vocabulary Size**: GloVe operates with a fixed vocabulary size determined during training. This fixed nature may limit adaptation, particularly in scenarios where language usage evolves over time.
> - **Limited Representation of Polysemy**: GloVe may struggle with polysemy, as it assigns a single vector to each word. This limitation poses challenges in effectively representing words with multiple meanings.
> - **Dependency on Training Data Quality**: The quality of GloVe embeddings heavily relies on the diversity and representativeness of the training corpus. Biased or unrepresentative training data may result in suboptimal embeddings.

#### Applications:

> - **Review Sentiment Classification**: Leveraging GloVe embeddings for sentiment analysis enhances the accuracy of classifying sentiments in product or service reviews, enabling more precise identification of positive and negative sentiments.
>  - **Document Clustering for Topic Discovery**: Applying GloVe embeddings in document clustering facilitates the discovery of topics within a large corpus, grouping similar documents based on the semantic relationships between words.
> - **Named Entity Recognition (NER) in Text**: Integrating GloVe embeddings into Named Entity Recognition tasks improves the identification and classification of named entities within textual data, aiding in information extraction.
> - **Keyword Extraction from Scientific Papers**: Utilizing GloVe embeddings for keyword extraction in scientific papers enhances the identification of critical terms, providing a more refined summary and categorization of research content.
