This notebook explores the use of retrofitted word vectors. Retrofitting is a process that adjusts pre-trained word vectors (like GloVe) to better align with semantic lexicons, such as WordNet. This encourages words that are synonyms or in the same conceptual set (synset) to have more similar vector representations. The technique is based on the paper by [Faruqui et al. (2015)](https://github.com/mfaruqui/retrofitting).

In [None]:
# Import the necessary function 'glove2word2vec' to convert GloVe format vectors to word2vec format.
from gensim.scripts.glove2word2vec import glove2word2vec
# Import 'KeyedVectors' from gensim, which is the standard class for loading and working with pre-trained word embeddings.
from gensim.models import Word2Vec, KeyedVectors

### Step 1: Process the Original GloVe Vectors

First, we need to convert the original GloVe vectors from their standard text format into a format that the `gensim` library can load efficiently. We'll use the `glove2word2vec` utility for this.

In [None]:
# Define a string variable for the path to the original GloVe vector file.
glove_file="../data/glove.42B.300d.50K.txt"
# Define a string variable for the path where the converted word2vec format file will be saved.
original_file="../data/glove.42B.300d.50K.w2v.txt"
# Call the conversion function. It reads the GloVe file and writes the new word2vec format file.
# The return value is not needed, so it's assigned to '_' by convention.
_ = glove2word2vec(glove_file, original_file)

Download vectors that have already been retrofit [here](https://drive.google.com/file/d/1sr0xEUzlLtjbrs0NY4-vek60SY7Gk9bQ/view?usp=sharing).  These vectors have been fit using the code of [Faruqui et al. 2015](https://github.com/mfaruqui/retrofitting):

```sh
python retrofit.py -i glove.42B.300d.50K.txt -l lexicons/wordnet-synonyms.txt -n 10 -o glove.42B.300d.50K.txt.retrofit
```

### Step 2: Process the Retrofitted GloVe Vectors

Next, we perform the same conversion for the retrofitted vectors. These vectors have been adjusted using WordNet synonym data to bring related words closer together in the vector space.

In [None]:
# Define the path to the input file containing the retrofitted GloVe vectors.
glove_file="../data/glove.42B.300d.50K.txt.retrofit"
# Define the path for the output file for the retrofitted vectors in word2vec format.
retrofit_file="../data/glove.42B.300d.50K.w2v.txt.retrofit"
# Perform the conversion from GloVe format to word2vec format for the retrofitted vectors.
_ = glove2word2vec(glove_file, retrofit_file)

### Step 3: Load the Vector Models

With both sets of vectors converted, we can now load them into `gensim`'s `KeyedVectors` objects. This will allow us to easily query them for word similarities and other tasks.

In [None]:
# Load the original (non-retrofitted) vectors from the converted file into a KeyedVectors object.
# The 'binary=False' argument specifies that the file is in a text format.
original = KeyedVectors.load_word2vec_format(original_file, binary=False)

Now we load the retrofitted vectors in the same way.

In [None]:
# Load the retrofitted vectors from the converted file.
# This creates a second KeyedVectors object that we can compare with the original.
retrofit = KeyedVectors.load_word2vec_format(retrofit_file, binary=False)

### Step 4: Compare the Models

Explore these two vector sets to see how the retrofitting process has encoded the synonym relationships from WordNet into the vector representations. We will do this by checking the nearest neighbors of a word in both models.

First, let's examine the most similar words to "hate" in the **original** GloVe model. These similarities are based purely on how words co-occur in the training corpus.

In [None]:
# Use the .most_similar() method on the 'original' model to find the top 10 words closest to "hate".
# This shows us the neighbors based on the original distributional semantics.
original.most_similar("hate", topn=10)

Now, let's look at the most similar words to "hate" in the **retrofitted** model. We expect these results to be more semantically precise and include more direct synonyms, as the model was explicitly updated using WordNet's synonym lexicon.

In [None]:
# Use the .most_similar() method on the 'retrofit' model to find the top 10 words closest to "hate".
# Comparing these results to the previous cell highlights the effect of retrofitting.
retrofit.most_similar("hate", topn=10)