### **Initial Setup and Library Installation**

This notebook explores word embeddings using the Gensim library. We will train our own Word2Vec model from a small text corpus and then compare its performance with pre-trained GloVe embeddings.

Before getting started, you need to install the `gensim` library.

```sh
conda install gensim=3.4.0
```

### **Importing Necessary Libraries**

This cell imports all the required modules and functions.
* `re`: The regular expression library, which we'll use for text cleaning.
* `Word2Vec`: The class from Gensim used to train a Word2Vec model.
* `KeyedVectors`: A class used to store and query word vectors efficiently, often used for loading pre-trained models.
* `glove2word2vec`: A utility script to convert word vectors from the GloVe text format to the Word2Vec format that Gensim can load.
* `datapath`: A helper function to locate test datasets that come bundled with Gensim.

In [None]:
# Import the regular expression module for text processing
import re
# Import the main Word2Vec model class and the KeyedVectors class from gensim
from gensim.models import Word2Vec, KeyedVectors
# Import a utility to convert GloVe embeddings to the Word2Vec format
from gensim.scripts.glove2word2vec import glove2word2vec
# Import a utility to access test datasets included with gensim
from gensim.test.utils import datapath

### **Part 1: Training a New Word2Vec Model**

First, let's train a new Word2Vec model on our own data. This involves two steps:
1.  **Preprocessing the Data**: We'll read a text file, convert it to lowercase, clean it up, and split it into a list of sentences (where each sentence is a list of words). This format is required by Gensim.
2.  **Training the Model**: We'll feed this processed data into the `Word2Vec` model to learn vector representations for the words in the corpus.

#### **Loading and Preprocessing the Corpus**

Here, we open a text file (`wiki.10K.txt`), read it line by line, and process each line to create a list of tokenized sentences.

In [None]:
# Initialize an empty list to hold our sentences
sentences=[]
# Define the path to our text data file
filename="../data/wiki.10K.txt"
# Open the file for reading
with open(filename) as file:
    # Iterate over each line in the file
    for line in file:
        # Remove any trailing whitespace (like newlines) and convert the line to lowercase
        words=line.rstrip().lower()
        # Use a regular expression to replace any sequence of one or more whitespace characters with a single space
        words=re.sub("\\s+", " ", words)
        # Split the cleaned line into a list of words (tokens) and append it to our sentences list
        sentences.append(words.split(" "))

#### **Instantiating and Training the Model**

Now we create an instance of the `Word2Vec` model and train it on our `sentences` data.
* `sentences`: Our corpus, formatted as a list of lists of words.
* `size=100`: The dimensionality of the resulting word vectors will be 100.
* `window=5`: The model will consider a context window of 5 words to the left and 5 to the right of the target word.
* `min_count=2`: Words that appear fewer than 2 times in the corpus will be ignored.
* `workers=10`: Use 10 parallel threads to speed up the training process.

In [None]:
# Create and train the Word2Vec model with our specified parameters
model = Word2Vec(
        sentences,      # The corpus to train on
        size=100,       # The desired vector dimension
        window=5,       # The context window size
        min_count=2,    # The minimum word frequency to consider
        workers=10)     # The number of threads to use for training

#### **Accessing and Saving the Trained Vectors**
After training, the learned vectors are stored in the `wv` (word vectors) attribute of the model. We can save these vectors to a file for later use without having to retrain the model.

In [None]:
# Access the KeyedVectors instance containing the trained word vectors
my_trained_vectors = model.wv
# Save the vectors to a text file named 'embeddings.txt'
# binary=False saves them in a human-readable text format
my_trained_vectors.save_word2vec_format('embeddings.txt', binary=False)

#### **Testing Our Trained Model**
Let's test our new model by finding the 10 words that are most semantically similar to the word "actor" based on the vector representations we just learned.

In [None]:
# Use the most_similar method to find the top 10 words closest to "actor" in the vector space
my_trained_vectors.most_similar("actor", topn=10)

### **Part 2: Using Pre-trained GloVe Embeddings**

Training word embeddings from scratch requires a very large dataset to be effective. A common practice is to use vectors that have already been trained on massive text corpora. Here, we'll load pre-trained **GloVe** (Global Vectors for Word Representation) vectors. These vectors were trained on the "Common Crawl" dataset, which contains 42 billion tokens.

First, download the vectors from [this link](https://drive.google.com/file/d/1n1jt0UIdI3CD26cY1EIeks39XH5S8O8M/view?usp=sharing) and place the file in your `data` directory.

#### **Converting GloVe Vectors to Word2Vec Format**

The pre-trained GloVe vectors are in a different file format than the one `gensim`'s `KeyedVectors` class expects. We first need to use the `glove2word2vec` utility to convert the file. This will create a new file in the correct format.

In [None]:
# Define the path to the original downloaded GloVe file
glove_file="../data/glove.42B.300d.50K.txt"
# Define the path for the output file in Word2Vec format
glove_in_w2v_format="../data/glove.42B.300d.50K.w2v.txt"
# Run the conversion utility. The '_' is used to ignore the function's return value.
_ = glove2word2vec(glove_file, glove_in_w2v_format)

#### **Loading the Pre-trained Vectors**
Now we can load the newly converted file into a `KeyedVectors` object. This object provides an efficient way to query the vectors.

In [None]:
# Load the converted GloVe vectors from the text file
# binary=False indicates that the file is in a text format
glove = KeyedVectors.load_word2vec_format(glove_in_w2v_format, binary=False)

#### **Testing the Pre-trained GloVe Model**
Let's run the same test as before: find the 10 most similar words to "actor". Because these GloVe vectors were trained on a much larger and more diverse dataset, we expect the results to be more robust and relevant than those from our small custom model.

In [None]:
# Find the top 10 words most similar to "actor" using the pre-trained GloVe vectors
glove.most_similar("actor", topn=10)

### **Exploring Analogies with Vector Arithmetic**

A fascinating property of word embeddings is that they capture semantic relationships, which can be explored using vector arithmetic. The classic example is "king - man + woman ≈ queen". The `most_similar` function can perform this calculation by specifying which vectors to add (`positive`) and which to subtract (`negative`). Let's test this with a couple of analogies.

#### **Performing Analogy Tasks**

Here, we solve the analogy "Paris is to France as Berlin is to ______?". This translates to the vector equation: `France - Paris + Berlin`.

In [None]:
# Example 1 (commented out): King - Man + Woman
# one="man"
# two="king"
# three="woman"

# Example 2: France - Paris + Berlin
one="paris"     # The vector to subtract
two="france"    # A vector to add
three="berlin"  # The other vector to add

# Find the top 5 words closest to the result of the vector operation: france + berlin - paris
glove.most_similar(positive=[two, three], negative=[one], topn=5)

### **Part 3: Evaluating Embedding Quality**

We can also perform an *intrinsic evaluation* to quantitatively measure the quality of our word vectors. We do this by comparing the similarity scores our model assigns to word pairs against similarity scores assigned by humans. The **WordSim353** dataset is a standard benchmark for this task. It contains 353 pairs of words, each with a human-assigned similarity score.

#### **Evaluating the Pre-trained GloVe Vectors**

The `evaluate_word_pairs` function calculates the cosine similarity for each word pair from the WordSim353 dataset that is present in our model's vocabulary. It then computes the Spearman and Pearson correlation coefficients between our model's similarities and the human scores. A higher correlation indicates that our model's understanding of word similarity aligns better with human judgment.

In [None]:
# Evaluate the GloVe vectors against the WordSim353 dataset
# datapath('wordsim353.tsv') provides the path to the test dataset
glove.evaluate_word_pairs(datapath('wordsim353.tsv'))

#### **Evaluating Our Custom-Trained Vectors**

Finally, let's run the same evaluation on the model we trained ourselves. We expect the correlation score to be significantly lower than the GloVe model's score because our model was trained on a much smaller dataset, limiting its ability to learn nuanced semantic relationships.

In [None]:
# Evaluate our custom-trained vectors against the WordSim353 dataset
my_trained_vectors.evaluate_word_pairs(datapath('wordsim353.tsv'))