# Lab 3

## Exploring Word2Vec with Gensim

In this notebook we will train and explore Word2Vec embeddings using gensim's powerful API. Unlike Lab 1 where we built CBOW from scratch, here we'll focus on understanding how to use Word2Vec in practice and explore the embedding space.

**Key Learning Objectives:**
- Train Word2Vec models using gensim
- Understand vocabulary indexing and word-to-vector mapping
- Extract and manipulate the embedding matrix
- Calculate word similarities and perform vector arithmetic
- Apply embeddings to document similarity tasks

**Dataset:** Yelp reviews dataset (same as previous labs)

You can run this lab both locally or in Colab.

- To run in Colab just go to `https://colab.research.google.com`, sign-in and you upload this notebook. Colab has GPU access for free.
- To run locally just run `jupyter notebook` and access the notebook in this lab. You would need to first install the requirements in `requirements.txt`

Follow the instructions. Good luck!

In [None]:
!nvidia-smi

In [None]:
# Install required packages with version constraints for compatibility
# numpy<2 is required for gensim compatibility
# gensim==4.2.0 is the version used throughout this course
!pip install 'numpy<2' 'gensim==4.2.0' textblob smart_open

In [None]:
import gensim
import numpy as np
import pandas as pd
from gensim.models import Word2Vec
from sklearn.model_selection import train_test_split
from sklearn.metrics.pairwise import cosine_similarity
import warnings

# Set random seeds for reproducibility
np.random.seed(42)
warnings.filterwarnings('ignore')

# Configuration parameters
embedding_dim = 100  # Dimensionality of word vectors
window_size = 5      # Context window size (words before and after)
min_count = 2        # Ignore words that appear less than this
workers = 4          # Number of parallel threads

## Data Acquisition

We'll use the same Yelp reviews dataset from previous labs. The download script checks if the file already exists before downloading.

In [None]:
%%writefile get_data.sh
if [ ! -f yelp.csv ]; then
  wget -O yelp.csv https://www.dropbox.com/scl/fi/dr6xmgw59kliq74gcd340/yelp.csv?rlkey=la6ue9a899v54f04eu92lbmlx&st=fld39cyt&dl=0
fi

In [None]:
# Load the Yelp dataset
path = './yelp.csv'
yelp = pd.read_csv(path)

# Create a DataFrame with only 5-star and 1-star reviews (extremes for better contrast)
yelp_best_worst = yelp[(yelp.stars==5) | (yelp.stars==1)]

# Extract text and labels
X = yelp_best_worst.text
y = yelp_best_worst.stars.map({1:0, 5:1})

# Preview the data
print(f"Total reviews: {len(X)}")
print(f"\nSample review:\n{X.iloc[0][:200]}...")
yelp_best_worst.head()

In [None]:
# Demo: Preprocess text using gensim's simple_preprocess
# simple_preprocess converts text to lowercase and tokenizes
# It returns a list of tokens (words)

def preprocess_text(text):
    """
    Preprocess a single text document.
    
    Args:
        text: Input text string
    
    Returns:
        List of tokens (words)
    """
    return gensim.utils.simple_preprocess(str(text))

# Example preprocessing
sample_text = X.iloc[0]
print(f"Original text:\n{sample_text[:150]}...\n")
print(f"Preprocessed tokens:\n{preprocess_text(sample_text)[:20]}...")

# Preprocess the entire corpus
# Each review becomes a list of tokens
processed_corpus = [preprocess_text(text) for text in X if isinstance(text, str)]

print(f"\nTotal documents in corpus: {len(processed_corpus)}")
print(f"First document (first 15 tokens): {processed_corpus[0][:15]}")

In [None]:
### Lab Exercise 1: Train Your Own Word2Vec Model

Now it's your turn! Train a Word2Vec model with different parameters to see how they affect the results.

**Tasks:**
1. Create a new preprocessing function that filters out very short documents (less than 5 words)
2. Train a Word2Vec model using CBOW instead of Skip-gram
3. Use a smaller embedding dimension (50) and larger window (10)
4. Print the vocabulary size and compare with the demo model

In [None]:
# Lab 1: Train your own Word2Vec model

# Step 1: Filter corpus to keep only documents with at least 5 words
filtered_corpus = None  # FILL: Use list comprehension to filter processed_corpus

print(f"Original corpus size: {len(processed_corpus)}")
print(f"Filtered corpus size: {len(filtered_corpus) if filtered_corpus else 'Not implemented yet'}")

# Step 2: Train Word2Vec with CBOW
# FILL: Create a Word2Vec model with:
# - sentences: filtered_corpus
# - vector_size: 50
# - window: 10
# - min_count: 2
# - workers: 4
# - sg: 0 (for CBOW)
# - epochs: 10

model_lab = None  # FILL

# Verification
if model_lab is not None:
    print(f"\nYour model vocabulary size: {len(model_lab.wv.key_to_index)} words")
    print(f"Your model vector dimensions: {model_lab.wv.vector_size}")
    print(f"Training algorithm: {'Skip-gram' if model_lab.sg == 1 else 'CBOW'}")

In [None]:
# Demo: Vocabulary exploration

# Get word-to-index mapping
word_to_idx = model.wv.key_to_index
idx_to_word = model.wv.index_to_key

# Example: Find index of specific words
example_words = ['pizza', 'good', 'restaurant', 'food']
print("Word to Index mapping:")
for word in example_words:
    if word in word_to_idx:
        idx = word_to_idx[word]
        print(f"  '{word}' -> index {idx}")

# Example: Find words at specific indices
print("\nIndex to Word mapping:")
for idx in [0, 1, 100, 500]:
    if idx < len(idx_to_word):
        word = idx_to_word[idx]
        print(f"  index {idx} -> '{word}'")

# Most frequent words are at the beginning
print(f"\nMost frequent words (first 20): {idx_to_word[:20]}")

In [None]:
# Demo: Extract embedding matrix and word vectors

# Get the full embedding matrix (vocab_size x vector_size)
embedding_matrix = model.wv.vectors
print(f"Embedding matrix shape: {embedding_matrix.shape}")
print(f"This means {embedding_matrix.shape[0]} words, each with {embedding_matrix.shape[1]} dimensions\n")

# Get vector for a specific word (two ways)
word = 'pizza'

# Method 1: Direct access using word
vector_method1 = model.wv[word]

# Method 2: Using index
word_idx = model.wv.key_to_index[word]
vector_method2 = embedding_matrix[word_idx]

print(f"Vector for '{word}' (Method 1 - direct):")
print(f"  First 10 values: {vector_method1[:10]}")
print(f"\nVector for '{word}' (Method 2 - via index):")
print(f"  First 10 values: {vector_method2[:10]}")
print(f"\nVectors are identical: {np.allclose(vector_method1, vector_method2)}")

### Lab Exercise 2: Explore Vocabulary and Extract Vectors

Practice working with vocabulary mappings and the embedding matrix.

**Tasks:**
1. Find the indices for the words: 'delicious', 'terrible', 'service', 'atmosphere'
2. Get the word at index 50 and index 200
3. Extract the embedding matrix and get the full vectors for 'good' and 'bad'
4. Calculate the Euclidean distance between 'good' and 'bad' vectors

In [None]:
# Lab 2: Vocabulary and vector extraction

# Task 1: Find indices for specific words
words_to_find = ['delicious', 'terrible', 'service', 'atmosphere']
print("Task 1: Word to index mapping")
for word in words_to_find:
    idx = None  # FILL: Get the index for this word from model.wv.key_to_index
    if idx is not None:
        print(f"  '{word}' -> index {idx}")

# Task 2: Find words at specific indices
print("\nTask 2: Index to word mapping")
idx_50 = None   # FILL: Get word at index 50
idx_200 = None  # FILL: Get word at index 200
print(f"  Index 50 -> '{idx_50}'")
print(f"  Index 200 -> '{idx_200}'")

# Task 3: Extract embedding matrix and get vectors for 'good' and 'bad'
embedding_matrix = None  # FILL: Get the embedding matrix from model.wv.vectors
vector_good = None       # FILL: Get vector for 'good'
vector_bad = None        # FILL: Get vector for 'bad'

if vector_good is not None and vector_bad is not None:
    print(f"\nTask 3: Extracted vectors")
    print(f"  'good' vector shape: {vector_good.shape}")
    print(f"  'bad' vector shape: {vector_bad.shape}")
    
    # Task 4: Calculate Euclidean distance
    euclidean_dist = None  # FILL: Use np.linalg.norm to calculate distance
    print(f"\nTask 4: Euclidean distance between 'good' and 'bad': {euclidean_dist}")

In [None]:
# Demo: Find most similar words

# Find words similar to 'pizza'
print("Most similar words to 'pizza':")
similar_to_pizza = model.wv.most_similar('pizza', topn=5)
for word, score in similar_to_pizza:
    print(f"  {word}: {score:.4f}")

# Find words similar to 'delicious'
print("\nMost similar words to 'delicious':")
similar_to_delicious = model.wv.most_similar('delicious', topn=5)
for word, score in similar_to_delicious:
    print(f"  {word}: {score:.4f}")

# Calculate similarity between word pairs
word_pairs = [('good', 'great'), ('good', 'bad'), ('pizza', 'burger'), ('pizza', 'service')]
print("\nWord pair similarities:")
for w1, w2 in word_pairs:
    try:
        sim = model.wv.similarity(w1, w2)
        print(f"  '{w1}' <-> '{w2}': {sim:.4f}")
    except KeyError as e:
        print(f"  '{w1}' <-> '{w2}': Word not in vocabulary")

In [None]:
# Demo: Vector arithmetic (analogies)

# Analogy: "good" is to "great" as "bad" is to ?
# Formula: great - good + bad ≈ ?
print("Analogy 1: 'good' is to 'great' as 'bad' is to ___")
try:
    result = model.wv.most_similar(positive=['great', 'bad'], negative=['good'], topn=3)
    for word, score in result:
        print(f"  {word}: {score:.4f}")
except KeyError:
    print("  Some words not in vocabulary")

# Analogy: "delicious" is to "food" as "friendly" is to ?
print("\nAnalogy 2: 'delicious' is to 'food' as 'friendly' is to ___")
try:
    result = model.wv.most_similar(positive=['friendly', 'food'], negative=['delicious'], topn=3)
    for word, score in result:
        print(f"  {word}: {score:.4f}")
except KeyError:
    print("  Some words not in vocabulary")

# Analogy: restaurant context
print("\nAnalogy 3: 'excellent' is to 'restaurant' as 'wonderful' is to ___")
try:
    result = model.wv.most_similar(positive=['wonderful', 'restaurant'], negative=['excellent'], topn=3)
    for word, score in result:
        print(f"  {word}: {score:.4f}")
except KeyError:
    print("  Some words not in vocabulary")

## Section 3: Word Similarity & Vector Arithmetic

### Understanding Semantic Similarity

Word2Vec captures semantic relationships in vector space. Words with similar meanings have vectors that are close together (measured by cosine similarity).

**Key operations:**

- `most_similar(word, topn=N)`: Find N most similar words
- `similarity(word1, word2)`: Calculate cosine similarity between two words
- **Vector arithmetic**: `king - man + woman ≈ queen`

This arithmetic works because Word2Vec learns directional relationships (e.g., the "gender" direction).

**Demo: Similarity and Analogies**

## Section 2: Vocabulary & Embedding Matrix

### Understanding Vocabulary Indexing

Word2Vec builds a vocabulary from the training corpus and assigns each word a unique index. This creates a mapping:

- **word → index**: Use `model.wv.key_to_index` dictionary
- **index → word**: Use `model.wv.index_to_key` list

The **embedding matrix** stores all word vectors in a 2D array with shape `(vocab_size, vector_size)`. Each row corresponds to one word's vector.

**Demo: Exploring Vocabulary and Vectors**

### Lab Exercise 1: Train Your Own Word2Vec Model - SOLUTION

Now it's your turn! Train a Word2Vec model with different parameters to see how they affect the results.

**Tasks:**
1. Create a new preprocessing function that filters out very short documents (less than 5 words)
2. Train a Word2Vec model using CBOW instead of Skip-gram
3. Use a smaller embedding dimension (50) and larger window (10)
4. Print the vocabulary size and compare with the demo model

## Section 1: Data Preparation & Training Word2Vec

### Understanding Word2Vec

Word2Vec is a technique to learn word embeddings by predicting context. There are two architectures:

- **CBOW (Continuous Bag of Words)**: Predicts target word from context words
- **Skip-gram**: Predicts context words from target word (generally better for semantic tasks)

In gensim, Word2Vec makes training incredibly simple. Key parameters:

- `vector_size`: Dimensionality of word vectors (e.g., 100, 300)
- `window`: Maximum distance between current and predicted word
- `min_count`: Ignores words with frequency less than this
- `workers`: Number of CPU threads for training
- `sg`: Training algorithm (0=CBOW, 1=Skip-gram)

**Demo: Preprocessing and Training**

First, we need to preprocess the text. Word2Vec expects a list of sentences, where each sentence is a list of tokens (words).

In [None]:
!bash get_data.sh

### Lab Exercise 3: Word Similarity and Analogies

Explore semantic relationships in the embedding space.

**Tasks:**
1. Find the top 10 most similar words to 'restaurant'
2. Calculate similarity scores between: ('amazing', 'awesome'), ('amazing', 'terrible'), ('chicken', 'burger')
3. Create an analogy: "lunch" is to "dinner" as "breakfast" is to ___
4. Create your own custom analogy using food/restaurant-related words

In [None]:
# Lab 3: Similarity and analogies

# Task 1: Find top 10 similar words to 'restaurant'
print("Task 1: Most similar to 'restaurant':")
similar_restaurant = None  # FILL: Use model.wv.most_similar with topn=10
if similar_restaurant:
    for word, score in similar_restaurant:
        print(f"  {word}: {score:.4f}")

# Task 2: Calculate similarities for word pairs
word_pairs = [('amazing', 'awesome'), ('amazing', 'terrible'), ('chicken', 'burger')]
print("\nTask 2: Word pair similarities:")
for w1, w2 in word_pairs:
    sim = None  # FILL: Use model.wv.similarity
    if sim is not None:
        print(f"  '{w1}' <-> '{w2}': {sim:.4f}")

# Task 3: Analogy - "lunch" is to "dinner" as "breakfast" is to ___
print("\nTask 3: Analogy - 'lunch' is to 'dinner' as 'breakfast' is to ___")
analogy_result = None  # FILL: Use model.wv.most_similar with positive and negative parameters
if analogy_result:
    for word, score in analogy_result[:3]:
        print(f"  {word}: {score:.4f}")

# Task 4: Your custom analogy
# Example structure: "coffee" is to "morning" as "wine" is to ___
print("\nTask 4: Custom analogy - YOUR CHOICE")
# FILL: Create your own analogy
# custom_analogy = model.wv.most_similar(positive=[?, ?], negative=[?], topn=3)
# Print results

## Section 4: Document Similarity

### From Words to Documents

Individual word vectors are useful, but how do we represent entire documents? A simple approach is to **average all word vectors** in a document.

While this loses word order information, it's surprisingly effective for tasks like:
- Document similarity
- Document clustering
- Simple document classification

**Formula:** For a document with words `w1, w2, ..., wn`:

```python
doc_vector = (vector(w1) + vector(w2) + ... + vector(wn)) / n
```

**Demo: Document Similarity**

In [None]:
# Demo: Convert documents to vectors by averaging word embeddings

def document_to_vector(doc_tokens, model):
    """
    Convert a document (list of tokens) to a vector by averaging word embeddings.
    
    Args:
        doc_tokens: List of tokens (words)
        model: Trained Word2Vec model
    
    Returns:
        numpy array: Document vector (averaged word vectors)
    """
    # Filter tokens that exist in vocabulary
    valid_tokens = [token for token in doc_tokens if token in model.wv]
    
    if not valid_tokens:
        # If no valid tokens, return zero vector
        return np.zeros(model.wv.vector_size)
    
    # Get vectors for all valid tokens and average them
    word_vectors = [model.wv[token] for token in valid_tokens]
    doc_vector = np.mean(word_vectors, axis=0)
    
    return doc_vector

# Test with a few sample reviews
sample_indices = [0, 1, 2]
doc_vectors = []

print("Converting documents to vectors:\n")
for idx in sample_indices:
    tokens = processed_corpus[idx]
    vec = document_to_vector(tokens, model)
    doc_vectors.append(vec)
    print(f"Document {idx}:")
    print(f"  Tokens (first 10): {tokens[:10]}")
    print(f"  Vector shape: {vec.shape}")
    print(f"  Vector (first 5 values): {vec[:5]}\n")

doc_vectors = np.array(doc_vectors)

In [None]:
# Demo: Calculate document similarity

# Calculate cosine similarity between document vectors
similarity_matrix = cosine_similarity(doc_vectors)

print("Document similarity matrix:")
print("(Shows cosine similarity between documents 0, 1, and 2)\n")
print(similarity_matrix)

print("\n\nInterpretation:")
print(f"  Doc 0 vs Doc 1: {similarity_matrix[0][1]:.4f}")
print(f"  Doc 0 vs Doc 2: {similarity_matrix[0][2]:.4f}")
print(f"  Doc 1 vs Doc 2: {similarity_matrix[1][2]:.4f}")

# Show the actual reviews for context
print("\n\nActual reviews (first 100 chars):")
for idx in sample_indices:
    print(f"  Doc {idx}: {X.iloc[idx][:100]}...")

### Lab Exercise 4: Build a Document Similarity System

Create a simple document retrieval system using averaged word embeddings.

**Tasks:**
1. Convert the first 100 documents from the corpus to document vectors
2. Pick a query document (e.g., document 0) and find the 5 most similar documents
3. Print the query document and the top 5 similar documents
4. Verify that the most similar documents make semantic sense

In [None]:
# Lab 4: Document similarity system

# Task 1: Convert first 100 documents to vectors
num_docs = 100
all_doc_vectors = []

for i in range(num_docs):
    vec = None  # FILL: Use document_to_vector function
    if vec is not None:
        all_doc_vectors.append(vec)

all_doc_vectors = np.array(all_doc_vectors)
print(f"Converted {len(all_doc_vectors)} documents to vectors")
print(f"Document vector matrix shape: {all_doc_vectors.shape}")

# Task 2: Find 5 most similar documents to a query document
query_idx = 0  # Using document 0 as query
query_vector = None  # FILL: Get the vector for the query document

# Calculate similarities between query and all other documents
similarities = None  # FILL: Use cosine_similarity to compare query_vector with all_doc_vectors

# Get indices of top 5 most similar documents (excluding the query itself)
# FILL: Use np.argsort to sort similarities and get top 5 indices
top_5_indices = None

# Task 3: Print results
if top_5_indices is not None:
    print(f"\nQuery document (index {query_idx}):")
    print(f"  {X.iloc[query_idx][:150]}...\n")
    
    print("Top 5 most similar documents:")
    for rank, idx in enumerate(top_5_indices, 1):
        if idx < len(X):
            sim_score = None  # FILL: Get similarity score
            print(f"\n{rank}. Document {idx} (similarity: {sim_score:.4f}):")
            print(f"   {X.iloc[idx][:150]}...")

## Summary

Congratulations! You've explored Word2Vec embeddings using gensim and learned how to:

1. **Train Word2Vec models** with different architectures (Skip-gram, CBOW) and parameters
2. **Navigate the vocabulary** using word-to-index and index-to-word mappings
3. **Access the embedding matrix** and extract word vectors
4. **Calculate word similarities** and perform vector arithmetic (analogies)
5. **Build document representations** by averaging word vectors
6. **Create a simple document retrieval system** using cosine similarity

### Key Takeaways

- **Word2Vec captures semantic relationships** in vector space
- **Skip-gram** typically performs better for semantic tasks than CBOW
- **Larger embeddings** (e.g., 300d) capture more nuance but require more data and training time
- **Document averaging** is simple but effective for many tasks
- **Word2Vec is the foundation** for more advanced embedding techniques (GloVe, FastText, transformers)

### Next Steps

- Try training on larger corpora for better quality embeddings
- Experiment with pre-trained embeddings (GloVe, FastText)
- Explore more sophisticated document representations (Doc2Vec, sentence transformers)
- Use embeddings as features for classification or clustering tasks

## Optional: Save and Load Models

Word2Vec models can be saved and loaded for reuse. This is useful when working with large corpora that take time to train.

In [None]:
# Optional: Save the trained model
model.save("word2vec_yelp.model")
print("Model saved to 'word2vec_yelp.model'")

# Optional: Load the model later
# loaded_model = Word2Vec.load("word2vec_yelp.model")
# print(f"Model loaded! Vocabulary size: {len(loaded_model.wv.key_to_index)}")

# Optional: Save just the word vectors (smaller file, but can't continue training)
model.wv.save("word2vec_yelp.wordvectors")
print("Word vectors saved to 'word2vec_yelp.wordvectors'")

# Optional: Load just word vectors
# from gensim.models import KeyedVectors
# loaded_wv = KeyedVectors.load("word2vec_yelp.wordvectors")
# print(loaded_wv.most_similar('pizza', topn=5))