# Lab 3## Exploring Word2Vec with GensimIn this notebook we will train and explore Word2Vec embeddings using gensim. We'll learn how word embeddings capture semantic relationships and how to use them for similarity tasks.Key concepts:- Training Word2Vec models (Skip-gram vs CBOW)- Vocabulary and embedding matrix- Word similarity and analogies- Document similarity using averaged word vectorsYou can run this lab both locally or in Colab.- To run in Colab just go to `https://colab.research.google.com`, sign-in and you upload this notebook. Colab has GPU access for free.- To run locally just run `jupyter notebook` and access the notebook in this lab. You would need to first install the requirements in `requirements.txt`Follow the instructions. Good luck!

In [None]:
!nvidia-smi

In [None]:
# CRITICAL: Version constraints for compatibility# These versions are tested and required for this course!pip install 'numpy<2' \             'gensim==4.2.0' \             textblob \             smart_open

In [None]:
import gensimimport numpy as npimport pandas as pdfrom gensim.models import Word2Vecfrom sklearn.model_selection import train_test_splitfrom sklearn.metrics.pairwise import cosine_similarity# Set random seeds for reproducibilitynp.random.seed(42)# Configurationembedding_dim = 100window_size = 5min_count = 2workers = 4

## Simple Word2Vec Demo

Before working with real data, let's see Word2Vec in action on a tiny example. This will help you understand:
- How gensim builds a vocabulary
- How words are mapped to indices
- How to access the embedding matrix
- How each word gets its vector representation

In [None]:
# Create a simple corpus of 10 sentences
simple_corpus = [
    "the cat sat on the mat",
    "the dog sat on the log",
    "cats and dogs are enemies",
    "dogs and cats fight sometimes",
    "the mat is comfortable",
    "the log is wooden",
    "comfortable mats are great",
    "wooden logs are heavy",
    "the cat loves the mat",
    "the dog loves the log"
]

# Tokenize each sentence (split by spaces and lowercase)
tokenized_corpus = [sentence.lower().split() for sentence in simple_corpus]

print(f"Number of sentences: {len(tokenized_corpus)}")
print(f"First sentence tokens: {tokenized_corpus[0]}")
print(f"Second sentence tokens: {tokenized_corpus[1]}")

In [None]:
# Train a simple Word2Vec model
# Using smaller dimensions (20) for this toy example
simple_model = Word2Vec(
    sentences=tokenized_corpus,
    vector_size=20,      # Small embedding dimension for demo
    window=3,            # Context window
    min_count=1,         # Include all words (even if they appear once)
    sg=1,                # Skip-gram
    epochs=100           # More epochs for small dataset
)

print(f"Model trained!")
print(f"Vocabulary size: {len(simple_model.wv.key_to_index)}")
print(f"Embedding dimension: {simple_model.wv.vector_size}")

In [None]:
# Explore the vocabulary
print("Vocabulary (word to index mapping):")
for word, idx in simple_model.wv.key_to_index.items():
    print(f"  '{word}' -> index {idx}")

# Most frequent words
print(f"\nMost frequent words: {simple_model.wv.index_to_key[:5]}")

In [None]:
# Get the full embedding matrix
embedding_matrix = simple_model.wv.vectors
print(f"Embedding matrix shape: {embedding_matrix.shape}")
print(f"  -> {embedding_matrix.shape[0]} words, each represented by {embedding_matrix.shape[1]} dimensions\n")

# Method 1: Get vector by word directly
word = "cat"
vector_by_word = simple_model.wv[word]
print(f"Vector for '{word}' (by word): {vector_by_word[:5]}... (showing first 5 values)")

# Method 2: Get vector by index from embedding matrix
word_index = simple_model.wv.key_to_index[word]
vector_by_index = embedding_matrix[word_index]
print(f"Vector for '{word}' (by index {word_index}): {vector_by_index[:5]}... (showing first 5 values)")

# Verify they're the same
print(f"\nAre they identical? {np.allclose(vector_by_word, vector_by_index)}")

# Show similarity
similar_words = simple_model.wv.most_similar(word, topn=3)
print(f"\nWords most similar to '{word}':")
for w, score in similar_words:
    print(f"  {w}: {score:.4f}")

## Working with Real Data

Now that you understand the basics, let's train Word2Vec on a real dataset of Yelp reviews!

In [None]:
%%writefile get_data.shif [ ! -f yelp.csv ]; then  wget -O yelp.csv https://www.dropbox.com/scl/fi/dr6xmgw59kliq74gcd340/yelp.csv?rlkey=la6ue9a899v54f04eu92lbmlx&st=fld39cyt&dl=0fi

In [None]:
!bash get_data.sh

In [None]:
path = './yelp.csv'yelp = pd.read_csv(path)# Create a DataFrame that only contains the 5-star and 1-star reviewsyelp_best_worst = yelp[(yelp.stars==5) | (yelp.stars==1)]X = yelp_best_worst.texty = yelp_best_worst.stars.map({1:0, 5:1})print(f"Total reviews: {len(X)}")print(f"Sample: {X.iloc[0][:100]}...")

Word2Vec expects text as a list of sentences, where each sentence is a list of tokens (words). We need to preprocess the raw text by tokenizing and lowercasing it.

In [None]:
# FILL: Preprocess text using gensim's simple_preprocessdef preprocess_text(text):    """Tokenize and preprocess text"""    return None  # FILL: Use gensim.utils.simple_preprocess# Preprocess the entire corpusprocessed_corpus = None  # FILL: Use list comprehension to preprocess all textsprint(f"Total documents: {len(processed_corpus) if processed_corpus else 'Not done yet'}")

Now we'll train a Word2Vec model. Word2Vec has two architectures:- **Skip-gram**: Predicts context words from a target word (better for semantic relationships)- **CBOW**: Predicts target word from context words (faster training)Key parameters:- `vector_size`: Dimension of word vectors (we use 100)- `window`: How many words before/after to consider as context (we use 5)- `min_count`: Ignore words appearing less than this (we use 2)- `sg`: 1 for Skip-gram, 0 for CBOW

In [None]:
# FILL: Train Word2Vec model# Use the Word2Vec class with these parameters:# - sentences: processed_corpus# - vector_size: embedding_dim# - window: window_size# - min_count: min_count# - workers: workers# - sg: 1 (for Skip-gram)# - epochs: 10model = None  # FILLif model:    print(f"Model trained!")    print(f"Vocabulary size: {len(model.wv.key_to_index)}")    print(f"Vector dimensions: {model.wv.vector_size}")

The model builds a vocabulary from the corpus, assigning each word a unique index. We can access:- `model.wv.key_to_index`: Dictionary mapping words to indices- `model.wv.index_to_key`: List mapping indices to words (sorted by frequency)

In [None]:
# Explore vocabularyword_to_idx = model.wv.key_to_indexidx_to_word = model.wv.index_to_keyprint("Word to Index examples:")for word in ['pizza', 'good', 'restaurant']:    if word in word_to_idx:        print(f"  '{word}' -> index {word_to_idx[word]}")print(f"\nMost frequent words: {idx_to_word[:10]}")

The **embedding matrix** stores all word vectors. It's a 2D array of shape `(vocab_size, vector_size)` where each row is a word's vector representation. We can access it via `model.wv.vectors` or get individual word vectors via `model.wv['word']`.

In [None]:
# Get the embedding matrixembedding_matrix = model.wv.vectorsprint(f"Embedding matrix shape: {embedding_matrix.shape}")# Get vector for a wordword = 'pizza'vector = model.wv[word]print(f"\nVector for '{word}' (first 10 values): {vector[:10]}")

Word2Vec captures semantic similarity - words with similar meanings have vectors that are close together (measured by cosine similarity). The `most_similar()` method finds words with the highest cosine similarity.

In [None]:
# FILL: Find most similar words to 'pizza'# Use model.wv.most_similar with topn=5similar = None  # FILLif similar:    print("Most similar to 'pizza':")    for word, score in similar:        print(f"  {word}: {score:.4f}")# FILL: Calculate similarity between 'good' and 'great'sim = None  # FILL: Use model.wv.similarityprint(f"\nSimilarity 'good' vs 'great': {sim}")

Word2Vec supports vector arithmetic! The famous example is: `king - man + woman ≈ queen`. This works because the model learns directional relationships. We can do this with `most_similar(positive=[...], negative=[...])`.

In [None]:
# Vector arithmetic: king - man + woman ≈ queen# In our Yelp context: good - restaurant + pizza ≈ ?print("Analogy: 'good' is to 'restaurant' as 'delicious' is to ___")try:    result = model.wv.most_similar(positive=['restaurant', 'delicious'], negative=['good'], topn=3)    for word, score in result:        print(f"  {word}: {score:.4f}")except KeyError as e:    print(f"  Word not in vocabulary: {e}")

To represent entire documents, we can average all word vectors in the document. While this loses word order, it's effective for document similarity tasks.

In [None]:
# FILL: Create function to convert document to vectordef document_to_vector(doc_tokens, model):    """Convert document to vector by averaging word embeddings"""    # FILL: Filter tokens that exist in vocabulary    valid_tokens = None    if not valid_tokens:        return np.zeros(model.wv.vector_size)    # FILL: Average the word vectors    return None# FILL: Convert first 3 documents to vectorsdoc_vectors = None# FILL: Calculate similarity matrix using cosine_similaritysimilarity_matrix = Noneif similarity_matrix is not None:    print("Document similarity matrix:")    print(similarity_matrix)

We successfully trained a Word2Vec model and explored word embeddings! The model captures semantic relationships between words and can be used for various NLP tasks.

In [None]:
# Save model (optional)model.save("word2vec_yelp.model")print("Model saved!")# You can load it later with:# loaded_model = Word2Vec.load("word2vec_yelp.model")