# Word2Vec: Learning Word Embeddings

Word2Vec is a neural network-based technique that learns dense vector representations (embeddings) of words from large text corpora. Unlike TF-IDF/BoW which treat words as independent symbols, Word2Vec captures **semantic relationships** — words with similar meanings have similar vectors.

**Key concepts:**
- **CBOW (Continuous Bag of Words)**: Predicts target word from surrounding context words
- **Skip-gram**: Predicts context words from a target word
- Words that appear in similar contexts get similar embeddings
- Enables vector arithmetic: `king - man + woman ≈ queen`

---

## 1. Setup & Installation

In [None]:
!pip install --upgrade scipy numpy gensim -q

In [None]:
from gensim.models import Word2Vec          # The Word2Vec model implementation
import gensim.downloader as api              # Download pre-built corpora/models
from gensim.utils import simple_preprocess   # Tokenization utility (lowercase, remove punctuation)

## 2. Load the 20 Newsgroups Dataset

The **20 Newsgroups** corpus contains ~18,000 newsgroup posts across 20 different topics (politics, sports, religion, tech, etc.). It's a classic text classification benchmark.

In [None]:
# Download the 20-newsgroups dataset via gensim's API
# Returns an iterable of dicts with 'data' (text), 'topic', etc.
corpus = api.load('20-newsgroups')

# Convert to list so we can inspect and reuse it
corpus_list = list(corpus)
print(f"Loaded {len(corpus_list)} documents")

In [None]:
# Inspect a sample document
sample = corpus_list[0]
print(f"Keys: {sample.keys()}")
print(f"Topic: {sample.get('topic', 'N/A')}")
print(f"Text preview:\n{sample['data'][:500]}...")

## 3. Preprocess the Text

`simple_preprocess()` does:
- Lowercase everything
- Remove punctuation and numbers
- Tokenize into words
- Filter tokens by length (default: 2-15 chars)

Word2Vec expects input as a **list of tokenized sentences/documents** (list of lists of strings).

In [None]:
# Tokenize each document
processed_docs = [simple_preprocess(doc['data']) for doc in corpus_list]

print(f"Processed {len(processed_docs)} documents")
print(f"Sample tokenized doc: {processed_docs[0][:20]}...")

## 4. Train the Word2Vec Model

Key parameters:
- **`min_count`**: Ignore words appearing fewer than N times (removes rare/noisy words)
- **`vector_size`**: Dimensionality of word vectors (default 100)
- **`window`**: Context window size (how many words around target to consider)
- **`sg`**: 0=CBOW, 1=Skip-gram
- **`workers`**: Parallel training threads

In [None]:
# Train Word2Vec model
# min_count=3 filters out words appearing < 3 times (noise reduction)
model = Word2Vec(
    sentences=processed_docs,
    vector_size=100,    # embedding dimension
    window=5,           # context window
    min_count=3,        # minimum word frequency
    workers=4,          # parallel threads
    sg=0                # 0=CBOW, 1=Skip-gram
)

print(f"Vocabulary size: {len(model.wv)}")
print(f"Vector dimensionality: {model.wv.vector_size}")

## 5. Explore Word Embeddings

Now the fun part — let's see what the model learned!

In [None]:
# Access word vectors via model.wv (KeyedVectors object)
wv = model.wv

# Get the vector for a word
print("Vector for 'computer':")
print(wv['computer'][:10], "...")  # First 10 dimensions

In [None]:
# Find similar words (by cosine similarity)
print("Words most similar to 'computer':")
wv.most_similar('computer', topn=10)

In [None]:
# Try different words based on the 20-newsgroups topics
print("Similar to 'science':")
print(wv.most_similar('science', topn=5))

print("\nSimilar to 'god':")
print(wv.most_similar('god', topn=5))

print("\nSimilar to 'windows':") 
print(wv.most_similar('windows', topn=5))

## 6. Word Analogies (Vector Arithmetic)

The classic Word2Vec demo: `king - man + woman = queen`

This works because the model learns that the relationship between king/queen is similar to man/woman.

In [None]:
# Word analogies: positive - negative
# "What is to X as Y is to Z?"
# Note: Results depend heavily on corpus size/domain — 20-newsgroups is relatively small

try:
    # mac - apple + microsoft = ?
    result = wv.most_similar(positive=['mac', 'microsoft'], negative=['apple'], topn=3)
    print("mac - apple + microsoft =", result)
except KeyError as e:
    print(f"Word not in vocabulary: {e}")

try:
    # university - student + professor = ?
    result = wv.most_similar(positive=['university', 'teach'], negative=['student'], topn=3)
    print("university - student + teach =", result)
except KeyError as e:
    print(f"Word not in vocabulary: {e}")

## 7. Word Similarity Scores

Directly compute cosine similarity between word pairs.

In [None]:
# Similarity between word pairs (cosine similarity, range -1 to 1)
pairs = [
    ('computer', 'software'),
    ('computer', 'religion'),
    ('science', 'research'),
    ('god', 'jesus'),
]

for w1, w2 in pairs:
    try:
        sim = wv.similarity(w1, w2)
        print(f"{w1} <-> {w2}: {sim:.4f}")
    except KeyError as e:
        print(f"Word not found: {e}")

In [None]:
# Find the odd one out (word that doesn't fit)
try:
    odd = wv.doesnt_match(['computer', 'software', 'hardware', 'god'])
    print(f"Doesn't match: {odd}")
except KeyError as e:
    print(f"Word not found: {e}")

## 8. Save & Load the Model

In [None]:
# Save the full model (can continue training later)
model.save("word2vec_20newsgroups.model")

# Save just the word vectors (smaller, read-only)
model.wv.save("word2vec_20newsgroups.wordvectors")

# Load them back
# loaded_model = Word2Vec.load("word2vec_20newsgroups.model")
# loaded_wv = KeyedVectors.load("word2vec_20newsgroups.wordvectors")

print("Model saved!")

---

## Summary

| TF-IDF/BoW | Word2Vec |
|------------|----------|
| Sparse, high-dimensional | Dense, low-dimensional (100-300d) |
| Words are independent symbols | Words with similar meanings → similar vectors |
| Explicit term frequency | Learned from context (neural network) |
| Good for exact keyword matching | Good for semantic similarity |
| No training required | Requires training on large corpus |

**Next steps:**
- Try pre-trained embeddings (e.g., `word2vec-google-news-300`)
- Experiment with Skip-gram vs CBOW (`sg=1` vs `sg=0`)
- Use embeddings for downstream tasks (classification, clustering)

## Bonus: Using Pre-trained Word Vectors

Training on 20-newsgroups gives decent results, but pre-trained embeddings (trained on billions of words) are much better for general use.

In [None]:
# Uncomment to download pre-trained Google News vectors (~1.7GB)
# This gives MUCH better results for analogies like king-man+woman=queen

# pretrained = api.load('word2vec-google-news-300')
# print(pretrained.most_similar(positive=['king', 'woman'], negative=['man'], topn=3))

# List available pre-trained models:
print("Available pre-trained models:")
print([m for m in api.info()['models'].keys() if 'word2vec' in m.lower() or 'glove' in m.lower()])