# Word2Vec: Skip-gram and CBOW

---

## Table of Contents
1. [Introduction to Word2Vec](#introduction)
2. [Intuition Behind Word2Vec](#intuition)
3. [CBOW Architecture](#cbow)
4. [Skip-gram Architecture](#skipgram)
5. [Training Word2Vec Models](#training)
6. [Comparing CBOW vs Skip-gram](#comparison)
7. [Advanced Topics](#advanced)
8. [Real-World Applications](#applications)
9. [Best Practices](#best-practices)

---

## 1. Introduction to Word2Vec <a id='introduction'></a>

**Word2Vec** is a technique for learning word embeddings introduced by Tomas Mikolov et al. at Google in 2013.

### What Makes Word2Vec Special?

1. **Efficient**: Can train on billions of words
2. **Effective**: Captures semantic and syntactic relationships
3. **Simple**: Based on neural networks with simple architecture
4. **Widely Used**: Foundation for many modern NLP techniques

### Two Architectures:

Word2Vec comes in two flavors:

1. **CBOW (Continuous Bag of Words)**
   - Predicts target word from context words
   - Faster to train
   - Better for frequent words

2. **Skip-gram**
   - Predicts context words from target word
   - Slower to train
   - Better for rare words and small datasets

### Key Innovation:

**Distributional Hypothesis**: *"You shall know a word by the company it keeps"*

Words appearing in similar contexts have similar meanings.

In [None]:
# Setup: Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn.metrics.pairwise import cosine_similarity
import warnings
warnings.filterwarnings('ignore')

# Gensim for Word2Vec
from gensim.models import Word2Vec
from gensim.models.word2vec import LineSentence
import gensim.downloader as api

# NLTK for text processing
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize

nltk.download('punkt', quiet=True)
nltk.download('punkt_tab', quiet=True)

print("‚úì All libraries imported successfully!")

## 2. Intuition Behind Word2Vec <a id='intuition'></a>

### The Core Idea:

```
Given a sentence:
"The quick brown fox jumps over the lazy dog"

For the word "fox":
  Context window (size=2): [quick, brown, jumps, over]
  
Word2Vec learns: Words with similar contexts should have similar embeddings
```

### Example:

```
Sentence 1: "The cat sat on the mat"
Sentence 2: "The dog sat on the rug"

Both "cat" and "dog" appear in context: [The ___ sat]
‚Üí "cat" and "dog" will have similar embeddings
```

### How It Works:

1. **Slide a window** over the text
2. **Create training pairs** (target word, context words)
3. **Train a neural network** to predict relationships
4. **Extract embeddings** from the hidden layer

### Shallow Neural Network:

```
Input Layer ‚Üí Hidden Layer ‚Üí Output Layer
(Vocab Size)  (Embedding)    (Vocab Size)
   10,000    ‚Üí     300     ‚Üí    10,000
                    ‚Üë
              These become our
              word embeddings!
```

In [None]:
# Visualize the sliding window concept

def show_context_windows(sentence, window_size=2):
    """
    Show how Word2Vec creates context windows.
    """
    words = word_tokenize(sentence.lower())
    
    print(f"Sentence: '{sentence}'")
    print(f"Window size: {window_size}\n")
    print("="*80)
    print(f"\n{'Target Word':<20} {'Context Words'}")
    print("-"*60)
    
    for i, target_word in enumerate(words):
        # Get context words
        start = max(0, i - window_size)
        end = min(len(words), i + window_size + 1)
        
        context = [words[j] for j in range(start, end) if j != i]
        
        print(f"{target_word:<20} {context}")

# Example
sentence = "The quick brown fox jumps over the lazy dog"
show_context_windows(sentence, window_size=2)

## 3. CBOW Architecture <a id='cbow'></a>

**CBOW (Continuous Bag of Words)** predicts the target word from its context.

### How CBOW Works:

```
Given context: ["quick", "brown", "jumps", "over"]
Predict:       "fox"
```

### Architecture:

```
Context Words
    ‚Üì  ‚Üì  ‚Üì  ‚Üì
[One-Hot Vectors]
    ‚Üì  ‚Üì  ‚Üì  ‚Üì
[Embedding Layer] ‚Üê Weights are word embeddings
    ‚Üì  ‚Üì  ‚Üì  ‚Üì
  [Average/Sum]
       ‚Üì
[Hidden Layer]
       ‚Üì
[Output Layer]
       ‚Üì
   Softmax
       ‚Üì
Target Word Probability
```

### Advantages:
- **Faster** than Skip-gram
- **Better for frequent words**
- **Smaller training data** requirements

### Disadvantages:
- **Less effective for rare words**
- **Loses word order** (bag of words)

In [None]:
# Train a simple CBOW model

# Sample corpus
sentences = [
    "the quick brown fox jumps over the lazy dog",
    "the dog runs fast",
    "the fox is quick",
    "the cat and dog are friends",
    "a quick brown cat",
    "the lazy cat sleeps",
]

# Tokenize sentences
tokenized_sentences = [sentence.split() for sentence in sentences]

print("Training CBOW Model...\n")
print("Sample sentences:")
for i, sent in enumerate(tokenized_sentences[:3], 1):
    print(f"  {i}. {' '.join(sent)}")
print(f"  ... ({len(tokenized_sentences)} total sentences)\n")

# Train CBOW model
# sg=0 means CBOW (sg=1 is Skip-gram)
cbow_model = Word2Vec(
    sentences=tokenized_sentences,
    vector_size=50,      # Embedding dimension
    window=2,            # Context window size
    min_count=1,         # Minimum word frequency
    sg=0,                # 0 = CBOW, 1 = Skip-gram
    epochs=100,          # Training iterations
    seed=42
)

print("‚úì CBOW Model trained!\n")
print(f"Vocabulary size: {len(cbow_model.wv)}")
print(f"Vector dimensions: {cbow_model.wv.vector_size}")
print(f"\nVocabulary: {list(cbow_model.wv.index_to_key)}")

In [None]:
# Explore CBOW embeddings

# Get similar words
test_words = ['dog', 'cat', 'quick']

print("CBOW Model - Similar Words:\n")
print("="*60)

for word in test_words:
    try:
        similar = cbow_model.wv.most_similar(word, topn=3)
        print(f"\nMost similar to '{word}':")
        for sim_word, score in similar:
            print(f"  {sim_word:<15} ‚Üí {score:.4f}")
    except KeyError:
        print(f"\n'{word}' not in vocabulary")

## 4. Skip-gram Architecture <a id='skipgram'></a>

**Skip-gram** predicts context words from the target word (opposite of CBOW).

### How Skip-gram Works:

```
Given target:  "fox"
Predict:       ["quick", "brown", "jumps", "over"]
```

### Architecture:

```
Target Word
     ‚Üì
[One-Hot Vector]
     ‚Üì
[Embedding Layer] ‚Üê Weights are word embeddings
     ‚Üì
[Hidden Layer]
     ‚Üì
[Output Layer]
     ‚Üì
  Softmax
     ‚Üì
Context Words Probabilities
```

### Training Pairs Created:

```
"The quick brown fox jumps"

Target: "fox"
Training pairs:
  (fox, quick)
  (fox, brown)
  (fox, jumps)
```

### Advantages:
- **Better for rare words**
- **Better for small datasets**
- **Captures more nuanced relationships**

### Disadvantages:
- **Slower to train**
- **More training data needed**

In [None]:
# Train Skip-gram model

print("Training Skip-gram Model...\n")

# Train Skip-gram model (sg=1)
skipgram_model = Word2Vec(
    sentences=tokenized_sentences,
    vector_size=50,
    window=2,
    min_count=1,
    sg=1,                # 1 = Skip-gram
    epochs=100,
    seed=42
)

print("‚úì Skip-gram Model trained!\n")
print(f"Vocabulary size: {len(skipgram_model.wv)}")
print(f"Vector dimensions: {skipgram_model.wv.vector_size}")

In [None]:
# Explore Skip-gram embeddings

print("Skip-gram Model - Similar Words:\n")
print("="*60)

for word in test_words:
    try:
        similar = skipgram_model.wv.most_similar(word, topn=3)
        print(f"\nMost similar to '{word}':")
        for sim_word, score in similar:
            print(f"  {sim_word:<15} ‚Üí {score:.4f}")
    except KeyError:
        print(f"\n'{word}' not in vocabulary")

## 5. Training Word2Vec Models <a id='training'></a>

Let's train on a larger, more realistic corpus.

In [None]:
# Create a larger training corpus

corpus_text = """
Natural language processing is a subfield of artificial intelligence.
It focuses on the interaction between computers and human language.
Machine learning algorithms are essential for modern NLP systems.
Deep learning has revolutionized natural language understanding.
Word embeddings capture semantic relationships between words.
Neural networks can learn complex patterns in text data.
Transformers are the foundation of modern language models.
BERT and GPT are examples of transformer-based models.
Text classification is a common NLP task.
Sentiment analysis determines the emotional tone of text.
Named entity recognition identifies important entities.
Part of speech tagging assigns grammatical categories.
Machine translation converts text from one language to another.
Question answering systems provide direct responses to queries.
Text summarization creates concise versions of documents.
Chatbots use NLP to understand and respond to users.
Language models predict the next word in a sequence.
Transfer learning allows models to leverage pre-trained knowledge.
Fine-tuning adapts models to specific tasks and domains.
Word vectors represent words as dense numerical features.
"""

# Tokenize into sentences
sentences = sent_tokenize(corpus_text)

# Tokenize each sentence into words
tokenized_corpus = [word_tokenize(sent.lower()) for sent in sentences]

print(f"Corpus Statistics:")
print(f"  Sentences: {len(tokenized_corpus)}")
print(f"  Total words: {sum(len(sent) for sent in tokenized_corpus)}")
print(f"  Unique words: {len(set([word for sent in tokenized_corpus for word in sent]))}")

print(f"\nSample sentences:")
for i, sent in enumerate(tokenized_corpus[:3], 1):
    print(f"  {i}. {' '.join(sent[:15])}...")

In [None]:
# Train both models on the corpus

print("Training Word2Vec Models...\n")
print("="*80)

# CBOW model
print("\n1. Training CBOW...")
cbow = Word2Vec(
    sentences=tokenized_corpus,
    vector_size=100,     # Increase embedding dimension
    window=5,            # Larger context window
    min_count=1,
    sg=0,                # CBOW
    epochs=100,
    workers=4,           # Parallel processing
    seed=42
)
print("   ‚úì CBOW trained")

# Skip-gram model
print("\n2. Training Skip-gram...")
skipgram = Word2Vec(
    sentences=tokenized_corpus,
    vector_size=100,
    window=5,
    min_count=1,
    sg=1,                # Skip-gram
    epochs=100,
    workers=4,
    seed=42
)
print("   ‚úì Skip-gram trained")

print("\n" + "="*80)
print("\nModel Statistics:")
print(f"\nCBOW:")
print(f"  Vocabulary: {len(cbow.wv)} words")
print(f"  Dimensions: {cbow.wv.vector_size}")

print(f"\nSkip-gram:")
print(f"  Vocabulary: {len(skipgram.wv)} words")
print(f"  Dimensions: {skipgram.wv.vector_size}")

In [None]:
# Explore learned embeddings

test_words_nlp = ['language', 'learning', 'model', 'text']

print("Exploring Trained Embeddings:\n")
print("="*80)

for word in test_words_nlp:
    print(f"\nWord: '{word}'")
    print("-"*60)
    
    # CBOW similar words
    cbow_similar = cbow.wv.most_similar(word, topn=5)
    print("\nCBOW - Most similar:")
    for w, score in cbow_similar:
        print(f"  {w:<20} {score:.4f}")
    
    # Skip-gram similar words
    sg_similar = skipgram.wv.most_similar(word, topn=5)
    print("\nSkip-gram - Most similar:")
    for w, score in sg_similar:
        print(f"  {w:<20} {score:.4f}")

In [None]:
# Test analogies

print("Testing Word Analogies:\n")
print("="*80)

# Test if our model learned relationships
analogies = [
    ('machine', 'learning', 'deep'),     # deep learning
    ('natural', 'language', 'artificial'), # artificial intelligence
]

for word1, word2, word3 in analogies:
    print(f"\n'{word1}' is to '{word2}' as '{word3}' is to:")
    print("-"*60)
    
    try:
        # CBOW
        cbow_result = cbow.wv.most_similar(
            positive=[word3, word2],
            negative=[word1],
            topn=3
        )
        print("\nCBOW predictions:")
        for w, score in cbow_result:
            print(f"  {w:<20} {score:.4f}")
        
        # Skip-gram
        sg_result = skipgram.wv.most_similar(
            positive=[word3, word2],
            negative=[word1],
            topn=3
        )
        print("\nSkip-gram predictions:")
        for w, score in sg_result:
            print(f"  {w:<20} {score:.4f}")
    except KeyError as e:
        print(f"  Error: {e}")

## 6. Comparing CBOW vs Skip-gram <a id='comparison'></a>

In [None]:
# Detailed comparison table

comparison_data = {
    'Aspect': [
        'Training Speed',
        'Best For',
        'Rare Words',
        'Dataset Size',
        'Context Usage',
        'Typical Use Case'
    ],
    'CBOW': [
        'Faster',
        'Frequent words',
        'Less effective',
        'Smaller datasets',
        'Predicts target from context',
        'Quick prototyping, large vocab'
    ],
    'Skip-gram': [
        'Slower',
        'Rare words',
        'More effective',
        'Any size (especially small)',
        'Predicts context from target',
        'Quality embeddings, rare words'
    ]
}

df_comparison = pd.DataFrame(comparison_data)

print("CBOW vs Skip-gram Comparison:\n")
print("="*90)
print(df_comparison.to_string(index=False))

print("\n" + "="*90)
print("\nRule of Thumb:")
print("  ‚Ä¢ Use CBOW when: You have large datasets and need speed")
print("  ‚Ä¢ Use Skip-gram when: You have small datasets or care about rare words")

In [None]:
# Visualize embeddings from both models

# Select words to visualize
words_to_viz = ['language', 'natural', 'machine', 'learning', 
                'deep', 'neural', 'text', 'model', 'word']

# Get vectors from both models
cbow_vectors = np.array([cbow.wv[word] for word in words_to_viz])
sg_vectors = np.array([skipgram.wv[word] for word in words_to_viz])

# Reduce to 2D using t-SNE
tsne = TSNE(n_components=2, random_state=42, perplexity=5)
cbow_2d = tsne.fit_transform(cbow_vectors)

tsne = TSNE(n_components=2, random_state=42, perplexity=5)
sg_2d = tsne.fit_transform(sg_vectors)

# Plot
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))

# CBOW
ax1.scatter(cbow_2d[:, 0], cbow_2d[:, 1], s=100, c='steelblue', alpha=0.6)
for i, word in enumerate(words_to_viz):
    ax1.annotate(word, (cbow_2d[i, 0], cbow_2d[i, 1]),
                fontsize=12, fontweight='bold',
                xytext=(5, 5), textcoords='offset points')
ax1.set_title('CBOW Embeddings (t-SNE)', fontsize=14, fontweight='bold')
ax1.grid(True, alpha=0.3)

# Skip-gram
ax2.scatter(sg_2d[:, 0], sg_2d[:, 1], s=100, c='coral', alpha=0.6)
for i, word in enumerate(words_to_viz):
    ax2.annotate(word, (sg_2d[i, 0], sg_2d[i, 1]),
                fontsize=12, fontweight='bold',
                xytext=(5, 5), textcoords='offset points')
ax2.set_title('Skip-gram Embeddings (t-SNE)', fontsize=14, fontweight='bold')
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\nNote: Both models learn similar relationships but may differ in details")

## 7. Advanced Topics <a id='advanced'></a>

### Negative Sampling

Training Word2Vec with full softmax is computationally expensive:
- Need to compute probabilities for entire vocabulary (10,000+ words)

**Solution: Negative Sampling**
- Instead of updating all words, update only:
  - The target word (positive sample)
  - A few random words (negative samples, typically 5-20)

### Hierarchical Softmax

Alternative to negative sampling:
- Uses binary tree structure
- Reduces complexity from O(V) to O(log V)
- Better for frequent words

### Subsampling Frequent Words

Very frequent words ("the", "is", "a") provide less information:
- Randomly skip frequent words during training
- Speeds up training
- Improves quality

In [None]:
# Train with different hyperparameters

print("Training with Different Configurations:\n")
print("="*80)

configs = [
    {'name': 'Small window', 'window': 2, 'vector_size': 50},
    {'name': 'Large window', 'window': 10, 'vector_size': 50},
    {'name': 'High dimensions', 'window': 5, 'vector_size': 200},
]

models = {}

for config in configs:
    print(f"\nTraining: {config['name']}")
    print(f"  Window: {config['window']}, Dimensions: {config['vector_size']}")
    
    model = Word2Vec(
        sentences=tokenized_corpus,
        vector_size=config['vector_size'],
        window=config['window'],
        min_count=1,
        sg=1,  # Skip-gram
        epochs=50,
        seed=42
    )
    
    models[config['name']] = model
    print(f"  ‚úì Training complete")

print("\n" + "="*80)
print("\nComparing similar words for 'learning':")
print("-"*80)

for name, model in models.items():
    similar = model.wv.most_similar('learning', topn=3)
    print(f"\n{name}:")
    for word, score in similar:
        print(f"  {word:<20} {score:.4f}")

## 8. Real-World Applications <a id='applications'></a>

Using Word2Vec embeddings for practical tasks.

In [None]:
# Application 1: Document Similarity

def document_vector(doc, model):
    """
    Convert document to vector by averaging word vectors.
    """
    words = word_tokenize(doc.lower())
    
    # Get vectors for words in vocabulary
    word_vecs = []
    for word in words:
        try:
            word_vecs.append(model.wv[word])
        except KeyError:
            pass
    
    if len(word_vecs) == 0:
        return np.zeros(model.wv.vector_size)
    
    return np.mean(word_vecs, axis=0)

# Test documents
documents = [
    "Machine learning is a subset of artificial intelligence",
    "Deep learning uses neural networks with multiple layers",
    "Natural language processing enables computers to understand text",
    "I enjoy eating pizza and pasta for dinner"
]

# Convert to vectors
doc_vectors = [document_vector(doc, skipgram) for doc in documents]

# Calculate similarities
print("Document Similarity Matrix:\n")
print("="*80)

for i in range(len(documents)):
    for j in range(i+1, len(documents)):
        sim = cosine_similarity(
            doc_vectors[i].reshape(1, -1),
            doc_vectors[j].reshape(1, -1)
        )[0][0]
        
        print(f"\nDoc {i+1} vs Doc {j+1}: {sim:.4f}")
        print(f"  Doc {i+1}: {documents[i][:60]}...")
        print(f"  Doc {j+1}: {documents[j][:60]}...")

In [None]:
# Application 2: Text Classification with Word2Vec Features

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

# Sample training data
texts = [
    "machine learning algorithm",
    "deep neural network",
    "natural language processing",
    "computer vision task",
    "pizza pasta dinner",
    "restaurant food menu",
    "cooking recipe ingredients",
    "breakfast lunch dinner"
]

labels = [0, 0, 0, 0, 1, 1, 1, 1]  # 0=Tech, 1=Food

# Convert to vectors
X = np.array([document_vector(text, skipgram) for text in texts])
y = np.array(labels)

# Train classifier
clf = LogisticRegression(random_state=42)
clf.fit(X, y)

# Test
test_texts = [
    "deep learning model",
    "delicious food",
    "neural network training"
]

print("Text Classification using Word2Vec:\n")
print("="*80)

for text in test_texts:
    vec = document_vector(text, skipgram).reshape(1, -1)
    prediction = clf.predict(vec)[0]
    probability = clf.predict_proba(vec)[0]
    
    category = "Tech" if prediction == 0 else "Food"
    confidence = max(probability) * 100
    
    print(f"\nText: '{text}'")
    print(f"Predicted: {category} (confidence: {confidence:.1f}%)")

## 9. Best Practices <a id='best-practices'></a>

### Training Tips:

1. **Choose the Right Architecture**
   - Large corpus ‚Üí CBOW (faster)
   - Small corpus or rare words ‚Üí Skip-gram

2. **Window Size**
   - Smaller (2-5): Captures syntactic relationships
   - Larger (5-10): Captures semantic/topical relationships

3. **Vector Dimensions**
   - Typical: 100-300
   - More dimensions = more capacity, but also more data needed

4. **Training Epochs**
   - Typically 5-100 epochs
   - Monitor convergence

5. **Min Count**
   - Filter very rare words (min_count=5 is common)
   - Reduces vocabulary size and noise

### When to Use Pre-trained vs Training Your Own:

**Use Pre-trained:**
- General domain
- Limited training data
- Quick prototyping

**Train Your Own:**
- Domain-specific vocabulary
- Lots of domain-specific data
- Need custom embeddings

In [None]:
# Save and load models

# Save model
model_path = "/tmp/word2vec_model.bin"
skipgram.save(model_path)
print(f"‚úì Model saved to: {model_path}")

# Load model
loaded_model = Word2Vec.load(model_path)
print(f"‚úì Model loaded from: {model_path}")

# Verify loaded model works
test_word = 'language'
similar = loaded_model.wv.most_similar(test_word, topn=3)

print(f"\nTest: Similar words to '{test_word}':")
for word, score in similar:
    print(f"  {word:<20} {score:.4f}")

print("\n‚úì Loaded model works correctly!")

## Summary

In this comprehensive notebook, we covered:

‚úÖ **Introduction to Word2Vec**: Revolutionary word embedding technique  
‚úÖ **Core Intuition**: Distributional hypothesis - words in similar contexts  
‚úÖ **CBOW Architecture**: Predict target from context (faster, frequent words)  
‚úÖ **Skip-gram Architecture**: Predict context from target (better for rare words)  
‚úÖ **Training Models**: Hands-on training with Gensim  
‚úÖ **Comparison**: When to use CBOW vs Skip-gram  
‚úÖ **Advanced Topics**: Negative sampling, hyperparameters  
‚úÖ **Real-World Applications**: Document similarity, classification  
‚úÖ **Best Practices**: Training tips and guidelines

### Key Takeaways:

1. **Word2Vec revolutionized NLP** by learning semantic representations
2. **Two architectures**:
   - CBOW: Fast, good for frequent words
   - Skip-gram: Better for rare words and small datasets
3. **Amazing properties**:
   - Semantic similarity
   - Vector arithmetic (king - man + woman ‚âà queen)
   - Clustering related words
4. **Practical applications**: Classification, similarity, feature extraction
5. **Foundation for modern NLP**: Led to BERT, GPT, and transformers

### The Word2Vec Legacy:

```
2013: Word2Vec introduced
       ‚Üì
2014: GloVe (improved global statistics)
       ‚Üì
2016: FastText (subword information)
       ‚Üì
2018: ELMo (context-dependent)
       ‚Üì
2018: BERT (bidirectional transformers)
       ‚Üì
2019: GPT-2, GPT-3 (massive language models)
       ‚Üì
Today: ChatGPT, modern LLMs
```

### What's Next?

To continue your NLP journey:
1. **Explore other embeddings**: GloVe, FastText
2. **Learn transformers**: BERT, GPT architecture
3. **Try advanced models**: Fine-tune pre-trained models
4. **Build projects**: Apply Word2Vec to your own data

---

## Congratulations! üéâ

You've completed the comprehensive NLP for Machine Learning tutorial series!

You now understand:
- Text preprocessing fundamentals
- Classical text representation methods
- POS tagging and NER
- Modern word embeddings
- Word2Vec architectures and training

**Keep learning and building! üöÄ**

---