# Word Embeddings: Word2Vec Training and Visualization

## üìö Learning Objectives

By completing this notebook, you will:
- Train word embeddings using Word2Vec from Gensim
- Understand CBOW vs Skip-gram architectures
- Visualize word embeddings using dimensionality reduction
- Apply Word2Vec to real-world text datasets

## üîó Prerequisites

- ‚úÖ Unit 1: Text preprocessing completed
- ‚úÖ Understanding of tokenization and text representation
- ‚úÖ Python, NumPy, Matplotlib knowledge

---

## Official Structure Reference

This notebook covers practical activities from **Course 07, Unit 2**:
- Training and representing word embeddings using Word2Vec from Gensim
- Applying dimensionality reduction on high-dimensional vectors and visualizing results
- **Source:** `DETAILED_UNIT_DESCRIPTIONS.md` - Unit 2 Practical Content

---

## Introduction

**Word2Vec** is a popular word embedding technique that learns dense vector representations of words by predicting words in context. It comes in two flavors: CBOW (Continuous Bag of Words) and Skip-gram.

In [1]:
import numpy as np
import matplotlib.pyplot as plt
from collections import Counter

# Try importing Gensim for Word2Vec
try:
    from gensim.models import Word2Vec
    from gensim.models import KeyedVectors
    HAS_GENSIM = True
    print("‚úÖ Gensim available for Word2Vec")
except ImportError:
    HAS_GENSIM = False
    print("‚ö†Ô∏è  Gensim not available. Install with: pip install gensim")

# Try importing sklearn for dimensionality reduction
try:
    from sklearn.decomposition import PCA
    from sklearn.manifold import TSNE
    HAS_SKLEARN = True
    print("‚úÖ scikit-learn available for visualization")
except ImportError:
    HAS_SKLEARN = False
    print("‚ö†Ô∏è  scikit-learn not available. Install with: pip install scikit-learn")

print("\n‚úÖ Libraries imported!")

‚ö†Ô∏è  Gensim not available. Install with: pip install gensim


‚úÖ scikit-learn available for visualization

‚úÖ Libraries imported!


## Part 1: Training Word2Vec Model with Gensim


In [2]:
if HAS_GENSIM:
    # Sample text corpus (tokenized sentences)
    # In practice, you would use preprocessed text from a dataset
    sentences = [
        ['natural', 'language', 'processing', 'is', 'important'],
        ['machine', 'learning', 'helps', 'with', 'nlp'],
        ['word', 'embeddings', 'represent', 'words', 'as', 'vectors'],
        ['deep', 'learning', 'models', 'use', 'embeddings'],
        ['neural', 'networks', 'process', 'language', 'data'],
        ['nlp', 'applications', 'include', 'translation', 'and', 'summarization'],
        ['text', 'classification', 'uses', 'word', 'vectors'],
        ['language', 'models', 'understand', 'context', 'and', 'meaning']
    ]
    
    print("=" * 60)
    print("Training Word2Vec Model")
    print("=" * 60)
    print(f"Number of sentences: {len(sentences)}")
    print(f"Sample sentence: {sentences[0]}")
    
    # Train Word2Vec model (Skip-gram)
    print("\nTraining Word2Vec with Skip-gram architecture...")
    model_skipgram = Word2Vec(
        sentences=sentences, vector_size=100,      # Dimension of word vectors
        window=5,             # Context window size
        min_count=1,          # Minimum word count
        sg=1,                 # Skip-gram (1) vs CBOW (0)
        epochs=100,
        workers=1
    )
    
    print(f"‚úÖ Model trained! Vocabulary size: {len(model_skipgram.wv.key_to_index)}")
    print(f"Word vector dimensions: {model_skipgram.wv.vector_size}")
    
    # Train Word2Vec model (CBOW)
    print("\nTraining Word2Vec with CBOW architecture...")
    model_cbow = Word2Vec(
        sentences=sentences, vector_size=100,
        window=5,
        min_count=1,
        sg=0,                 # CBOW (0)
        epochs=100,
        workers=1
    )
    
    print(f"‚úÖ CBOW model trained! Vocabulary size: {len(model_cbow.wv.key_to_index)}")
    
    # Get word vectors
    print("\nSample word vectors:")
    sample_word = 'language'
    if sample_word in model_skipgram.wv.key_to_index:
        vector = model_skipgram.wv[sample_word]
        print(f"Word: '{sample_word}'")
        print(f"Vector shape: {vector.shape}")
        print(f"First 5 values: {vector[:5]}")
        
        # Find similar words
        similar_words = model_skipgram.wv.most_similar(sample_word, topn=5)
        print(f"\nWords similar to '{sample_word}':")
        for word, similarity in similar_words:
            print(f"  {word}: {similarity:.4f}")
else:
    print("=" * 60)
    print("Word2Vec Training (Installation Required)")
    print("=" * 60)
    print("""
    To train Word2Vec models:
    
    1. Install Gensim:
       pip install gensim
    
    2. Prepare tokenized sentences:
       sentences = [['word1', 'word2'], ['word3', 'word4'], ...]
    
    3. Train model:
       from gensim.models import Word2Vec
       model = Word2Vec(sentences, vector_size=100, window=5, sg=1)
    
    4. Use embeddings:
       vector = model.wv['word']
       similar = model.wv.most_similar('word')
    """)


Word2Vec Training (Installation Required)

    To train Word2Vec models:
    
    1. Install Gensim:
       pip install gensim
    
    2. Prepare tokenized sentences:
       sentences = [['word1', 'word2'], ['word3', 'word4'], ...]
    
    3. Train model:
       from gensim.models import Word2Vec
       model = Word2Vec(sentences, vector_size=100, window=5, sg=1)
    
    4. Use embeddings:
       vector = model.wv['word']
       similar = model.wv.most_similar('word')
    


## Part 2: Visualizing Word Embeddings with Dimensionality Reduction


In [3]:
if HAS_GENSIM and HAS_SKLEARN:
    # Select words to visualize
    words_to_visualize = ['natural', 'language', 'processing', 'machine', 'learning', 
                         'word', 'embeddings', 'neural', 'networks', 'text', 'models']
    
    # Get word vectors
    word_vectors = []
    words_found = []
    for word in words_to_visualize:
        if word in model_skipgram.wv.key_to_index:
            word_vectors.append(model_skipgram.wv[word])
            words_found.append(word)
    
    word_vectors = np.array(word_vectors)
    
    print("=" * 60)
    print("Visualizing Word Embeddings")
    print("=" * 60)
    print(f"Original dimensions: {word_vectors.shape}")
    print(f"Words to visualize: {len(words_found)}")
    
    # Apply PCA for dimensionality reduction (2D)
    pca = PCA(n_components=2, random_state=42)
    vectors_2d_pca = pca.fit_transform(word_vectors)
    
    print(f"\nPCA explained variance: {pca.explained_variance_ratio_.sum():.2%}")
    
    # Apply t-SNE for dimensionality reduction (2D)
    print("\nApplying t-SNE...")
    tsne = TSNE(n_components=2, random_state=42, perplexity=min(30, len(words_found)-1))
    vectors_2d_tsne = tsne.fit_transform(word_vectors)
    
    # Visualize
    fig, axes = plt.subplots(1, 2, figsize=(16, 6))
    
    # PCA visualization
    axes[0].scatter(vectors_2d_pca[:, 0], vectors_2d_pca[:, 1], s=100, alpha=0.7)
    for i, word in enumerate(words_found):
        axes[0].annotate(word, (vectors_2d_pca[i, 0], vectors_2d_pca[i, 1]), 
                        fontsize=10, alpha=0.8)
    axes[0].set_title('Word Embeddings Visualization (PCA)', fontsize=14, fontweight='bold')
    axes[0].set_xlabel(f'PC1 ({pca.explained_variance_ratio_[0]:.1%} variance)')
    axes[0].set_ylabel(f'PC2 ({pca.explained_variance_ratio_[1]:.1%} variance)')
    axes[0].grid(True, alpha=0.3)
    
    # t-SNE visualization
    axes[1].scatter(vectors_2d_tsne[:, 0], vectors_2d_tsne[:, 1], s=100, alpha=0.7, c='green')
    for i, word in enumerate(words_found):
        axes[1].annotate(word, (vectors_2d_tsne[i, 0], vectors_2d_tsne[i, 1]), 
                        fontsize=10, alpha=0.8)
    axes[1].set_title('Word Embeddings Visualization (t-SNE)', fontsize=14, fontweight='bold')
    axes[1].set_xlabel('t-SNE Component 1')
    axes[1].set_ylabel('t-SNE Component 2')
    axes[1].grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()
    
    print("\n‚úÖ Word embeddings visualized!")
    print("  - PCA: Linear dimensionality reduction")
    print("  - t-SNE: Non-linear, preserves local neighborhoods")
    print("  - Similar words should be close together")
else:
    print("Note: Install gensim and scikit-learn to visualize word embeddings")


Note: Install gensim and scikit-learn to visualize word embeddings


## Part 3: CBOW vs Skip-gram Comparison


In [4]:
if HAS_GENSIM:
    print("=" * 60)
    print("CBOW vs Skip-gram Comparison")
    print("=" * 60)
    
    test_word = 'language'
    if test_word in model_skipgram.wv.key_to_index and test_word in model_cbow.wv.key_to_index:
        print(f"\nSimilar words to '{test_word}':")
        
        print("\nSkip-gram results:")
        skipgram_similar = model_skipgram.wv.most_similar(test_word, topn=3)
        for word, sim in skipgram_similar:
            print(f"  {word}: {sim:.4f}")
        
        print("\nCBOW results:")
        cbow_similar = model_cbow.wv.most_similar(test_word, topn=3)
        for word, sim in cbow_similar:
            print(f"  {word}: {sim:.4f}")
        
        print("\n‚úÖ Key Differences:")
        print("  - Skip-gram: Better for rare words, slower training")
        print("  - CBOW: Faster training, better for frequent words")
        print("  - Skip-gram: Generally better for small datasets")
        print("  - CBOW: More efficient for large datasets")


## Summary

### Key Concepts:
1. **Word2Vec**: Word embedding technique that learns dense vector representations
   - **CBOW**: Predicts target word from context (faster, better for frequent words)
   - **Skip-gram**: Predicts context from target word (better for rare words)

2. **Training**: Use Gensim Word2Vec to train on tokenized sentences
3. **Visualization**: Use PCA or t-SNE to reduce dimensions for visualization
4. **Applications**: Word similarity, semantic relationships, feature representation

### Best Practices:
- Preprocess text properly (tokenization, lowercasing)
- Use appropriate window size (typically 5-10)
- Train on large corpora for better embeddings
- Use pretrained models (Google News, GloVe) for better results
- Visualize embeddings to understand relationships

**Reference:** Course 07, Unit 2: "Text Representation and Feature Engineering" - Word2Vec practical content
