# Hands-on Lab: Explore Embeddings

## Goal
Visualize word similarity using pretrained Word2Vec embeddings

## Tools
- **gensim**: For loading pretrained Word2Vec models
- **sklearn**: For dimensionality reduction and similarity calculations
- **matplotlib**: For visualization

## Learning Objectives
By the end of this lab, you will be able to:
1. Load and work with pretrained Word2Vec embeddings
2. Find nearest neighbors in vector space
3. Visualize word similarities using 2D plots
4. Analyze business-relevant vocabulary through embeddings

## Step 1: Import Required Libraries

First, let's import all the necessary libraries for our embedding exploration.

In [None]:
# Core libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# NLP and embeddings
import gensim
from gensim.models import Word2Vec
from gensim.models import KeyedVectors

# Machine learning
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn.metrics.pairwise import cosine_similarity

# Utilities
import warnings
warnings.filterwarnings('ignore')

# Set style for better plots
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

print("Libraries imported successfully!")
print(f"Gensim version: {gensim.__version__}")

## Step 2: Load Pretrained Word2Vec Model

We'll use Google's pretrained Word2Vec model trained on Google News dataset. This model contains 300-dimensional vectors for 3 million words and phrases.

**Local Data Management**: 
- The model will be automatically downloaded to a local `data/` directory
- Subsequent runs will load the model directly from the local directory
- This ensures faster loading and better project organization

**Model Details**:
- **Source**: Google News dataset (3 million words and phrases)
- **Dimensions**: 300-dimensional vectors
- **Local Path**: `./data/word2vec-google-news-300.bin`
- **Fallback**: Downloads automatically if not found locally

In [None]:
# Try to load Google's pretrained Word2Vec model
import os

# Create data directory if it doesn't exist
data_dir = 'data'
os.makedirs(data_dir, exist_ok=True)

try:
    # Try to load from local data directory first
    model_path = os.path.join(data_dir, 'word2vec-google-news-300.bin')
    word_vectors = KeyedVectors.load_word2vec_format(model_path, binary=True)
    print("✅ Successfully loaded Word2Vec model from local data directory")
    print(f"📁 Loaded from: {model_path}")
    print(f"Vocabulary size: {len(word_vectors.key_to_index):,}")
    print(f"Vector dimensions: {word_vectors.vector_size}")
    
except FileNotFoundError:
    try:
        # Try the original Google News model filename
        model_path = os.path.join(data_dir, 'GoogleNews-vectors-negative300.bin')
        word_vectors = KeyedVectors.load_word2vec_format(model_path, binary=True)
        print("✅ Successfully loaded Google News Word2Vec model from local data directory")
        print(f"📁 Loaded from: {model_path}")
        print(f"Vocabulary size: {len(word_vectors.key_to_index):,}")
        print(f"Vector dimensions: {word_vectors.vector_size}")
        
    except FileNotFoundError:
        print("❌ No model found in data directory. Downloading...")
        print("This will download the model to the local data directory.")
        
        # Use gensim's API to download the model
        import gensim.downloader as api
        
        # Download word2vec-google-news-300 model
        print("Downloading word2vec-google-news-300 model...")
        word_vectors = api.load('word2vec-google-news-300')
        
        # Save the model to our data directory for future use
        model_save_path = os.path.join(data_dir, 'word2vec-google-news-300.bin')
        word_vectors.save_word2vec_format(model_save_path, binary=True)
        print(f"✅ Model downloaded and saved to: {model_save_path}")
        
        print("✅ Successfully loaded Word2Vec model from gensim API")
        print(f"Vocabulary size: {len(word_vectors.key_to_index):,}")
        print(f"Vector dimensions: {word_vectors.vector_size}")
        print(f"📁 Model cached in: ./{data_dir}/")

# Verify the model is working
print(f"\n🔍 Model verification:")
print(f"   Variable 'word_vectors' is ready for use")
print(f"   Model type: {type(word_vectors)}")
print(f"   Sample word 'business' in vocabulary: {'business' in word_vectors.key_to_index}")

## Step 3: Define Business-Relevant Words

Let's select a set of business-relevant words to explore their embeddings and relationships.

In [None]:
# Define business-relevant words for analysis
business_words = [
    # Finance & Economics
    'profit', 'revenue', 'investment', 'budget', 'finance', 'economy',
    
    # Technology & Innovation
    'technology', 'innovation', 'digital', 'software', 'artificial_intelligence', 'data',
    
    # Marketing & Sales
    'marketing', 'sales', 'customer', 'brand', 'advertising', 'promotion',
    
    # Operations & Management
    'management', 'leadership', 'strategy', 'operations', 'efficiency', 'quality',
    
    # Human Resources
    'employee', 'talent', 'training', 'performance', 'recruitment', 'teamwork'
]

# Filter words that exist in our vocabulary
available_words = [word for word in business_words if word in word_vectors.key_to_index]
missing_words = [word for word in business_words if word not in word_vectors.key_to_index]

print(f"✅ Available words ({len(available_words)}): {available_words}")
print(f"❌ Missing words ({len(missing_words)}): {missing_words}")

# Use available words for our analysis
target_words = available_words[:15]  # Limit to first 15 for better visualization
print(f"\n🎯 Words selected for analysis: {target_words}")

## Step 4: Explore Word Similarity - Find Nearest Neighbors

Let's find the nearest neighbors for each of our target words in the vector space.

In [None]:
def find_nearest_neighbors(word, model, top_n=5):
    """Find nearest neighbors for a given word"""
    try:
        neighbors = model.most_similar(word, topn=top_n)
        return neighbors
    except KeyError:
        return None

# Find nearest neighbors for each target word
print("🔍 Finding nearest neighbors for each business word...\n")

neighbors_data = {}
for word in target_words:
    neighbors = find_nearest_neighbors(word, word_vectors, top_n=5)
    if neighbors:
        neighbors_data[word] = neighbors
        print(f"📊 **{word.upper()}** - Nearest neighbors:")
        for neighbor, similarity in neighbors:
            print(f"   {neighbor}: {similarity:.3f}")
        print()

## Step 5: Visualize Word Embeddings in 2D

Now let's visualize our business words in a 2D space using dimensionality reduction techniques.

In [None]:
# Extract vectors for our target words
word_vectors_matrix = np.array([word_vectors[word] for word in target_words])

# Apply PCA for dimensionality reduction
pca = PCA(n_components=2)
word_vectors_2d_pca = pca.fit_transform(word_vectors_matrix)

# Apply t-SNE for dimensionality reduction
tsne = TSNE(n_components=2, random_state=42, perplexity=min(5, len(target_words)-1))
word_vectors_2d_tsne = tsne.fit_transform(word_vectors_matrix)

# Create subplots
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 7))

# Plot PCA results
scatter1 = ax1.scatter(word_vectors_2d_pca[:, 0], word_vectors_2d_pca[:, 1], 
                      c=range(len(target_words)), cmap='viridis', s=100, alpha=0.7)

for i, word in enumerate(target_words):
    ax1.annotate(word, (word_vectors_2d_pca[i, 0], word_vectors_2d_pca[i, 1]), 
                xytext=(5, 5), textcoords='offset points', fontsize=10, fontweight='bold')

ax1.set_title('Word Embeddings Visualization - PCA', fontsize=14, fontweight='bold')
ax1.set_xlabel(f'PC1 ({pca.explained_variance_ratio_[0]:.1%} variance)')
ax1.set_ylabel(f'PC2 ({pca.explained_variance_ratio_[1]:.1%} variance)')
ax1.grid(True, alpha=0.3)

# Plot t-SNE results
scatter2 = ax2.scatter(word_vectors_2d_tsne[:, 0], word_vectors_2d_tsne[:, 1], 
                      c=range(len(target_words)), cmap='viridis', s=100, alpha=0.7)

for i, word in enumerate(target_words):
    ax2.annotate(word, (word_vectors_2d_tsne[i, 0], word_vectors_2d_tsne[i, 1]), 
                xytext=(5, 5), textcoords='offset points', fontsize=10, fontweight='bold')

ax2.set_title('Word Embeddings Visualization - t-SNE', fontsize=14, fontweight='bold')
ax2.set_xlabel('t-SNE Component 1')
ax2.set_ylabel('t-SNE Component 2')
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"📈 Visualized {len(target_words)} business words in 2D space")
print(f"📊 PCA explained variance: {pca.explained_variance_ratio_.sum():.1%}")

## Step 6: Calculate and Visualize Similarity Matrix

Let's create a similarity matrix to see how related our business words are to each other.

In [None]:
# Calculate cosine similarity matrix
similarity_matrix = cosine_similarity(word_vectors_matrix)

# Create a DataFrame for better visualization
similarity_df = pd.DataFrame(similarity_matrix, 
                           index=target_words, 
                           columns=target_words)

# Create heatmap
plt.figure(figsize=(12, 10))
mask = np.triu(np.ones_like(similarity_df, dtype=bool))  # Mask upper triangle

sns.heatmap(similarity_df, 
            annot=True, 
            cmap='RdYlBu_r', 
            vmin=0, 
            vmax=1,
            center=0.5,
            square=True,
            mask=mask,
            fmt='.2f',
            cbar_kws={'label': 'Cosine Similarity'})

plt.title('Business Words Similarity Matrix', fontsize=16, fontweight='bold', pad=20)
plt.xlabel('Words', fontsize=12)
plt.ylabel('Words', fontsize=12)
plt.xticks(rotation=45, ha='right')
plt.yticks(rotation=0)
plt.tight_layout()
plt.show()

print("🔥 Similarity matrix created! Higher values indicate more similar words.")

## Step 7: Find Most and Least Similar Word Pairs

Let's identify the most and least similar word pairs from our business vocabulary.

In [None]:
# Find most and least similar pairs
word_pairs = []
similarities = []

for i in range(len(target_words)):
    for j in range(i+1, len(target_words)):
        word1, word2 = target_words[i], target_words[j]
        similarity = similarity_matrix[i, j]
        word_pairs.append((word1, word2))
        similarities.append(similarity)

# Sort by similarity
sorted_pairs = sorted(zip(word_pairs, similarities), key=lambda x: x[1], reverse=True)

print("🔝 TOP 10 MOST SIMILAR WORD PAIRS:")
print("="*50)
for i, ((word1, word2), sim) in enumerate(sorted_pairs[:10]):
    print(f"{i+1:2d}. {word1:12} ↔ {word2:12} | Similarity: {sim:.3f}")

print("\n🔻 TOP 10 LEAST SIMILAR WORD PAIRS:")
print("="*50)
for i, ((word1, word2), sim) in enumerate(sorted_pairs[-10:]):
    print(f"{i+1:2d}. {word1:12} ↔ {word2:12} | Similarity: {sim:.3f}")

## Step 8: Interactive Word Similarity Explorer

Let's create an interactive function to explore word relationships in our vocabulary.

In [None]:
def explore_word_relationships(word, model, target_words_list):
    """Explore relationships between a word and our target vocabulary"""
    if word not in model.key_to_index:
        print(f"❌ '{word}' not found in vocabulary")
        return
    
    print(f"🔍 Exploring relationships for: **{word.upper()}**")
    print("="*60)
    
    # Find similarities with our target words
    similarities = []
    for target_word in target_words_list:
        if target_word in model.key_to_index:
            sim = model.similarity(word, target_word)
            similarities.append((target_word, sim))
    
    # Sort by similarity
    similarities.sort(key=lambda x: x[1], reverse=True)
    
    print("📊 Similarity with business words:")
    for target_word, sim in similarities[:10]:
        print(f"   {target_word:15} | {sim:.3f} {'🔥' if sim > 0.5 else '📊' if sim > 0.3 else '📉'}")
    
    # Find general nearest neighbors
    print("\n🎯 Top 5 nearest neighbors:")
    neighbors = model.most_similar(word, topn=5)
    for neighbor, sim in neighbors:
        print(f"   {neighbor:15} | {sim:.3f}")

# Example usage
print("Try exploring different words! For example:")
explore_word_relationships('profit', word_vectors, target_words)

## Step 9: Word Arithmetic and Analogies

One fascinating aspect of word embeddings is their ability to capture semantic relationships through vector arithmetic.

In [None]:
def find_analogy(word1, word2, word3, model, top_n=5):
    """Find word that completes the analogy: word1 is to word2 as word3 is to ?"""
    try:
        result = model.most_similar(positive=[word2, word3], negative=[word1], topn=top_n)
        return result
    except KeyError as e:
        return f"Error: {e}"

# Business analogies to explore
analogies = [
    ('king', 'man', 'woman'),  # Classic example: king - man + woman = queen
    ('CEO', 'company', 'school'),  # CEO is to company as ? is to school
    ('profit', 'business', 'education'),  # profit is to business as ? is to education
    ('marketing', 'product', 'candidate'),  # marketing is to product as ? is to candidate
    ('investment', 'money', 'time'),  # investment is to money as ? is to time
]

print("🧮 WORD ARITHMETIC AND ANALOGIES")
print("="*60)

for word1, word2, word3 in analogies:
    print(f"\n🔍 {word1} is to {word2} as {word3} is to...")
    result = find_analogy(word1, word2, word3, word_vectors, top_n=3)
    
    if isinstance(result, str):
        print(f"   {result}")
    else:
        print(f"   Top predictions:")
        for word, score in result:
            print(f"     {word} ({score:.3f})")

## Step 10: Summary and Key Insights

Let's summarize our findings and extract key insights from our embedding exploration.

In [None]:
# Summary statistics
print("📈 EMBEDDING EXPLORATION SUMMARY")
print("="*60)

print(f"📊 Total words analyzed: {len(target_words)}")
print(f"📐 Vector dimensions: {word_vectors.vector_size}")
print(f"📚 Total vocabulary size: {len(word_vectors.key_to_index):,}")

# Calculate some statistics
avg_similarity = np.mean(similarity_matrix[np.triu_indices(len(target_words), k=1)])
max_similarity = np.max(similarity_matrix[np.triu_indices(len(target_words), k=1)])
min_similarity = np.min(similarity_matrix[np.triu_indices(len(target_words), k=1)])

print(f"\n📊 Similarity Statistics:")
print(f"   Average similarity: {avg_similarity:.3f}")
print(f"   Maximum similarity: {max_similarity:.3f}")
print(f"   Minimum similarity: {min_similarity:.3f}")

print("\n🎯 KEY INSIGHTS:")
print("1. Word embeddings capture semantic relationships between business terms")
print("2. Similar words cluster together in the vector space")
print("3. Vector arithmetic can reveal analogical relationships")
print("4. Dimensionality reduction helps visualize high-dimensional embeddings")
print("5. Cosine similarity is effective for measuring word relationships")

print("\n🚀 NEXT STEPS:")
print("• Experiment with different word lists (industry-specific terms)")
print("• Try different pretrained models (GloVe, FastText, etc.)")
print("• Explore domain-specific embedding models")
print("• Apply embeddings to text classification or clustering tasks")
print("• Create custom embeddings from your own text data")

## 🎯 Lab Exercise: Your Turn!

Now it's your turn to explore embeddings! Complete the following exercises:

### Exercise 1: Custom Word List
Create your own list of words related to your field of interest (e.g., healthcare, education, sports) and repeat the analysis above.

### Exercise 2: Word Arithmetic
Try to find interesting analogies using word arithmetic. Can you find business-related analogies?

### Exercise 3: Similarity Threshold
Experiment with different similarity thresholds to group words into clusters.

Use the cells below to implement your solutions:

In [None]:
# Exercise 1: Your custom word list
my_words = []
# Add your words here and run the analysis

# Your code here...

In [None]:
# Exercise 2: Word arithmetic experiments
# Try your own analogies here

# Your code here...

In [None]:
# Exercise 3: Clustering with similarity thresholds
# Group words based on similarity thresholds

# Your code here...