# EmbedEvolution Stage 2: Word2Vec Embeddings

Welcome to the second stage of EmbedEvolution! We now explore Word2Vec, a highly influential model that learns static word embeddings from local word co-occurrence statistics. Unlike RNNs which process sequences to generate potentially contextual (though limited) representations, Word2Vec assigns a single, fixed vector to each word in the vocabulary.

**Goal:** Understand how Word2Vec captures semantic similarities, its efficiency, and its inherent limitations, particularly its static nature and lack of true contextual understanding compared to sequence models.

## 1. Setup and Imports

Import necessary libraries. We'll use `gensim` for Word2Vec.
`nltk` is useful for tokenization.

In [None]:
import numpy as np
import gensim
from gensim.models import Word2Vec
import gensim.downloader as api
import seaborn as sns
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
import re
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
from sklearn.metrics.pairwise import cosine_similarity

# Download nltk resources if not already present (run once)
try:
    # Attempt to use a function that requires 'punkt' to see if it's available
    _ = sent_tokenize("This is a test sentence.")
    print("NLTK 'punkt' resource seems to be available.")
except LookupError:
    print("NLTK 'punkt' resource not found. Downloading...")
    nltk.download('punkt')
    nltk.download('punkt_tab')
    print("NLTK 'punkt' resource downloaded.")
except Exception as e:
    print(f"An error occurred during NLTK setup: {e}")

print("Gensim version:", gensim.__version__)

## 2. Define the Corpus: The "Lumina Codex"

We'll use the detailed "Lumina Codex and the Solara System" text.

In [None]:
text = """
The Lumina Codex and the Solara System: A Tapestry of Ancient Wisdom and Cosmic Discovery
In the shadowed halls of the Cairo Museum, a dusty papyrus scroll, cataloged as Papyrus K-37b from the Middle Kingdom, lay forgotten for centuries. Dubbed the Lumina Codex by its discoverers, this fragile relic was initially dismissed as a mythological curiosity, its cryptic hieroglyphs and star charts interpreted as poetic musings of a priestly scribe. Yet, in 2024, a team of linguists and astronomers, led by Dr. Amara Nassar, deciphered its veiled verses, revealing an astonishing truth: the codex described a distant star system with uncanny precision, orbiting a radiant G-type star named the Star of Eternal Radiance—now known as Lumina. Intriguingly, the codex also spoke of a divine figure, "the Swift One," bearing a staff entwined with serpents, reminiscent of the god Mercury in Roman mythology or Thoth in Egyptian lore. This messenger, often depicted as Thoth the scribe of the gods, is said to have imparted the knowledge of the stars to the ancient scribes, guiding their hands in creating the star charts. This revelation sparked a scientific odyssey, merging ancient Egyptian cosmology with cutting-edge astronomy, as the Solara System emerged from the Nebula Cygnus-X1, nestled in the Orion Spur of the Milky Way Galaxy.

The Lumina Codex spoke of Lumina and its ten celestial attendants, organized into poetic regions: the searing Forges of Ra for the inner worlds, the verdant Blessed Belt of Osiris for the habitable zone, the majestic Domains of the Sky Titans for gas giants, and the enigmatic Frozen Outlands for the outer realms. Its star charts, etched with meticulous care, hinted at a cosmic map, with references to the Rivers of Stars—likely the Milky Way—and the Celestial Gardens, evoking the Local Group within the Virgo Supercluster. The codex's verses, such as "Ten jewels dance in the embrace of the Eternal Radiance, their faces veiled in fire, water, and ice," seemed to prefigure a system now confirmed by the Cygnus-X1 Deep Sky Array, a fictional next-generation telescope orbiting beyond Earth's atmosphere.

Discovery and Modern Corroboration
The Solara System's discovery began in 2023, when the Cygnus-X1 Deep Sky Array detected subtle wobbles in Lumina's light, indicating a complex system of orbiting bodies. Located 1,200 light-years away in the Nebula Cygnus-X1, Lumina is a stable, middle-aged G-type star, slightly larger than the Sun, with a luminosity that sustains a diverse array of worlds. As astronomers analyzed the data, they identified ten planets, each with unique characteristics that eerily echoed the Lumina Codex. The parallels were undeniable: the codex's Forges of Ra matched the inner rocky planets, while the Blessed Belt of Osiris aligned with two habitable worlds teeming with life. The Domains of the Sky Titans and Frozen Outlands described gas giants and icy dwarfs with striking accuracy. The scientific community buzzed with excitement, as linguists and astronomers collaborated to decode the codex's metaphors, revealing a blend of ancient intuition and cosmic truth.

The Solara System: A Celestial Menagerie
Lumina: The Star of Eternal Radiance

Lumina, a G2V star, radiates a warm, golden light, its stable fusion cycle supporting a system spanning 12 astronomical units. Its magnetic fields are unusually calm, suggesting a long lifespan conducive to life's evolution. The codex describes Lumina as "the hearth of eternity, whose breath kindles the dance of worlds," a poetic nod to its life-giving energy.

The Forges of Ra: Inner Planets
1- Ignis: The closest planet to Lumina, Ignis is a scorched, iron-rich world with a molten surface pocked by ancient impact craters, reminiscent of the planet Mercury in our own solar system. Its thin atmosphere, rich in sulfur dioxide, glows faintly under Lumina's intense radiation. The codex calls it "Ra's Anvil, where molten rivers forge the bones of the cosmos," reflecting its volcanic past and metallic crust.
2- Ferrus: Slightly larger, Ferrus is a rocky planet with vast plains of oxidized iron, giving it a crimson hue. Its surface bears scars of past tectonic activity, with towering cliffs and deep chasms. The codex names it "the Forge of Hephaestus's Twin," hinting at its metallic wealth, now confirmed by spectroscopic analysis revealing nickel and cobalt deposits.

The Blessed Belt of Osiris: Habitable Zone
1- Aqua: A breathtaking ocean world, Aqua is enveloped in turquoise clouds of water vapor and nitrogen. Its surface is 90% liquid water, with archipelagos of coral-like structures hosting complex aquatic ecosystems. Bioluminescent Aquarelles, jellyfish-like creatures with crystalline tentacles, drift in vast schools, their light pulses synchronizing in rhythmic displays. Predatory Thalacynths, eel-like organisms with electromagnetic sensors, hunt in the deep trenches. Aqua's moon, Thalassa, is an ice-covered world with a subglacial ocean, where astrobiologists hypothesize microbial extremophiles thrive in hydrothermal vents, metabolizing sulfur compounds. The codex describes Aqua as "Osiris's Chalice, where life swims in the tears of the gods," and Thalassa as "the frozen veil hiding the spark of creation."
2- Veridia: A super-Earth, Veridia boasts lush continents of bioluminescent flora, such as Luminara trees, which pulse with green and violet light, and Crystalferns, whose fractal leaves refract Lumina's rays into dazzling spectra. Veridia is home to the Sylvans, sentient, silicon-based life forms resembling ambulatory crystal shrubs. Their bodies, composed of lattice-like structures, shimmer with bioluminescent patterns used for communication. Intriguingly, the Sylvans' technology incorporates liquid mercury, a metal known for its unique properties, in their communication devices. This allows them to transmit their bioluminescent patterns through conductive channels, enhancing their collective consciousness. Sylvan society is decentralized, with "groves" of individuals linked via light-based signals, forming a collective consciousness deeply attuned to Veridia's ecosystem. Their architecture, grown from crystalline minerals, integrates seamlessly with the landscape. The codex calls Veridia "the Garden of Osiris's Breath," where "the shining ones weave light into wisdom."

The Domains of the Sky Titans: Gas Giants
1- Zephyrus: A massive hydrogen-helium gas giant, Zephyrus dominates the system with its radiant ring system, composed of ice and silicate particles. Its atmosphere swirls with golden storms, driven by intense winds. Among its 47 moons, Io-Prime stands out, a volcanically active world spewing sulfur plumes, likely powered by tidal heating. The codex names Zephyrus "the Sky Titan's Crown," its rings "the jeweled girdle of the heavens."
2- Boreas: An ice giant with a deep blue methane atmosphere, Boreas exhibits retrograde rotation and an asymmetrical magnetic field, creating auroras that dance across its poles. Its 22 moons include Erynnis, a rocky moon with methane lakes. The codex describes Boreas as "the Frost Titan, whose breath chills the void," capturing its icy majesty.

The Frozen Outlands: Outer Planets
1- Umbriel: A dwarf planet with a charcoal-dark surface, Umbriel's icy crust is fractured by ancient impacts. Its moon Nyx, a captured object, is rich in organic compounds, hinting at prebiotic chemistry. The codex calls Umbriel "the Shadowed Outcast, guarded by the dark sentinel."
2- Erebus: An icy world with a nitrogen-methane atmosphere, Erebus has a highly elliptical orbit, suggesting a captured origin. Its surface sparkles with frost-covered ridges. The codex names it "the Silent Wanderer, cloaked in eternal frost."
3- Aetheria: The outermost planet, Aetheria is a rogue dwarf with a thin atmosphere of neon and argon. Its moon Lethe exhibits cryovolcanism, spewing ammonia-water mixtures. Astrobiologists speculate that Lethe's subsurface ocean may harbor microbial life, analogous to Thalassa's. The codex describes Aetheria as "the Veiled Wanderer, whose dreams freeze in the outer dark," and Lethe as "the weeping mirror of the cosmos."
4- Nyxara: A small, icy body with a chaotic orbit, Nyxara's surface is a mosaic of frozen nitrogen and carbon monoxide. The codex calls it "the Lost Jewel, dancing beyond the Titans' gaze."

Life in the Solara System
Aqua's aquatic ecosystems are a marvel, with Aquarelles forming symbiotic networks with coral-like Hydroskeletons, which filter nutrients from the water. Thalacynths use electromagnetic pulses to stun prey, suggesting an evolutionary arms race. On Thalassa, microbial life is hypothesized based on chemical signatures of sulfur and methane in its subglacial ocean, though no direct evidence exists yet.
Veridia's Sylvans are the system's crown jewel. Their crystalline bodies, averaging two meters tall, refract light into complex patterns, encoding emotions, ideas, and memories. Their society operates as a "luminous collective," with no central authority; decisions emerge from synchronized light displays across groves. Sylvan technology manipulates crystalline minerals and liquid mercury to create tools and habitats, all in harmony with Veridia's ecosystem. Their discovery has sparked intense study by linguists decoding their light-based language, revealing a philosophy centered on balance and interconnectedness.
On Lethe, cryovolcanic activity suggests a subsurface ocean with potential microbial ecosystems, possibly metabolizing ammonia. Unlike Aqua's confirmed complex life and Veridia's sentient Sylvans, life on Thalassa and Lethe remains speculative, driving astrobiological research.

Galactic Context
The Solara System resides in the Orion Spur, a minor arm of the Milky Way, part of the Local Group within the Virgo Supercluster. The codex's Rivers of Stars evoke the Milky Way's spiral arms, while the Celestial Gardens suggest a poetic grasp of the Local Group's galactic cluster. This cosmic placement underscores Solara's significance as a microcosm of the universe's diversity.

Ongoing Exploration
Scientific teams, including astrobiologists, geologists, and linguists, are studying Solara via the Cygnus-X1 Deep Sky Array and planned probes, such as the Lumina Pathfinder Mission. Challenges include the 1,200-light-year distance, requiring advanced telemetry for data transmission. Sylvan communication poses a unique hurdle, as their light patterns defy traditional linguistic models. Future missions aim to deploy orbiters around Aqua, Veridia, and Lethe to confirm microbial life and study Sylvan culture.

A Cosmic Tapestry
The Solara System, unveiled through the Lumina Codex and modern astronomy, blends ancient wisdom with scientific discovery. Its worlds—from the fiery Forges of Ra to the icy Frozen Outlands—offer a rich tapestry of environments, life forms, and mysteries. As scientists probe this distant system, the codex's poetic verses resonate, reminding humanity that the cosmos has long whispered its secrets, awaiting those bold enough to listen.

To share this cosmic tapestry with the world, an educational initiative has launched the Solara Explorer Kit for children. The kit includes detailed models of the ten planets, with Ignis crafted to resemble the planet Mercury as a toy for the kids to play and learn, it's complete with its cratered surface and a tiny, movable rover that kids can roll across its terrain. A central figure of Lumina, the star, shines brightly, paired with a glow-in-the-dark Mercury space blaster that shoots safe, squishy foam darts, letting kids pretend they're defending the Solara System from space invaders. This kit aims to spark curiosity in young minds, inviting them to explore the night sky and imagine distant worlds through hands-on fun.
"""

# Clean the text slightly
cleaned_text = text.lower()
# Keep letters, spaces, and hyphens (for terms like "g-type", "light-years", "silicon-based")
# Also keep numbers for e.g. planet numbers, distances, although they will become separate tokens.
cleaned_text = re.sub(r'[^a-z0-9\s-]', '', cleaned_text)
cleaned_text = re.sub(r'\s+', ' ', cleaned_text).strip() # Remove extra whitespace

print("Cleaned Text Sample (Lumina Codex):")
print(cleaned_text[:300] + "...")

## 3. Text Preprocessing for Word2Vec

Word2Vec expects input as a list of sentences, where each sentence is a list of tokens (words).


In [None]:
raw_sentences = sent_tokenize(cleaned_text) 
tokenized_sentences_for_w2v = []
for sentence in raw_sentences:
    # Clean each sentence individually after sentence tokenization
    sentence_cleaned = sentence.lower()
    sentence_cleaned = re.sub(r'[^a-z0-9\s-]', '', sentence_cleaned)
    sentence_cleaned = re.sub(r'\s+', ' ', sentence_cleaned).strip()
    words = word_tokenize(sentence_cleaned) # word_tokenize handles most punctuation correctly
    words = [word for word in words if word] # Remove empty strings that might result
    if words: # Only add non-empty lists of words
        tokenized_sentences_for_w2v.append(words)

if tokenized_sentences_for_w2v:
    print(f"Number of sentences for Word2Vec: {len(tokenized_sentences_for_w2v)}")
    print("Sample tokenized sentence:", tokenized_sentences_for_w2v[0])
else:
    print("Warning: No sentences were tokenized. Check the text and preprocessing.")

## Part 1: Word2Vec Trained from Scratch on "Lumina Codex"

We'll first train a Word2Vec model using only our "Lumina Codex" text. This will help us understand how Word2Vec learns from a specific, smaller corpus and its limitations.

### 4.1 Train Word2Vec Model (Scratch)

In [None]:
# Word2Vec Parameters
embedding_dim_w2v = 100    # Dimensionality of the word vectors
window_size_w2v = 5        # Max distance between current and predicted word
min_word_count_w2v = 2     # Ignores words with frequency lower than this
num_workers_w2v = 4        # Number of CPU cores for training
sg_w2v = 1                 # 1 for Skip-gram; 0 for CBOW
epochs_w2v = 50            # Number of iterations over the corpus

if tokenized_sentences_for_w2v:
    print(f"Training Word2Vec model from scratch on {len(tokenized_sentences_for_w2v)} sentences...")
    scratch_w2v_model = Word2Vec(sentences=tokenized_sentences_for_w2v,
                                 vector_size=embedding_dim_w2v,
                                 window=window_size_w2v,
                                 min_count=min_word_count_w2v,
                                 workers=num_workers_w2v,
                                 sg=sg_w2v,
                                 epochs=epochs_w2v)
    print("Word2Vec model (scratch) trained.")
    print(f"Vocabulary size (scratch model): {len(scratch_w2v_model.wv.key_to_index)}")
else:
    print("Cannot train Word2Vec model from scratch: no tokenized sentences available.")
    scratch_w2v_model = None

### 4.2 Keyword Embedding Analysis (Scratch Model)

Let's examine the embeddings for our target keywords from the "Lumina Codex" as learned by the scratch-trained model. We will assess: Semantic Similarity, "Contextual Understanding" (how the static vector represents general meaning),Vocabulary Handling, and Clustering Quality.

In [None]:
keywords_to_analyze = ["mercury", "lumina", "star", "sun", "ignis", "ferrus", "iron", "thoth", "orbit", "nebula"]

if scratch_w2v_model:
    # Filter keywords present in the scratch model's vocabulary
    keywords_in_scratch_vocab = sorted([kw for kw in keywords_to_analyze if kw in scratch_w2v_model.wv.key_to_index])
    print(f"Keywords for analysis found in scratch Word2Vec vocabulary ({len(keywords_in_scratch_vocab)}): {keywords_in_scratch_vocab}")

    if len(keywords_in_scratch_vocab) >= 2:
        embedding_matrix_scratch = np.array([scratch_w2v_model.wv[word] for word in keywords_in_scratch_vocab])
        print(f"\nShape of embedding matrix for analysis (scratch Word2Vec): {embedding_matrix_scratch.shape}")

        # --- 1. Semantic Similarity (Heatmap) ---
        print("\n--- Cosine Similarity Heatmap (Keywords - Scratch Word2Vec) ---")
        similarity_matrix_scratch = cosine_similarity(embedding_matrix_scratch)
        
        num_labels_scratch = len(keywords_in_scratch_vocab)
        fig_width_scratch = max(12, num_labels_scratch * 0.7)
        fig_height_scratch = max(10, num_labels_scratch * 0.5)
        plt.figure(figsize=(fig_width_scratch, fig_height_scratch))
        
        annotate_heatmap_scratch = num_labels_scratch < 25
        
        # Enhanced color map with better contrast (matching RNN analysis)
        sns.heatmap(similarity_matrix_scratch,
                    annot=annotate_heatmap_scratch, 
                    cmap='viridis', 
                    fmt=".2f",
                    xticklabels=keywords_in_scratch_vocab, 
                    yticklabels=keywords_in_scratch_vocab,
                    linewidths=.5, 
                    cbar_kws={"shrink": .8},
                    vmin=0,  # Set minimum value for better contrast
                    vmax=1)  # Set maximum value for better contrast
        plt.title(f'Word2Vec (Scratch) Cosine Similarity (Keywords - Lumina Codex)', fontsize=16)
        plt.xticks(rotation=65, ha='right', fontsize=max(8, 14 - num_labels_scratch // 4))
        plt.yticks(rotation=0, fontsize=max(8, 14 - num_labels_scratch // 4))
        plt.tight_layout()
        plt.show()

        # --- Specific Pairwise Similarities for Discussion (matching RNN pairs) ---
        print("\n--- Specific Pairwise Similarities (Scratch Word2Vec) ---")
        pairs_to_check_discussion = [
            ("mercury", "ignis"), ("lumina", "star"), ("star", "sun"),
            ("ignis", "ferrus"), ("mercury", "iron"), ("thoth", "mercury"),
            ("orbit", "lumina"), ("nebula", "lumina")
        ]
        for kw1, kw2 in pairs_to_check_discussion:
            if kw1 in scratch_w2v_model.wv and kw2 in scratch_w2v_model.wv:
                sim_val = scratch_w2v_model.wv.similarity(kw1, kw2)
                print(f"Similarity between '{kw1}' and '{kw2}': {sim_val:.4f}")
            else:
                print(f"Cannot compare '{kw1}' and '{kw2}' (one or both not in scratch vocab).")
        
        # --- 2. Clustering Quality (PCA Visualization) ---
        print("\n--- PCA Visualization (Keywords - Scratch Word2Vec) ---")
        pca_scratch = PCA(n_components=2)
        embeddings_2d_scratch = pca_scratch.fit_transform(embedding_matrix_scratch)

        plt.figure(figsize=(fig_width_scratch * 0.9, fig_height_scratch * 0.9))
        
        # Define keyword categories and colors (matching RNN analysis)
        keyword_categories = {
            'celestial': ['star', 'sun', 'lumina', 'orbit', 'nebula'],
            'elements': ['mercury', 'ignis', 'ferrus', 'iron'],
            'mythology': ['thoth']
        }
        
        # Create color mapping
        colors = []
        for label in keywords_in_scratch_vocab:
            if label in keyword_categories['celestial']:
                colors.append('gold')
            elif label in keyword_categories['elements']:
                colors.append('darkred')
            elif label in keyword_categories['mythology']:
                colors.append('purple')
            else:
                colors.append('gray')
        
        # Plot with colors
        plt.scatter(embeddings_2d_scratch[:, 0], embeddings_2d_scratch[:, 1], 
                   alpha=0.8, s=100, c=colors, edgecolors='black', linewidth=0.5)
        
        # Add annotations with better positioning
        for i, label in enumerate(keywords_in_scratch_vocab):
            plt.annotate(label, 
                       (embeddings_2d_scratch[i, 0], embeddings_2d_scratch[i, 1]), 
                       textcoords="offset points", 
                       xytext=(5,5), 
                       ha='center', 
                       fontsize=max(8, 12 - num_labels_scratch // 5),
                       fontweight='bold')
        
        plt.title('Word2Vec (Scratch) Keyword Embeddings (Lumina Codex) - PCA', fontsize=16)
        plt.xlabel('PCA Component 1', fontsize=12)
        plt.ylabel('PCA Component 2', fontsize=12)
        plt.grid(True, alpha=0.3)
        
        # Add legend for categories
        from matplotlib.patches import Patch
        legend_elements = [
            Patch(facecolor='gold', edgecolor='black', label='Celestial'),
            Patch(facecolor='darkred', edgecolor='black', label='Elements'),
            Patch(facecolor='purple', edgecolor='black', label='Mythology')
        ]
        plt.legend(handles=legend_elements, 
                  loc='upper center', 
                  bbox_to_anchor=(0.5, -0.1), 
                  ncol=3,
                  fontsize=10,
                  title='Keyword Categories')
        
        # Adjust layout to prevent legend cutoff
        plt.tight_layout()
        plt.subplots_adjust(bottom=0.15)
        plt.show()
        
        # Print explained variance ratio
        print(f"\nPCA explained variance ratio: {pca_scratch.explained_variance_ratio_}")
        print(f"Total variance explained: {sum(pca_scratch.explained_variance_ratio_):.2%}")
    else:
        print("Not enough keywords in scratch vocabulary for analysis (need at least 2).")
else:
    print("Word2Vec model (scratch) not trained. Skipping keyword analysis.")

### 4.3 Detailed Analysis of 'mercury' Occurrences

In [None]:
# === Word2Vec Mercury Context Analysis (Scratch Model) ===
print("\n=== Word2Vec Mercury Context Analysis - Demonstrating Static vs Contextual Embeddings ===")

# Keywords to analyze - same as RNN analysis
mercury_context_keywords = ["mercury", "planet", "toy", "metal", "mythology"]

if scratch_w2v_model:
    # Check which keywords are in the vocabulary
    keywords_in_vocab = [kw for kw in mercury_context_keywords if kw in scratch_w2v_model.wv.key_to_index]
    missing_keywords = [kw for kw in mercury_context_keywords if kw not in scratch_w2v_model.wv.key_to_index]
    
    print(f"Keywords found in scratch Word2Vec vocabulary ({len(keywords_in_vocab)}): {keywords_in_vocab}")
    if missing_keywords:
        print(f"Keywords missing from scratch vocabulary: {missing_keywords}")
    
    if "mercury" in keywords_in_vocab and len(keywords_in_vocab) >= 2:
        print(f"\n=== Demonstrating Word2Vec's Static Nature ===")
        print("Unlike RNN which generated different embeddings for mercury in different contexts,")
        print("Word2Vec produces the SAME embedding for 'mercury' regardless of context.")
        
        # Using the SAME hardcoded list as RNN analysis to show the limitation
        # This list represents the actual meaning of mercury in each occurrence (in order)
        mercury_categories = ["mythology", "planet", "metal", "metal", "planet", "toy"]
        
        print("Mercury occurrences in text order (same as RNN analysis):")
        for i, category in enumerate(mercury_categories, 1):
            print(f"  Occurrence {i}: mercury in {category} context")
        
        # Get the single mercury embedding from Word2Vec
        mercury_embedding = scratch_w2v_model.wv["mercury"]
        
        # Create labeled contexts using the same category mapping as RNN
        # BUT all will have identical embeddings (demonstrating Word2Vec limitation)
        mercury_contexts = {}
        label_counts = {"mythology": 0, "planet": 0, "metal": 0, "toy": 0}
        
        for i, category in enumerate(mercury_categories):
            label_counts[category] += 1
            context_label = f"mercury_{category}_{label_counts[category]}"
            mercury_contexts[context_label] = mercury_embedding  # Same embedding for all!
            print(f"  {context_label}: IDENTICAL Word2Vec embedding (limitation)")
        
        print(f"\nKey Difference from RNN:")
        print(f"• RNN: Generated 6 DIFFERENT embeddings based on surrounding context")
        print(f"• Word2Vec: Uses 1 IDENTICAL embedding regardless of context")
        
        # Add other context keywords if available
        other_embeddings = {}
        for kw in ["planet", "toy", "metal", "mythology"]:
            if kw in keywords_in_vocab:
                other_embeddings[kw] = scratch_w2v_model.wv[kw]
        
        # Combine all embeddings
        all_embeddings_w2v = {**mercury_contexts, **other_embeddings}
        all_labels_w2v = list(all_embeddings_w2v.keys())
        
        print(f"\nCreated 'contextual' labels for mercury (all identical embeddings):")
        for label in mercury_contexts.keys():
            print(f"- {label}: Uses same Word2Vec embedding")
        
        print(f"\nAdded context keywords:")
        for label in other_embeddings.keys():
            print(f"- {label}: Separate Word2Vec embedding")
        
        # Create embedding matrix
        embedding_matrix_w2v = np.array(list(all_embeddings_w2v.values()))
        print(f"\nShape of embedding matrix (Word2Vec Context Analysis): {embedding_matrix_w2v.shape}")
        
        # --- Cosine Similarity Heatmap ---
        print("\n--- Cosine Similarity Heatmap (Word2Vec 'Context' Analysis) ---")
        similarity_matrix_w2v = cosine_similarity(embedding_matrix_w2v)
        
        num_labels_w2v = len(all_labels_w2v)
        fig_width_w2v = max(10, num_labels_w2v * 0.8)
        fig_height_w2v = max(8, num_labels_w2v * 0.6)
        plt.figure(figsize=(fig_width_w2v, fig_height_w2v))
        
        annotate_heatmap_w2v = num_labels_w2v < 15
        
        sns.heatmap(similarity_matrix_w2v,
                    annot=annotate_heatmap_w2v,
                    cmap='viridis',
                    fmt=".2f",
                    xticklabels=all_labels_w2v,
                    yticklabels=all_labels_w2v,
                    linewidths=.5,
                    cbar_kws={"shrink": .8},
                    vmin=0,
                    vmax=1)
        plt.title(f'Word2Vec (Scratch) "Context" Analysis - Shows Static Nature', fontsize=16)
        plt.xticks(rotation=45, ha='right', fontsize=10)
        plt.yticks(rotation=0, fontsize=10)
        plt.tight_layout()
        plt.show()
        
        print(f"\n=== Key Observations from Heatmap ===")
        print("• All mercury_* entries show PERFECT similarity (1.00) to each other")
        print("• mercury_mythology_1, mercury_planet_1, mercury_metal_1, etc. are IDENTICAL")
        print("• This demonstrates Word2Vec's inability to distinguish context")
        print("• Compare this to RNN which showed varying similarities:")
        print("  - RNN: mercury_mythology_1 vs mercury_planet_1 = 0.96")
        print("  - RNN: mercury_toy_1 was often isolated from other mercury contexts")
        print("  - Word2Vec: ALL mercury pairs = 1.00 (no distinction)")
        
        # Show specific similarities
        print(f"\n=== Mercury 'Context' Similarities (All Should Be 1.00) ===")
        mercury_labels = [label for label in all_labels_w2v if 'mercury' in label]
        for i, label1 in enumerate(mercury_labels):
            for label2 in mercury_labels[i+1:]:
                sim_idx1 = all_labels_w2v.index(label1)
                sim_idx2 = all_labels_w2v.index(label2)
                similarity = similarity_matrix_w2v[sim_idx1][sim_idx2]
                print(f"'{label1}' vs '{label2}': {similarity:.4f}")
                
        print(f"\nContrast with RNN results from your analysis:")
        print(f"• RNN mercury_mythology_1 vs mercury_planet_1: 0.96 (high but not perfect)")
        print(f"• RNN mercury_toy_1 vs others: often 0.30-0.40 (clearly different)")
        print(f"• Word2Vec: ALL mercury comparisons = 1.00 (no contextual distinction)")
        
        # Show mercury relationships to context words
        print(f"\n=== Mercury Relationships to Context Words ===")
        if "mercury" in keywords_in_vocab:
            for context_word in ["mythology", "planet", "metal", "toy"]:
                if context_word in keywords_in_vocab:
                    similarity = scratch_w2v_model.wv.similarity("mercury", context_word)
                    print(f"'mercury' vs '{context_word}': {similarity:.4f}")
        
        # --- PCA Visualization ---
        print("\n--- PCA Visualization (Word2Vec 'Context' Analysis) ---")
        pca_w2v = PCA(n_components=2)
        embeddings_2d_w2v = pca_w2v.fit_transform(embedding_matrix_w2v)
        
        plt.figure(figsize=(fig_width_w2v * 0.9, fig_height_w2v * 0.9))
        
        # Create color map - all mercury contexts will overlap perfectly
        colors_w2v = []
        sizes_w2v = []
        for label in all_labels_w2v:
            if 'mercury' in label:
                colors_w2v.append('red')
                sizes_w2v.append(120)  # Smaller since they'll overlap
            elif label == 'planet':
                colors_w2v.append('blue')
                sizes_w2v.append(100)
            elif label == 'toy':
                colors_w2v.append('green')
                sizes_w2v.append(100)
            elif label == 'metal':
                colors_w2v.append('orange')
                sizes_w2v.append(100)
            elif label == 'mythology':
                colors_w2v.append('purple')
                sizes_w2v.append(100)
            else:
                colors_w2v.append('gray')
                sizes_w2v.append(100)
        
        plt.scatter(embeddings_2d_w2v[:, 0], embeddings_2d_w2v[:, 1], 
                   alpha=0.7, s=sizes_w2v, c=colors_w2v, edgecolors='black', linewidth=0.5)
        
        # Add annotations - mercury contexts will overlap
        mercury_count = 0
        for i, label in enumerate(all_labels_w2v):
            if 'mercury' in label:
                mercury_count += 1
                # Offset multiple mercury labels slightly to show they're the same point
                offset_x = (mercury_count - 3.5) * 0.05  # Small offset for visibility
                offset_y = (mercury_count - 3.5) * 0.05
                plt.annotate(label, 
                           (embeddings_2d_w2v[i, 0] + offset_x, embeddings_2d_w2v[i, 1] + offset_y), 
                           textcoords="offset points", xytext=(5,5), ha='center', fontsize=9)
            else:
                plt.annotate(label, (embeddings_2d_w2v[i, 0], embeddings_2d_w2v[i, 1]), 
                            textcoords="offset points", xytext=(5,5), ha='center', fontsize=10, fontweight='bold')
        
        plt.title('Word2Vec "Context" Analysis - All Mercury Points Overlap', fontsize=16)
        plt.xlabel('PCA Component 1', fontsize=12)
        plt.ylabel('PCA Component 2', fontsize=12)
        plt.grid(True, alpha=0.3)
        
        # Add legend
        from matplotlib.patches import Patch
        legend_elements = [
            Patch(facecolor='red', edgecolor='black', label='mercury (all contexts - overlapping)'),
            Patch(facecolor='blue', edgecolor='black', label='planet'),
            Patch(facecolor='green', edgecolor='black', label='toy'),
            Patch(facecolor='orange', edgecolor='black', label='metal'),
            Patch(facecolor='purple', edgecolor='black', label='mythology')
        ]
        plt.legend(handles=legend_elements, loc='upper center', bbox_to_anchor=(0.5, -0.1), ncol=3)
        
        plt.tight_layout()
        plt.subplots_adjust(bottom=0.15)
        plt.show()
        
        print(f"\n=== PCA Insights ===")
        print("• All mercury_* points appear at the EXACT same location (perfect overlap)")
        print("• This visually demonstrates Word2Vec's context-independence")
        print("• Compare to RNN PCA where mercury contexts were scattered:")
        print("  - RNN mercury_toy_1: appeared isolated in lower left")
        print("  - RNN mercury_planet contexts: clustered with 'planet' keyword")
        print("  - RNN mercury_metal contexts: appeared in different regions")
        print("  - Word2Vec: ALL mercury points stack on top of each other")
        print(f"• Explained variance: {pca_w2v.explained_variance_ratio_}")
        
        print(f"\n=== Summary: Word2Vec vs RNN for Mercury Analysis ===")
        print("RNN Approach:")
        print("  ✓ Generated 6 different embeddings for mercury contexts")
        print("  ✓ Showed varying similarities (0.30-0.96 range)")
        print("  ✓ Could distinguish mythology vs planet vs metal vs toy")
        print("  ✓ mercury_toy_1 was clearly separated from others")
        print("  ✓ Context-dependent representations")
        print("\nWord2Vec Approach:")
        print("  ✗ Generates identical embedding regardless of context")
        print("  ✗ Cannot distinguish between different mercury meanings")
        print("  ✗ All mercury contexts show perfect 1.00 similarity")
        print("  ✗ All mercury points overlap in PCA space")
        print("  ✓ Provides consistent representation across all occurrences")
        print("  ✓ Shows relationships to individual context words")
        print("\nThis demonstrates why RNNs were a step forward in handling polysemy!")
        
    else:
        print("Cannot perform mercury context analysis - missing required keywords.")
else:
    print("Scratch Word2Vec model not available. Skipping mercury context analysis.")

## Part 2: Using a Pre-trained Word2Vec Model

Now, let's load a comprehensive pre-trained Word2Vec model to see how its general English embeddings represent our "Lumina Codex" keywords and compare its performance.

### 5.1 Load Pre-trained Word2Vec Model

In [None]:
model_name_pretrained = 'word2vec-google-news-300'
try:
    print(f"Loading pre-trained Word2Vec model: '{model_name_pretrained}'... (This may take time and memory)")
    pretrained_w2v_model = api.load(model_name_pretrained)
    print(f"Successfully loaded pre-trained Word2Vec model.")
    print(f"Vocabulary size (pre-trained): {len(pretrained_w2v_model.key_to_index)}")
    print(f"Vector dimension (pre-trained): {pretrained_w2v_model.vector_size}")
except Exception as e:
    print(f"Error loading pre-trained Word2Vec model: {e}")
    pretrained_w2v_model = None

### 5.2 Keyword Embedding Analysis (Pre-trained Model)

In [None]:
keywords_to_analyze = ["mercury", "lumina", "star", "sun", "ignis", "ferrus", "iron", "thoth", "orbit", "nebula"]

if pretrained_w2v_model:
    # Filter keywords present in the pre-trained model's vocabulary
    keywords_in_pretrained_vocab = sorted([kw for kw in keywords_to_analyze if kw in pretrained_w2v_model.key_to_index])
    print(f"Keywords for analysis found in pre-trained Word2Vec vocabulary ({len(keywords_in_pretrained_vocab)}): {keywords_in_pretrained_vocab}")
    
    # Note which keywords are missing from pre-trained model
    missing_keywords = [kw for kw in keywords_to_analyze if kw not in pretrained_w2v_model.key_to_index]
    if missing_keywords:
        print(f"Keywords missing from pre-trained vocabulary: {missing_keywords}")

    if len(keywords_in_pretrained_vocab) >= 2:
        embedding_matrix_pretrained = np.array([pretrained_w2v_model[word] for word in keywords_in_pretrained_vocab])
        print(f"\nShape of embedding matrix for analysis (pre-trained Word2Vec): {embedding_matrix_pretrained.shape}")

        # --- 1. Semantic Similarity (Heatmap) ---
        print("\n--- Cosine Similarity Heatmap (Keywords - Pre-trained Word2Vec) ---")
        similarity_matrix_pretrained = cosine_similarity(embedding_matrix_pretrained)
        
        num_labels_pt = len(keywords_in_pretrained_vocab)
        fig_width_pt = max(12, num_labels_pt * 0.7)
        fig_height_pt = max(10, num_labels_pt * 0.5)
        plt.figure(figsize=(fig_width_pt, fig_height_pt))
        
        annotate_heatmap_pt = num_labels_pt < 25
        
        # Enhanced color map with better contrast (matching other analyses)
        sns.heatmap(similarity_matrix_pretrained,
                    annot=annotate_heatmap_pt, 
                    cmap='viridis', 
                    fmt=".2f",
                    xticklabels=keywords_in_pretrained_vocab, 
                    yticklabels=keywords_in_pretrained_vocab,
                    linewidths=.5, 
                    cbar_kws={"shrink": .8},
                    vmin=0,  # Set minimum value for better contrast
                    vmax=1)  # Set maximum value for better contrast
        plt.title(f'Word2Vec (Pre-trained) Cosine Similarity (Keywords - Lumina Codex)', fontsize=16)
        plt.xticks(rotation=65, ha='right', fontsize=max(8, 14 - num_labels_pt // 4))
        plt.yticks(rotation=0, fontsize=max(8, 14 - num_labels_pt // 4))
        plt.tight_layout()
        plt.show()

        # --- Specific Pairwise Similarities for Discussion (matching other analyses) ---
        print("\n--- Specific Pairwise Similarities (Pre-trained Word2Vec) ---")
        pairs_to_check_discussion = [
            ("mercury", "ignis"), ("lumina", "star"), ("star", "sun"),
            ("ignis", "ferrus"), ("mercury", "iron"), ("thoth", "mercury"),
            ("orbit", "lumina"), ("nebula", "lumina")
        ]
        for kw1, kw2 in pairs_to_check_discussion:
            if kw1 in pretrained_w2v_model and kw2 in pretrained_w2v_model:
                sim_val = pretrained_w2v_model.similarity(kw1, kw2)
                print(f"Similarity between '{kw1}' and '{kw2}': {sim_val:.4f}")
            else:
                print(f"Cannot compare '{kw1}' and '{kw2}' (one or both not in pre-trained vocab).")
        
        # --- 2. Clustering Quality (PCA Visualization) ---
        print("\n--- PCA Visualization (Keywords - Pre-trained Word2Vec) ---")
        pca_pretrained = PCA(n_components=2)
        embeddings_2d_pretrained = pca_pretrained.fit_transform(embedding_matrix_pretrained)

        plt.figure(figsize=(fig_width_pt * 0.9, fig_height_pt * 0.9))
        
        # Define keyword categories and colors (matching other analyses)
        keyword_categories = {
            'celestial': ['star', 'sun', 'lumina', 'orbit', 'nebula'],
            'elements': ['mercury', 'ignis', 'ferrus', 'iron'],
            'mythology': ['thoth']
        }
        
        # Create color mapping
        colors = []
        for label in keywords_in_pretrained_vocab:
            if label in keyword_categories['celestial']:
                colors.append('gold')
            elif label in keyword_categories['elements']:
                colors.append('darkred')
            elif label in keyword_categories['mythology']:
                colors.append('purple')
            else:
                colors.append('gray')
        
        # Plot with colors
        plt.scatter(embeddings_2d_pretrained[:, 0], embeddings_2d_pretrained[:, 1], 
                   alpha=0.8, s=100, c=colors, edgecolors='black', linewidth=0.5)
        
        # Add annotations with better positioning
        for i, label in enumerate(keywords_in_pretrained_vocab):
            plt.annotate(label, 
                       (embeddings_2d_pretrained[i, 0], embeddings_2d_pretrained[i, 1]), 
                       textcoords="offset points", 
                       xytext=(5,5), 
                       ha='center', 
                       fontsize=max(8, 12 - num_labels_pt // 5),
                       fontweight='bold')
        
        plt.title('Word2Vec (Pre-trained) Keyword Embeddings (Lumina Codex) - PCA', fontsize=16)
        plt.xlabel('PCA Component 1', fontsize=12)
        plt.ylabel('PCA Component 2', fontsize=12)
        plt.grid(True, alpha=0.3)
        
        # Add legend for categories
        from matplotlib.patches import Patch
        legend_elements = [
            Patch(facecolor='gold', edgecolor='black', label='Celestial'),
            Patch(facecolor='darkred', edgecolor='black', label='Elements'),
            Patch(facecolor='purple', edgecolor='black', label='Mythology')
        ]
        plt.legend(handles=legend_elements, 
                  loc='upper center', 
                  bbox_to_anchor=(0.5, -0.1), 
                  ncol=3,
                  fontsize=10,
                  title='Keyword Categories')
        
        # Adjust layout to prevent legend cutoff
        plt.tight_layout()
        plt.subplots_adjust(bottom=0.15)
        plt.show()
        
        # Print explained variance ratio
        print(f"\nPCA explained variance ratio: {pca_pretrained.explained_variance_ratio_}")
        print(f"Total variance explained: {sum(pca_pretrained.explained_variance_ratio_):.2%}")
    else:
        print("Not enough keywords in pre-trained vocabulary for analysis (need at least 2).")
else:
    print("Pre-trained Word2Vec model not loaded. Skipping keyword analysis.")

### 5.3 Detailed Analysis of 'mercury' Occurrences

In [None]:
# === Pre-trained Word2Vec Mercury Context Analysis ===
print("\n=== Pre-trained Word2Vec Mercury Context Analysis - Demonstrating Static vs Contextual Embeddings ===")

# Keywords to analyze - same as RNN analysis
mercury_context_keywords = ["mercury", "planet", "toy", "metal", "mythology"]

if pretrained_w2v_model:
    # Check which keywords are in the vocabulary
    keywords_in_vocab = [kw for kw in mercury_context_keywords if kw in pretrained_w2v_model.key_to_index]
    missing_keywords = [kw for kw in mercury_context_keywords if kw not in pretrained_w2v_model.key_to_index]
    
    print(f"Keywords found in pre-trained Word2Vec vocabulary ({len(keywords_in_vocab)}): {keywords_in_vocab}")
    if missing_keywords:
        print(f"Keywords missing from pre-trained vocabulary: {missing_keywords}")
    
    if "mercury" in keywords_in_vocab and len(keywords_in_vocab) >= 2:
        print(f"\n=== Demonstrating Pre-trained Word2Vec's Static Nature ===")
        print("Like scratch Word2Vec, pre-trained Word2Vec also produces the SAME embedding")
        print("for 'mercury' regardless of context, but with potentially better semantic relationships")
        print("due to training on much larger, diverse corpora.")
        
        # Using the SAME hardcoded list as RNN analysis to show the limitation
        # This list represents the actual meaning of mercury in each occurrence (in order)
        mercury_categories = ["mythology", "planet", "metal", "metal", "planet", "toy"]
        
        print("\nMercury occurrences in text order (same as RNN analysis):")
        for i, category in enumerate(mercury_categories, 1):
            print(f"  Occurrence {i}: mercury in {category} context")
        
        # Get the single mercury embedding from pre-trained Word2Vec
        mercury_embedding = pretrained_w2v_model["mercury"]
        
        # Create labeled contexts using the same category mapping as RNN
        # BUT all will have identical embeddings (demonstrating Word2Vec limitation)
        mercury_contexts = {}
        label_counts = {"mythology": 0, "planet": 0, "metal": 0, "toy": 0}
        
        for i, category in enumerate(mercury_categories):
            label_counts[category] += 1
            context_label = f"mercury_{category}_{label_counts[category]}"
            mercury_contexts[context_label] = mercury_embedding  # Same embedding for all!
            print(f"  {context_label}: IDENTICAL pre-trained Word2Vec embedding")
        
        print(f"\nKey Difference from RNN:")
        print(f"• RNN: Generated 6 DIFFERENT embeddings based on surrounding context")
        print(f"• Pre-trained Word2Vec: Uses 1 IDENTICAL embedding regardless of context")
        
        # Add other context keywords if available
        other_embeddings = {}
        for kw in ["planet", "toy", "metal", "mythology"]:
            if kw in keywords_in_vocab:
                other_embeddings[kw] = pretrained_w2v_model[kw]
        
        # Combine all embeddings
        all_embeddings_w2v_pt = {**mercury_contexts, **other_embeddings}
        all_labels_w2v_pt = list(all_embeddings_w2v_pt.keys())
        
        print(f"\nAdded context keywords:")
        for label in other_embeddings.keys():
            print(f"- {label}: Separate pre-trained Word2Vec embedding")
        
        # Create embedding matrix
        embedding_matrix_w2v_pt = np.array(list(all_embeddings_w2v_pt.values()))
        print(f"\nShape of embedding matrix (Pre-trained Word2Vec Context Analysis): {embedding_matrix_w2v_pt.shape}")
        
        # --- Cosine Similarity Heatmap ---
        print("\n--- Cosine Similarity Heatmap (Pre-trained Word2Vec 'Context' Analysis) ---")
        similarity_matrix_w2v_pt = cosine_similarity(embedding_matrix_w2v_pt)
        
        num_labels_w2v_pt = len(all_labels_w2v_pt)
        fig_width_w2v_pt = max(10, num_labels_w2v_pt * 0.8)
        fig_height_w2v_pt = max(8, num_labels_w2v_pt * 0.6)
        plt.figure(figsize=(fig_width_w2v_pt, fig_height_w2v_pt))
        
        annotate_heatmap_w2v_pt = num_labels_w2v_pt < 15
        
        sns.heatmap(similarity_matrix_w2v_pt,
                    annot=annotate_heatmap_w2v_pt,
                    cmap='viridis',
                    fmt=".2f",
                    xticklabels=all_labels_w2v_pt,
                    yticklabels=all_labels_w2v_pt,
                    linewidths=.5,
                    cbar_kws={"shrink": .8},
                    vmin=0,
                    vmax=1)
        plt.title(f'Word2Vec (Pre-trained) "Context" Analysis - Shows Static Nature', fontsize=16)
        plt.xticks(rotation=45, ha='right', fontsize=10)
        plt.yticks(rotation=0, fontsize=10)
        plt.tight_layout()
        plt.show()
        
        print(f"\n=== Key Observations from Heatmap ===")
        print("• All mercury_* entries show PERFECT similarity (1.00) to each other")
        print("• mercury_mythology_1, mercury_planet_1, mercury_metal_1, etc. are IDENTICAL")
        print("• This demonstrates pre-trained Word2Vec's inability to distinguish context")
        print("• However, relationships to context words may be stronger than scratch model")
        print("• Compare this to RNN which showed varying similarities:")
        print("  - RNN: mercury_mythology_1 vs mercury_planet_1 = 0.96")
        print("  - RNN: mercury_toy_1 was often isolated from other mercury contexts")
        print("  - Pre-trained Word2Vec: ALL mercury pairs = 1.00 (no distinction)")
        
        # Show specific similarities
        print(f"\n=== Mercury 'Context' Similarities (All Should Be 1.00) ===")
        mercury_labels = [label for label in all_labels_w2v_pt if 'mercury' in label]
        for i, label1 in enumerate(mercury_labels):
            for label2 in mercury_labels[i+1:]:
                sim_idx1 = all_labels_w2v_pt.index(label1)
                sim_idx2 = all_labels_w2v_pt.index(label2)
                similarity = similarity_matrix_w2v_pt[sim_idx1][sim_idx2]
                print(f"'{label1}' vs '{label2}': {similarity:.4f}")
        
        # Show mercury relationships to context words
        print(f"\n=== Mercury Relationships to Context Words (Pre-trained) ===")
        if "mercury" in keywords_in_vocab:
            for context_word in ["mythology", "planet", "metal", "toy"]:
                if context_word in keywords_in_vocab:
                    similarity = pretrained_w2v_model.similarity("mercury", context_word)
                    print(f"'mercury' vs '{context_word}': {similarity:.4f}")
        
        # Show top similar words to mercury from full vocabulary
        print(f"\n=== Top 10 Most Similar Words to 'Mercury' (Pre-trained Vocabulary) ===")
        try:
            most_similar = pretrained_w2v_model.most_similar("mercury", topn=10)
            for word, similarity in most_similar:
                print(f"'{word}': {similarity:.4f}")
            print("• Notice: These may include planet names, chemical elements, or mythological terms")
            print("• Pre-trained model captures broader semantic relationships")
        except Exception as e:
            print(f"Error finding most similar words: {e}")
        
        # --- PCA Visualization ---
        print("\n--- PCA Visualization (Pre-trained Word2Vec 'Context' Analysis) ---")
        pca_w2v_pt = PCA(n_components=2)
        embeddings_2d_w2v_pt = pca_w2v_pt.fit_transform(embedding_matrix_w2v_pt)
        
        plt.figure(figsize=(fig_width_w2v_pt * 0.9, fig_height_w2v_pt * 0.9))
        
        # Create color map - all mercury contexts will overlap perfectly
        colors_w2v_pt = []
        sizes_w2v_pt = []
        for label in all_labels_w2v_pt:
            if 'mercury' in label:
                colors_w2v_pt.append('red')
                sizes_w2v_pt.append(120)  # Smaller since they'll overlap
            elif label == 'planet':
                colors_w2v_pt.append('blue')
                sizes_w2v_pt.append(100)
            elif label == 'toy':
                colors_w2v_pt.append('green')
                sizes_w2v_pt.append(100)
            elif label == 'metal':
                colors_w2v_pt.append('orange')
                sizes_w2v_pt.append(100)
            elif label == 'mythology':
                colors_w2v_pt.append('purple')
                sizes_w2v_pt.append(100)
            else:
                colors_w2v_pt.append('gray')
                sizes_w2v_pt.append(100)
        
        plt.scatter(embeddings_2d_w2v_pt[:, 0], embeddings_2d_w2v_pt[:, 1], 
                   alpha=0.7, s=sizes_w2v_pt, c=colors_w2v_pt, edgecolors='black', linewidth=0.5)
        
        # Add annotations - mercury contexts will overlap
        mercury_count = 0
        for i, label in enumerate(all_labels_w2v_pt):
            if 'mercury' in label:
                mercury_count += 1
                # Offset multiple mercury labels slightly to show they're the same point
                offset_x = (mercury_count - 3.5) * 0.05  # Small offset for visibility
                offset_y = (mercury_count - 3.5) * 0.05
                plt.annotate(label, 
                           (embeddings_2d_w2v_pt[i, 0] + offset_x, embeddings_2d_w2v_pt[i, 1] + offset_y), 
                           textcoords="offset points", xytext=(5,5), ha='center', fontsize=9)
            else:
                plt.annotate(label, (embeddings_2d_w2v_pt[i, 0], embeddings_2d_w2v_pt[i, 1]), 
                            textcoords="offset points", xytext=(5,5), ha='center', fontsize=10, fontweight='bold')
        
        plt.title('Pre-trained Word2Vec "Context" Analysis - All Mercury Points Overlap', fontsize=16)
        plt.xlabel('PCA Component 1', fontsize=12)
        plt.ylabel('PCA Component 2', fontsize=12)
        plt.grid(True, alpha=0.3)
        
        # Add legend
        from matplotlib.patches import Patch
        legend_elements = [
            Patch(facecolor='red', edgecolor='black', label='mercury (all contexts - overlapping)'),
            Patch(facecolor='blue', edgecolor='black', label='planet'),
            Patch(facecolor='green', edgecolor='black', label='toy'),
            Patch(facecolor='orange', edgecolor='black', label='metal'),
            Patch(facecolor='purple', edgecolor='black', label='mythology')
        ]
        plt.legend(handles=legend_elements, loc='upper center', bbox_to_anchor=(0.5, -0.1), ncol=3)
        
        plt.tight_layout()
        plt.subplots_adjust(bottom=0.15)
        plt.show()
        
        print(f"\n=== PCA Insights ===")
        print("• All mercury_* points appear at the EXACT same location (perfect overlap)")
        print("• This visually demonstrates pre-trained Word2Vec's context-independence")
        print("• Compare to RNN PCA where mercury contexts were scattered:")
        print("  - RNN mercury_toy_1: appeared isolated in lower left")
        print("  - RNN mercury_planet contexts: clustered with 'planet' keyword")
        print("  - RNN mercury_metal contexts: appeared in different regions")
        print("  - Pre-trained Word2Vec: ALL mercury points stack on top of each other")
        print(f"• Explained variance: {pca_w2v_pt.explained_variance_ratio_}")
        
        print(f"\n=== Summary: Pre-trained Word2Vec vs RNN for Mercury Analysis ===")
        print("RNN Approach:")
        print("  ✓ Generated 6 different embeddings for mercury contexts")
        print("  ✓ Showed varying similarities (0.30-0.96 range)")
        print("  ✓ Could distinguish mythology vs planet vs metal vs toy")
        print("  ✓ mercury_toy_1 was clearly separated from others")
        print("  ✓ Context-dependent representations")
        print("\nPre-trained Word2Vec Approach:")
        print("  ✗ Generates identical embedding regardless of context")
        print("  ✗ Cannot distinguish between different mercury meanings")
        print("  ✗ All mercury contexts show perfect 1.00 similarity")
        print("  ✗ All mercury points overlap in PCA space")
        print("  ✓ Provides consistent representation across all occurrences")
        print("  ✓ Shows stronger relationships to context words (due to large training corpus)")
        print("  ✓ Captures broader semantic knowledge from diverse training data")
        print("\nKey Advantage over Scratch Word2Vec:")
        print("  ✓ Better semantic relationships due to massive training corpus")
        print("  ✓ More robust word similarities")
        print("  ✗ Still cannot solve the polysemy problem that RNNs attempted to address")
        
    else:
        print("Cannot perform mercury context analysis - missing required keywords.")
else:
    print("Pre-trained Word2Vec model not available. Skipping mercury context analysis.")

## Part 3: Word2Vec Stage Conclusion - Insights from the "Lumina Codex" Experiments

This notebook provided a deep dive into Word2Vec, a cornerstone technique for learning static word embeddings. We took a dual approach:
1.  Training a Word2Vec model from scratch using our custom "Lumina Codex" corpus.
2.  Leveraging the power of a comprehensive pre-trained Word2Vec model (Google News 300-dim).

For both models, we analyzed embeddings for keywords relevant to the "Lumina Codex," focusing on semantic similarity, the nature of Word2Vec's "contextual understanding" (as static representations), vocabulary handling, clustering quality via PCA, and the trainability of the scratch model on our specific corpus. The naive "autocompletion" tests further served to highlight Word2Vec's non-generative nature.

**Key Insights & Performance Summary:**

1.  **Semantic Similarity:**
    * **Scratch Model (Lumina Codex):** The heatmap and specific similarity scores showed that the model learned associations *specific to the "Lumina Codex"*. For example, "lumina" and "star" likely showed high similarity, as did terms frequently co-occurring within that narrative. However, due to the limited data, many unrelated terms might also have shown inflated similarities, reflecting the model's constrained worldview.
    * **Pre-trained Model (Google News):** The heatmap and similarity scores demonstrated a much richer and more general understanding of English semantics. "Lumina Codex" keywords were interpreted based on their general meanings (e.g., "aqua" relating to "water"), providing a broader semantic context than the scratch model.
    * *Insight:* Word2Vec effectively captures semantic similarity, but its quality and generalizability are heavily dependent on the size and diversity of the training corpus.

2.  **"Contextual Understanding" (as a Static Representation):**
    * Word2Vec provides a single, static vector for each word. This vector is an aggregation of all contexts in which the word appeared during training.
    * **Observation:** The pre-trained model showed "aqua" might be similar to "water" or "ocean." This isn't dynamic context like in BERT, but rather the static vector for "aqua" has learned to be close to these general concepts. The scratch model's understanding of "aqua" would be solely based on its usage within the "Lumina Codex."
    * *Limitation Highlighted:* This static nature means Word2Vec cannot disambiguate polysemy (e.g., a single vector for "bank" regardless of financial or riverside meaning).

3.  **Handling Vocabulary Terms & Novelty (OOV):**
    * **Scratch Model:** Its vocabulary was strictly limited to words (and sub-threshold `min_count` words) present in the "Lumina Codex." Any keyword not meeting `min_count` or not in the text was OOV.
    * **Pre-trained Model:** Possessed a vast vocabulary (3 million words/phrases from Google News), covering most of our keywords and general English terms. However, highly niche or newly coined fictional terms specific *only* to the "Lumina Codex" (if any existed that weren't common words) could still be OOV.
    * *Limitation Highlighted:* Standard Word2Vec offers no mechanism to generate embeddings for true OOV words not encountered during its training.

4.  **Clustering Quality (PCA Visualization):**
    * **Scratch Model:** The PCA plot showed some thematic groupings reflecting the narrative of the "Lumina Codex" (e.g., astronomical terms together, characters together). However, these clusters might have been less distinct and more influenced by frequent co-occurrences in the single document.
    * **Pre-trained Model:** The PCA plot ([see Figure Pre-trained-PCA - your `word2vec_trained_pca.png`](https://www.linkedin.com/feed/update/urn:li:activity:YOUR_ACTIVITY_ID_WHERE_YOU_POST_IMAGE/)) demonstrated much clearer and more intuitive semantic clustering for general English words. "Lumina Codex" specific keywords were positioned within this broader semantic space based on their general meanings.
    * *Insight:* Larger, diverse training data leads to a more robust and semantically organized embedding space.

5.  **Trainability on Specific Corpus (Scratch Model):**
    * The scratch model successfully trained on the "Lumina Codex," learning vector representations for its vocabulary. The specific similarities and clusters observed (e.g., "Lumina" being close to "star" *within the context of this model*) demonstrate that it captured patterns from this particular text.
    * *Insight:* Word2Vec can be effectively trained on domain-specific corpora to produce embeddings that reflect the nuances of that domain, though these will lack broader world knowledge.

**Word2Vec in the "EmbedEvolution" Context:**

This exploration confirms Word2Vec's role as a highly efficient and effective method for generating **static word embeddings**. It excels at capturing word-level semantic similarity and analogies, especially when leveraging large, pre-trained models. The comparison between our scratch-trained model and the Google News pre-trained model vividly illustrates the impact of data scale on embedding quality and generalizability.

However, the core limitations remain:
* **Static Nature:** Inability to capture context-dependent meanings (polysemy).
* **Non-Sequential:** Not designed for understanding or generating ordered sequences of text, as clearly shown by the naive "autocompletion" experiments (outputs for pre-trained: e.g., `Seed: 'the white rabbit' -> Generated: 'the white rabbit rabbits animals animal Animal Animals Animals_Peta Animals_PETA PETA PeTA founder_Ingrid_Newkirk'`).
* **OOV Handling:** Standard Word2Vec struggles with words not seen during training.

These limitations, particularly the need for dynamic, context-aware representations, were major drivers for the subsequent evolution towards Transformer-based models.

**Next Steps:**

Having thoroughly examined static word embeddings with Word2Vec, our "EmbedEvolution" journey now takes a significant leap. We will dive into the world of **contextual embeddings** with **BERT (Bidirectional Encoder Representations from Transformers)**. BERT revolutionized NLP by providing word representations that dynamically change based on the surrounding words, addressing polysemy and capturing much deeper linguistic nuances.