# EmbedEvolution Stage 1: RNN Embeddings

Welcome to the first stage of EmbedEvolution! Here, we explore how early sequence models like Recurrent Neural Networks (RNNs) began to capture context in language, going beyond static word vectors. We'll train a simple RNN, extract contextual representations, measure semantic similarity, and visualize the results.

**Goal:** Understand how RNNs process sequences and generate context-dependent embeddings, observing their strengths and weaknesses (especially with long-range dependencies).

%%
## 1. Setup and Imports

Import necessary libraries

In [None]:
import re
import numpy as np
import tensorflow as tf
import seaborn as sns
import matplotlib.pyplot as plt
from tensorflow.keras.layers import Embedding, SimpleRNN, Dense
from tensorflow.keras.models import Sequential
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.decomposition import PCA

print("TensorFlow Version:", tf.__version__)

## 2. Define the Corpus
We'll use a sample text to train our RNN and extract embeddings. A slightly longer and also fictional text to help illustrate context better.

In [None]:
text = """
The Lumina Codex and the Solara System: A Tapestry of Ancient Wisdom and Cosmic Discovery
In the shadowed halls of the Cairo Museum, a dusty papyrus scroll, cataloged as Papyrus K-37b from the Middle Kingdom, lay forgotten for centuries. Dubbed the Lumina Codex by its discoverers, this fragile relic was initially dismissed as a mythological curiosity, its cryptic hieroglyphs and star charts interpreted as poetic musings of a priestly scribe. Yet, in 2024, a team of linguists and astronomers, led by Dr. Amara Nassar, deciphered its veiled verses, revealing an astonishing truth: the codex described a distant star system with uncanny precision, orbiting a radiant G-type star named the Star of Eternal Radiance—now known as Lumina. Intriguingly, the codex also spoke of a divine figure, "the Swift One," bearing a staff entwined with serpents, reminiscent of the god Mercury in Roman mythology or Thoth in Egyptian lore. This messenger, often depicted as Thoth the scribe of the gods, is said to have imparted the knowledge of the stars to the ancient scribes, guiding their hands in creating the star charts. This revelation sparked a scientific odyssey, merging ancient Egyptian cosmology with cutting-edge astronomy, as the Solara System emerged from the Nebula Cygnus-X1, nestled in the Orion Spur of the Milky Way Galaxy.

The Lumina Codex spoke of Lumina and its ten celestial attendants, organized into poetic regions: the searing Forges of Ra for the inner worlds, the verdant Blessed Belt of Osiris for the habitable zone, the majestic Domains of the Sky Titans for gas giants, and the enigmatic Frozen Outlands for the outer realms. Its star charts, etched with meticulous care, hinted at a cosmic map, with references to the Rivers of Stars—likely the Milky Way—and the Celestial Gardens, evoking the Local Group within the Virgo Supercluster. The codex's verses, such as "Ten jewels dance in the embrace of the Eternal Radiance, their faces veiled in fire, water, and ice," seemed to prefigure a system now confirmed by the Cygnus-X1 Deep Sky Array, a fictional next-generation telescope orbiting beyond Earth's atmosphere.

Discovery and Modern Corroboration
The Solara System's discovery began in 2023, when the Cygnus-X1 Deep Sky Array detected subtle wobbles in Lumina's light, indicating a complex system of orbiting bodies. Located 1,200 light-years away in the Nebula Cygnus-X1, Lumina is a stable, middle-aged G-type star, slightly larger than the Sun, with a luminosity that sustains a diverse array of worlds. As astronomers analyzed the data, they identified ten planets, each with unique characteristics that eerily echoed the Lumina Codex. The parallels were undeniable: the codex's Forges of Ra matched the inner rocky planets, while the Blessed Belt of Osiris aligned with two habitable worlds teeming with life. The Domains of the Sky Titans and Frozen Outlands described gas giants and icy dwarfs with striking accuracy. The scientific community buzzed with excitement, as linguists and astronomers collaborated to decode the codex's metaphors, revealing a blend of ancient intuition and cosmic truth.

The Solara System: A Celestial Menagerie
Lumina: The Star of Eternal Radiance

Lumina, a G2V star, radiates a warm, golden light, its stable fusion cycle supporting a system spanning 12 astronomical units. Its magnetic fields are unusually calm, suggesting a long lifespan conducive to life's evolution. The codex describes Lumina as "the hearth of eternity, whose breath kindles the dance of worlds," a poetic nod to its life-giving energy.

The Forges of Ra: Inner Planets
1- Ignis: The closest planet to Lumina, Ignis is a scorched, iron-rich world with a molten surface pocked by ancient impact craters, reminiscent of the planet Mercury in our own solar system. Its thin atmosphere, rich in sulfur dioxide, glows faintly under Lumina's intense radiation. The codex calls it "Ra's Anvil, where molten rivers forge the bones of the cosmos," reflecting its volcanic past and metallic crust.
2- Ferrus: Slightly larger, Ferrus is a rocky planet with vast plains of oxidized iron, giving it a crimson hue. Its surface bears scars of past tectonic activity, with towering cliffs and deep chasms. The codex names it "the Forge of Hephaestus's Twin," hinting at its metallic wealth, now confirmed by spectroscopic analysis revealing nickel and cobalt deposits.

The Blessed Belt of Osiris: Habitable Zone
1- Aqua: A breathtaking ocean world, Aqua is enveloped in turquoise clouds of water vapor and nitrogen. Its surface is 90% liquid water, with archipelagos of coral-like structures hosting complex aquatic ecosystems. Bioluminescent Aquarelles, jellyfish-like creatures with crystalline tentacles, drift in vast schools, their light pulses synchronizing in rhythmic displays. Predatory Thalacynths, eel-like organisms with electromagnetic sensors, hunt in the deep trenches. Aqua's moon, Thalassa, is an ice-covered world with a subglacial ocean, where astrobiologists hypothesize microbial extremophiles thrive in hydrothermal vents, metabolizing sulfur compounds. The codex describes Aqua as "Osiris's Chalice, where life swims in the tears of the gods," and Thalassa as "the frozen veil hiding the spark of creation."
2- Veridia: A super-Earth, Veridia boasts lush continents of bioluminescent flora, such as Luminara trees, which pulse with green and violet light, and Crystalferns, whose fractal leaves refract Lumina's rays into dazzling spectra. Veridia is home to the Sylvans, sentient, silicon-based life forms resembling ambulatory crystal shrubs. Their bodies, composed of lattice-like structures, shimmer with bioluminescent patterns used for communication. Intriguingly, the Sylvans' technology incorporates liquid mercury, a metal known for its unique properties, in their communication devices. This allows them to transmit their bioluminescent patterns through conductive channels, enhancing their collective consciousness. Sylvan society is decentralized, with "groves" of individuals linked via light-based signals, forming a collective consciousness deeply attuned to Veridia's ecosystem. Their architecture, grown from crystalline minerals, integrates seamlessly with the landscape. The codex calls Veridia "the Garden of Osiris's Breath," where "the shining ones weave light into wisdom."

The Domains of the Sky Titans: Gas Giants
1- Zephyrus: A massive hydrogen-helium gas giant, Zephyrus dominates the system with its radiant ring system, composed of ice and silicate particles. Its atmosphere swirls with golden storms, driven by intense winds. Among its 47 moons, Io-Prime stands out, a volcanically active world spewing sulfur plumes, likely powered by tidal heating. The codex names Zephyrus "the Sky Titan's Crown," its rings "the jeweled girdle of the heavens."
2- Boreas: An ice giant with a deep blue methane atmosphere, Boreas exhibits retrograde rotation and an asymmetrical magnetic field, creating auroras that dance across its poles. Its 22 moons include Erynnis, a rocky moon with methane lakes. The codex describes Boreas as "the Frost Titan, whose breath chills the void," capturing its icy majesty.

The Frozen Outlands: Outer Planets
1- Umbriel: A dwarf planet with a charcoal-dark surface, Umbriel's icy crust is fractured by ancient impacts. Its moon Nyx, a captured object, is rich in organic compounds, hinting at prebiotic chemistry. The codex calls Umbriel "the Shadowed Outcast, guarded by the dark sentinel."
2- Erebus: An icy world with a nitrogen-methane atmosphere, Erebus has a highly elliptical orbit, suggesting a captured origin. Its surface sparkles with frost-covered ridges. The codex names it "the Silent Wanderer, cloaked in eternal frost."
3- Aetheria: The outermost planet, Aetheria is a rogue dwarf with a thin atmosphere of neon and argon. Its moon Lethe exhibits cryovolcanism, spewing ammonia-water mixtures. Astrobiologists speculate that Lethe's subsurface ocean may harbor microbial life, analogous to Thalassa's. The codex describes Aetheria as "the Veiled Wanderer, whose dreams freeze in the outer dark," and Lethe as "the weeping mirror of the cosmos."
4- Nyxara: A small, icy body with a chaotic orbit, Nyxara's surface is a mosaic of frozen nitrogen and carbon monoxide. The codex calls it "the Lost Jewel, dancing beyond the Titans' gaze."

Life in the Solara System
Aqua's aquatic ecosystems are a marvel, with Aquarelles forming symbiotic networks with coral-like Hydroskeletons, which filter nutrients from the water. Thalacynths use electromagnetic pulses to stun prey, suggesting an evolutionary arms race. On Thalassa, microbial life is hypothesized based on chemical signatures of sulfur and methane in its subglacial ocean, though no direct evidence exists yet.
Veridia's Sylvans are the system's crown jewel. Their crystalline bodies, averaging two meters tall, refract light into complex patterns, encoding emotions, ideas, and memories. Their society operates as a "luminous collective," with no central authority; decisions emerge from synchronized light displays across groves. Sylvan technology manipulates crystalline minerals and liquid mercury to create tools and habitats, all in harmony with Veridia's ecosystem. Their discovery has sparked intense study by linguists decoding their light-based language, revealing a philosophy centered on balance and interconnectedness.
On Lethe, cryovolcanic activity suggests a subsurface ocean with potential microbial ecosystems, possibly metabolizing ammonia. Unlike Aqua's confirmed complex life and Veridia's sentient Sylvans, life on Thalassa and Lethe remains speculative, driving astrobiological research.

Galactic Context
The Solara System resides in the Orion Spur, a minor arm of the Milky Way, part of the Local Group within the Virgo Supercluster. The codex's Rivers of Stars evoke the Milky Way's spiral arms, while the Celestial Gardens suggest a poetic grasp of the Local Group's galactic cluster. This cosmic placement underscores Solara's significance as a microcosm of the universe's diversity.

Ongoing Exploration
Scientific teams, including astrobiologists, geologists, and linguists, are studying Solara via the Cygnus-X1 Deep Sky Array and planned probes, such as the Lumina Pathfinder Mission. Challenges include the 1,200-light-year distance, requiring advanced telemetry for data transmission. Sylvan communication poses a unique hurdle, as their light patterns defy traditional linguistic models. Future missions aim to deploy orbiters around Aqua, Veridia, and Lethe to confirm microbial life and study Sylvan culture.

A Cosmic Tapestry
The Solara System, unveiled through the Lumina Codex and modern astronomy, blends ancient wisdom with scientific discovery. Its worlds—from the fiery Forges of Ra to the icy Frozen Outlands—offer a rich tapestry of environments, life forms, and mysteries. As scientists probe this distant system, the codex's poetic verses resonate, reminding humanity that the cosmos has long whispered its secrets, awaiting those bold enough to listen.

To share this cosmic tapestry with the world, an educational initiative has launched the Solara Explorer Kit for children. The kit includes detailed models of the ten planets, with Ignis crafted to resemble the planet Mercury as a toy for the kids to play and learn, it's complete with its cratered surface and a tiny, movable rover that kids can roll across its terrain. A central figure of Lumina, the star, shines brightly, paired with a glow-in-the-dark Mercury space blaster that shoots safe, squishy foam darts, letting kids pretend they're defending the Solara System from space invaders. This kit aims to spark curiosity in young minds, inviting them to explore the night sky and imagine distant worlds through hands-on fun.
"""

# Clean the text slightly

text = text.lower()
text = re.sub(r'[^a-z\s]', '', text) # Keep only lowercase letters and spaces
text = re.sub(r'\s+', ' ', text).strip() # Remove extra whitespace

print("Cleaned Text Sample:")
print(text[:300] + "...")

## 3. Text Preprocessing

We need to convert the text into numerical sequences that the RNN can process.
1.  **Tokenization:** Split the text into words (tokens).
2.  **Vocabulary Creation:** Build a mapping from unique words to integer indices.
3.  **Sequence Generation:** Create input sequences (e.g., sequences of 5 words) and corresponding target words (the 6th word) for training the RNN to predict the next word.

In [None]:
tokenizer = Tokenizer(oov_token="<unk>") # Add OOV token for any novel terms during generation
tokenizer.fit_on_texts([text])
word_index = tokenizer.word_index
vocab_size = len(word_index) + 1

print(f"Vocabulary Size: {vocab_size}")

tokens = text.split(' ')
seq_length = 5 # Length of input sequence for prediction, can be tuned
sequences = []

for i in range(seq_length, len(tokens)):
    input_seq_tokens = tokens[i-seq_length:i]
    target_word_token = tokens[i]
    sequences.append((input_seq_tokens, target_word_token))

print(f"Number of sequences: {len(sequences)}")
if sequences:
    print("Sample sequence input:", sequences[0][0])
    print("Sample sequence output:", sequences[0][1])

X_list = []
y_list = []
for input_tokens, target_token in sequences:
    # Convert list of token strings to list of indices
    current_X_indices = [word_index.get(token, word_index["<unk>"]) for token in input_tokens]
    X_list.append(current_X_indices)
    y_list.append(word_index.get(target_token, word_index["<unk>"]))

X = pad_sequences(X_list, maxlen=seq_length, padding='pre')
y = np.array(y_list)

if X.size > 0:
    print("\nSample encoded X (after padding):", X[0])
    print("Sample encoded y:", y[0])
print("Shape of X:", X.shape)
print("Shape of y:", y.shape)

## 4. Build the Simple RNN Model
We'll create a sequential model with:
1.  An `Embedding` layer: Maps word indices to dense vectors.
2.  A `SimpleRNN` layer: Processes the sequence of embeddings.
3.  A `Dense` layer: Outputs predictions for the next word over the vocabulary.

In [None]:
embedding_dim = 100 # Dimensionality of the learned word vectors
rnn_units = 128   # Number of units in the SimpleRNN layer (memory capacity)

# Ensure X is not empty before proceeding
if X.shape[0] == 0:
    print("Error: No training sequences generated. Check text preprocessing and corpus size.")
    model = None # Or handle error appropriately
else:
    model = Sequential([
        Embedding(input_dim=vocab_size, output_dim=embedding_dim, input_length=seq_length),
        SimpleRNN(units=rnn_units),
        Dense(units=vocab_size, activation='softmax')
    ])
    model.build(input_shape=(None, seq_length))
    model.compile(optimizer='adam',
                  loss='sparse_categorical_crossentropy',
                  metrics=['accuracy'])
    model.summary()

## 5. Train the RNN Model

We'll train the model for a small number of epochs. The goal isn't perfect prediction, but rather to get the Embedding and RNN layers to learn some representation of the sequence context.
*Note: Training RNNs on larger texts can be time-consuming. We use a small dataset and few epochs for demonstration.*

In [None]:
epochs = 100 # Adjust as needed for better results vs. training time
batch_size = 64
print("\nTraining RNN model...")

history = model.fit(X, y,
                    epochs=epochs,
                    batch_size=batch_size,
                    verbose=1)
print("Model training complete.")

## 6. Analyzing Contextual Keyword Embeddings: Similarity Heatmap & PCA Visualization

We'll extract the RNN's hidden state after processing short sequences ending with (or containing) our target keywords from the "Lumina Codex." This hidden state acts as a contextual embedding for that keyword in its specific preceding context. We will then:
1.  Visualize the cosine similarities between these contextual keyword embeddings using a heatmap.
2.  Visualize the embeddings themselves in 2D using PCA to observe potential clustering.
This helps evaluate Semantic Similarity, Contextual Understanding, and Clustering Quality.

In [None]:
keywords_to_analyze = ["mercury", "lumina", "star", "sun", "ignis", "ferrus", "iron", "thoth", "orbit", "nebula"]

# Ensure prerequisite variables from previous cells are available:
# model, tokenizer, word_index, tokens, seq_length, rnn_units

if 'model' not in locals() or model is None:
    print("Model not trained. Please run model training cell first.")
elif 'tokenizer' not in locals() or 'word_index' not in locals():
    print("Tokenizer or word_index not available. Please run text preprocessing cell.")
else:
    # Debug: Check which keywords are in the vocabulary
    keywords_in_vocab = [kw for kw in keywords_to_analyze if kw in word_index]
    if not keywords_in_vocab:
        print("Warning: None of the specified keywords_to_analyze are in the trained vocabulary.")
    else:
        print(f"Keywords from list found in vocabulary for analysis ({len(keywords_in_vocab)} words): {keywords_in_vocab}")

        # Function to find short sequences in the text containing a keyword
        def get_sequences_for_keywords(text_tokens, keywords, seq_len, max_sequences_per_keyword=1):
            keyword_sequences = {}
            for kw in keywords:
                kw_sequences_found = []
                # Case-insensitive search
                for i in range(len(text_tokens) - seq_len + 1):
                    window = text_tokens[i : i + seq_len]
                    # Check if keyword is in window (case-insensitive)
                    if any(token.lower() == kw.lower() for token in window):
                        kw_sequences_found.append(window)
                        if max_sequences_per_keyword and len(kw_sequences_found) >= max_sequences_per_keyword:
                            break
                if kw_sequences_found:
                    if max_sequences_per_keyword == 1:
                        label = kw
                        if label not in keyword_sequences:
                            keyword_sequences[label] = kw_sequences_found[0]
                    else:
                        for idx, seq in enumerate(kw_sequences_found):
                            label = f"{kw}_{idx+1}"
                            if label not in keyword_sequences:
                                keyword_sequences[label] = seq
            return keyword_sequences

        embedding_layer = model.layers[0]
        rnn_layer = model.layers[1]
        rnn_output_model = Sequential([embedding_layer, rnn_layer])
        rnn_output_model.build(input_shape=(None, seq_length))

        def get_rnn_contextual_embedding(token_list):
            token_list_str = [str(t) for t in token_list]
            encoded_sequence = [word_index.get(token, word_index.get("<unk>", 0)) for token in token_list_str]
            padded_sequence = pad_sequences([encoded_sequence], maxlen=seq_length, padding='pre')
            if padded_sequence.shape[1] == 0:
                return np.zeros(rnn_units)
            return rnn_output_model.predict(padded_sequence, verbose=0)[0]

        # === Section 1: General Keyword Analysis ===
        print("\n=== Section 1: General Keyword Analysis ===")
        contextual_phrase_dict = get_sequences_for_keywords(tokens, keywords_in_vocab, seq_length, max_sequences_per_keyword=1)

        embeddings_for_analysis = {}
        plot_labels_for_analysis = []

        print("\nExtracting RNN hidden states for keyword contexts:")
        for name, phrase_tokens in contextual_phrase_dict.items():
            if not phrase_tokens:
                print(f"Warning: No tokens for '{name}'. Skipping.")
                continue
            embedding = get_rnn_contextual_embedding(phrase_tokens)
            embeddings_for_analysis[name] = embedding
            plot_labels_for_analysis.append(name)
            print(f"- Extracted embedding for '{name}' using context: '{' '.join(phrase_tokens)}'")

        if embeddings_for_analysis and len(embeddings_for_analysis) >= 2:
            embedding_matrix_analysis = np.array(list(embeddings_for_analysis.values()))
            print("\nShape of embedding matrix for analysis:", embedding_matrix_analysis.shape)

            # --- 1. Semantic Similarity & 2. Contextual Understanding (via Heatmap) ---
            print("\n--- Cosine Similarity Heatmap (Contextual Keywords) ---")
            similarity_matrix_keywords = cosine_similarity(embedding_matrix_analysis)
            
            num_labels = len(plot_labels_for_analysis)
            fig_width = max(12, num_labels * 0.7)
            fig_height = max(10, num_labels * 0.5)
            plt.figure(figsize=(fig_width, fig_height))
            
            annotate_heatmap = num_labels < 25

            # Enhanced color map with better contrast
            sns.heatmap(similarity_matrix_keywords,
                        annot=annotate_heatmap,
                        cmap='viridis', 
                        fmt=".2f",
                        xticklabels=plot_labels_for_analysis,
                        yticklabels=plot_labels_for_analysis,
                        linewidths=.5,
                        cbar_kws={"shrink": .8},
                        vmin=0,  # Set minimum value for better contrast
                        vmax=1)  # Set maximum value for better contrast
            plt.title(f'RNN Cosine Similarity (Contextual Keywords - Lumina Codex)', fontsize=16)
            plt.xticks(rotation=65, ha='right', fontsize=max(8, 14 - num_labels // 4))
            plt.yticks(rotation=0, fontsize=max(8, 14 - num_labels // 4))
            plt.tight_layout()
            plt.show()

            # Print a few specific similarities to discuss
            print("\n--- Specific Pairwise Similarities for Discussion ---")
            pairs_to_check_for_discussion = [
                ("mercury", "ignis"), ("lumina", "star"), ("star", "sun"),
                ("ignis", "ferrus"), ("mercury", "iron"), ("thoth", "mercury"),
                ("orbit", "lumina"), ("nebula", "lumina")
            ]
            for kw1_label, kw2_label in pairs_to_check_for_discussion:
                if kw1_label in embeddings_for_analysis and kw2_label in embeddings_for_analysis:
                    sim_val = cosine_similarity(
                        embeddings_for_analysis[kw1_label].reshape(1,-1),
                        embeddings_for_analysis[kw2_label].reshape(1,-1)
                    )[0][0]
                    print(f"Similarity between '{kw1_label}' and '{kw2_label}': {sim_val:.4f}")

            # --- 4. Clustering Quality (PCA Visualization) ---
            print("\n--- PCA Visualization (Contextual Keywords) ---")
            pca = PCA(n_components=2)
            embeddings_2d_pca = pca.fit_transform(embedding_matrix_analysis)

            plt.figure(figsize=(fig_width * 0.9, fig_height * 0.9))
            
            # Define keyword categories and colors
            keyword_categories = {
                'celestial': ['star', 'sun', 'lumina', 'orbit', 'nebula'],
                'elements': ['mercury', 'ignis', 'ferrus', 'iron'],
                'mythology': ['thoth']
            }
            
            # Create color mapping
            colors = []
            for label in plot_labels_for_analysis:
                if label in keyword_categories['celestial']:
                    colors.append('gold')
                elif label in keyword_categories['elements']:
                    colors.append('darkred')
                elif label in keyword_categories['mythology']:
                    colors.append('purple')
                else:
                    colors.append('gray')
            
            # Plot with colors
            plt.scatter(embeddings_2d_pca[:, 0], embeddings_2d_pca[:, 1], 
                       alpha=0.8, s=100, c=colors, edgecolors='black', linewidth=0.5)
            
            # Add annotations with better positioning
            for i, label in enumerate(plot_labels_for_analysis):
                plt.annotate(label, 
                           (embeddings_2d_pca[i, 0], embeddings_2d_pca[i, 1]), 
                           textcoords="offset points", 
                           xytext=(5,5), 
                           ha='center', 
                           fontsize=max(8, 12 - num_labels // 5),
                           fontweight='bold')
            
            plt.title('RNN Contextual Keyword Embeddings (Lumina Codex) - PCA', fontsize=16)
            plt.xlabel('PCA Component 1', fontsize=12)
            plt.ylabel('PCA Component 2', fontsize=12)
            plt.grid(True, alpha=0.3)
            
            # Add legend for categories
            from matplotlib.patches import Patch
            legend_elements = [
                Patch(facecolor='gold', edgecolor='black', label='Celestial'),
                Patch(facecolor='darkred', edgecolor='black', label='Elements'),
                Patch(facecolor='purple', edgecolor='black', label='Mythology')
            ]
            plt.legend(handles=legend_elements, 
                      loc='upper center', 
                      bbox_to_anchor=(0.5, -0.1), 
                      ncol=3,
                      fontsize=10,
                      title='Keyword Categories')
            
            # Adjust layout to prevent legend cutoff
            plt.tight_layout()
            plt.subplots_adjust(bottom=0.15)
            plt.show()
            
            # Print explained variance ratio
            print(f"\nPCA explained variance ratio: {pca.explained_variance_ratio_}")
            print(f"Total variance explained: {sum(pca.explained_variance_ratio_):.2%}")
        else:
            print("Not enough keyword embeddings (need at least 2) for similarity matrix or PCA plot.")

## 7. Detailed Analysis of 'mercury' Occurrences

In [None]:
# Keywords to analyze - expanded to include toy, metal, mythology
keywords_to_analyze = ["mercury", "planet", "toy", "metal", "mythology", "lumina", "star", "sun", "ignis", "ferrus", "iron", "thoth", "orbit", "nebula"]

# Ensure prerequisite variables from previous cells are available:
# model, tokenizer, word_index, tokens, seq_length, rnn_units

if 'model' not in locals() or model is None:
    print("Model not trained. Please run model training cell first.")
elif 'tokenizer' not in locals() or 'word_index' not in locals():
    print("Tokenizer or word_index not available. Please run text preprocessing cell.")
else:
    # Debug: Check which keywords are in the vocabulary
    keywords_in_vocab = [kw for kw in keywords_to_analyze if kw in word_index]
    if not keywords_in_vocab:
        print("Warning: None of the specified keywords_to_analyze are in the trained vocabulary.")
    else:
        print(f"Keywords from list found in vocabulary for analysis ({len(keywords_in_vocab)} words): {keywords_in_vocab}")

        # Debug: Check for all "mercury" occurrences in tokens
        mercury_positions = [i for i, token in enumerate(tokens) if token.lower() == "mercury"]
        print(f"\nDiagnostic: Found {len(mercury_positions)} occurrences of 'mercury' in tokens at positions: {mercury_positions}")

        # Updated function to find sequences for each unique occurrence of a keyword
        def get_sequences_for_keywords(text_tokens, keywords, seq_len, max_sequences_per_keyword=None):
            keyword_sequences = {}
            for kw in keywords:
                # Find all positions of the keyword (case-insensitive)
                kw_positions = [i for i, token in enumerate(text_tokens) if token.lower() == kw]
                kw_sequences_found = []
                used_positions = set()  # Track positions to avoid overlap

                for pos in kw_positions:
                    # Find the earliest window that includes this occurrence
                    start = max(0, pos - seq_len + 1)
                    end = min(len(text_tokens), pos + seq_len)
                    for i in range(start, pos + 1):
                        window = text_tokens[i:i + seq_len]
                        if len(window) == seq_len and kw.lower() in [w.lower() for w in window] and i not in used_positions:
                            kw_sequences_found.append(window)
                            used_positions.update(range(i, i + seq_len))
                            break  # Take the first valid window for this occurrence

                    # Stop if we've reached the max number of sequences
                    if max_sequences_per_keyword and len(kw_sequences_found) >= max_sequences_per_keyword:
                        break

                # Assign labels based on the number of sequences found
                if kw_sequences_found:
                    if max_sequences_per_keyword == 1 and len(kw_positions) <= 1:
                        label = kw
                        if label not in keyword_sequences:
                            keyword_sequences[label] = kw_sequences_found[0]
                    else:
                        for idx, seq in enumerate(kw_sequences_found):
                            label = f"{kw}_{idx+1}" if len(kw_positions) > 1 else kw
                            if label not in keyword_sequences:
                                keyword_sequences[label] = seq
            return keyword_sequences

        embedding_layer = model.layers[0]
        rnn_layer = model.layers[1]
        rnn_output_model = Sequential([embedding_layer, rnn_layer])
        rnn_output_model.build(input_shape=(None, seq_length))

        def get_rnn_contextual_embedding(token_list):
            token_list_str = [str(t) for t in token_list]
            encoded_sequence = [word_index.get(token, word_index.get("<unk>", 0)) for token in token_list_str]
            padded_sequence = pad_sequences([encoded_sequence], maxlen=seq_length, padding='pre')
            if padded_sequence.shape[1] == 0:
                return np.zeros(rnn_units)
            return rnn_output_model.predict(padded_sequence, verbose=0)[0]

        # === Section 2: Analysis of All 'mercury' Occurrences and Related Keywords ===
        print("\n=== Section 2: Analysis of All 'mercury' Occurrences and Related Keywords ===")
        
        # Keywords to extract contexts for
        analysis_keywords = ["mercury", "planet", "toy", "metal", "mythology"]
        
        # Check if all keywords are in vocabulary
        missing_keywords = [kw for kw in analysis_keywords if kw not in word_index]
        if missing_keywords:
            print(f"Warning: The following keywords are not in the trained vocabulary: {missing_keywords}")
        
        # Extract all contexts for 'mercury' without limiting to one sequence total
        mercury_contexts_dict = get_sequences_for_keywords(tokens, ["mercury"], seq_length, max_sequences_per_keyword=None)
        
        # Extract contexts for other keywords (limit to 1 sequence each for simplicity)
        other_contexts = {}
        for kw in ["planet", "toy", "metal", "mythology"]:
            if kw in word_index:
                contexts = get_sequences_for_keywords(tokens, [kw], seq_length, max_sequences_per_keyword=1)
                if contexts:
                    other_contexts.update(contexts)

        # Hardcoded list of mercury categories based on text analysis (6 occurrences)
        mercury_categories = ["mythology", "planet", "metal", "metal", "planet", "toy"]

        # Manually assign context-specific labels to 'mercury' occurrences based on the category list
        all_embeddings = {}
        all_labels = []
        label_counts = {"mythology": 0, "planet": 0, "metal": 0, "toy": 0}  # Track indices for each category

        print("\nExtracting RNN hidden states for all 'mercury' contexts:")
        for idx, (label, phrase_tokens) in enumerate(mercury_contexts_dict.items()):
            if not phrase_tokens:
                print(f"Warning: No tokens for '{label}'. Skipping.")
                continue
            if idx >= len(mercury_categories):
                print(f"Warning: Extra occurrence detected at index {idx}, skipping to match 6 occurrences.")
                continue  # Skip extra occurrences to align with the 6-category list
            category = mercury_categories[idx]
            label_counts[category] += 1
            context_label = f"mercury_{category}_{label_counts[category]}"
            embedding = get_rnn_contextual_embedding(phrase_tokens)
            all_embeddings[context_label] = embedding
            all_labels.append(context_label)
            print(f"- Extracted embedding for '{context_label}' using context: '{' '.join(phrase_tokens)}'")

        # Extract embeddings for other keywords
        print("\nExtracting RNN hidden states for other keywords:")
        for label, phrase_tokens in other_contexts.items():
            if not phrase_tokens:
                print(f"Warning: No tokens for '{label}'. Skipping.")
                continue
            embedding = get_rnn_contextual_embedding(phrase_tokens)
            # Use the keyword itself as the label
            keyword_label = label.split('_')[0]  # Get the base keyword without any suffix
            all_embeddings[keyword_label] = embedding
            all_labels.append(keyword_label)
            print(f"- Extracted embedding for '{keyword_label}' using context: '{' '.join(phrase_tokens)}'")

        # Diagnostic: Check number of extracted embeddings
        print(f"\nDiagnostic: Total embeddings extracted: {len(all_embeddings)}")
        print(f"Labels: {all_labels}")

        # Proceed with analysis if there are at least 2 total embeddings
        if len(all_embeddings) >= 2:
            # Create embedding matrix from all embeddings
            embedding_matrix = np.array(list(all_embeddings.values()))
            print("\nShape of embedding matrix for all contexts:", embedding_matrix.shape)

            # Heatmap for all contexts
            print("\n--- Cosine Similarity Heatmap (All Contexts) ---")
            similarity_matrix = cosine_similarity(embedding_matrix)
            
            num_labels = len(all_labels)
            fig_width = max(10, num_labels * 0.8)
            fig_height = max(8, num_labels * 0.6)
            plt.figure(figsize=(fig_width, fig_height))
            
            annotate_heatmap = num_labels < 15
            
            sns.heatmap(similarity_matrix,
                        annot=annotate_heatmap,
                        cmap='viridis',
                        fmt=".2f",
                        xticklabels=all_labels,
                        yticklabels=all_labels,
                        linewidths=.5,
                        cbar_kws={"shrink": .8})
            plt.title(f'RNN Cosine Similarity (All Contexts - Lumina Codex)', fontsize=16)
            plt.xticks(rotation=45, ha='right', fontsize=10)
            plt.yticks(rotation=0, fontsize=10)
            plt.tight_layout()
            plt.show()

            # PCA for all contexts
            print("\n--- PCA Visualization (All Contexts) ---")
            pca = PCA(n_components=2)
            embeddings_2d_pca = pca.fit_transform(embedding_matrix)

            plt.figure(figsize=(fig_width * 0.9, fig_height * 0.9))
            
            # Create color map for different keyword types
            colors = []
            for label in all_labels:
                if 'mercury' in label:
                    colors.append('red')
                elif label == 'planet':
                    colors.append('blue')
                elif label == 'toy':
                    colors.append('green')
                elif label == 'metal':
                    colors.append('orange')
                elif label == 'mythology':
                    colors.append('purple')
                else:
                    colors.append('gray')
            
            plt.scatter(embeddings_2d_pca[:, 0], embeddings_2d_pca[:, 1], alpha=0.7, s=100, c=colors)
            
            for i, label in enumerate(all_labels):
                plt.annotate(label, (embeddings_2d_pca[i, 0], embeddings_2d_pca[i, 1]), 
                            textcoords="offset points", xytext=(5,5), ha='center', fontsize=10)
            
            plt.title('RNN Contextual Embeddings (All Contexts - Lumina Codex) - PCA', fontsize=16)
            plt.xlabel('PCA Component 1', fontsize=12)
            plt.ylabel('PCA Component 2', fontsize=12)
            plt.grid(True)
            
            # Add legend for colors
            from matplotlib.patches import Patch
            legend_elements = [
                Patch(facecolor='red', label='mercury contexts'),
                Patch(facecolor='blue', label='planet'),
                Patch(facecolor='green', label='toy'),
                Patch(facecolor='orange', label='metal'),
                Patch(facecolor='purple', label='mythology')
            ]
            plt.legend(handles=legend_elements, loc='upper center', bbox_to_anchor=(0.5, -0.15), ncol=5)
            
            plt.tight_layout()
            plt.show()
        else:
            print("Not enough total embeddings (need at least 2) for similarity matrix or PCA plot.")

## 8. Illustrating Limitations: A Code-Driven Look at RNN Embeddings with the "Lumina Codex"

The conceptual limitations of Recurrent Neural Networks become more concrete when we examine their performance on our "Lumina Codex" corpus. In the `EmbedEvolution` RNN notebook (`1_RNN_Embeddings.ipynb`), we trained a SimpleRNN for next-word prediction and then evaluated its ability to understand context and generate text.

* **Code Setup Snapshot:**
    The experiment involved using **TensorFlow/Keras** to build a `Sequential` model comprising an `Embedding` layer, a `SimpleRNN` layer, and a `Dense` output layer. Text preprocessing utilized the Keras `Tokenizer`, while analysis involved **NumPy** for numerical tasks, **Scikit-learn** for `cosine_similarity` and `PCA`, and **Matplotlib/Seaborn** for plotting the heatmap and PCA visualizations.

* **What the Code Revealed: RNNs with the "Lumina Codex"**

    Our tests aimed to evaluate Semantic Similarity, Contextual Understanding, Clustering Quality, the model's handling of its vocabulary, and its basic trainability on this new, richer corpus.

    1. **Semantic Similarity & Contextual Understanding (via Heatmap & Specific Pairs):**
        We extracted hidden states representing keywords within their contexts. The cosine similarities between these contextual keyword embeddings revealed both strengths and limitations.
        
        * **Mercury's Multiple Contexts:** The most striking finding was how the RNN handled the word "mercury" across its six different contexts in the Lumina Codex:
            * **mercury_mythology_1** and **mercury_planet_1** showed very high similarity (0.96), despite referring to completely different concepts (Roman god vs. planet)
            * **mercury_metal_1** showed moderate similarity to mythology (0.32) and planet contexts (0.29)
            * **mercury_metal_2** showed higher similarity to mythology (0.53) and planet contexts (0.47)
            * **mercury_toy_1** maintained moderate connections to mythology (0.33) and planet contexts (0.30)
            * **mercury_planet_2** showed strong similarity to mercury_planet_1 (0.48) as expected
            
        * **Observation from Heatmap:** The heatmap reveals the RNN's struggle with polysemy. Despite "mercury" appearing in vastly different contexts (mythology, astronomy, chemistry, toys), the model produces relatively similar embeddings for all instances, with similarities ranging from 0.29 to 0.96. This suggests the RNN is heavily influenced by the shared word "mercury" rather than effectively differentiating based on surrounding context.

        * **Specific Similarities from General Keywords:**
            * "lumina" and "star" showed moderate similarity (0.41), reflecting their astronomical connection
            * "mercury" and "thoth" exhibited strong similarity (0.62), correctly capturing their mythological relationship
            * "ferrus" and "iron" showed surprisingly low similarity (0.16), despite being related metal terms
            * "ignis" and "mercury" showed moderate similarity (0.41), possibly due to alchemical associations

    2. **Clustering Quality (via PCA Plot):**
        The PCA visualizations reveal interesting patterns in how the RNN organizes semantic space:
        
        * **Mercury Context Clustering:** In the mercury-focused PCA plot, we observe:
            * mercury_mythology_1 and mercury_planet_1 cluster very closely despite different meanings
            * mercury_metal_1 appears isolated in the lower right
            * mercury_toy_1 is separated in the lower left
            * The two planet contexts (mercury_planet_1 and mercury_planet_2) don't cluster as tightly as expected
            * Standalone keywords (planet, metal, mythology, toy) don't align well with their corresponding mercury contexts
        
        * **General Keyword Clustering:** The broader keyword PCA shows:
            * Celestial terms (star, sun, nebula) cluster in the lower portion but aren't tightly grouped
            * Elements (mercury, ferrus, iron, ignis) are scattered across the space
            * "thoth" (mythology) appears isolated at the top, far from "mercury"
            * No clear semantic clusters emerge, indicating the RNN struggles to create well-defined conceptual regions

    3. **Contextual Understanding Limitations:**
        The RNN's handling of "mercury" exemplifies its fundamental limitation with context:
        * Despite appearing in sentences about mythology ("Mercury, messenger of the gods"), astronomy ("Mercury orbits closest to Lumina"), chemistry ("liquid mercury metal"), and toys ("mercury toy from Earth"), the model fails to create sufficiently distinct representations
        * The high similarity between mythology and planet contexts (0.96) suggests the model relies heavily on the word itself rather than contextual disambiguation
        * This demonstrates the SimpleRNN's limited ability to maintain and utilize broader context beyond immediate surrounding words


    5. **Handling of Vocabulary Terms:**
        * The tokenizer successfully captured domain-specific terms from the Lumina Codex (lumina, ignis, ferrus, thoth, etc.)
        * However, the quality of learned representations varies significantly, with some related terms (ferrus/iron) showing unexpectedly low similarity
        * The model cannot generate meaningful representations for words outside the Lumina Codex vocabulary

    6. **Trainability and Adaptation:**
        * The model successfully trained on the Lumina Codex, learning to associate terms within this specific domain
        * However, the quality of learned representations reveals the architecture's limitations:
            * Poor disambiguation of polysemous words (mercury)
            * Inconsistent semantic similarities (ferrus/iron low, mercury/thoth high)
            * Lack of clear semantic clustering in the embedding space

* **Conclusion on RNNs from "Lumina Codex" Exploration:**
    This hands-on experiment with a SimpleRNN on the "Lumina Codex" text concretely demonstrates both the capabilities and severe limitations of basic RNN architectures. While the model can learn sequential patterns and produce contextual representations, it fundamentally struggles with:
    
    1. **Polysemy**: The "mercury" example starkly illustrates how RNNs fail to adequately distinguish between different meanings of the same word based on context
    2. **Long-range dependencies**: The model cannot effectively use broader document context to inform word representations
    3. **Semantic coherence**: The lack of clear clustering in PCA space indicates poor semantic organization
    
    These limitations—particularly evident in the mercury polysemy problem—motivated the development of more sophisticated architectures and embedding methods that could better capture both static word meanings and dynamic contextual variations, leading to innovations like Word2Vec, and eventually, transformer-based models that can effectively handle such contextual nuances.