# EmbedEvolution Stage 4: BERT Embeddings

Welcome to Stage 4! We now delve into BERT (Bidirectional Encoder Representations from Transformers). This marks a significant shift from static embeddings (Word2Vec, GloVe) and basic sequential models (RNNs) to powerful, deeply contextualized word representations. BERT leverages the Transformer architecture, specifically its encoder part, and is pre-trained on vast amounts of text using tasks like Masked Language Modeling (MLM) and Next Sentence Prediction (NSP).

The key feature of BERT is its ability to generate embeddings that change based on the surrounding words, truly capturing the meaning of a word in its specific context.

**Goal:** Understand how BERT generates contextual embeddings, observe how these embeddings differ for the same word in different sentences, and appreciate the advantages over static embedding methods.

## 1. Setup and Imports

We'll use the `transformers` library by Hugging Face and `torch`.

In [None]:
import torch
import nltk
from transformers import BertTokenizerFast, BertModel, BertForMaskedLM, get_linear_schedule_with_warmup
from torch.optim import AdamW
from torch.utils.data import Dataset, DataLoader
import numpy as np
from sklearn.decomposition import PCA
from sklearn.metrics.pairwise import cosine_similarity
import matplotlib.pyplot as plt
import seaborn as sns
import re
from tqdm.auto import tqdm # For progress bars

# Device selection (CUDA/MPS/CPU)
if torch.cuda.is_available():
    device_str = "cuda"
    print("CUDA is available. Using GPU (CUDA).")
elif torch.backends.mps.is_available():
    device_str = "mps"
    print("MPS is available. Using Apple Silicon GPU (MPS).")
else:
    device_str = "cpu"
    print("CUDA and MPS not available. Using CPU.")
device = torch.device(device_str)
print(f"Selected device: {device}")

## 2. Define the Corpus: The "Lumina Codex"
We'll use our detailed "Lumina Codex and the Solara System" text.

In [None]:
text = """
The Lumina Codex and the Solara System: A Tapestry of Ancient Wisdom and Cosmic Discovery
In the shadowed halls of the Cairo Museum, a dusty papyrus scroll, cataloged as Papyrus K-37b from the Middle Kingdom, lay forgotten for centuries. Dubbed the Lumina Codex by its discoverers, this fragile relic was initially dismissed as a mythological curiosity, its cryptic hieroglyphs and star charts interpreted as poetic musings of a priestly scribe. Yet, in 2024, a team of linguists and astronomers, led by Dr. Amara Nassar, deciphered its veiled verses, revealing an astonishing truth: the codex described a distant star system with uncanny precision, orbiting a radiant G-type star named the Star of Eternal Radiance—now known as Lumina. This revelation sparked a scientific odyssey, merging ancient Egyptian cosmology with cutting-edge astronomy, as the Solara System emerged from the Nebula Cygnus-X1, nestled in the Orion Spur of the Milky Way Galaxy.

The Lumina Codex spoke of Lumina and its ten celestial attendants, organized into poetic regions: the searing Forges of Ra for the inner worlds, the verdant Blessed Belt of Osiris for the habitable zone, the majestic Domains of the Sky Titans for gas giants, and the enigmatic Frozen Outlands for the outer realms. Its star charts, etched with meticulous care, hinted at a cosmic map, with references to the Rivers of Stars—likely the Milky Way—and the Celestial Gardens, evoking the Local Group within the Virgo Supercluster. The codex’s verses, such as “Ten jewels dance in the embrace of the Eternal Radiance, their faces veiled in fire, water, and ice,” seemed to prefigure a system now confirmed by the Cygnus-X1 Deep Sky Array, a fictional next-generation telescope orbiting beyond Earth’s atmosphere.

Discovery and Modern Corroboration
The Solara System’s discovery began in 2023, when the Cygnus-X1 Deep Sky Array detected subtle wobbles in Lumina’s light, indicating a complex system of orbiting bodies. Located 1,200 light-years away in the Nebula Cygnus-X1, Lumina is a stable, middle-aged G-type star, slightly larger than the Sun, with a luminosity that sustains a diverse array of worlds. As astronomers analyzed the data, they identified ten planets, each with unique characteristics that eerily echoed the Lumina Codex. The parallels were undeniable: the codex’s Forges of Ra matched the inner rocky planets, while the Blessed Belt of Osiris aligned with two habitable worlds teeming with life. The Domains of the Sky Titans and Frozen Outlands described gas giants and icy dwarfs with striking accuracy. The scientific community buzzed with excitement, as linguists and astronomers collaborated to decode the codex’s metaphors, revealing a blend of ancient intuition and cosmic truth.

The Solara System: A Celestial Menagerie
Lumina: The Star of Eternal Radiance
Lumina, a G2V star, radiates a warm, golden light, its stable fusion cycle supporting a system spanning 12 astronomical units. Its magnetic fields are unusually calm, suggesting a long lifespan conducive to life’s evolution. The codex describes Lumina as “the hearth of eternity, whose breath kindles the dance of worlds,” a poetic nod to its life-giving energy.

The Forges of Ra: Inner Planets
1- Ignis: The closest planet to Lumina, Ignis is a scorched, iron-rich world with a molten surface pocked by ancient impact craters. Its thin atmosphere, rich in sulfur dioxide, glows faintly under Lumina’s intense radiation. The codex calls it “Ra’s Anvil, where molten rivers forge the bones of the cosmos,” reflecting its volcanic past and metallic crust.
2- Ferrus: Slightly larger, Ferrus is a rocky planet with vast plains of oxidized iron, giving it a crimson hue. Its surface bears scars of past tectonic activity, with towering cliffs and deep chasms. The codex names it “the Forge of Hephaestus’s Twin,” hinting at its metallic wealth, now confirmed by spectroscopic analysis revealing nickel and cobalt deposits.
The Blessed Belt of Osiris: Habitable Zone
1- Aqua: A breathtaking ocean world, Aqua is enveloped in turquoise clouds of water vapor and nitrogen. Its surface is 90% liquid water, with archipelagos of coral-like structures hosting complex aquatic ecosystems. Bioluminescent Aquarelles, jellyfish-like creatures with crystalline tentacles, drift in vast schools, their light pulses synchronizing in rhythmic displays. Predatory Thalacynths, eel-like organisms with electromagnetic sensors, hunt in the deep trenches. Aqua’s moon, Thalassa, is an ice-covered world with a subglacial ocean, where astrobiologists hypothesize microbial extremophiles thrive in hydrothermal vents, metabolizing sulfur compounds. The codex describes Aqua as “Osiris’s Chalice, where life swims in the tears of the gods,” and Thalassa as “the frozen veil hiding the spark of creation.”
2- Veridia: A super-Earth, Veridia boasts lush continents of bioluminescent flora, such as Luminara trees, which pulse with green and violet light, and Crystalferns, whose fractal leaves refract Lumina’s rays into dazzling spectra. Veridia is home to the Sylvans, sentient, silicon-based life forms resembling ambulatory crystal shrubs. Their bodies, composed of lattice-like structures, shimmer with bioluminescent patterns used for communication. Sylvan society is decentralized, with “groves” of individuals linked via light-based signals, forming a collective consciousness deeply attuned to Veridia’s ecosystem. Their architecture, grown from crystalline minerals, integrates seamlessly with the landscape. The codex calls Veridia “the Garden of Osiris’s Breath,” where “the shining ones weave light into wisdom.”
The Domains of the Sky Titans: Gas Giants
1- Zephyrus: A massive hydrogen-helium gas giant, Zephyrus dominates the system with its radiant ring system, composed of ice and silicate particles. Its atmosphere swirls with golden storms, driven by intense winds. Among its 47 moons, Io-Prime stands out, a volcanically active world spewing sulfur plumes, likely powered by tidal heating. The codex names Zephyrus “the Sky Titan’s Crown,” its rings “the jeweled girdle of the heavens.”
2- Boreas: An ice giant with a deep blue methane atmosphere, Boreas exhibits retrograde rotation and an asymmetrical magnetic field, creating auroras that dance across its poles. Its 22 moons include Erynnis, a rocky moon with methane lakes. The codex describes Boreas as “the Frost Titan, whose breath chills the void,” capturing its icy majesty.
The Frozen Outlands: Outer Planets
1- Umbriel: A dwarf planet with a charcoal-dark surface, Umbriel’s icy crust is fractured by ancient impacts. Its moon Nyx, a captured object, is rich in organic compounds, hinting at prebiotic chemistry. The codex calls Umbriel “the Shadowed Outcast, guarded by the dark sentinel.”
2- Erebus: An icy world with a nitrogen-methane atmosphere, Erebus has a highly elliptical orbit, suggesting a captured origin. Its surface sparkles with frost-covered ridges. The codex names it “the Silent Wanderer, cloaked in eternal frost.”
3- Aetheria: The outermost planet, Aetheria is a rogue dwarf with a thin atmosphere of neon and argon. Its moon Lethe exhibits cryovolcanism, spewing ammonia-water mixtures. Astrobiologists speculate that Lethe’s subsurface ocean may harbor microbial life, analogous to Thalassa’s. The codex describes Aetheria as “the Veiled Wanderer, whose dreams freeze in the outer dark,” and Lethe as “the weeping mirror of the cosmos.”
4- Nyxara: A small, icy body with a chaotic orbit, Nyxara’s surface is a mosaic of frozen nitrogen and carbon monoxide. The codex calls it “the Lost Jewel, dancing beyond the Titans’ gaze.”
Life in the Solara System
Aqua’s aquatic ecosystems are a marvel, with Aquarelles forming symbiotic networks with coral-like Hydroskeletons, which filter nutrients from the water. Thalacynths use electromagnetic pulses to stun prey, suggesting an evolutionary arms race. On Thalassa, microbial life is hypothesized based on chemical signatures of sulfur and methane in its subglacial ocean, though no direct evidence exists yet.

Veridia’s Sylvans are the system’s crown jewel. Their crystalline bodies, averaging two meters tall, refract light into complex patterns, encoding emotions, ideas, and memories. Their society operates as a “luminous collective,” with no central authority; decisions emerge from synchronized light displays across groves. Sylvan technology manipulates crystalline minerals to create tools and habitats, all in harmony with Veridia’s ecosystem. Their discovery has sparked intense study by linguists decoding their light-based language, revealing a philosophy centered on balance and interconnectedness.

On Lethe, cryovolcanic activity suggests a subsurface ocean with potential microbial ecosystems, possibly metabolizing ammonia. Unlike Aqua’s confirmed complex life and Veridia’s sentient Sylvans, life on Thalassa and Lethe remains speculative, driving astrobiological research.

Galactic Context
The Solara System resides in the Orion Spur, a minor arm of the Milky Way, part of the Local Group within the Virgo Supercluster. The codex’s Rivers of Stars evoke the Milky Way’s spiral arms, while the Celestial Gardens suggest a poetic grasp of the Local Group’s galactic cluster. This cosmic placement underscores Solara’s significance as a microcosm of the universe’s diversity.

Ongoing Exploration
Scientific teams, including astrobiologists, geologists, and linguists, are studying Solara via the Cygnus-X1 Deep Sky Array and planned probes, such as the Lumina Pathfinder Mission. Challenges include the 1,200-light-year distance, requiring advanced telemetry for data transmission. Sylvan communication poses a unique hurdle, as their light patterns defy traditional linguistic models. Future missions aim to deploy orbiters around Aqua, Veridia, and Lethe to confirm microbial life and study Sylvan culture.

A Cosmic Tapestry
The Solara System, unveiled through the Lumina Codex and modern astronomy, blends ancient wisdom with scientific discovery. Its worlds—from the fiery Forges of Ra to the icy Frozen Outlands—offer a rich tapestry of environments, life forms, and mysteries. As scientists probe this distant system, the codex’s poetic verses resonate, reminding humanity that the cosmos has long whispered its secrets, awaiting those bold enough to listen.
"""
# Simple text cleaning
def clean_text(input_text):
    input_text = input_text.lower()
    input_text = re.sub(r'[^a-z0-9\s-]', '', input_text) # Keep letters, numbers, spaces, hyphens
    input_text = re.sub(r'\s+', ' ', input_text).strip()
    return input_text

cleaned_text = clean_text(text)
print("Cleaned Text Sample (Lumina Codex):")
print(cleaned_text[:300] + "...")

# Split text into sentences or manageable chunks for BERT.
# For keyword analysis, we'll pick sentences containing keywords.
# For fine-tuning, we might use overlapping chunks if sentences are too long.
# NLTK's sent_tokenize can be helpful here for more robust sentence splitting.
try:
    nltk.data.find('tokenizers/punkt')
except nltk.downloader.DownloadError:
    nltk.download('punkt')
sentences_from_corpus = nltk.sent_tokenize(text) # Use original text for better sentence tokenization
cleaned_sentences_from_corpus = [clean_text(s) for s in sentences_from_corpus]
print(f"\nFound {len(cleaned_sentences_from_corpus)} sentences in the corpus.")


## 3. Load Pre-trained BERT Model and Tokenizer
We'll use `bert-base-uncased`

In [None]:
BERT_MODEL_NAME = 'bert-base-uncased'

try:
    tokenizer = BertTokenizerFast.from_pretrained(BERT_MODEL_NAME)
    bert_model_pretrained = BertModel.from_pretrained(BERT_MODEL_NAME).to(device)
    bert_model_pretrained.eval() # Set to evaluation mode
    print(f"Successfully loaded pre-trained BERT model ('{BERT_MODEL_NAME}') and tokenizer.")
except Exception as e:
    print(f"Error loading BERT model or tokenizer: {e}")
    tokenizer = None
    bert_model_pretrained = None

## 4. Helper Function to Extract BERT Embeddings
This function will get contextual embeddings for specified words within given sentences.
It handles subword tokenization by averaging the embeddings of a word's subword tokens.

In [None]:
def get_bert_contextual_embeddings(texts, target_words_info, tokenizer_instance, model_instance, layer_index=-1):
    """
    Extracts contextual embeddings for target words from specified texts using BERT.

    Args:
        texts (list of str): List of sentences/texts.
        target_words_info (dict): Dict where keys are unique identifiers and values are dicts
                                  {"target_word": "word", "text_idx": index_in_texts_list}.
        tokenizer_instance: Initialized BERT tokenizer.
        model_instance: Initialized BERT model.
        layer_index (int): Which hidden layer to use for embeddings (-1 for last).

    Returns:
        dict: Embeddings for target words, keyed by the unique identifiers.
    """
    if not tokenizer_instance or not model_instance:
        print("Tokenizer or model not available.")
        return {}

    embeddings_dict = {}
    model_instance.eval() # Ensure model is in eval mode

    for key, info in tqdm(target_words_info.items(), desc="Extracting BERT embeddings"):
        text_idx = info["text_idx"]
        target_word_str = info["target_word"].lower() # Match uncased model
        current_text = texts[text_idx].lower()

        inputs = tokenizer_instance(current_text, return_tensors="pt", truncation=True, padding=True, return_offsets_mapping=True)
        offset_mapping = inputs.pop("offset_mapping").squeeze(0) # Remove batch dim for single sentence
        inputs = {k: v.to(model_instance.device) for k, v in inputs.items()}

        with torch.no_grad():
            outputs = model_instance(**inputs, output_hidden_states=True)
        
        # Use hidden states from the specified layer (or last layer by default)
        if layer_index == -1 or layer_index >= len(outputs.hidden_states):
            hidden_states = outputs.hidden_states[-1].squeeze(0) # Last hidden state, remove batch dim
        else:
            hidden_states = outputs.hidden_states[layer_index].squeeze(0)


        word_token_indices = []
        # Find tokens corresponding to the target word using offset mapping
        # This is a simplified approach, robust matching can be more complex
        try:
            start_char_idx = current_text.find(target_word_str)
            if start_char_idx != -1:
                end_char_idx = start_char_idx + len(target_word_str)
                for token_idx, (offset_start, offset_end) in enumerate(offset_mapping):
                    if offset_start == offset_end and offset_start == 0: continue # Skip special tokens like [CLS], [PAD] by checking zero offset
                    # Check if the token's span overlaps with the target word's span
                    if max(start_char_idx, offset_start) < min(end_char_idx, offset_end):
                        word_token_indices.append(token_idx)
            else: # Fallback if find doesn't work, try exact token match
                 tokenized_target = tokenizer_instance.tokenize(target_word_str)
                 input_tokens_str = tokenizer_instance.convert_ids_to_tokens(inputs['input_ids'].squeeze(0))
                 for i in range(len(input_tokens_str) - len(tokenized_target) + 1):
                     if input_tokens_str[i:i+len(tokenized_target)] == tokenized_target:
                         word_token_indices = list(range(i, i+len(tokenized_target)))
                         break


        except AttributeError: # If current_text is not string (should not happen)
             print(f"Error processing text: {current_text}")
             continue


        if word_token_indices:
            word_embeddings = hidden_states[word_token_indices]
            word_embedding_avg = torch.mean(word_embeddings, dim=0).cpu().numpy()
            embeddings_dict[key] = word_embedding_avg
        else:
            print(f"Warning: Target word '{target_word_str}' not precisely found in tokenized sentence: '{current_text}'. Skipping '{key}'. This might be due to subword tokenization complexities or word variations.")
            # Try to get [CLS] token as a fallback representation of the sentence containing the word
            # embeddings_dict[key] = hidden_states[0].cpu().numpy() # [CLS] token

    return embeddings_dict

## 5. Part 1: Analysis with Pre-trained BERT
First, let's see how a standard pre-trained BERT model represents our keywords from the "Lumina Codex" in different contexts.

### 5.1 Define Keywords and Sentences for Contextual Analysis
We need to pick specific sentences from the "Lumina Codex" that contain our keywords. This will allow us to test Contextual Understanding (same word, different sentence -> different embedding) and Semantic Similarity between different keywords in their contexts.

In [None]:
keywords_for_bert_analysis = [
    "lumina", "solara", "aqua", "veridia", "sylvans", "thalassa", "lethe",
    "star", "planet", "moon", "orbit", "atmosphere", "ecosystem",
    "rocky", "icy", "habitable", "bioluminescent", "sentient",
    "codex", "discovery", "life", "egyptian", "astronomy", "science"
]

target_words_info_pretrained = {}
processed_keywords_for_multiple_contexts = set()

for kw in keywords_for_bert_analysis:
    count = 0
    indices_found = []
    for i, sentence in enumerate(cleaned_sentences_from_corpus):
        if kw in sentence.split(): # Simple check if word is in sentence
            indices_found.append(i)
            if kw not in processed_keywords_for_multiple_contexts:
                target_words_info_pretrained[f"{kw}_ctx1"] = {"target_word": kw, "text_idx": i}
                processed_keywords_for_multiple_contexts.add(kw)
                count += 1
            elif count < 2 : # Get a second context if keyword was already processed once
                target_words_info_pretrained[f"{kw}_ctx2"] = {"target_word": kw, "text_idx": i}
                count += 1
            if count >= 2:
                break # Got two contexts for this keyword
    if not indices_found:
        print(f"Keyword '{kw}' not found in any sentence.")


print(f"\nSelected {len(target_words_info_pretrained)} keyword instances for pre-trained BERT analysis.")
print("Details:", target_words_info_pretrained)

### 5.2 Extract and Analyze Embeddings (Pre-trained BERT)

In [None]:
if bert_model_pretrained and tokenizer and target_words_info_pretrained:
    print("Extracting embeddings with Pre-trained BERT...")
    embeddings_bert_pretrained = get_bert_contextual_embeddings(
        cleaned_sentences_from_corpus,
        target_words_info_pretrained,
        tokenizer,
        bert_model_pretrained
    )

    if embeddings_bert_pretrained and len(embeddings_bert_pretrained) >= 2:
        plot_labels_bert_pretrained = list(embeddings_bert_pretrained.keys())
        embedding_matrix_bert_pretrained = np.array(list(embeddings_bert_pretrained.values()))

        print(f"\nShape of embedding matrix (Pre-trained BERT): {embedding_matrix_bert_pretrained.shape}")

        # --- 1. Semantic Similarity & 2. Contextual Understanding (Heatmap) ---
        print("\n--- Cosine Similarity Heatmap (Keywords - Pre-trained BERT) ---")
        similarity_matrix_bert_pretrained = cosine_similarity(embedding_matrix_bert_pretrained)
        
        num_labels_pt_bert = len(plot_labels_bert_pretrained)
        fig_width_pt_bert = max(15, num_labels_pt_bert * 0.5) # Adjust for potentially many labels
        fig_height_pt_bert = max(12, num_labels_pt_bert * 0.4)
        plt.figure(figsize=(fig_width_pt_bert, fig_height_pt_bert))
        
        annotate_heatmap_pt_bert = num_labels_pt_bert < 30 
        sns.heatmap(similarity_matrix_bert_pretrained,
                    annot=annotate_heatmap_pt_bert, cmap='coolwarm', fmt=".2f",
                    xticklabels=plot_labels_bert_pretrained, yticklabels=plot_labels_bert_pretrained,
                    linewidths=.5, cbar_kws={"shrink": .8})
        plt.title(f'BERT (Pre-trained) Cosine Similarity (Contextual Keywords - Lumina Codex)', fontsize=16)
        plt.xticks(rotation=65, ha='right', fontsize=max(8, 12 - num_labels_pt_bert // 6))
        plt.yticks(rotation=0, fontsize=max(8, 12 - num_labels_pt_bert // 6))
        plt.tight_layout()
        plt.show()

        # --- Specific comparisons for Contextual Understanding ---
        print("\n--- Contextual Understanding Examples (Pre-trained BERT) ---")
        if "life_ctx1" in embeddings_bert_pretrained and "life_ctx2" in embeddings_bert_pretrained:
            sim_life_contexts = cosine_similarity(
                embeddings_bert_pretrained["life_ctx1"].reshape(1,-1),
                embeddings_bert_pretrained["life_ctx2"].reshape(1,-1)
            )[0][0]
            print(f"Similarity between 'life' in context 1 vs. context 2: {sim_life_contexts:.4f}")
            original_sent_life1 = cleaned_sentences_from_corpus[target_words_info_pretrained['life_ctx1']['text_idx']]
            original_sent_life2 = cleaned_sentences_from_corpus[target_words_info_pretrained['life_ctx2']['text_idx']]
            print(f"  Context 1 for 'life': '...{original_sent_life1[max(0, original_sent_life1.find('life')-30):original_sent_life1.find('life')+30]}...'")
            print(f"  Context 2 for 'life': '...{original_sent_life2[max(0, original_sent_life2.find('life')-30):original_sent_life2.find('life')+30]}...'")


        if "planet_ctx1" in embeddings_bert_pretrained and "planet_ctx2" in embeddings_bert_pretrained:
            sim_planet_contexts = cosine_similarity(
                embeddings_bert_pretrained["planet_ctx1"].reshape(1,-1),
                embeddings_bert_pretrained["planet_ctx2"].reshape(1,-1)
            )[0][0]
            print(f"Similarity between 'planet' in context 1 vs. context 2: {sim_planet_contexts:.4f}")


        # --- 4. Clustering Quality (PCA Visualization) ---
        print("\n--- PCA Visualization (Keywords - Pre-trained BERT) ---")
        pca_bert_pretrained = PCA(n_components=2)
        embeddings_2d_bert_pretrained = pca_bert_pretrained.fit_transform(embedding_matrix_bert_pretrained)

        plt.figure(figsize=(fig_width_pt_bert * 0.9, fig_height_pt_bert * 0.9))
        plt.scatter(embeddings_2d_bert_pretrained[:, 0], embeddings_2d_bert_pretrained[:, 1], alpha=0.7, s=60)
        for i, label in enumerate(plot_labels_bert_pretrained):
            plt.annotate(label, (embeddings_2d_bert_pretrained[i, 0], embeddings_2d_bert_pretrained[i, 1]),
                         textcoords="offset points", xytext=(5,5), ha='center', fontsize=max(7, 10 - num_labels_pt_bert // 7))
        plt.title('BERT (Pre-trained) Contextual Keyword Embeddings - PCA', fontsize=16)
        plt.xlabel('PCA Component 1', fontsize=12)
        plt.ylabel('PCA Component 2', fontsize=12)
        plt.grid(True)
        plt.tight_layout()
        plt.show()
    else:
        print("Not enough keyword embeddings extracted from pre-trained BERT for analysis.")
else:
    print("Pre-trained BERT model or tokenizer not available.")

### Interpretation Notes for Pre-trained BERT:
* **Semantic Similarity & Contextual Understanding:** The heatmap and specific comparisons should show if BERT differentiates the same word in different contexts (e.g., `life_ctx1` vs `life_ctx2` should have similarity < 1.0). It should also show if keywords used in similar overall sentence contexts cluster together.
* **Handling Novel Terms/Vocabulary:** BERT uses WordPiece tokenization. It can represent words not in its original pre-training vocabulary by breaking them into known subwords. This means it rarely has true "OOV" issues for words composed of common characters, though the representation for very rare/novel combinations might be less robust. Your `keywords_for_bert_analysis` are likely to be well-represented.
* **Clustering Quality:** The PCA plot for BERT is expected to be more nuanced than Word2Vec. Contextual variations of the same word will plot as different points. Observe if these variations cluster meaningfully or if different keywords used in similar semantic roles appear close.

## 6. Part 2: Fine-tuning BERT on "Lumina Codex"

To assess "Fine-tuning Capability" and see if we can make BERT's embeddings even more specific to our "Lumina Codex" domain, we'll perform a brief fine-tuning process. We'll use a Masked Language Modeling (MLM) objective.

### 6.1 Prepare Data for MLM Fine-tuning

In [None]:
class LuminaMLMDataset(Dataset):
    def __init__(self, texts, tokenizer, max_length=128): # Max length can be tuned
        self.tokenizer = tokenizer
        self.inputs = []
        self.labels = []

        for text in tqdm(texts, desc="Processing texts for MLM"):
            # Tokenize the text
            tokenized_text = self.tokenizer(text, max_length=max_length, padding="max_length", truncation=True, return_tensors="pt")
            
            # Create labels and mask some tokens
            input_ids = tokenized_text['input_ids'].squeeze()
            label_ids = input_ids.clone()
            
            # Masking strategy: mask ~15% of tokens
            # We'll mask non-special tokens.
            # Create probability matrix for masking.
            prob_matrix = torch.full(label_ids.shape, 0.15)
            # Prevent masking of special tokens [CLS], [SEP], [PAD]
            special_tokens_mask = self.tokenizer.get_special_tokens_mask(label_ids.tolist(), already_has_special_tokens=True)
            special_tokens_mask = torch.tensor(special_tokens_mask, dtype=torch.bool)
            prob_matrix.masked_fill_(special_tokens_mask, value=0.0)
            
            # Decide which tokens to mask
            masked_indices = torch.bernoulli(prob_matrix).bool()
            label_ids[~masked_indices] = -100 # We only compute loss on masked tokens

            # Of the masked tokens, 80% are [MASK], 10% random, 10% same
            indices_replaced = torch.bernoulli(torch.full(label_ids.shape, 0.8)).bool() & masked_indices
            input_ids[indices_replaced] = self.tokenizer.mask_token_id

            indices_random = torch.bernoulli(torch.full(label_ids.shape, 0.5)).bool() & masked_indices & ~indices_replaced
            random_words = torch.randint(len(self.tokenizer), label_ids.shape, dtype=torch.long)
            input_ids[indices_random] = random_words[indices_random]
            
            # The remaining 10% of masked tokens are unchanged (already handled as input_ids[masked_indices & ~indices_replaced & ~indices_random] = label_ids[...])

            self.inputs.append({"input_ids": input_ids, "attention_mask": tokenized_text['attention_mask'].squeeze()})
            self.labels.append(label_ids)

    def __len__(self):
        return len(self.inputs)

    def __getitem__(self, idx):
        item = self.inputs[idx]
        item['labels'] = self.labels[idx]
        return item

if tokenizer:
    # Use a subset of sentences for faster demo fine-tuning, or all of them
    # For a real fine-tuning, you'd want much more data or many epochs over this data.
    # Using only the first N sentences for this demo, or splitting into train/eval.
    num_finetune_sentences = min(len(cleaned_sentences_from_corpus), 500) # Limit for demo
    finetune_texts = cleaned_sentences_from_corpus[:num_finetune_sentences]
    
    mlm_dataset = LuminaMLMDataset(finetune_texts, tokenizer)
    # Ensure dataset is not empty
    if len(mlm_dataset) > 0:
        mlm_dataloader = DataLoader(mlm_dataset, batch_size=8, shuffle=True) # Small batch size for small dataset
        print(f"Created MLM dataset with {len(mlm_dataset)} instances.")
    else:
        print("MLM dataset is empty. Fine-tuning cannot proceed.")
        mlm_dataloader = None
else:
    print("Tokenizer not available. Skipping MLM data preparation.")
    mlm_dataloader = None

### 6.2 Fine-tune BERT Model

In [None]:
if bert_model_pretrained and mlm_dataloader and len(mlm_dataset)>0:
    print("Preparing BERT for fine-tuning (Masked Language Modeling)...")
    # Load BertForMaskedLM using the same pre-trained weights
    try:
        bert_model_finetuning = BertForMaskedLM.from_pretrained(BERT_MODEL_NAME).to(device)
        bert_model_finetuning.train() # Set model to training mode

        # Optimizer and Scheduler
        optimizer = AdamW(bert_model_finetuning.parameters(), lr=5e-5) # Common learning rate for BERT fine-tuning
        num_epochs_finetune = 3 # Small number for demo; real fine-tuning needs more
        total_steps = len(mlm_dataloader) * num_epochs_finetune
        scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=0, num_training_steps=total_steps)

        print(f"Starting fine-tuning for {num_epochs_finetune} epochs...")
        for epoch in range(num_epochs_finetune):
            epoch_loss = 0
            for batch in tqdm(mlm_dataloader, desc=f"Epoch {epoch + 1}/{num_epochs_finetune}"):
                optimizer.zero_grad()
                input_ids = batch['input_ids'].to(device)
                attention_mask = batch['attention_mask'].to(device)
                labels = batch['labels'].to(device)
                
                outputs = bert_model_finetuning(input_ids, attention_mask=attention_mask, labels=labels)
                loss = outputs.loss
                epoch_loss += loss.item()
                
                loss.backward()
                optimizer.step()
                scheduler.step()
            
            avg_epoch_loss = epoch_loss / len(mlm_dataloader)
            print(f"Epoch {epoch + 1} complete. Average Loss: {avg_epoch_loss:.4f}")

        print("Fine-tuning complete.")
        # For extracting embeddings, we need the base BertModel part
        bert_model_finetuned = bert_model_finetuning.bert # Extract the base BertModel
        bert_model_finetuned.eval() # Set to evaluation mode for embedding extraction
    except Exception as e:
        print(f"An error occurred during fine-tuning: {e}")
        bert_model_finetuned = None # Fallback
else:
    print("Pre-trained BERT model or MLM data loader not available. Skipping fine-tuning.")
    bert_model_finetuned = None # Ensure it's defined

### 6.3 Extract and Analyze Embeddings (Fine-tuned BERT)
Now we repeat the keyword analysis with the fine-tuned BERT model.

In [None]:
if bert_model_finetuned and tokenizer and target_words_info_pretrained: # Use same target_words_info
    print("\nExtracting embeddings with Fine-tuned BERT...")
    embeddings_bert_finetuned = get_bert_contextual_embeddings(
        cleaned_sentences_from_corpus,
        target_words_info_pretrained, # Using the same selection of keyword contexts
        tokenizer,
        bert_model_finetuned # Pass the fine-tuned base model
    )

    if embeddings_bert_finetuned and len(embeddings_bert_finetuned) >= 2:
        plot_labels_bert_finetuned = list(embeddings_bert_finetuned.keys())
        embedding_matrix_bert_finetuned = np.array(list(embeddings_bert_finetuned.values()))

        print(f"\nShape of embedding matrix (Fine-tuned BERT): {embedding_matrix_bert_finetuned.shape}")

        # --- Semantic Similarity & Contextual Understanding (Heatmap - Fine-tuned) ---
        print("\n--- Cosine Similarity Heatmap (Keywords - Fine-tuned BERT) ---")
        similarity_matrix_bert_finetuned = cosine_similarity(embedding_matrix_bert_finetuned)
        
        num_labels_ft_bert = len(plot_labels_bert_finetuned)
        fig_width_ft_bert = max(15, num_labels_ft_bert * 0.5)
        fig_height_ft_bert = max(12, num_labels_ft_bert * 0.4)
        plt.figure(figsize=(fig_width_ft_bert, fig_height_ft_bert))
        
        annotate_heatmap_ft_bert = num_labels_ft_bert < 30
        sns.heatmap(similarity_matrix_bert_finetuned,
                    annot=annotate_heatmap_ft_bert, cmap='coolwarm', fmt=".2f",
                    xticklabels=plot_labels_bert_finetuned, yticklabels=plot_labels_bert_finetuned,
                    linewidths=.5, cbar_kws={"shrink": .8})
        plt.title(f'BERT (Fine-tuned on Lumina Codex) Cosine Similarity', fontsize=16)
        plt.xticks(rotation=65, ha='right', fontsize=max(8, 12 - num_labels_ft_bert // 6))
        plt.yticks(rotation=0, fontsize=max(8, 12 - num_labels_ft_bert // 6))
        plt.tight_layout()
        plt.show()
        
        # --- Specific comparisons for Contextual Understanding (Fine-tuned) ---
        print("\n--- Contextual Understanding Examples (Fine-tuned BERT) ---")
        if "life_ctx1" in embeddings_bert_finetuned and "life_ctx2" in embeddings_bert_finetuned:
            sim_life_contexts_ft = cosine_similarity(
                embeddings_bert_finetuned["life_ctx1"].reshape(1,-1),
                embeddings_bert_finetuned["life_ctx2"].reshape(1,-1)
            )[0][0]
            print(f"Similarity between 'life' (fine-tuned) in context 1 vs. context 2: {sim_life_contexts_ft:.4f}")
            # Compare with sim_life_contexts from pre-trained if available

        if "planet_ctx1" in embeddings_bert_finetuned and "planet_ctx2" in embeddings_bert_finetuned:
            sim_planet_contexts_ft = cosine_similarity(
                embeddings_bert_finetuned["planet_ctx1"].reshape(1,-1),
                embeddings_bert_finetuned["planet_ctx2"].reshape(1,-1)
            )[0][0]
            print(f"Similarity between 'planet' (fine-tuned) in context 1 vs. context 2: {sim_planet_contexts_ft:.4f}")


        # --- Clustering Quality (PCA Visualization - Fine-tuned) ---
        print("\n--- PCA Visualization (Keywords - Fine-tuned BERT) ---")
        pca_bert_finetuned = PCA(n_components=2)
        embeddings_2d_bert_finetuned = pca_bert_finetuned.fit_transform(embedding_matrix_bert_finetuned)

        plt.figure(figsize=(fig_width_ft_bert * 0.9, fig_height_ft_bert * 0.9))
        plt.scatter(embeddings_2d_bert_finetuned[:, 0], embeddings_2d_bert_finetuned[:, 1], alpha=0.7, s=60)
        for i, label in enumerate(plot_labels_bert_finetuned):
            plt.annotate(label, (embeddings_2d_bert_finetuned[i, 0], embeddings_2d_bert_finetuned[i, 1]),
                         textcoords="offset points", xytext=(5,5), ha='center', fontsize=max(7, 10 - num_labels_ft_bert // 7))
        plt.title('BERT (Fine-tuned on Lumina Codex) Contextual Keyword Embeddings - PCA', fontsize=16)
        plt.xlabel('PCA Component 1', fontsize=12)
        plt.ylabel('PCA Component 2', fontsize=12)
        plt.grid(True)
        plt.tight_layout()
        plt.show()
    else:
        print("Not enough keyword embeddings extracted from fine-tuned BERT for analysis.")
else:
    print("Fine-tuned BERT model not available.")

## 7. Discussion & Conclusion for BERT Stage (Lumina Codex)

In this notebook, we explored the capabilities of BERT (Bidirectional Encoder Representations from Transformers) in generating deeply contextual embeddings, using our "Lumina Codex" corpus as the testbed. Our investigation involved two key phases:
1.  Analyzing contextual keyword embeddings derived from a standard **pre-trained `bert-base-uncased` model**.
2.  Briefly **fine-tuning this BERT model on the "Lumina Codex" text** using a Masked Language Modeling (MLM) objective, and then re-evaluating its contextual keyword embeddings.

Throughout, we focused on assessing Semantic Similarity, Contextual Understanding, Vocabulary Handling (via subword tokenization), Clustering Quality (through PCA), and the impact of Fine-tuning.

**Key Insights from Our BERT Experiments:**

1.  **Deep Contextual Understanding (Pre-trained & Fine-tuned BERT):**
    * **Pre-trained BERT:** A core strength of BERT is its ability to generate different embeddings for the same word based on its surrounding context. For instance, when we extracted embeddings for "life" from two different sentences in the Lumina Codex (`life_ctx1` vs. `life_ctx2`), their cosine similarity was [e.g., `0.XX`], significantly less than 1.0. This demonstrates that the pre-trained model differentiates the meaning of "life" based on its specific sentence context. *(You'll replace this with your actual similarity value and specific contexts from cell 19 output).*
    * **Fine-tuned BERT:** After fine-tuning on the "Lumina Codex," the contextual distinction might have shifted. For example, the similarity between `life_ctx1` and `life_ctx2` might now be [e.g., `0.YY`] *(your actual value from cell 23 output)*. A change here (increase or decrease) would suggest that fine-tuning has altered how the model perceives these specific contexts within the Lumina Codex narrative. *(Here, discuss if the fine-tuning made the distinction sharper or if it found common thematic links specific to the codex, based on your results.)*

2.  **Semantic Similarity & Clustering Quality (Heatmaps & PCA Plots):**
    * **Pre-trained BERT:**
        * The heatmap of cosine similarities for keyword contexts ([see Figure Pre-trained-BERT-Heatmap - your `bert_pretrained_heatmap.png`](https://www.linkedin.com/feed/update/urn:li:activity:YOUR_ACTIVITY_ID_WHERE_YOU_POST_IMAGE/)) likely showed relationships based on general English understanding, influenced by the specific sentences chosen from the Lumina Codex. For instance, "lumina_ctx1" (the star) and "star_ctx1" might show strong similarity.
        * The PCA plot ([see Figure Pre-trained-BERT-PCA - your `bert_pretrained_pca.png`](https://www.linkedin.com/feed/update/urn:li:activity:YOUR_ACTIVITY_ID_WHERE_YOU_POST_IMAGE/)) would visualize these relationships. Contextual variations of the same keyword (e.g., `planet_ctx1`, `planet_ctx2`) would appear as distinct points. Keywords used in similar overall sentence themes might cluster.
    * **Fine-tuned BERT:**
        * The heatmap ([see Figure Fine-tuned-BERT-Heatmap - your `bert_finetuned_heatmap.png`](https://www.linkedin.com/feed/update/urn:li:activity:YOUR_ACTIVITY_ID_WHERE_YOU_POST_IMAGE/)) and PCA plot ([see Figure Fine-tuned-BERT-PCA - your `bert_finetuned_pca.png`](https://www.linkedin.com/feed/update/urn:li:activity:YOUR_ACTIVITY_ID_WHERE_YOU_POST_IMAGE/)) for the fine-tuned model are crucial for comparison. *(Describe the changes: Did clusters become tighter for Lumina Codex-specific concepts like "Sylvans" and "Veridia"? Did the similarity scores between, for example, "codex" and "egyptian" contexts increase after fine-tuning on this text where they are closely related? Did general English terms perhaps become slightly more specialized in their relationships reflecting their usage *only* within the codex?)*

3.  **Handling Vocabulary & Novel Terms (BERT's Subword Power):**
    * BERT's WordPiece tokenizer inherently handles a virtually unlimited vocabulary by breaking unknown words into known subword units. This is a major advantage.
    * **Observation:** Most, if not all, keywords from the "Lumina Codex" (including fictional names like "Sylvans," "Aquarelles," "Thalacynths," or "Nyxara") would have been successfully tokenized and embedded by both the pre-trained and fine-tuned models. The model might not have "known" these exact words from its original pre-training, but it could construct meaningful representations from their subword components based on the provided context. This ensures no true "Out-of-Vocabulary" errors for words made of known characters.

4.  **Impact and Demonstration of Fine-tuning Capability:**
    * The fine-tuning process (MLM training loop where loss decreased) demonstrated that BERT's representations **can be adapted** to a new, specific domain or corpus like the "Lumina Codex."
    * **Evidence:** The shifts (however subtle or pronounced) observed in the heatmaps and PCA plots between the pre-trained and fine-tuned model outputs are direct evidence of this adaptation. The fine-tuned model's embeddings are now more influenced by the specific contexts and word relationships present in the "Lumina Codex." For example, if "Lumina" (the star) and "Aqua" (a planet) are frequently mentioned in contexts discussing life-sustaining energy, their embeddings might become closer after fine-tuning, even if their general English meanings are less directly linked.

**BERT in the "EmbedEvolution" Context:**

BERT, and the Transformer architecture it's built upon, represent a monumental leap in generating text embeddings. This exploration with the "Lumina Codex" highlights:

* **True Contextualization:** BERT's ability to produce different embeddings for the same word based on its surrounding context is its defining strength, effectively addressing the polysemy problem that plagued static embeddings.
* **Robust Vocabulary Handling:** Subword tokenization largely overcomes the OOV issue.
* **Adaptability through Fine-tuning:** Pre-trained BERT models can be further specialized to specific domains or tasks, making their embeddings even more relevant and powerful for those applications, as hinted by our brief MLM fine-tuning.

**Limitations (General Considerations for BERT):**
* **Computational Cost:** BERT models are significantly more computationally intensive than Word2Vec or simple RNNs.
* **Sentence-Level Embeddings for Similarity:** While BERT excels at token-level contextual embeddings, using its raw outputs (e.g., `[CLS]` token or mean-pooled token embeddings) directly for semantic similarity scoring between *sentences* is not always optimal without specific fine-tuning for that task.

**Next Steps:**

The power of BERT's contextual token embeddings is undeniable. However, for many applications, a single, high-quality vector representing an entire sentence or paragraph, optimized for similarity comparisons, is highly desirable. This leads us directly to the next stage in our "EmbedEvolution": **Sentence Transformers (SBERT)**, which build upon models like BERT to achieve precisely that.