# EmbedEvolution Stage 4: Sentence Transformers with "Lumina Codex"

Welcome to Stage 4! We now explore **Sentence Transformers** (often SBERT). After seeing BERT's power with contextual word embeddings, Sentence Transformers fine-tune models like BERT/RoBERTa using siamese or triplet networks. This process specifically trains them to produce **semantically meaningful sentence embeddings** that can be directly compared using cosine similarity, addressing a common need not optimally met by simply pooling vanilla BERT token outputs.

We will:
1. Use a pre-trained Sentence Transformer model (`all-mpnet-base-v2`).
2. Generate embeddings for selected sentences from the "Lumina Codex."
3. Analyze these sentence embeddings for:
    - Semantic Similarity (between sentences)
    - Clustering Quality (of sentences via PCA)
    - How it handles vocabulary (via the underlying Transformer's subword tokenization)
    - Understanding its nature as a pre-fine-tuned model.

## 1. Setup and Imports

In [None]:
import torch
from sentence_transformers import SentenceTransformer, models
from transformers import BertTokenizerFast, BertForMaskedLM, DataCollatorForLanguageModeling, TrainingArguments, Trainer
from torch.utils.data import Dataset, DataLoader
import numpy as np
from sklearn.decomposition import PCA
from sklearn.metrics.pairwise import cosine_similarity
import matplotlib.pyplot as plt
import seaborn as sns
import re
import nltk
from tqdm.auto import tqdm
import os

# Device selection
if torch.cuda.is_available():
    device_str = "cuda"
elif torch.backends.mps.is_available():
    device_str = "mps"
else:
    device_str = "cpu"
device = torch.device(device_str)
print(f"Selected device: {device}")

# Download nltk 'punkt' resource if not already present
try:
    _ = nltk.sent_tokenize("Test sentence.")
except LookupError:
    nltk.download('punkt')

## 2. Define the Corpus: The "Lumina Codex"
We'll use our detailed "Lumina Codex and the Solara System" text.

In [None]:
text = """
The Lumina Codex and the Solara System: A Tapestry of Ancient Wisdom and Cosmic Discovery
In the shadowed halls of the Cairo Museum, a dusty papyrus scroll, cataloged as Papyrus K-37b from the Middle Kingdom, lay forgotten for centuries. Dubbed the Lumina Codex by its discoverers, this fragile relic was initially dismissed as a mythological curiosity, its cryptic hieroglyphs and star charts interpreted as poetic musings of a priestly scribe. Yet, in 2024, a team of linguists and astronomers, led by Dr. Amara Nassar, deciphered its veiled verses, revealing an astonishing truth: the codex described a distant star system with uncanny precision, orbiting a radiant G-type star named the Star of Eternal Radiance—now known as Lumina. This revelation sparked a scientific odyssey, merging ancient Egyptian cosmology with cutting-edge astronomy, as the Solara System emerged from the Nebula Cygnus-X1, nestled in the Orion Spur of the Milky Way Galaxy.

The Lumina Codex spoke of Lumina and its ten celestial attendants, organized into poetic regions: the searing Forges of Ra for the inner worlds, the verdant Blessed Belt of Osiris for the habitable zone, the majestic Domains of the Sky Titans for gas giants, and the enigmatic Frozen Outlands for the outer realms. Its star charts, etched with meticulous care, hinted at a cosmic map, with references to the Rivers of Stars—likely the Milky Way—and the Celestial Gardens, evoking the Local Group within the Virgo Supercluster. The codex’s verses, such as “Ten jewels dance in the embrace of the Eternal Radiance, their faces veiled in fire, water, and ice,” seemed to prefigure a system now confirmed by the Cygnus-X1 Deep Sky Array, a fictional next-generation telescope orbiting beyond Earth’s atmosphere.

Discovery and Modern Corroboration
The Solara System’s discovery began in 2023, when the Cygnus-X1 Deep Sky Array detected subtle wobbles in Lumina’s light, indicating a complex system of orbiting bodies. Located 1,200 light-years away in the Nebula Cygnus-X1, Lumina is a stable, middle-aged G-type star, slightly larger than the Sun, with a luminosity that sustains a diverse array of worlds. As astronomers analyzed the data, they identified ten planets, each with unique characteristics that eerily echoed the Lumina Codex. The parallels were undeniable: the codex’s Forges of Ra matched the inner rocky planets, while the Blessed Belt of Osiris aligned with two habitable worlds teeming with life. The Domains of the Sky Titans and Frozen Outlands described gas giants and icy dwarfs with striking accuracy. The scientific community buzzed with excitement, as linguists and astronomers collaborated to decode the codex’s metaphors, revealing a blend of ancient intuition and cosmic truth.

The Solara System: A Celestial Menagerie
Lumina: The Star of Eternal Radiance
Lumina, a G2V star, radiates a warm, golden light, its stable fusion cycle supporting a system spanning 12 astronomical units. Its magnetic fields are unusually calm, suggesting a long lifespan conducive to life’s evolution. The codex describes Lumina as “the hearth of eternity, whose breath kindles the dance of worlds,” a poetic nod to its life-giving energy.

The Forges of Ra: Inner Planets
1- Ignis: The closest planet to Lumina, Ignis is a scorched, iron-rich world with a molten surface pocked by ancient impact craters. Its thin atmosphere, rich in sulfur dioxide, glows faintly under Lumina’s intense radiation. The codex calls it “Ra’s Anvil, where molten rivers forge the bones of the cosmos,” reflecting its volcanic past and metallic crust.
2- Ferrus: Slightly larger, Ferrus is a rocky planet with vast plains of oxidized iron, giving it a crimson hue. Its surface bears scars of past tectonic activity, with towering cliffs and deep chasms. The codex names it “the Forge of Hephaestus’s Twin,” hinting at its metallic wealth, now confirmed by spectroscopic analysis revealing nickel and cobalt deposits.
The Blessed Belt of Osiris: Habitable Zone
1- Aqua: A breathtaking ocean world, Aqua is enveloped in turquoise clouds of water vapor and nitrogen. Its surface is 90% liquid water, with archipelagos of coral-like structures hosting complex aquatic ecosystems. Bioluminescent Aquarelles, jellyfish-like creatures with crystalline tentacles, drift in vast schools, their light pulses synchronizing in rhythmic displays. Predatory Thalacynths, eel-like organisms with electromagnetic sensors, hunt in the deep trenches. Aqua’s moon, Thalassa, is an ice-covered world with a subglacial ocean, where astrobiologists hypothesize microbial extremophiles thrive in hydrothermal vents, metabolizing sulfur compounds. The codex describes Aqua as “Osiris’s Chalice, where life swims in the tears of the gods,” and Thalassa as “the frozen veil hiding the spark of creation.”
2- Veridia: A super-Earth, Veridia boasts lush continents of bioluminescent flora, such as Luminara trees, which pulse with green and violet light, and Crystalferns, whose fractal leaves refract Lumina’s rays into dazzling spectra. Veridia is home to the Sylvans, sentient, silicon-based life forms resembling ambulatory crystal shrubs. Their bodies, composed of lattice-like structures, shimmer with bioluminescent patterns used for communication. Sylvan society is decentralized, with “groves” of individuals linked via light-based signals, forming a collective consciousness deeply attuned to Veridia’s ecosystem. Their architecture, grown from crystalline minerals, integrates seamlessly with the landscape. The codex calls Veridia “the Garden of Osiris’s Breath,” where “the shining ones weave light into wisdom.”
The Domains of the Sky Titans: Gas Giants
1- Zephyrus: A massive hydrogen-helium gas giant, Zephyrus dominates the system with its radiant ring system, composed of ice and silicate particles. Its atmosphere swirls with golden storms, driven by intense winds. Among its 47 moons, Io-Prime stands out, a volcanically active world spewing sulfur plumes, likely powered by tidal heating. The codex names Zephyrus “the Sky Titan’s Crown,” its rings “the jeweled girdle of the heavens.”
2- Boreas: An ice giant with a deep blue methane atmosphere, Boreas exhibits retrograde rotation and an asymmetrical magnetic field, creating auroras that dance across its poles. Its 22 moons include Erynnis, a rocky moon with methane lakes. The codex describes Boreas as “the Frost Titan, whose breath chills the void,” capturing its icy majesty.
The Frozen Outlands: Outer Planets
1- Umbriel: A dwarf planet with a charcoal-dark surface, Umbriel’s icy crust is fractured by ancient impacts. Its moon Nyx, a captured object, is rich in organic compounds, hinting at prebiotic chemistry. The codex calls Umbriel “the Shadowed Outcast, guarded by the dark sentinel.”
2- Erebus: An icy world with a nitrogen-methane atmosphere, Erebus has a highly elliptical orbit, suggesting a captured origin. Its surface sparkles with frost-covered ridges. The codex names it “the Silent Wanderer, cloaked in eternal frost.”
3- Aetheria: The outermost planet, Aetheria is a rogue dwarf with a thin atmosphere of neon and argon. Its moon Lethe exhibits cryovolcanism, spewing ammonia-water mixtures. Astrobiologists speculate that Lethe’s subsurface ocean may harbor microbial life, analogous to Thalassa’s. The codex describes Aetheria as “the Veiled Wanderer, whose dreams freeze in the outer dark,” and Lethe as “the weeping mirror of the cosmos.”
4- Nyxara: A small, icy body with a chaotic orbit, Nyxara’s surface is a mosaic of frozen nitrogen and carbon monoxide. The codex calls it “the Lost Jewel, dancing beyond the Titans’ gaze.”
Life in the Solara System
Aqua’s aquatic ecosystems are a marvel, with Aquarelles forming symbiotic networks with coral-like Hydroskeletons, which filter nutrients from the water. Thalacynths use electromagnetic pulses to stun prey, suggesting an evolutionary arms race. On Thalassa, microbial life is hypothesized based on chemical signatures of sulfur and methane in its subglacial ocean, though no direct evidence exists yet.

Veridia’s Sylvans are the system’s crown jewel. Their crystalline bodies, averaging two meters tall, refract light into complex patterns, encoding emotions, ideas, and memories. Their society operates as a “luminous collective,” with no central authority; decisions emerge from synchronized light displays across groves. Sylvan technology manipulates crystalline minerals to create tools and habitats, all in harmony with Veridia’s ecosystem. Their discovery has sparked intense study by linguists decoding their light-based language, revealing a philosophy centered on balance and interconnectedness.

On Lethe, cryovolcanic activity suggests a subsurface ocean with potential microbial ecosystems, possibly metabolizing ammonia. Unlike Aqua’s confirmed complex life and Veridia’s sentient Sylvans, life on Thalassa and Lethe remains speculative, driving astrobiological research.

Galactic Context
The Solara System resides in the Orion Spur, a minor arm of the Milky Way, part of the Local Group within the Virgo Supercluster. The codex’s Rivers of Stars evoke the Milky Way’s spiral arms, while the Celestial Gardens suggest a poetic grasp of the Local Group’s galactic cluster. This cosmic placement underscores Solara’s significance as a microcosm of the universe’s diversity.

Ongoing Exploration
Scientific teams, including astrobiologists, geologists, and linguists, are studying Solara via the Cygnus-X1 Deep Sky Array and planned probes, such as the Lumina Pathfinder Mission. Challenges include the 1,200-light-year distance, requiring advanced telemetry for data transmission. Sylvan communication poses a unique hurdle, as their light patterns defy traditional linguistic models. Future missions aim to deploy orbiters around Aqua, Veridia, and Lethe to confirm microbial life and study Sylvan culture.

A Cosmic Tapestry
The Solara System, unveiled through the Lumina Codex and modern astronomy, blends ancient wisdom with scientific discovery. Its worlds—from the fiery Forges of Ra to the icy Frozen Outlands—offer a rich tapestry of environments, life forms, and mysteries. As scientists probe this distant system, the codex’s poetic verses resonate, reminding humanity that the cosmos has long whispered its secrets, awaiting those bold enough to listen.
"""
def clean_text_for_sbert(input_text):
    input_text = re.sub(r'\s+', ' ', input_text).strip()
    return input_text

cleaned_full_text_sbert = clean_text_for_sbert(text)
sentences_from_corpus_sbert = nltk.sent_tokenize(text) # Use original for sentence splitting
cleaned_sentences_for_sbert_analysis = [clean_text_for_sbert(s) for s in sentences_from_corpus_sbert if s.strip()]

print(f"Found {len(cleaned_sentences_for_sbert_analysis)} sentences in the corpus for SBERT.")
if cleaned_sentences_for_sbert_analysis:
    print("Sample sentence for SBERT:", cleaned_sentences_for_sbert_analysis[0])


## 3. Load Pre-trained Sentence Transformer Model
We'll use `all-mpnet-base-v2`, a well-balanced model for speed and performance.

In [None]:
SBERT_MODEL_NAME = 'all-mpnet-base-v2'

try:
    sbert_model_pretrained = SentenceTransformer(SBERT_MODEL_NAME, device=device_str)
    print(f"Successfully loaded Sentence Transformer model: '{SBERT_MODEL_NAME}'")
except Exception as e:
    print(f"Error loading Sentence Transformer model: {e}")
    sbert_model_pretrained = None

## 4. Define Sentences for Analysis & Keyword Spotting

Instead of embedding individual keywords in isolation (as SBERT is for sentences),we'll select sentences from the "Lumina Codex" that *contain* our keywords or are thematically relevant.We will then analyze the embeddings of these *full sentences*.

In [None]:
selected_sentences_for_sbert = []
selected_sentence_labels_sbert = []
temp_selected_indices_sbert = set()

keywords_for_sbert_themes = [
    "lumina", "solara", "aqua", "veridia", "sylvans", "thalassa", "lethe",
    "star", "planet", "moon", "orbit", "atmosphere", "ecosystem",
    "rocky", "icy", "habitable", "bioluminescent", "sentient",
    "codex", "discovery", "life", "egyptian", "astronomy", "science"
]

if cleaned_sentences_for_sbert_analysis:
    prominent_keywords_sbert = ["lumina", "aqua", "veridia", "sylvans", "codex", "star", "planet", "life", "discovery"]
    for kw in prominent_keywords_sbert:
        found_count = 0
        for i, s_text in enumerate(cleaned_sentences_for_sbert_analysis):
            s_lower = s_text.lower()
            if kw in s_lower and i not in temp_selected_indices_sbert: # Check if keyword (as a whole word) is in sentence
                # A more robust check could be `\b{}\b`.format(kw) in regex
                if re.search(r'\b' + re.escape(kw) + r'\b', s_lower):
                    selected_sentences_for_sbert.append(s_text)
                    selected_sentence_labels_sbert.append(f"{kw}_ctx{found_count+1}")
                    temp_selected_indices_sbert.add(i)
                    found_count += 1
                    if found_count >= 1: break # One sentence per prominent keyword for now
    
    num_needed_sbert = max(0, 15 - len(selected_sentences_for_sbert))
    for i, s_text in enumerate(cleaned_sentences_for_sbert_analysis):
        if num_needed_sbert == 0: break
        if i not in temp_selected_indices_sbert:
            s_lower = s_text.lower()
            added_for_diversity = False
            for kw_other in keywords_for_sbert_themes:
                 if re.search(r'\b' + re.escape(kw_other) + r'\b', s_lower) and kw_other not in prominent_keywords_sbert:
                    selected_sentences_for_sbert.append(s_text)
                    selected_sentence_labels_sbert.append(f"{kw_other}_ctx1")
                    temp_selected_indices_sbert.add(i)
                    num_needed_sbert -=1
                    added_for_diversity = True
                    break 
            if not added_for_diversity and (i % (len(cleaned_sentences_for_sbert_analysis) // max(1, num_needed_sbert +1 )) == 0 if num_needed_sbert > 0 else False):
                 selected_sentences_for_sbert.append(s_text)
                 selected_sentence_labels_sbert.append(f"misc_ctx{len(selected_sentence_labels_sbert)}")
                 temp_selected_indices_sbert.add(i)
                 num_needed_sbert -=1
    
    # Ensure a minimum number of sentences for plotting if selection is too sparse
    if len(selected_sentences_for_sbert) < 5 and cleaned_sentences_for_sbert_analysis:
        print(f"Selected sentences list too short ({len(selected_sentences_for_sbert)}), taking first 15 sentences for SBERT analysis.")
        selected_sentences_for_sbert = cleaned_sentences_for_sbert_analysis[:min(15, len(cleaned_sentences_for_sbert_analysis))]
        selected_sentence_labels_sbert = [f"S{i+1}" for i in range(len(selected_sentences_for_sbert))]

    print(f"\nSelected {len(selected_sentences_for_sbert)} diverse sentences for SBERT analysis.")
    for i in range(min(len(selected_sentences_for_sbert), 5)):
         print(f"- {selected_sentence_labels_sbert[i]}: {selected_sentences_for_sbert[i][:80]}...")
else:
    print("No sentences available for selection for SBERT.")

## 5. Part 1: Analysis with Pre-trained `all-mpnet-base-v2`

### 5.1 Generate and Evaluate Pre-trained SBERT Embeddings

In [None]:
sentence_embeddings_sbert_pt = None # pt for pre-trained (as-is)
if sbert_model_pretrained and selected_sentences_for_sbert:
    print(f"\nGenerating SBERT (pre-trained) embeddings for {len(selected_sentences_for_sbert)} sentences...")
    sentence_embeddings_sbert_pt = sbert_model_pretrained.encode(
        selected_sentences_for_sbert,
        convert_to_numpy=True,
        normalize_embeddings=True # Often recommended for SBERT cosine similarity
    )
    print(f"Generated SBERT (pre-trained) sentence embeddings. Shape: {sentence_embeddings_sbert_pt.shape}")

    if sentence_embeddings_sbert_pt is not None and sentence_embeddings_sbert_pt.shape[0] >= 2:
        embedding_matrix_sbert_pt = sentence_embeddings_sbert_pt
        plot_labels_sbert_pt = selected_sentence_labels_sbert

        # --- Semantic Similarity of Sentences (Heatmap) ---
        print("\n--- Cosine Similarity Heatmap (Sentences - Pre-trained SBERT) ---")
        similarity_matrix_sbert_pt = cosine_similarity(embedding_matrix_sbert_pt)
        
        num_labels_sbert_pt = len(plot_labels_sbert_pt)
        fig_width_sbert_pt = max(12, num_labels_sbert_pt * 0.6)
        fig_height_sbert_pt = max(10, num_labels_sbert_pt * 0.45)
        plt.figure(figsize=(fig_width_sbert_pt, fig_height_sbert_pt))
        
        annotate_heatmap_sbert_pt = num_labels_sbert_pt < 20
        sns.heatmap(similarity_matrix_sbert_pt,
                    annot=annotate_heatmap_sbert_pt, cmap='cividis', fmt=".2f",
                    xticklabels=plot_labels_sbert_pt, yticklabels=plot_labels_sbert_pt,
                    linewidths=.5, cbar_kws={"shrink": .8})
        plt.title(f'SBERT ({SBERT_MODEL_NAME}) Sentence Cosine Similarity (Pre-trained)', fontsize=16)
        plt.xticks(rotation=65, ha='right', fontsize=max(8, 12 - num_labels_sbert_pt // 5))
        plt.yticks(rotation=0, fontsize=max(8, 12 - num_labels_sbert_pt // 5))
        plt.tight_layout()
        plt.show()

        # --- Clustering Quality of Sentences (PCA Visualization) ---
        print("\n--- PCA Visualization (Sentences - Pre-trained SBERT) ---")
        pca_sbert_pt = PCA(n_components=2)
        embeddings_2d_sbert_pt = pca_sbert_pt.fit_transform(embedding_matrix_sbert_pt)

        plt.figure(figsize=(fig_width_sbert_pt * 0.9, fig_height_sbert_pt * 0.9))
        plt.scatter(embeddings_2d_sbert_pt[:, 0], embeddings_2d_sbert_pt[:, 1], alpha=0.7, s=60)
        for i, label in enumerate(plot_labels_sbert_pt):
            plt.annotate(label, (embeddings_2d_sbert_pt[i, 0], embeddings_2d_sbert_pt[i, 1]),
                         textcoords="offset points", xytext=(5,5), ha='center', fontsize=max(7, 10 - num_labels_sbert_pt // 6))
        plt.title(f'SBERT ({SBERT_MODEL_NAME}) Sentence Embeddings (Pre-trained) - PCA', fontsize=16)
        plt.xlabel('PCA Component 1', fontsize=12)
        plt.ylabel('PCA Component 2', fontsize=12)
        plt.grid(True)
        plt.tight_layout()
        plt.show()
    else:
        print("Not enough sentence embeddings generated for pre-trained SBERT analysis.")
else:
    print("Pre-trained SBERT model not loaded or no sentences selected. Skipping pre-trained analysis.")


### Interpretation Notes for Pre-trained `all-mpnet-base-v2`:
* **Semantic Similarity:** Observe the heatmap. Do sentences you intuitively feel are similar (e.g., two sentences describing Aqua's ecosystem, or two sentences about the Sylvans) show higher similarity scores? How does it compare to dissimilar sentences (e.g., one about Ignis vs. one about Sylvan philosophy)?
* **Clustering Quality (PCA):** Does the PCA plot show meaningful clusters of sentences? Do sentences about similar topics group together? `all-mpnet-base-v2` is a strong general model, so expect reasonable performance.
* **Contextual Understanding (Sentence Level):** This is inherent in the sentence embeddings. The quality of similarity and clustering reflects this.
* **Vocabulary:** MPNet (base for this SBERT model) uses subword tokenization, handling a wide vocabulary well.

## 6. Part 2: Experimental Fine-tuning of SBERT's Base Model on "Lumina Codex" (MLM)

We will attempt to further adapt the underlying Transformer model of `all-mpnet-base-v2` to our "Lumina Codex" domain using Masked Language Modeling (MLM).
**Important Note:** `all-mpnet-base-v2` is already a highly optimized sentence embedder. This MLM fine-tuning is *experimental* for domain adaptation of its token-level representations. It may not necessarily improve, and could potentially slightly degrade, its performance on general sentence similarity tasks without a proper sentence-pair similarity objective. The goal is to observe if/how embeddings change due to this domain exposure.

### 6.1 Prepare Model and Data for MLM Fine-tuning

In [None]:
if sbert_model_pretrained:
    try:
        # Access the underlying Transformer model (e.g., MPNet)
        transformer_model_for_mlm = sbert_model_pretrained[0].auto_model # This gets the MPNetModel
        # Use the tokenizer from the SentenceTransformer, or reload the base model's tokenizer
        # For MPNet, its tokenizer might be 'microsoft/mpnet-base' or similar if SBERT used a custom one.
        # Let's try to use SBERT's tokenizer first if it's compatible.
        # It's often better to get the tokenizer associated with the base Transformer for MLM.
        # The SBERT tokenizer might have added special tokens or different settings.
        # For MPNet, 'microsoft/mpnet-base' is the original. SBERT's fine-tuning might have used that directly.
        # Let's assume SBERT's tokenizer is fine, or try to load the base one.
        # A safer bet is to get the tokenizer that matches the base model name if known.
        # all-mpnet-base-v2 uses sentencepiece tokenizer from microsoft/mpnet-base
        base_tokenizer_name = 'microsoft/mpnet-base' # Underlying model for all-mpnet-base-v2
        try:
            mlm_tokenizer = BertTokenizerFast.from_pretrained(base_tokenizer_name) # MPNet uses BERT-style tokenizer
            print(f"Loaded tokenizer '{base_tokenizer_name}' for MLM fine-tuning.")
        except Exception as e_tok:
            print(f"Could not load base tokenizer '{base_tokenizer_name}', falling back to SBERT's tokenizer. Error: {e_tok}")
            mlm_tokenizer = sbert_model_pretrained.tokenizer # This might be less ideal for MLM from scratch

        # Wrap the transformer model with BertForMaskedLM (MPNet is BERT-like)
        # Need to ensure the config matches.
        # We can load BertForMaskedLM from the same base model name.
        mlm_model_to_finetune = BertForMaskedLM.from_pretrained(base_tokenizer_name).to(device)
        # Crucially, we need to transfer the weights from the SBERT's transformer to this MLM head model
        mlm_model_to_finetune.bert = transformer_model_for_mlm # Transfer SBERT's trained MPNet weights
        print("Prepared underlying Transformer model for MLM fine-tuning.")

    except Exception as e:
        print(f"Error preparing model for MLM fine-tuning: {e}")
        mlm_model_to_finetune = None
        mlm_tokenizer = None
else:
    mlm_model_to_finetune = None
    mlm_tokenizer = None


class MLMDatasetSBERT(Dataset):
    def __init__(self, texts, tokenizer, max_length=128): # Max length of MPNet is 512
        self.tokenizer = tokenizer
        self.texts = texts
        self.max_length = max_length
        # Tokenize all texts once
        self.encoded_texts = self.tokenizer(
            texts,
            add_special_tokens=True,
            max_length=self.max_length,
            padding="max_length",
            truncation=True,
            return_attention_mask=True,
            return_tensors="pt"
        )
        self.data_collator = DataCollatorForLanguageModeling(tokenizer=self.tokenizer, mlm=True, mlm_probability=0.15)

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        # We need to return a dict that the collator can work with.
        # The collator will handle the actual masking.
        # We just return the input_ids and attention_mask for the given sentence.
        return {
            "input_ids": self.encoded_texts["input_ids"][idx],
            "attention_mask": self.encoded_texts["attention_mask"][idx]
        }


mlm_dataloader_sbert = None
if mlm_tokenizer and cleaned_sentences_for_sbert_analysis:
    # Use a subset of sentences for faster demo fine-tuning
    num_finetune_sentences_sbert = min(len(cleaned_sentences_for_sbert_analysis), 500) # Or all if you have time
    finetune_texts_sbert = cleaned_sentences_for_sbert_analysis[:num_finetune_sentences_sbert]
    
    if finetune_texts_sbert:
        mlm_dataset_sbert = MLMDatasetSBERT(finetune_texts_sbert, mlm_tokenizer)
        if len(mlm_dataset_sbert) > 0:
            # The DataCollatorForLanguageModeling will handle creating labels and masking
            mlm_dataloader_sbert = DataLoader(mlm_dataset_sbert, batch_size=8, shuffle=True, collate_fn=mlm_dataset_sbert.data_collator)
            print(f"Created MLM dataset for SBERT base model with {len(mlm_dataset_sbert)} instances.")
        else:
            print("MLM dataset for SBERT base is empty.")
    else:
        print("No texts selected for SBERT MLM fine-tuning.")
else:
    print("MLM Tokenizer not available for SBERT base model fine-tuning.")

### 6.2 Fine-tune SBERT's Base Model (Experimental MLM)

In [None]:
sbert_model_finetuned_base = None # This will store the fine-tuned Transformer base

if mlm_model_to_finetune and mlm_dataloader_sbert and len(mlm_dataset_sbert)>0:
    print("Starting experimental MLM fine-tuning of SBERT's base Transformer...")
    mlm_model_to_finetune.train()

    optimizer = torch.optim.AdamW(mlm_model_to_finetune.parameters(), lr=5e-6) # Lower LR for fine-tuning
    num_epochs_sbert_finetune = 1 # Just 1 epoch for a quick domain adaptation demo
    
    for epoch in range(num_epochs_sbert_finetune):
        epoch_loss = 0
        progress_bar = tqdm(mlm_dataloader_sbert, desc=f"Epoch {epoch + 1}/{num_epochs_sbert_finetune}")
        for batch in progress_bar:
            optimizer.zero_grad()
            # Batch already contains masked input_ids and labels from DataCollator
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels = batch['labels'].to(device)
            
            outputs = mlm_model_to_finetune(input_ids, attention_mask=attention_mask, labels=labels)
            loss = outputs.loss
            
            if loss is not None: # Check if loss is computed (it should be)
                loss.backward()
                optimizer.step()
                epoch_loss += loss.item()
                progress_bar.set_postfix({'loss': loss.item()})
            else:
                print("Warning: Loss is None during fine-tuning step.")

        avg_epoch_loss = epoch_loss / len(mlm_dataloader_sbert) if len(mlm_dataloader_sbert) > 0 else float('nan')
        print(f"Epoch {epoch + 1} complete. Average Loss: {avg_epoch_loss:.4f}")

    print("Experimental MLM fine-tuning of SBERT's base Transformer complete.")
    sbert_model_finetuned_base = mlm_model_to_finetune.bert # Get the fine-tuned base Bert/MPNet model
    sbert_model_finetuned_base.eval()

    # Now, create a new SentenceTransformer model using this fine-tuned base
    # We need the original pooling layer from sbert_model_pretrained
    if sbert_model_pretrained and len(sbert_model_pretrained) > 1:
        word_embedding_model = sbert_model_finetuned_base
        pooling_model = sbert_model_pretrained[1] # Assuming 2nd module is Pooling
        # Optional: Add normalization if it was the 3rd layer
        if len(sbert_model_pretrained) > 2 and isinstance(sbert_model_pretrained[2], models.Normalize):
            normalize_model = sbert_model_pretrained[2]
            sbert_model_finetuned_full = SentenceTransformer(modules=[word_embedding_model, pooling_model, normalize_model], device=device_str)
        else: # If no explicit normalize layer, or if pooling handles it.
            sbert_model_finetuned_full = SentenceTransformer(modules=[word_embedding_model, pooling_model], device=device_str)
        print("Created new SentenceTransformer with fine-tuned base model.")
    else:
        print("Could not reconstruct SentenceTransformer with fine-tuned base. Original pooling layer missing or structure unclear.")
        sbert_model_finetuned_full = None

else:
    print("MLM model or dataloader not available. Skipping experimental fine-tuning.")
    sbert_model_finetuned_full = None

### 6.3 Evaluate Fine-tuned SBERT Embeddings

In [None]:
sentence_embeddings_sbert_ft = None # ft for fine-tuned
sbert_model_finetuned_full = None   # Initialize

# Ensure the fine-tuned base model and the tokenizer used for MLM are available
if ('sbert_model_finetuned_base' in locals() and sbert_model_finetuned_base is not None and
    'sbert_model_pretrained' in locals() and sbert_model_pretrained is not None and
    'mlm_tokenizer' in locals() and mlm_tokenizer is not None): # mlm_tokenizer was used for fine-tuning

    print("Preparing fine-tuned SBERT model for evaluation...")
    try:
        # --- Step 1: Save the fine-tuned base Hugging Face Transformer model & its tokenizer ---
        finetuned_base_model_path = "./sbert_finetuned_base_mlm_from_notebook" # Temporary path
        
        sbert_model_finetuned_base.save_pretrained(finetuned_base_model_path)
        mlm_tokenizer.save_pretrained(finetuned_base_model_path) # Save the tokenizer too
        print(f"Fine-tuned base Transformer model and tokenizer saved to {finetuned_base_model_path}")

        # --- Step 2: Reconstruct the SentenceTransformer using the path to the fine-tuned base ---
        # This SentenceTransformer layer will load the fine-tuned model and its tokenizer from the path.
        # We get max_seq_length from the original pre-trained SBERT's first layer,
        # as this is a property of the SBERT wrapper layer, not the HF model itself.
        # The models.Transformer class has a max_seq_length attribute it uses.
        original_sbert_transformer_layer = sbert_model_pretrained[0]

        word_embedding_module_ft = models.Transformer(
            model_name_or_path=finetuned_base_model_path, # LOAD FROM OUR SAVED PATH
            max_seq_length=original_sbert_transformer_layer.max_seq_length # Use original SBERT's max_seq_length
        )
        word_embedding_module_ft.to(device)

        # Get the original pooling and normalization layers
        pooling_model = sbert_model_pretrained[1]
        pooling_model.to(device)

        reconstructed_modules_ft = [word_embedding_module_ft, pooling_model]
        
        if len(sbert_model_pretrained) > 2 and isinstance(sbert_model_pretrained[2], models.Normalize):
            normalize_model = sbert_model_pretrained[2]
            normalize_model.to(device)
            reconstructed_modules_ft.append(normalize_model)

        sbert_model_finetuned_full = SentenceTransformer(modules=reconstructed_modules_ft, device=device_str)
        print("Successfully reconstructed SentenceTransformer with fine-tuned base model.")

        # Optional: Clean up the saved model directory if no longer needed
        # import shutil
        # shutil.rmtree(finetuned_base_model_path)
        # print(f"Cleaned up temporary directory: {finetuned_base_model_path}")

        # --- Step 3: Proceed with evaluation ---
        if sbert_model_finetuned_full and selected_sentences_for_sbert:
            print(f"\nGenerating SBERT (fine-tuned) embeddings for {len(selected_sentences_for_sbert)} sentences...")
            # Ensure the model on device is used by encode if not automatically handled
            sentence_embeddings_sbert_ft = sbert_model_finetuned_full.encode(
                selected_sentences_for_sbert,
                convert_to_numpy=True,
                normalize_embeddings=True, # Ensure normalization
                device=device_str # Explicitly pass device if needed
            )
            print(f"Generated SBERT (fine-tuned) sentence embeddings. Shape: {sentence_embeddings_sbert_ft.shape}")

            if sentence_embeddings_sbert_ft is not None and sentence_embeddings_sbert_ft.shape[0] >= 2:
                embedding_matrix_sbert_ft = sentence_embeddings_sbert_ft
                plot_labels_sbert_ft = selected_sentence_labels_sbert

                # --- Semantic Similarity of Sentences (Heatmap - Fine-tuned) ---
                print("\n--- Cosine Similarity Heatmap (Sentences - Fine-tuned SBERT) ---")
                similarity_matrix_sbert_ft = cosine_similarity(embedding_matrix_sbert_ft)
                
                num_labels_sbert_ft = len(plot_labels_sbert_ft)
                fig_width_sbert_ft = max(12, num_labels_sbert_ft * 0.6)
                fig_height_sbert_ft = max(10, num_labels_sbert_ft * 0.45)

                plt.figure(figsize=(fig_width_sbert_ft, fig_height_sbert_ft))
                annotate_heatmap_sbert_ft = num_labels_sbert_ft < 20
                sns.heatmap(similarity_matrix_sbert_ft,
                            annot=annotate_heatmap_sbert_ft, cmap='cividis', fmt=".2f",
                            xticklabels=plot_labels_sbert_ft, yticklabels=plot_labels_sbert_ft,
                            linewidths=.5, cbar_kws={"shrink": .8})
                plt.title(f'SBERT ({SBERT_MODEL_NAME}) Cosine Similarity (Fine-tuned on Lumina Codex)', fontsize=16)
                plt.xticks(rotation=65, ha='right', fontsize=max(8, 12 - num_labels_sbert_ft // 5))
                plt.yticks(rotation=0, fontsize=max(8, 12 - num_labels_sbert_ft // 5))
                plt.tight_layout()
                plt.show()

                # --- Clustering Quality of Sentences (PCA Visualization - Fine-tuned) ---
                print("\n--- PCA Visualization (Sentences - Fine-tuned SBERT) ---")
                pca_sbert_ft = PCA(n_components=2)
                embeddings_2d_sbert_ft = pca_sbert_ft.fit_transform(embedding_matrix_sbert_ft)

                plt.figure(figsize=(fig_width_sbert_ft * 0.9, fig_height_sbert_ft * 0.9))
                plt.scatter(embeddings_2d_sbert_ft[:, 0], embeddings_2d_sbert_ft[:, 1], alpha=0.7, s=60)
                for i, label in enumerate(plot_labels_sbert_ft):
                    plt.annotate(label, (embeddings_2d_sbert_ft[i, 0], embeddings_2d_sbert_ft[i, 1]),
                                 textcoords="offset points", xytext=(5,5), ha='center', fontsize=max(7, 10 - num_labels_sbert_ft // 6))
                plt.title(f'SBERT ({SBERT_MODEL_NAME}) Sentence Embeddings (Fine-tuned on Lumina Codex) - PCA', fontsize=16)
                plt.xlabel('PCA Component 1', fontsize=12)
                plt.ylabel('PCA Component 2', fontsize=12)
                plt.grid(True)
                plt.tight_layout()
                plt.show()
            else:
                print("Not enough sentence embeddings generated from fine-tuned SBERT for analysis.")
        else:
             print("Fine-tuned SBERT model was not properly reconstructed or no sentences to analyze.")

    except Exception as e:
        print(f"An error occurred during fine-tuned SBERT model evaluation: {e}")
        sbert_model_finetuned_full = None 

else:
    # Conditional print statements for missing components
    if 'sbert_model_finetuned_base' not in locals() or sbert_model_finetuned_base is None:
        print("Fine-tuned base Transformer model ('sbert_model_finetuned_base') is not available.")
    if 'sbert_model_pretrained' not in locals() or sbert_model_pretrained is None:
        print("Original pre-trained SBERT model ('sbert_model_pretrained') is not available.")
    if 'mlm_tokenizer' not in locals() or mlm_tokenizer is None:
        print("MLM Tokenizer ('mlm_tokenizer') used for fine-tuning is not available.")
    print("Skipping evaluation of fine-tuned SBERT model.")

## 7. Discussion & Conclusion for Sentence Transformer Stage (Lumina Codex)

In this notebook, we explored Sentence Transformers, specifically using the powerful `all-mpnet-base-v2` model, to generate embeddings for sentences from our "Lumina Codex" corpus. Our primary goal was to evaluate how well these pre-trained sentence embeddings capture semantic meaning for similarity and clustering. We also conducted an *experimental* further fine-tuning of the model's underlying Transformer base using Masked Language Modeling (MLM) on the "Lumina Codex" text to observe any domain adaptation effects.

Our analysis focused on: Semantic Similarity, Sentence-Level Contextual Understanding, Vocabulary Handling, Clustering Quality (via PCA), and the impact of our experimental Fine-tuning.

**Key Insights from `all-mpnet-base-v2` Experiments:**

1.  **Semantic Similarity of Sentences (Pre-trained vs. Fine-tuned):**
    * **Pre-trained `all-mpnet-base-v2`:**
        The heatmap of cosine similarities for sentences embedded using the pre-trained `all-mpnet-base-v2` ([see Figure SBERT-PT-Heatmap - your `sbert_pretrained_heatmap.png`](https://www.linkedin.com/feed/update/urn:li:activity:YOUR_ACTIVITY_ID_WHERE_YOU_POST_IMAGE/)) showcases its strong ability to discern semantic relationships between full sentences.
        *(Your interpretation here: "For example, sentences from the 'Lumina Codex' that describe [similar theme A, e.g., 'planetary habitability for Aqua and Veridia'] likely show a high similarity score of [your_value_from_heatmap]. Conversely, sentences with disparate topics, such as one detailing 'Ignis's molten surface' versus one about 'Sylvan philosophy,' would exhibit low similarity [your_value_lower]. This reflects the model's robust general-purpose training for semantic textual similarity.")*
    * **Experimentally Fine-tuned SBERT (MLM):**
        After the experimental MLM fine-tuning on the "Lumina Codex," the corresponding heatmap ([see Figure SBERT-FT-Heatmap - your `sbert_finetuned_heatmap.png`](https://www.linkedin.com/feed/update/urn:li:activity:YOUR_ACTIVITY_ID_WHERE_YOU_POST_IMAGE/)) should be compared.
        *(Your interpretation here: "The heatmap for the fine-tuned model might show [subtle shifts/stronger differentiation/little change] in similarity scores for sentences specific to the 'Lumina Codex.' For instance, the relationship between sentences discussing 'ancient Egyptian cosmology' and 'modern astronomy' within the codex might now have a similarity score of [new_value], potentially different from the pre-trained model's assessment. This indicates that MLM fine-tuning, even if not directly optimizing for sentence similarity, can adapt the underlying representations to the domain vocabulary and common phrasing.")*

2.  **Sentence-Level Contextual Understanding & Vocabulary Handling:**
    * `all-mpnet-base-v2` excels at producing a single vector that holistically represents the meaning of an entire sentence, inherently capturing its full context. The quality of the semantic similarities observed above is a direct reflection of this.
    * The underlying MPNet model uses subword tokenization, ensuring robust handling of the diverse vocabulary within the "Lumina Codex," including specialized or fictional terms, by composing representations from known subword units.

3.  **Clustering Quality (PCA of Sentences - Pre-trained vs. Fine-tuned):**
    * **Pre-trained `all-mpnet-base-v2`:**
        The PCA plot of sentence embeddings from the pre-trained model ([see Figure SBERT-PT-PCA - your `sbert_pretrained_pca.png`](https://www.linkedin.com/feed/update/urn:li:activity:YOUR_LINKEDIN_ACTIVITY_ID_WHERE_YOU_POST_IMAGE/)) visualizes the semantic space.
        *(Your interpretation here: "This plot should demonstrate good thematic clustering. Sentences from the 'Lumina Codex' discussing, for example, 'planetary descriptions,' 'life forms,' or 'astronomical discovery' might form distinguishable groups, showcasing the model's strong general semantic organization.")*
    * **Experimentally Fine-tuned SBERT (MLM):**
        Comparing this with the PCA plot from the MLM fine-tuned model ([see Figure SBERT-FT-PCA - your `sbert_finetuned_pca.png`](https://www.linkedin.com/feed/update/urn:li:activity:YOUR_ACTIVITY_ID_WHERE_YOU_POST_IMAGE/)) can reveal the impact of domain adaptation.
        *(Your interpretation here: "The fine-tuning might lead to [e.g., 'tighter clusters for themes very specific to the Lumina Codex,' 'a shift in the relative positions of certain sentence groups,' or 'subtle refinement of existing clusters.']. This visually represents how the embedding space has been influenced by further exposure to the specific narrative and vocabulary of the codex.")*

4.  **Impact of Experimental Fine-tuning (MLM Adaptation):**
    * `all-mpnet-base-v2` is already a highly optimized sentence embedder. Our brief MLM fine-tuning on the "Lumina Codex" served as an experiment in domain adaptation. The reduction in MLM loss during this process signifies that the model's underlying token representations were indeed being adjusted to better fit the statistical patterns of our specific corpus.
    * The key observation from comparing the "before" and "after" heatmaps and PCA plots is to see how these token-level adaptations (from MLM) translate to changes in the final *sentence embeddings* when processed through SBERT's pooling mechanism. While MLM doesn't directly optimize for sentence similarity in the way SBERT's original training did, it can make the model more "familiar" with the domain's language, which may subtly influence sentence-level representations.

**Sentence Transformers in the "EmbedEvolution" Context:**

Sentence Transformers like `all-mpnet-base-v2` represent a critical advancement, providing off-the-shelf, high-quality sentence embeddings specifically designed for semantic comparison. They address a key limitation of vanilla BERT, whose pooled outputs are not always optimal for direct sentence similarity tasks.

* **Strengths Demonstrated:** Excellent performance in capturing sentence-level semantic similarity, robust vocabulary handling, and the clear benefit of specialized fine-tuning (as embodied by the pre-trained SBERT model itself).
* Our experimental MLM fine-tuning further showed that the underlying Transformer models are adaptable to new domains, although for sentence-level tasks, task-specific fine-tuning objectives (like NLI or STS) are generally preferred for significant performance gains in similarity.

**Next Steps:**

Having explored Sentence Transformers, which are fine-tuned for general semantic understanding between sentences, we next investigate **GTE (General Text Embeddings)** models, like `BAAI/bge-base-en-v1.5`. These often represent the cutting edge in general-purpose text embedding performance across diverse benchmarks and may introduce subtle aspects of task-awareness, bridging towards fully instruction-tuned models.