# EmbedEvolution Stage 6: GTE (General Text Embeddings) with "Lumina Codex"

Welcome to Stage 6! We explore **General Text Embedding (GTE)** models, using `BAAI/bge-base-en-v1.5`. These models aim for robust, general-purpose embeddings.

We will:
1. Use the pre-trained `BAAI/bge-base-en-v1.5`.
2. Analyze its sentence embeddings from the "Lumina Codex."
3. Attempt an *experimental* further fine-tuning of its underlying Transformer using Masked Language Modeling (MLM) on the "Lumina Codex" to observe domain adaptation.
4. Compare embeddings before and after this experimental fine-tuning.

## 1. Setup and Imports

In [None]:
import torch
from sentence_transformers import SentenceTransformer, models
from transformers import AutoTokenizer, AutoModelForMaskedLM, DataCollatorForLanguageModeling, TrainingArguments, Trainer # For MLM
from torch.utils.data import Dataset, DataLoader
import numpy as np
from sklearn.decomposition import PCA
from sklearn.metrics.pairwise import cosine_similarity
import matplotlib.pyplot as plt
import seaborn as sns
import re
import nltk
from tqdm.auto import tqdm
import os
import shutil # For cleaning up saved model directory

# Device selection
if torch.cuda.is_available():
    device_str = "cuda"
elif torch.backends.mps.is_available():
    device_str = "mps"
else:
    device_str = "cpu"
device = torch.device(device_str)
print(f"Selected device: {device}")

# Download nltk 'punkt'
try:
    _ = nltk.sent_tokenize("Test sentence.")
except LookupError:
    nltk.download('punkt')

## 2. Define the Corpus: The "Lumina Codex"
We'll use our detailed "Lumina Codex and the Solara System" text.

In [None]:
text = """
The Lumina Codex and the Solara System: A Tapestry of Ancient Wisdom and Cosmic Discovery
In the shadowed halls of the Cairo Museum, a dusty papyrus scroll, cataloged as Papyrus K-37b from the Middle Kingdom, lay forgotten for centuries. Dubbed the Lumina Codex by its discoverers, this fragile relic was initially dismissed as a mythological curiosity, its cryptic hieroglyphs and star charts interpreted as poetic musings of a priestly scribe. Yet, in 2024, a team of linguists and astronomers, led by Dr. Amara Nassar, deciphered its veiled verses, revealing an astonishing truth: the codex described a distant star system with uncanny precision, orbiting a radiant G-type star named the Star of Eternal Radiance—now known as Lumina. This revelation sparked a scientific odyssey, merging ancient Egyptian cosmology with cutting-edge astronomy, as the Solara System emerged from the Nebula Cygnus-X1, nestled in the Orion Spur of the Milky Way Galaxy.

The Lumina Codex spoke of Lumina and its ten celestial attendants, organized into poetic regions: the searing Forges of Ra for the inner worlds, the verdant Blessed Belt of Osiris for the habitable zone, the majestic Domains of the Sky Titans for gas giants, and the enigmatic Frozen Outlands for the outer realms. Its star charts, etched with meticulous care, hinted at a cosmic map, with references to the Rivers of Stars—likely the Milky Way—and the Celestial Gardens, evoking the Local Group within the Virgo Supercluster. The codex’s verses, such as “Ten jewels dance in the embrace of the Eternal Radiance, their faces veiled in fire, water, and ice,” seemed to prefigure a system now confirmed by the Cygnus-X1 Deep Sky Array, a fictional next-generation telescope orbiting beyond Earth’s atmosphere.

Discovery and Modern Corroboration
The Solara System’s discovery began in 2023, when the Cygnus-X1 Deep Sky Array detected subtle wobbles in Lumina’s light, indicating a complex system of orbiting bodies. Located 1,200 light-years away in the Nebula Cygnus-X1, Lumina is a stable, middle-aged G-type star, slightly larger than the Sun, with a luminosity that sustains a diverse array of worlds. As astronomers analyzed the data, they identified ten planets, each with unique characteristics that eerily echoed the Lumina Codex. The parallels were undeniable: the codex’s Forges of Ra matched the inner rocky planets, while the Blessed Belt of Osiris aligned with two habitable worlds teeming with life. The Domains of the Sky Titans and Frozen Outlands described gas giants and icy dwarfs with striking accuracy. The scientific community buzzed with excitement, as linguists and astronomers collaborated to decode the codex’s metaphors, revealing a blend of ancient intuition and cosmic truth.

The Solara System: A Celestial Menagerie
Lumina: The Star of Eternal Radiance
Lumina, a G2V star, radiates a warm, golden light, its stable fusion cycle supporting a system spanning 12 astronomical units. Its magnetic fields are unusually calm, suggesting a long lifespan conducive to life’s evolution. The codex describes Lumina as “the hearth of eternity, whose breath kindles the dance of worlds,” a poetic nod to its life-giving energy.

The Forges of Ra: Inner Planets
1- Ignis: The closest planet to Lumina, Ignis is a scorched, iron-rich world with a molten surface pocked by ancient impact craters. Its thin atmosphere, rich in sulfur dioxide, glows faintly under Lumina’s intense radiation. The codex calls it “Ra’s Anvil, where molten rivers forge the bones of the cosmos,” reflecting its volcanic past and metallic crust.
2- Ferrus: Slightly larger, Ferrus is a rocky planet with vast plains of oxidized iron, giving it a crimson hue. Its surface bears scars of past tectonic activity, with towering cliffs and deep chasms. The codex names it “the Forge of Hephaestus’s Twin,” hinting at its metallic wealth, now confirmed by spectroscopic analysis revealing nickel and cobalt deposits.
The Blessed Belt of Osiris: Habitable Zone
1- Aqua: A breathtaking ocean world, Aqua is enveloped in turquoise clouds of water vapor and nitrogen. Its surface is 90% liquid water, with archipelagos of coral-like structures hosting complex aquatic ecosystems. Bioluminescent Aquarelles, jellyfish-like creatures with crystalline tentacles, drift in vast schools, their light pulses synchronizing in rhythmic displays. Predatory Thalacynths, eel-like organisms with electromagnetic sensors, hunt in the deep trenches. Aqua’s moon, Thalassa, is an ice-covered world with a subglacial ocean, where astrobiologists hypothesize microbial extremophiles thrive in hydrothermal vents, metabolizing sulfur compounds. The codex describes Aqua as “Osiris’s Chalice, where life swims in the tears of the gods,” and Thalassa as “the frozen veil hiding the spark of creation.”
2- Veridia: A super-Earth, Veridia boasts lush continents of bioluminescent flora, such as Luminara trees, which pulse with green and violet light, and Crystalferns, whose fractal leaves refract Lumina’s rays into dazzling spectra. Veridia is home to the Sylvans, sentient, silicon-based life forms resembling ambulatory crystal shrubs. Their bodies, composed of lattice-like structures, shimmer with bioluminescent patterns used for communication. Sylvan society is decentralized, with “groves” of individuals linked via light-based signals, forming a collective consciousness deeply attuned to Veridia’s ecosystem. Their architecture, grown from crystalline minerals, integrates seamlessly with the landscape. The codex calls Veridia “the Garden of Osiris’s Breath,” where “the shining ones weave light into wisdom.”
The Domains of the Sky Titans: Gas Giants
1- Zephyrus: A massive hydrogen-helium gas giant, Zephyrus dominates the system with its radiant ring system, composed of ice and silicate particles. Its atmosphere swirls with golden storms, driven by intense winds. Among its 47 moons, Io-Prime stands out, a volcanically active world spewing sulfur plumes, likely powered by tidal heating. The codex names Zephyrus “the Sky Titan’s Crown,” its rings “the jeweled girdle of the heavens.”
2- Boreas: An ice giant with a deep blue methane atmosphere, Boreas exhibits retrograde rotation and an asymmetrical magnetic field, creating auroras that dance across its poles. Its 22 moons include Erynnis, a rocky moon with methane lakes. The codex describes Boreas as “the Frost Titan, whose breath chills the void,” capturing its icy majesty.
The Frozen Outlands: Outer Planets
1- Umbriel: A dwarf planet with a charcoal-dark surface, Umbriel’s icy crust is fractured by ancient impacts. Its moon Nyx, a captured object, is rich in organic compounds, hinting at prebiotic chemistry. The codex calls Umbriel “the Shadowed Outcast, guarded by the dark sentinel.”
2- Erebus: An icy world with a nitrogen-methane atmosphere, Erebus has a highly elliptical orbit, suggesting a captured origin. Its surface sparkles with frost-covered ridges. The codex names it “the Silent Wanderer, cloaked in eternal frost.”
3- Aetheria: The outermost planet, Aetheria is a rogue dwarf with a thin atmosphere of neon and argon. Its moon Lethe exhibits cryovolcanism, spewing ammonia-water mixtures. Astrobiologists speculate that Lethe’s subsurface ocean may harbor microbial life, analogous to Thalassa’s. The codex describes Aetheria as “the Veiled Wanderer, whose dreams freeze in the outer dark,” and Lethe as “the weeping mirror of the cosmos.”
4- Nyxara: A small, icy body with a chaotic orbit, Nyxara’s surface is a mosaic of frozen nitrogen and carbon monoxide. The codex calls it “the Lost Jewel, dancing beyond the Titans’ gaze.”
Life in the Solara System
Aqua’s aquatic ecosystems are a marvel, with Aquarelles forming symbiotic networks with coral-like Hydroskeletons, which filter nutrients from the water. Thalacynths use electromagnetic pulses to stun prey, suggesting an evolutionary arms race. On Thalassa, microbial life is hypothesized based on chemical signatures of sulfur and methane in its subglacial ocean, though no direct evidence exists yet.

Veridia’s Sylvans are the system’s crown jewel. Their crystalline bodies, averaging two meters tall, refract light into complex patterns, encoding emotions, ideas, and memories. Their society operates as a “luminous collective,” with no central authority; decisions emerge from synchronized light displays across groves. Sylvan technology manipulates crystalline minerals to create tools and habitats, all in harmony with Veridia’s ecosystem. Their discovery has sparked intense study by linguists decoding their light-based language, revealing a philosophy centered on balance and interconnectedness.

On Lethe, cryovolcanic activity suggests a subsurface ocean with potential microbial ecosystems, possibly metabolizing ammonia. Unlike Aqua’s confirmed complex life and Veridia’s sentient Sylvans, life on Thalassa and Lethe remains speculative, driving astrobiological research.

Galactic Context
The Solara System resides in the Orion Spur, a minor arm of the Milky Way, part of the Local Group within the Virgo Supercluster. The codex’s Rivers of Stars evoke the Milky Way’s spiral arms, while the Celestial Gardens suggest a poetic grasp of the Local Group’s galactic cluster. This cosmic placement underscores Solara’s significance as a microcosm of the universe’s diversity.

Ongoing Exploration
Scientific teams, including astrobiologists, geologists, and linguists, are studying Solara via the Cygnus-X1 Deep Sky Array and planned probes, such as the Lumina Pathfinder Mission. Challenges include the 1,200-light-year distance, requiring advanced telemetry for data transmission. Sylvan communication poses a unique hurdle, as their light patterns defy traditional linguistic models. Future missions aim to deploy orbiters around Aqua, Veridia, and Lethe to confirm microbial life and study Sylvan culture.

A Cosmic Tapestry
The Solara System, unveiled through the Lumina Codex and modern astronomy, blends ancient wisdom with scientific discovery. Its worlds—from the fiery Forges of Ra to the icy Frozen Outlands—offer a rich tapestry of environments, life forms, and mysteries. As scientists probe this distant system, the codex’s poetic verses resonate, reminding humanity that the cosmos has long whispered its secrets, awaiting those bold enough to listen.
"""
def clean_text_for_embedding_models(input_text):
    input_text = re.sub(r'\s+', ' ', input_text).strip()
    return input_text

cleaned_full_text_gte_corpus = clean_text_for_embedding_models(text)
sentences_from_corpus_gte_analysis = nltk.sent_tokenize(text)
cleaned_sentences_for_gte_analysis = [clean_text_for_embedding_models(s) for s in sentences_from_corpus_gte_analysis if s.strip()]

print(f"Found {len(cleaned_sentences_for_gte_analysis)} sentences in the corpus for GTE analysis.")



## 3. Load Pre-trained GTE Model (`BAAI/bge-base-en-v1.5`)

In [None]:
GTE_MODEL_NAME_PRETRAINED = 'BAAI/bge-base-en-v1.5' # This is our base pre-trained GTE

try:
    gte_model_pretrained = SentenceTransformer(GTE_MODEL_NAME_PRETRAINED, device=device_str)
    print(f"Successfully loaded PRE-TRAINED GTE model: '{GTE_MODEL_NAME_PRETRAINED}'")
except Exception as e:
    print(f"Error loading PRE-TRAINED GTE model: {e}")
    gte_model_pretrained = None

## 4. Define Sentences for Analysis & Keyword Spotting

In [None]:
selected_sentences_for_gte = []
selected_sentence_labels_gte = []
# (Using the same selection logic as your SBERT notebook for consistency)
temp_selected_indices_gte = set()
keywords_for_gte_themes = [
    "lumina", "solara", "aqua", "veridia", "sylvans", "thalassa", "lethe", "star", "planet",
    "moon", "orbit", "atmosphere", "ecosystem", "rocky", "icy", "habitable",
    "bioluminescent", "sentient", "codex", "discovery", "life", "egyptian", "astronomy", "science"
]
if cleaned_sentences_for_gte_analysis:
    prominent_keywords_gte = ["lumina", "aqua", "veridia", "sylvans", "codex", "star", "planet", "life", "discovery"]
    for kw in prominent_keywords_gte:
        found_count = 0
        for i, s_text in enumerate(cleaned_sentences_for_gte_analysis):
            s_lower = s_text.lower()
            if re.search(r'\b' + re.escape(kw) + r'\b', s_lower) and i not in temp_selected_indices_gte:
                selected_sentences_for_gte.append(s_text)
                selected_sentence_labels_gte.append(f"{kw}_ctx{found_count+1}")
                temp_selected_indices_gte.add(i)
                found_count += 1
                if found_count >= 1: break
    num_needed_gte = max(0, 15 - len(selected_sentences_for_gte))
    idx_counter = 0
    while num_needed_gte > 0 and idx_counter < len(cleaned_sentences_for_gte_analysis):
        s_text = cleaned_sentences_for_gte_analysis[idx_counter]
        if idx_counter not in temp_selected_indices_gte:
            selected_sentences_for_gte.append(s_text)
            s_lower = s_text.lower()
            label_found = False
            for kw_other in keywords_for_gte_themes:
                if re.search(r'\b' + re.escape(kw_other) + r'\b', s_lower):
                    potential_label = f"{kw_other}_misc_ctx"
                    if potential_label not in selected_sentence_labels_gte:
                        selected_sentence_labels_gte.append(potential_label)
                        label_found = True; break
            if not label_found: selected_sentence_labels_gte.append(f"misc_ctx{len(selected_sentence_labels_gte)}")
            temp_selected_indices_gte.add(idx_counter); num_needed_gte -=1
        idx_counter += 1
    if len(selected_sentences_for_gte) < 5 and cleaned_sentences_for_gte_analysis:
        selected_sentences_for_gte = cleaned_sentences_for_gte_analysis[:min(15, len(cleaned_sentences_for_gte_analysis))]
        selected_sentence_labels_gte = [f"S{i+1}_{s[:10].replace(' ','_')}" for i, s in enumerate(selected_sentences_for_gte)]
    print(f"\nSelected {len(selected_sentences_for_gte)} diverse sentences for GTE analysis.")


## 5. Part 1: Analysis with Pre-trained GTE Model
We use `BAAI/bge-base-en-v1.5` as is. Remember, for BGE models, `normalize_embeddings=True` is crucial.
For embedding general passages (our sentences), no specific instruction prefix is typically needed.


### 5.1 Generate and Evaluate Pre-trained GTE Embeddings

In [None]:
sentence_embeddings_gte_pt = None # pt for pre-trained
if gte_model_pretrained and selected_sentences_for_gte:
    print(f"\nGenerating GTE (pre-trained) embeddings for {len(selected_sentences_for_gte)} sentences...")
    sentence_embeddings_gte_pt = gte_model_pretrained.encode(
        selected_sentences_for_gte,
        convert_to_numpy=True,
        normalize_embeddings=True # Crucial for BGE
    )
    print(f"Generated GTE (pre-trained) sentence embeddings. Shape: {sentence_embeddings_gte_pt.shape}")

    if sentence_embeddings_gte_pt is not None and sentence_embeddings_gte_pt.shape[0] >= 2:
        embedding_matrix_gte_pt = sentence_embeddings_gte_pt
        plot_labels_gte_pt = selected_sentence_labels_gte

        # --- Semantic Similarity (Heatmap) ---
        print("\n--- Cosine Similarity Heatmap (Sentences - Pre-trained GTE) ---")
        similarity_matrix_gte_pt = np.dot(embedding_matrix_gte_pt, embedding_matrix_gte_pt.T)
        
        num_labels_gte_pt = len(plot_labels_gte_pt)
        fig_width_gte_pt = max(12, num_labels_gte_pt * 0.6)
        fig_height_gte_pt = max(10, num_labels_gte_pt * 0.45)
        plt.figure(figsize=(fig_width_gte_pt, fig_height_gte_pt))
        
        annotate_heatmap_gte_pt = num_labels_gte_pt < 20
        sns.heatmap(similarity_matrix_gte_pt,
                    annot=annotate_heatmap_gte_pt, cmap='coolwarm', fmt=".2f",
                    xticklabels=plot_labels_gte_pt, yticklabels=plot_labels_gte_pt,
                    linewidths=.5, cbar_kws={"shrink": .8}, vmin=-1, vmax=1)
        plt.title(f'GTE ({GTE_MODEL_NAME_PRETRAINED}) Sentence Cosine Similarity (Pre-trained)', fontsize=16)
        plt.xticks(rotation=65, ha='right', fontsize=max(8, 12 - num_labels_gte_pt // 5))
        plt.yticks(rotation=0, fontsize=max(8, 12 - num_labels_gte_pt // 5))
        plt.tight_layout()
        plt.show()

        # --- Clustering Quality (PCA) ---
        print("\n--- PCA Visualization (Sentences - Pre-trained GTE) ---")
        pca_gte_pt = PCA(n_components=2)
        embeddings_2d_gte_pt = pca_gte_pt.fit_transform(embedding_matrix_gte_pt)

        plt.figure(figsize=(fig_width_gte_pt * 0.9, fig_height_gte_pt * 0.9))
        plt.scatter(embeddings_2d_gte_pt[:, 0], embeddings_2d_gte_pt[:, 1], alpha=0.7, s=60)
        for i, label in enumerate(plot_labels_gte_pt):
            plt.annotate(label, (embeddings_2d_gte_pt[i, 0], embeddings_2d_gte_pt[i, 1]),
                         textcoords="offset points", xytext=(5,5), ha='center', fontsize=max(7, 10 - num_labels_gte_pt // 6))
        plt.title(f'GTE ({GTE_MODEL_NAME_PRETRAINED}) Sentence Embeddings (Pre-trained) - PCA', fontsize=16)
        plt.xlabel('PCA Component 1'); plt.ylabel('PCA Component 2'); plt.grid(True); plt.tight_layout(); plt.show()
    else:
        print("Not enough sentence embeddings for pre-trained GTE analysis.")
else:
    print("Pre-trained GTE model not loaded or no sentences. Skipping pre-trained GTE analysis.")

### Interpretation Notes for Pre-trained GTE (`BAAI/bge-base-en-v1.5`):
* **Semantic Similarity & Clustering:** BGE models are SOTA. Expect very strong semantic understanding. The heatmap should show intuitive similarities for Lumina Codex sentences, and the PCA plot should reveal meaningful thematic clusters. Compare these qualitatively to the SBERT results.
* **Vocabulary Handling:** Excellent due to its Transformer base.
* **Instruction Hint:** Remember to mention in your *document's discussion* that while we use BGE here for general passage embedding (no prefix), its power is further enhanced in retrieval tasks by adding instructions to queries.

## 6. Part 2: Experimental Fine-tuning of GTE's Base Model on "Lumina Codex" (MLM)

As with SBERT, we'll attempt an *experimental* further adaptation of the GTE model's underlying Transformer to our "Lumina Codex" domain using Masked Language Modeling (MLM).
**Note:** BGE is already highly optimized. This MLM fine-tuning primarily explores domain adaptation of token-level representations and may not necessarily improve its already strong general sentence similarity performance without specific sentence-pair objectives.

### 6.1 Prepare Model and Data for MLM Fine-tuning

In [None]:
gte_mlm_model_to_finetune = None
gte_mlm_tokenizer = None
gte_mlm_dataloader = None

if gte_model_pretrained: # Original SentenceTransformer object for BGE
    try:
        # BGE models are typically BERT-like. The underlying model name for bge-base-en-v1.5 is BAAI/bge-base-en-v1.5
        # We need to load this base model architecture with an MLM head.
        # The tokenizer should also be from the base model.
        
        base_gte_model_name_for_mlm = GTE_MODEL_NAME_PRETRAINED # e.g., 'BAAI/bge-base-en-v1.5'
        
        gte_mlm_tokenizer = AutoTokenizer.from_pretrained(base_gte_model_name_for_mlm)
        print(f"Loaded tokenizer '{base_gte_model_name_for_mlm}' for GTE MLM fine-tuning.")

        # Load the architecture with an MLM head
        gte_mlm_model_to_finetune = AutoModelForMaskedLM.from_pretrained(base_gte_model_name_for_mlm).to(device)
        
        # Critical: Transfer weights from the SentenceTransformer's base model if they differ
        # from a freshly loaded AutoModelForMaskedLM. For BGE, the SentenceTransformer object
        # `gte_model_pretrained[0].auto_model` IS the BAAI/bge-base-en-v1.5 model.
        # So, gte_mlm_model_to_finetune should already have these weights if loaded from the same path.
        # If gte_model_pretrained[0].auto_model was somehow different (e.g. from a custom path), a state_dict load would be needed.
        # For simplicity, we assume from_pretrained(base_gte_model_name_for_mlm) gets us the right starting point.
        
        print("Prepared underlying GTE Transformer model for MLM fine-tuning.")

    except Exception as e:
        print(f"Error preparing GTE model for MLM fine-tuning: {e}")
        gte_mlm_model_to_finetune = None
        gte_mlm_tokenizer = None
else:
    print("Original pre-trained GTE model not available to extract base for fine-tuning.")


class MLMDatasetGTE(Dataset): # Can reuse the SBERT one if tokenizer API is compatible
    def __init__(self, texts, tokenizer, max_length=512): # BGE can handle longer sequences
        self.tokenizer = tokenizer
        self.texts = texts
        self.max_length = min(max_length, tokenizer.model_max_length) # Respect model's max length
        
        # BGE uses CLS pooling for sentence embeddings, so it needs [CLS]
        # For MLM, it's standard to add special tokens.
        self.encoded_texts = self.tokenizer(
            texts,
            add_special_tokens=True, # Add [CLS] and [SEP]
            max_length=self.max_length,
            padding="max_length",
            truncation=True,
            return_attention_mask=True,
            return_tensors="pt"
        )
        self.data_collator = DataCollatorForLanguageModeling(tokenizer=self.tokenizer, mlm=True, mlm_probability=0.15)

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        return {
            "input_ids": self.encoded_texts["input_ids"][idx],
            "attention_mask": self.encoded_texts["attention_mask"][idx]
        }

if gte_mlm_tokenizer and cleaned_sentences_for_gte_analysis:
    num_finetune_sentences_gte = min(len(cleaned_sentences_for_gte_analysis), 500)
    finetune_texts_gte = cleaned_sentences_for_gte_analysis[:num_finetune_sentences_gte]
    
    if finetune_texts_gte:
        mlm_dataset_gte = MLMDatasetGTE(finetune_texts_gte, gte_mlm_tokenizer)
        if len(mlm_dataset_gte) > 0:
            gte_mlm_dataloader = DataLoader(mlm_dataset_gte, batch_size=4, shuffle=True, collate_fn=mlm_dataset_gte.data_collator) # Smaller batch for potentially larger model
            print(f"Created MLM dataset for GTE base model with {len(mlm_dataset_gte)} instances.")
        else:
            print("MLM dataset for GTE base is empty.")
    else:
        print("No texts selected for GTE MLM fine-tuning.")
else:
    print("MLM Tokenizer not available for GTE base model fine-tuning.")

### 6.2 Fine-tune GTE's Base Model (Experimental MLM)

In [None]:
gte_model_finetuned_base = None # This will store the fine-tuned Transformer base

if gte_mlm_model_to_finetune and gte_mlm_dataloader and len(mlm_dataset_gte)>0:
    print("Starting experimental MLM fine-tuning of GTE's base Transformer...")
    gte_mlm_model_to_finetune.train()

    optimizer_gte = torch.optim.AdamW(gte_mlm_model_to_finetune.parameters(), lr=2e-6) # Even lower LR for SOTA models
    num_epochs_gte_finetune = 1 # Just 1-2 epochs for a quick domain adaptation demo
    
    for epoch in range(num_epochs_gte_finetune):
        epoch_loss_gte = 0
        progress_bar_gte = tqdm(gte_mlm_dataloader, desc=f"GTE Epoch {epoch + 1}/{num_epochs_gte_finetune}")
        for batch in progress_bar_gte:
            optimizer_gte.zero_grad()
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels = batch['labels'].to(device)
            
            outputs = gte_mlm_model_to_finetune(input_ids, attention_mask=attention_mask, labels=labels)
            loss = outputs.loss
            
            if loss is not None:
                loss.backward()
                optimizer_gte.step()
                epoch_loss_gte += loss.item()
                progress_bar_gte.set_postfix({'loss': loss.item()})
            else:
                print("Warning: Loss is None during GTE fine-tuning step.")

        avg_epoch_loss_gte = epoch_loss_gte / len(gte_mlm_dataloader) if len(gte_mlm_dataloader) > 0 else float('nan')
        print(f"GTE Epoch {epoch + 1} complete. Average Loss: {avg_epoch_loss_gte:.4f}")

    print("Experimental MLM fine-tuning of GTE's base Transformer complete.")
    # For BGE, the model loaded by AutoModelForMaskedLM *is* the transformer that produces embeddings.
    # BertForMaskedLM usually has a '.bert' attribute for the base model.
    # For BGE, which might use a custom model class or just AutoModel, we check.
    if hasattr(gte_mlm_model_to_finetune, 'bert'): # Standard BERT-like structure
         gte_model_finetuned_base = gte_mlm_model_to_finetune.bert
    elif hasattr(gte_mlm_model_to_finetune, 'roberta'): # RoBERTa-like
         gte_model_finetuned_base = gte_mlm_model_to_finetune.roberta
    # For BGE which doesn't have a separate '.bert' or '.roberta' when loaded via AutoModelForMaskedLM for its specific architecture
    # if its architecture directly is the embedding generator within the MLM model:
    else: # Assume gte_mlm_model_to_finetune itself IS the base model we need
        print("Using gte_mlm_model_to_finetune directly as the fine-tuned base (may need specific pooling strategy later if it's not a raw encoder).")
        # This might require careful handling if gte_mlm_model_to_finetune includes the MLM head.
        # A safer way for BGE is often to load the base model from its original path, then load the fine-tuned state_dict.
        # For now, let's try saving and reloading the whole MLM model and then extracting the base.
        temp_finetuned_mlm_path = "./gte_finetuned_mlm_temp"
        gte_mlm_model_to_finetune.save_pretrained(temp_finetuned_mlm_path)
        # Try to load just the base model from this. BGE has a specific architecture.
        # The SentenceTransformer object gte_model_pretrained[0].auto_model is what we'd want to replace.
        # So, the gte_mlm_model_to_finetune (after fine-tuning) should provide this base model.
        # For BGE (BAAI/bge-base-en-v1.5), the AutoModel architecture might be the base itself without a .bert sub-module
        # In that case, gte_mlm_model_to_finetune *is* the fine-tuned base for sentence-transformers.
        gte_model_finetuned_base = AutoModel.from_pretrained(temp_finetuned_mlm_path).to(device) # Load base from fine-tuned MLM model path
        print(f"Loaded base model from fine-tuned MLM model at {temp_finetuned_mlm_path}")


    if gte_model_finetuned_base:
        gte_model_finetuned_base.eval()
else:
    print("GTE MLM model or dataloader not available. Skipping experimental fine-tuning.")

### 6.3 Evaluate Fine-tuned GTE Embeddings

In [None]:
sentence_embeddings_gte_ft = None # ft for fine-tuned
gte_model_finetuned_full = None   # Initialize

# Ensure the fine-tuned base model and the tokenizer used for MLM are available
# gte_model_finetuned_base is your fine-tuned base model (e.g., the .bert or .transformer attribute, or the model itself)
# gte_model_pretrained is your original loaded SentenceTransformer ('BAAI/bge-base-en-v1.5')
# gte_mlm_tokenizer is the tokenizer used for the MLM fine-tuning (e.g., AutoTokenizer.from_pretrained('BAAI/bge-base-en-v1.5'))

if ('gte_model_finetuned_base' in locals() and gte_model_finetuned_base is not None and
    'gte_model_pretrained' in locals() and gte_model_pretrained is not None and
    'gte_mlm_tokenizer' in locals() and gte_mlm_tokenizer is not None):

    print("Preparing fine-tuned GTE model for evaluation...")
    try:
        # --- Step 1: Save the fine-tuned base Hugging Face Transformer model & its tokenizer ---
        finetuned_base_gte_model_path = "./gte_finetuned_base_for_sbert_construction" # Temporary path
        
        gte_model_finetuned_base.save_pretrained(finetuned_base_gte_model_path)
        gte_mlm_tokenizer.save_pretrained(finetuned_base_gte_model_path) # Save the tokenizer used for MLM
        print(f"Fine-tuned GTE base model and tokenizer saved to {finetuned_base_gte_model_path}")

        # --- Step 2: Reconstruct the SentenceTransformer using the path to the fine-tuned base ---
        # This SentenceTransformer layer will load the fine-tuned model and its tokenizer from the path.
        original_gte_transformer_layer = gte_model_pretrained[0] # This is a sentence_transformers.models.Transformer

        word_embedding_module_gte_ft = models.Transformer(
            model_name_or_path=finetuned_base_gte_model_path, # LOAD FROM OUR SAVED PATH
            max_seq_length=original_gte_transformer_layer.max_seq_length # Use original SBERT wrapper's max_seq_length
        )
        word_embedding_module_gte_ft.to(device)

        # Get the original pooling and normalization layers (if any)
        # BGE models often use CLS pooling by default when loaded in SentenceTransformer
        # The second module in gte_model_pretrained is usually the pooling layer.
        pooling_model_gte = gte_model_pretrained[1]
        pooling_model_gte.to(device)
        
        reconstructed_modules_gte_ft = [word_embedding_module_gte_ft, pooling_model_gte]
        
        # Check if original gte_model_pretrained had a Normalize layer
        if len(gte_model_pretrained) > 2 and isinstance(gte_model_pretrained[2], models.Normalize):
            normalize_model_gte = gte_model_pretrained[2]
            normalize_model_gte.to(device)
            reconstructed_modules_gte_ft.append(normalize_model_gte)
        # If no explicit Normalize layer, but BGE needs normalization, ensure encode applies it
        # (The normalize_embeddings=True in encode() call later will handle this)

        gte_model_finetuned_full = SentenceTransformer(modules=reconstructed_modules_gte_ft, device=device_str)
        print("Successfully reconstructed SentenceTransformer with fine-tuned GTE base.")

        # Optional: Clean up the saved model directory
        # import shutil
        # if os.path.exists(finetuned_base_gte_model_path):
        #     shutil.rmtree(finetuned_base_gte_model_path)
        #     print(f"Cleaned up temporary directory: {finetuned_base_gte_model_path}")
        # if os.path.exists("./gte_finetuned_mlm_temp"): # If path from MLM saving exists
        #    shutil.rmtree("./gte_finetuned_mlm_temp")


        # --- Step 3: Proceed with evaluation ---
        if gte_model_finetuned_full and selected_sentences_for_gte: # selected_sentences_for_gte from cell 4
            print(f"\nGenerating GTE (fine-tuned) embeddings for {len(selected_sentences_for_gte)} sentences...")
            sentence_embeddings_gte_ft = gte_model_finetuned_full.encode(
                selected_sentences_for_gte,
                convert_to_numpy=True,
                normalize_embeddings=True, # Crucial for BGE
                device=device_str # Explicitly pass device
            )
            print(f"Generated GTE (fine-tuned) sentence embeddings. Shape: {sentence_embeddings_gte_ft.shape}")

            if sentence_embeddings_gte_ft is not None and sentence_embeddings_gte_ft.shape[0] >= 2:
                embedding_matrix_gte_ft = sentence_embeddings_gte_ft
                plot_labels_gte_ft = selected_sentence_labels_gte # Use the same labels

                # --- Semantic Similarity of Sentences (Heatmap - Fine-tuned GTE) ---
                print("\n--- Cosine Similarity Heatmap (Sentences - Fine-tuned GTE) ---")
                # For normalized embeddings, dot product is equivalent to cosine similarity
                similarity_matrix_gte_ft = np.dot(embedding_matrix_gte_ft, embedding_matrix_gte_ft.T)
                
                # Assuming fig_width_gte_pt and fig_height_gte_pt were defined when plotting pre-trained GTE
                # If not, define them here based on num_labels_gte_pt (or num_labels_gte_ft)
                num_labels_gte_ft = len(plot_labels_gte_ft)
                fig_width_gte_ft = max(12, num_labels_gte_ft * 0.6)
                fig_height_gte_ft = max(10, num_labels_gte_ft * 0.45)

                plt.figure(figsize=(fig_width_gte_ft, fig_height_gte_ft))
                annotate_heatmap_gte_ft = num_labels_gte_ft < 20
                sns.heatmap(similarity_matrix_gte_ft,
                            annot=annotate_heatmap_gte_ft, cmap='coolwarm', fmt=".2f",
                            xticklabels=plot_labels_gte_ft, yticklabels=plot_labels_gte_ft,
                            linewidths=.5, cbar_kws={"shrink": .8}, vmin=-1, vmax=1)
                plt.title(f'GTE ({GTE_MODEL_NAME_PRETRAINED}) Similarity (Fine-tuned on Lumina Codex)', fontsize=16)
                plt.xticks(rotation=65, ha='right', fontsize=max(8, 12 - num_labels_gte_ft // 5))
                plt.yticks(rotation=0, fontsize=max(8, 12 - num_labels_gte_ft // 5))
                plt.tight_layout(); plt.show()

                # --- Clustering Quality of Sentences (PCA Visualization - Fine-tuned GTE) ---
                print("\n--- PCA Visualization (Sentences - Fine-tuned GTE) ---")
                pca_gte_ft = PCA(n_components=2)
                embeddings_2d_gte_ft = pca_gte_ft.fit_transform(embedding_matrix_gte_ft)

                plt.figure(figsize=(fig_width_gte_ft * 0.9, fig_height_gte_ft * 0.9))
                plt.scatter(embeddings_2d_gte_ft[:, 0], embeddings_2d_gte_ft[:, 1], alpha=0.7, s=60)
                for i, label in enumerate(plot_labels_gte_ft):
                    plt.annotate(label, (embeddings_2d_gte_ft[i, 0], embeddings_2d_gte_ft[i, 1]),
                                 textcoords="offset points", xytext=(5,5), ha='center', fontsize=max(7, 10 - num_labels_gte_ft // 6))
                plt.title(f'GTE ({GTE_MODEL_NAME_PRETRAINED}) Embeddings (Fine-tuned on Lumina Codex) - PCA', fontsize=16)
                plt.xlabel('PCA Component 1'); plt.ylabel('PCA Component 2'); plt.grid(True); plt.tight_layout(); plt.show()
            else:
                print("Not enough sentence embeddings generated from fine-tuned GTE for analysis.")
        else:
             print("Fine-tuned GTE model was not properly reconstructed or no sentences to analyze.")
            
    except Exception as e:
        print(f"An error occurred during fine-tuned GTE model evaluation: {e}")
        gte_model_finetuned_full = None # Ensure it's None if any part of reconstruction/use fails
else:
    # Conditional print statements for missing components for this specific reconstruction block
    if 'gte_model_finetuned_base' not in locals() or gte_model_finetuned_base is None:
        print("Fine-tuned base GTE model ('gte_model_finetuned_base') is not available.")
    if 'gte_model_pretrained' not in locals() or gte_model_pretrained is None:
        print("Original pre-trained GTE model ('gte_model_pretrained') is not available.")
    if 'gte_mlm_tokenizer' not in locals() or gte_mlm_tokenizer is None:
        print("MLM Tokenizer ('gte_mlm_tokenizer') used for GTE fine-tuning is not available.")
    print("Skipping evaluation of fine-tuned GTE model.")

## 7. Discussion & Conclusion for GTE Stage (Lumina Codex)

In this notebook, we explored `BAAI/bge-base-en-v1.5`, a prominent General Text Embedding (GTE) model, using sentences from our "Lumina Codex" corpus. Our primary investigation involved:
1.  Analyzing sentence embeddings generated by the **pre-trained GTE model**.
2.  Conducting an **experimental further fine-tuning** of its underlying Transformer base via Masked Language Modeling (MLM) on the "Lumina Codex" to observe domain adaptation.
3.  Comparing the results before and after this experimental fine-tuning across semantic similarity, sentence-level contextual understanding, vocabulary handling, clustering quality (PCA), and the model's overall capabilities.

**Key Insights from GTE (`BAAI/bge-base-en-v1.5`) Experiments:**

1.  **Semantic Similarity & Sentence-Level Contextual Understanding:**
    * **Pre-trained GTE:** The heatmap of cosine similarities ([see Figure GTE-PT-Heatmap - your `gte_pretrained_heatmap.png`](https://www.linkedin.com/feed/update/urn:li:activity:YOUR_LINKEDIN_ACTIVITY_ID_WHERE_YOU_POST_IMAGE/)) from the pre-trained `bge-base-en-v1.5` demonstrated its exceptional ability to capture nuanced semantic relationships between sentences from the "Lumina Codex."
        *(Your interpretation here: "For example, sentences describing [similar thematic elements, e.g., different aspects of 'Sylvan' society or descriptions of 'Aqua's ecosystem'] likely showed consistently high similarity scores (e.g., [your_value_1] and [your_value_2]). Conversely, sentences with clearly distinct topics, such as one detailing 'Ignis's fiery nature' versus another discussing 'ancient Egyptian cosmology in the codex,' would exhibit appropriately low similarity scores (e.g., [your_value_3]). This indicates the model's robust grasp of overall sentence meaning, crucial for contextual understanding." The use of `normalize_embeddings=True` during encoding, as recommended for BGE, ensures these cosine similarities (or dot products) are meaningful.*
    * **Experimentally Fine-tuned GTE (MLM):** The corresponding heatmap after MLM fine-tuning ([see Figure GTE-FT-Heatmap - your `gte_finetuned_heatmap.png`](https://www.linkedin.com/feed/update/urn:li:activity:YOUR_LINKEDIN_ACTIVITY_ID_WHERE_YOU_POST_IMAGE/)) provides a comparison point.
        *(Your interpretation here: "The heatmap from the MLM-adapted GTE model might reveal [subtle shifts / minimal changes / specific improvements] in how it scores the similarity of 'Lumina Codex' sentences. For instance, the similarity between two sentences discussing 'cryovolcanism on Lethe' might now be [new_value], potentially [higher/lower/similar] compared to the pre-trained version. This reflects how MLM fine-tuning, by exposing the model more to domain-specific vocabulary and syntax, can subtly alter its perception of semantic closeness within that domain, even if the general high quality of BGE means changes are not drastic.")*

2.  **Clustering Quality (PCA of Sentences):**
    * **Pre-trained GTE:** The PCA plot ([see Figure GTE-PT-PCA - your `gte_pretrained_pca.png`](https://www.linkedin.com/feed/update/urn:li:activity:YOUR_LINKEDIN_ACTIVITY_ID_WHERE_YOU_POST_IMAGE/)) of sentence embeddings from the pre-trained BGE model is expected to showcase strong thematic clustering.
        *(Your interpretation: "Sentences from the 'Lumina Codex' related to [e.g., 'planetary environments,' 'life forms,' 'scientific discovery,' or 'ancient codex references'] likely formed relatively distinct and coherent groups in the 2D space. This visual evidence supports BGE's reputation for creating well-organized semantic spaces.")*
    * **Experimentally Fine-tuned GTE (MLM):** The PCA plot for the MLM-adapted model ([see Figure GTE-FT-PCA - your `gte_finetuned_pca.png`](https://www.linkedin.com/feed/update/urn:li:activity:YOUR_LINKEDIN_ACTIVITY_ID_WHERE_YOU_POST_IMAGE/)) should be compared.
        *(Your interpretation: "After MLM fine-tuning, the PCA plot might show [e.g., 'even tighter clustering for concepts very unique to the Lumina Codex,' or 'a slight re-arrangement of clusters as the model adapted to the corpus's specific linguistic patterns.']. This helps visualize the domain adaptation at a macro level.")*

3.  **Vocabulary Handling (Subword Tokenization):**
    * BGE, like BERT and SBERT models, is built upon a Transformer architecture that utilizes subword tokenization. This allows it to effectively process and represent the diverse and potentially specialized vocabulary of the "Lumina Codex," including fictional names (e.g., "Sylvans," "Aquarelles," "Nyxara") by breaking them into known subword units if they are not in its pre-training vocabulary. This significantly mitigates the "Out-of-Vocabulary" problem.

4.  **"Fine-tuning Capability" (Interpreted as Leveraging SOTA Pre-training & Observing Domain Adaptation):**
    * `BAAI/bge-base-en-v1.5` is an advanced, pre-trained model. Its strong out-of-the-box performance is a direct result of its extensive and sophisticated training on massive, diverse datasets using advanced objectives (like contrastive learning). This represents the "fine-tuning capability" of the community being leveraged.
    * Our *experimental* MLM fine-tuning on the "Lumina Codex" demonstrated that the underlying Transformer's weights can indeed be further adapted to a specific domain. The reduction in MLM loss during this process (if observed) and any subtle shifts in the subsequent embedding space (seen in heatmaps/PCA) are evidence of this adaptation. While this MLM approach might not be the optimal way to enhance BGE for *sentence similarity* (which would typically involve sentence-pair objectives), it illustrates the model's capacity to learn from new domain-specific text.
    * **Instruction Sensitivity (A Key GTE Characteristic):** It's crucial to note that GTE models like BGE often achieve their peak performance on specific tasks (especially asymmetric retrieval like query-document search) by utilizing **instruction prefixes** for queries (e.g., "Represent this sentence for searching relevant passages:"). While we embedded our "Lumina Codex" sentences directly as passages (where BGE typically doesn't require a prefix for indexing), this inherent design for instruction-sensitivity is a hallmark of many GTEs and distinguishes them further from earlier SBERT models.

**GTE Models in the "EmbedEvolution" Context – A Step Up from SBERT & BERT:**

General Text Embedding models like `BAAI/bge-base-en-v1.5` represent a significant advancement in the pursuit of highly versatile and powerful text embeddings.

* **Compared to Vanilla BERT:** GTEs (like SBERTs) are directly optimized to produce sentence-level embeddings that are meaningful for similarity comparison, a task for which vanilla BERT's pooled outputs are often suboptimal without specific fine-tuning.
* **Compared to earlier Sentence Transformers (e.g., `all-MiniLM-L6-v2`):** GTE models like BGE are typically more recent, often trained on even larger and more diverse datasets with more refined training objectives (like advanced contrastive learning techniques). This generally leads to superior performance across a broader range of text embedding benchmarks (as seen on leaderboards like MTEB). They often aim for a higher degree of "general-purpose" excellence. The subtle introduction of instruction-sensitivity for certain tasks (like retrieval queries in BGE) also marks a step towards more controllable embeddings.

**Strengths of GTEs like BGE:**
* State-of-the-art or near state-of-the-art performance for general text embedding tasks.
* Robust handling of diverse text and vocabulary.
* Excellent for semantic search, clustering, RAG systems, and as features for downstream tasks.

**Next Steps:**

The instruction-prefix capability noted in some GTE models (like BGE for queries) serves as a perfect bridge to our final stage in "EmbedEvolution": explicitly **Instruction-Aware Embedding Models** (such as `nomic-ai/nomic-embed-text-v1.5` or the Instructor series). These models make task-specific guidance a central design feature, aiming to produce embeddings even more precisely tailored to the user's expressed intent for a given task.