# EmbedEvolution Stage 7: Instruction-Aware Embeddings with `intfloat/e5-mistral-7b-instruct`

Welcome to the final stage of EmbedEvolution! We explore **Instruction-Aware Embedding Models**, using the powerful `intfloat/e5-mistral-7b-instruct`. These models are designed to generate embeddings tailored to specific tasks by understanding and following natural language instructions prepended to the input text. This offers unparalleled flexibility and control.

We will:
1. Use the pre-trained `intfloat/e5-mistral-7b-instruct` model to demonstrate its core instruction-following capability.
2. Analyze its sentence embeddings (using a default instruction) from the "Lumina Codex" for semantic similarity and clustering.
3. Attempt an *experimental* further fine-tuning of its underlying Transformer base using Masked Language Modeling (MLM) on the "Lumina Codex" to observe domain adaptation.
4. Compare embeddings before and after this experimental fine-tuning.
5. Discuss its characteristics regarding vocabulary, contextual understanding (via instructions), and its nature as an advanced instruction-tuned model.

## 1. Setup and Imports
**Note:** `intfloat/e5-mistral-7b-instruct` is a VERY LARGE model (7 Billion parameters).
- Ensure you have **significant RAM (e.g., 32GB+, ideally 64GB for fine-tuning)** and **VRAM (e.g., 16GB+, ideally 24GB+ like an A100 or H100 for fine-tuning)**.
- Using a powerful GPU (CUDA) is **essential** for reasonable performance, especially for fine-tuning. MPS on Mac M1/M2/M3 will likely be extremely slow or run out of memory for fine-tuning this model. CPU execution will be prohibitively slow.
- Consider using a smaller instruction-tuned model (e.g., `hkunlp/instructor-base` or `large`) if resources are limited. This notebook proceeds with `e5-mistral-7b-instruct` as requested, with strong caveats.

In [None]:
import torch
from sentence_transformers import SentenceTransformer, models
from transformers import AutoTokenizer, AutoModelForMaskedLM, AutoConfig, DataCollatorForLanguageModeling, TrainingArguments, Trainer
from torch.utils.data import Dataset
import numpy as np
from sklearn.decomposition import PCA
from sklearn.metrics.pairwise import cosine_similarity
import matplotlib.pyplot as plt
import seaborn as sns
import re
import nltk # For sentence tokenization
from tqdm.auto import tqdm
import os
import shutil # For cleaning up saved model directory

# Device selection
if torch.cuda.is_available():
    device_str = "cuda"
    print("CUDA is available. Using GPU (CUDA).")
    # Consider torch.cuda.empty_cache() if memory issues persist
elif torch.backends.mps.is_available():
    device_str = "mps"
    print("MPS is available. Using Apple Silicon GPU (MPS).")
    print("WARNING: Fine-tuning a 7B model on MPS will be extremely slow and may lead to memory issues.")
else:
    device_str = "cpu"
    print("CUDA and MPS not available. Using CPU.")
    print("WARNING: Using CPU for intfloat/e5-mistral-7b-instruct (especially fine-tuning) will be prohibitively slow and require very high RAM.")
device = torch.device(device_str)
print(f"Selected device: {device}")

# Download nltk 'punkt' resource
try:
    _ = nltk.sent_tokenize("Test sentence.")
except LookupError:
    nltk.download('punkt')

## 2. Define the Corpus: The "Lumina Codex"
We'll use our detailed "Lumina Codex and the Solara System" text.

In [None]:
text = """
The Lumina Codex and the Solara System: A Tapestry of Ancient Wisdom and Cosmic Discovery
In the shadowed halls of the Cairo Museum, a dusty papyrus scroll, cataloged as Papyrus K-37b from the Middle Kingdom, lay forgotten for centuries. Dubbed the Lumina Codex by its discoverers, this fragile relic was initially dismissed as a mythological curiosity, its cryptic hieroglyphs and star charts interpreted as poetic musings of a priestly scribe. Yet, in 2024, a team of linguists and astronomers, led by Dr. Amara Nassar, deciphered its veiled verses, revealing an astonishing truth: the codex described a distant star system with uncanny precision, orbiting a radiant G-type star named the Star of Eternal Radiance—now known as Lumina. This revelation sparked a scientific odyssey, merging ancient Egyptian cosmology with cutting-edge astronomy, as the Solara System emerged from the Nebula Cygnus-X1, nestled in the Orion Spur of the Milky Way Galaxy.

The Lumina Codex spoke of Lumina and its ten celestial attendants, organized into poetic regions: the searing Forges of Ra for the inner worlds, the verdant Blessed Belt of Osiris for the habitable zone, the majestic Domains of the Sky Titans for gas giants, and the enigmatic Frozen Outlands for the outer realms. Its star charts, etched with meticulous care, hinted at a cosmic map, with references to the Rivers of Stars—likely the Milky Way—and the Celestial Gardens, evoking the Local Group within the Virgo Supercluster. The codex’s verses, such as “Ten jewels dance in the embrace of the Eternal Radiance, their faces veiled in fire, water, and ice,” seemed to prefigure a system now confirmed by the Cygnus-X1 Deep Sky Array, a fictional next-generation telescope orbiting beyond Earth’s atmosphere.

Discovery and Modern Corroboration
The Solara System’s discovery began in 2023, when the Cygnus-X1 Deep Sky Array detected subtle wobbles in Lumina’s light, indicating a complex system of orbiting bodies. Located 1,200 light-years away in the Nebula Cygnus-X1, Lumina is a stable, middle-aged G-type star, slightly larger than the Sun, with a luminosity that sustains a diverse array of worlds. As astronomers analyzed the data, they identified ten planets, each with unique characteristics that eerily echoed the Lumina Codex. The parallels were undeniable: the codex’s Forges of Ra matched the inner rocky planets, while the Blessed Belt of Osiris aligned with two habitable worlds teeming with life. The Domains of the Sky Titans and Frozen Outlands described gas giants and icy dwarfs with striking accuracy. The scientific community buzzed with excitement, as linguists and astronomers collaborated to decode the codex’s metaphors, revealing a blend of ancient intuition and cosmic truth.

The Solara System: A Celestial Menagerie
Lumina: The Star of Eternal Radiance
Lumina, a G2V star, radiates a warm, golden light, its stable fusion cycle supporting a system spanning 12 astronomical units. Its magnetic fields are unusually calm, suggesting a long lifespan conducive to life’s evolution. The codex describes Lumina as “the hearth of eternity, whose breath kindles the dance of worlds,” a poetic nod to its life-giving energy.

The Forges of Ra: Inner Planets
1- Ignis: The closest planet to Lumina, Ignis is a scorched, iron-rich world with a molten surface pocked by ancient impact craters. Its thin atmosphere, rich in sulfur dioxide, glows faintly under Lumina’s intense radiation. The codex calls it “Ra’s Anvil, where molten rivers forge the bones of the cosmos,” reflecting its volcanic past and metallic crust.
2- Ferrus: Slightly larger, Ferrus is a rocky planet with vast plains of oxidized iron, giving it a crimson hue. Its surface bears scars of past tectonic activity, with towering cliffs and deep chasms. The codex names it “the Forge of Hephaestus’s Twin,” hinting at its metallic wealth, now confirmed by spectroscopic analysis revealing nickel and cobalt deposits.
The Blessed Belt of Osiris: Habitable Zone
1- Aqua: A breathtaking ocean world, Aqua is enveloped in turquoise clouds of water vapor and nitrogen. Its surface is 90% liquid water, with archipelagos of coral-like structures hosting complex aquatic ecosystems. Bioluminescent Aquarelles, jellyfish-like creatures with crystalline tentacles, drift in vast schools, their light pulses synchronizing in rhythmic displays. Predatory Thalacynths, eel-like organisms with electromagnetic sensors, hunt in the deep trenches. Aqua’s moon, Thalassa, is an ice-covered world with a subglacial ocean, where astrobiologists hypothesize microbial extremophiles thrive in hydrothermal vents, metabolizing sulfur compounds. The codex describes Aqua as “Osiris’s Chalice, where life swims in the tears of the gods,” and Thalassa as “the frozen veil hiding the spark of creation.”
2- Veridia: A super-Earth, Veridia boasts lush continents of bioluminescent flora, such as Luminara trees, which pulse with green and violet light, and Crystalferns, whose fractal leaves refract Lumina’s rays into dazzling spectra. Veridia is home to the Sylvans, sentient, silicon-based life forms resembling ambulatory crystal shrubs. Their bodies, composed of lattice-like structures, shimmer with bioluminescent patterns used for communication. Sylvan society is decentralized, with “groves” of individuals linked via light-based signals, forming a collective consciousness deeply attuned to Veridia’s ecosystem. Their architecture, grown from crystalline minerals, integrates seamlessly with the landscape. The codex calls Veridia “the Garden of Osiris’s Breath,” where “the shining ones weave light into wisdom.”
The Domains of the Sky Titans: Gas Giants
1- Zephyrus: A massive hydrogen-helium gas giant, Zephyrus dominates the system with its radiant ring system, composed of ice and silicate particles. Its atmosphere swirls with golden storms, driven by intense winds. Among its 47 moons, Io-Prime stands out, a volcanically active world spewing sulfur plumes, likely powered by tidal heating. The codex names Zephyrus “the Sky Titan’s Crown,” its rings “the jeweled girdle of the heavens.”
2- Boreas: An ice giant with a deep blue methane atmosphere, Boreas exhibits retrograde rotation and an asymmetrical magnetic field, creating auroras that dance across its poles. Its 22 moons include Erynnis, a rocky moon with methane lakes. The codex describes Boreas as “the Frost Titan, whose breath chills the void,” capturing its icy majesty.
The Frozen Outlands: Outer Planets
1- Umbriel: A dwarf planet with a charcoal-dark surface, Umbriel’s icy crust is fractured by ancient impacts. Its moon Nyx, a captured object, is rich in organic compounds, hinting at prebiotic chemistry. The codex calls Umbriel “the Shadowed Outcast, guarded by the dark sentinel.”
2- Erebus: An icy world with a nitrogen-methane atmosphere, Erebus has a highly elliptical orbit, suggesting a captured origin. Its surface sparkles with frost-covered ridges. The codex names it “the Silent Wanderer, cloaked in eternal frost.”
3- Aetheria: The outermost planet, Aetheria is a rogue dwarf with a thin atmosphere of neon and argon. Its moon Lethe exhibits cryovolcanism, spewing ammonia-water mixtures. Astrobiologists speculate that Lethe’s subsurface ocean may harbor microbial life, analogous to Thalassa’s. The codex describes Aetheria as “the Veiled Wanderer, whose dreams freeze in the outer dark,” and Lethe as “the weeping mirror of the cosmos.”
4- Nyxara: A small, icy body with a chaotic orbit, Nyxara’s surface is a mosaic of frozen nitrogen and carbon monoxide. The codex calls it “the Lost Jewel, dancing beyond the Titans’ gaze.”
Life in the Solara System
Aqua’s aquatic ecosystems are a marvel, with Aquarelles forming symbiotic networks with coral-like Hydroskeletons, which filter nutrients from the water. Thalacynths use electromagnetic pulses to stun prey, suggesting an evolutionary arms race. On Thalassa, microbial life is hypothesized based on chemical signatures of sulfur and methane in its subglacial ocean, though no direct evidence exists yet.

Veridia’s Sylvans are the system’s crown jewel. Their crystalline bodies, averaging two meters tall, refract light into complex patterns, encoding emotions, ideas, and memories. Their society operates as a “luminous collective,” with no central authority; decisions emerge from synchronized light displays across groves. Sylvan technology manipulates crystalline minerals to create tools and habitats, all in harmony with Veridia’s ecosystem. Their discovery has sparked intense study by linguists decoding their light-based language, revealing a philosophy centered on balance and interconnectedness.

On Lethe, cryovolcanic activity suggests a subsurface ocean with potential microbial ecosystems, possibly metabolizing ammonia. Unlike Aqua’s confirmed complex life and Veridia’s sentient Sylvans, life on Thalassa and Lethe remains speculative, driving astrobiological research.

Galactic Context
The Solara System resides in the Orion Spur, a minor arm of the Milky Way, part of the Local Group within the Virgo Supercluster. The codex’s Rivers of Stars evoke the Milky Way’s spiral arms, while the Celestial Gardens suggest a poetic grasp of the Local Group’s galactic cluster. This cosmic placement underscores Solara’s significance as a microcosm of the universe’s diversity.

Ongoing Exploration
Scientific teams, including astrobiologists, geologists, and linguists, are studying Solara via the Cygnus-X1 Deep Sky Array and planned probes, such as the Lumina Pathfinder Mission. Challenges include the 1,200-light-year distance, requiring advanced telemetry for data transmission. Sylvan communication poses a unique hurdle, as their light patterns defy traditional linguistic models. Future missions aim to deploy orbiters around Aqua, Veridia, and Lethe to confirm microbial life and study Sylvan culture.

A Cosmic Tapestry
The Solara System, unveiled through the Lumina Codex and modern astronomy, blends ancient wisdom with scientific discovery. Its worlds—from the fiery Forges of Ra to the icy Frozen Outlands—offer a rich tapestry of environments, life forms, and mysteries. As scientists probe this distant system, the codex’s poetic verses resonate, reminding humanity that the cosmos has long whispered its secrets, awaiting those bold enough to listen.
"""
def clean_text_for_embedding_models(input_text):
    input_text = re.sub(r'\s+', ' ', input_text).strip()
    return input_text

cleaned_full_text_instruct_corpus = clean_text_for_embedding_models(text)
sentences_from_corpus_instruct = nltk.sent_tokenize(text)
cleaned_sentences_for_instruct_analysis = [clean_text_for_embedding_models(s) for s in sentences_from_corpus_instruct if s.strip()]

print(f"Found {len(cleaned_sentences_for_instruct_analysis)} sentences for Instruction-Aware model analysis.")

## 3. Load Pre-trained Instruction-Aware Model (`intfloat/e5-mistral-7b-instruct`)
This model is designed to follow natural language instructions prepended to the text.
It requires `trust_remote_code=True` for loading its custom Mistral architecture if not natively supported by SentenceTransformer's AutoModel.

In [None]:
INSTRUCT_MODEL_NAME = 'intfloat/e5-mistral-7b-instruct'
instructor_model_pretrained = None # Initialize

try:
    # trust_remote_code=True might be necessary if the model uses custom code not yet in Hugging Face transformers
    # For many standard architectures wrapped by SentenceTransformer, it might not be needed.
    # However, E5-Mistral could have specifics.
    instructor_model_pretrained = SentenceTransformer(INSTRUCT_MODEL_NAME, device=device_str, trust_remote_code=True)
    print(f"Successfully loaded PRE-TRAINED Instruction-Aware model: '{INSTRUCT_MODEL_NAME}'")
except Exception as e:
    print(f"Error loading PRE-TRAINED Instruction-Aware model: {e}")
    print("This model is very large. Ensure sufficient VRAM/RAM and correct installation environment.")

## 4. Part 1: Analysis with Pre-trained `intfloat/e5-mistral-7b-instruct`

### 4.1 Demonstrating Instruction-Awareness
We take a sample sentence and embed it with different types of instructions/prefixes.
The E5 series typically uses "query: " and "passage: " for retrieval.
Instruction-tuned models like this one often respond well to more descriptive instructions.

In [None]:
if instructor_model_pretrained and cleaned_sentences_for_instruct_analysis:
    sample_sentence_idx_instruct = min(20, len(cleaned_sentences_for_instruct_analysis) - 1) # Pick a descriptive sentence
    if sample_sentence_idx_instruct >=0 :
        sample_sentence_instruct = cleaned_sentences_for_instruct_analysis[sample_sentence_idx_instruct]
        print(f"Sample Sentence for Instruction Test: '{sample_sentence_instruct}'")

        # Define instructions. For e5-mistral-7b-instruct, it's often "Instruct: [task_description] \n Document: [text]"
        # or specific prefixes like "query: " / "passage: "
        instructions_to_test = {
            "as_passage": "passage: ", # E5-style for general document representation
            "as_query_for_discovery": "query: What new discoveries were made in the Solara system? ", # E5-style query
            "for_classification_topic": "Instruct: Classify the main topic of this text. \n Document: ",
            "for_similarity_check": "Instruct: Represent this sentence to find semantically similar sentences. \n Sentence: ",
            "no_instruction": "" # Baseline
        }

        print(f"\n--- Embeddings for Sample Sentence with Different Instructions ---")
        sample_embeddings_by_instruction = {}
        for key, instruction_prefix in instructions_to_test.items():
            text_to_embed = instruction_prefix + sample_sentence_instruct
            try:
                embedding = instructor_model_pretrained.encode(
                    text_to_embed, convert_to_numpy=True, normalize_embeddings=True # E5 models usually require normalization
                )
                sample_embeddings_by_instruction[key] = embedding
                print(f"Embedding for '{key}' (first 3 dims): {embedding[:3]}")
            except Exception as e:
                print(f"Error encoding with instruction '{key}': {e}")
                sample_embeddings_by_instruction[key] = None
        
        # Compare some embeddings
        if sample_embeddings_by_instruction.get("as_passage") is not None and \
           sample_embeddings_by_instruction.get("as_query_for_discovery") is not None:
            sim_passage_query = cosine_similarity(
                sample_embeddings_by_instruction["as_passage"].reshape(1, -1),
                sample_embeddings_by_instruction["as_query_for_discovery"].reshape(1, -1)
            )[0][0]
            print(f"\nSim. of SAME sentence with 'passage:' vs 'query: What new discoveries...': {sim_passage_query:.4f}")
        
        # PCA of these instruction-varied embeddings for one sentence
        plot_labels_one_sent_instruct = [k for k,v in sample_embeddings_by_instruction.items() if v is not None]
        embeddings_to_plot_one_sent_instruct = [sample_embeddings_by_instruction[k] for k in plot_labels_one_sent_instruct]

        if len(embeddings_to_plot_one_sent_instruct) >= 2:
            pca_one_sent_instruct = PCA(n_components=2)
            embeddings_2d_one_sent_instruct = pca_one_sent_instruct.fit_transform(np.array(embeddings_to_plot_one_sent_instruct))
            plt.figure(figsize=(12, 8))
            plt.scatter(embeddings_2d_one_sent_instruct[:, 0], embeddings_2d_one_sent_instruct[:, 1], s=120)
            for i, label in enumerate(plot_labels_one_sent_instruct):
                plt.annotate(label, (embeddings_2d_one_sent_instruct[i, 0], embeddings_2d_one_sent_instruct[i, 1]),
                             textcoords="offset points", xytext=(5,5), ha='center', fontsize=9)
            plt.title(f'PCA of One Sentence with Different Instructions ({INSTRUCT_MODEL_NAME})', fontsize=14)
            plt.xlabel('PCA Comp 1'); plt.ylabel('PCA Comp 2'); plt.grid(True); plt.tight_layout(); plt.show()
    else:
        print("Not enough sentences in corpus to run prefix test.")
else:
    print("Instruction-Aware model not loaded or no sentences. Skipping instruction impact demo.")

### 4.2 Define Sentences for Broader Analysis & Generate Embeddings (Default Instruction)
For consistent analysis across sentences, we'll use a general "passage: " instruction.

In [None]:
selected_sentences_for_instruct_analysis = []
selected_sentence_labels_instruct_analysis = []
# (Using similar selection logic as previous notebooks)
# ... (You can copy your sentence selection logic from GTE/SBERT notebook here) ...
# For brevity, I'll just take first 15 for this example, you should use your keyword-based selection
if cleaned_sentences_for_instruct_analysis:
    selected_sentences_for_instruct_analysis = cleaned_sentences_for_instruct_analysis[:min(15, len(cleaned_sentences_for_instruct_analysis))]
    selected_sentence_labels_instruct_analysis = [f"S{i+1}_{s[:10].replace(' ','_')}" for i,s in enumerate(selected_sentences_for_instruct_analysis)]
    print(f"\nSelected {len(selected_sentences_for_instruct_analysis)} sentences for broader analysis.")
else:
    print("No cleaned sentences available for broader analysis.")

# Default instruction for embedding passages
default_instruction_e5 = "passage: "
sentences_with_default_e5_instruction = [default_instruction_e5 + s for s in selected_sentences_for_instruct_analysis]

sentence_embeddings_instruct_pt = None # pt for pre-trained
if instructor_model_pretrained and sentences_with_default_e5_instruction:
    print(f"\nGenerating Instruction-Aware embeddings (pre-trained) for {len(sentences_with_default_e5_instruction)} sentences using instruction: '{default_instruction_e5.strip()}'...")
    sentence_embeddings_instruct_pt = instructor_model_pretrained.encode(
        sentences_with_default_e5_instruction, convert_to_numpy=True, normalize_embeddings=True
    )
    print(f"Generated Instruction-Aware (pre-trained) embeddings. Shape: {sentence_embeddings_instruct_pt.shape if sentence_embeddings_instruct_pt is not None else 'None'}")
else:
    print("Instruction-Aware model or sentences not ready for default instruction embedding.")

### 4.3 Evaluate Pre-trained Instruction-Aware Embeddings (Default Instruction)

In [None]:
if sentence_embeddings_instruct_pt is not None and sentence_embeddings_instruct_pt.shape[0] >= 2:
    embedding_matrix_instruct_pt = sentence_embeddings_instruct_pt
    plot_labels_instruct_pt = selected_sentence_labels_instruct_analysis

    # --- Semantic Similarity (Heatmap) ---
    print("\n--- Cosine Similarity Heatmap (Sentences - Pre-trained Instruction-Aware, default instruction) ---")
    similarity_matrix_instruct_pt = np.dot(embedding_matrix_instruct_pt, embedding_matrix_instruct_pt.T)
    num_lbl_instruct_pt = len(plot_labels_instruct_pt); fig_w_instruct_pt = max(12,num_lbl_instruct_pt*0.6); fig_h_instruct_pt = max(10,num_lbl_instruct_pt*0.45)
    plt.figure(figsize=(fig_w_instruct_pt, fig_h_instruct_pt))
    annot_hm_instruct_pt = num_lbl_instruct_pt < 20
    sns.heatmap(similarity_matrix_instruct_pt, annot=annot_hm_instruct_pt, cmap='magma', fmt=".2f", xticklabels=plot_labels_instruct_pt, yticklabels=plot_labels_instruct_pt, linewidths=.5, cbar_kws={"shrink":.8}, vmin=-1, vmax=1)
    plt.title(f'Instruction-Aware ({INSTRUCT_MODEL_NAME}) Similarity (Instruction: "{default_instruction_e5.strip()}")', fontsize=16)
    plt.xticks(rotation=65, ha='right'); plt.yticks(rotation=0); plt.tight_layout(); plt.show()

    # --- Clustering Quality (PCA) ---
    print("\n--- PCA Visualization (Sentences - Pre-trained Instruction-Aware, default instruction) ---")
    pca_instruct_pt = PCA(n_components=2)
    embeddings_2d_instruct_pt = pca_instruct_pt.fit_transform(embedding_matrix_instruct_pt)
    plt.figure(figsize=(fig_w_instruct_pt*0.9, fig_h_instruct_pt*0.9))
    plt.scatter(embeddings_2d_instruct_pt[:,0], embeddings_2d_instruct_pt[:,1], alpha=0.7, s=60)
    for i,lbl in enumerate(plot_labels_instruct_pt): plt.annotate(lbl,(embeddings_2d_instruct_pt[i,0],embeddings_2d_instruct_pt[i,1]),textcoords="offset points",xytext=(5,5),ha='center')
    plt.title(f'Instruction-Aware ({INSTRUCT_MODEL_NAME}) Embeddings (Instruction: "{default_instruction_e5.strip()}") - PCA', fontsize=16)
    plt.xlabel('PCA Comp 1'); plt.ylabel('PCA Comp 2'); plt.grid(True); plt.tight_layout(); plt.show()
else:
    print("Not enough embeddings for pre-trained Instruction-Aware model analysis.")

### Interpretation Notes for Pre-trained Nomic Embed (with `search_document:`):
* **Semantic Similarity & Clustering:** How do sentences from "Lumina Codex" relate to each other when embedded for "document" purposes? Are the clusters/similarities intuitive for this task type? Compare with GTE/SBERT results.
* **Vocabulary & Context:** Handled well by the underlying Transformer. The "context" is now also influenced by the task prefix.

## 5. Part 2: Experimental Fine-tuning of `e5-mistral-7b-instruct`'s Base (MLM)

**WARNING: Fine-tuning a 7B parameter model like Mistral is extremely resource-intensive.**
- This requires significant GPU VRAM (likely 24GB+ per GPU, possibly multiple GPUs for reasonable speed).
- It will take a very long time on consumer hardware or even standard cloud GPU instances if not using high-end ones.
- This section is primarily for demonstrating the *concept* of domain adaptation. For practical fine-tuning of such large models, specialized hardware, distributed training setups, and techniques like LoRA/QLoRA are typically used.

We will proceed with a very small MLM fine-tuning setup for illustrative purposes only. The expectation is to observe if *any* adaptation to the "Lumina Codex" occurs, rather than achieving SOTA fine-tuning.

### 5.1 Prepare Model and Data for MLM Fine-tuning

In [None]:
instruct_mlm_model_to_finetune = None
instruct_mlm_tokenizer = None
instruct_mlm_dataloader = None

BASE_MISTRAL_FOR_E5 = "mistralai/Mistral-7B-Instruct-v0.1" 
try:
    print(f"Attempting to load tokenizer for base model: {BASE_MISTRAL_FOR_E5}")
    instruct_mlm_tokenizer = AutoTokenizer.from_pretrained(BASE_MISTRAL_FOR_E5, trust_remote_code=True)
    # Add padding token if tokenizer doesn't have one (Mistral often doesn't)
    if instruct_mlm_tokenizer.pad_token is None:
        instruct_mlm_tokenizer.add_special_tokens({'pad_token': '[PAD]'})
    print(f"Loaded tokenizer '{BASE_MISTRAL_FOR_E5}' for MLM fine-tuning.")

    print(f"Attempting to load base model for MLM: {BASE_MISTRAL_FOR_E5}")
    # Load the base model configuration
    config = AutoConfig.from_pretrained(BASE_MISTRAL_FOR_E5, trust_remote_code=True)
    # Try to load it with AutoModelForMaskedLM. This will add a new MLM head if one doesn't exist.
    instruct_mlm_model_to_finetune = AutoModelForMaskedLM.from_config(config=config, trust_remote_code=True).to(device)
    print("INFO: MLM Fine-tuning setup for a 7B model is resource-intensive and complex.")
    print("For this notebook, we will proceed by conceptually acknowledging this step.")
    print("In a real scenario, one would load the base Mistral model, ensure it has/can have an MLM head, and fine-tune.")

    if instructor_model_pretrained:
        instruct_model_finetuned_base = instructor_model_pretrained[0].auto_model 
        instruct_model_finetuned_base.to(device) # Ensure it's on device
        print("Using the base of the pre-trained instruction model as a STAND-IN for a fine-tuned base model for demonstration purposes.")
        # In a real scenario, this 'instruct_model_finetuned_base' would be the result of an actual MLM fine-tuning process.
    else:
        instruct_model_finetuned_base = None

except Exception as e:
    print(f"Error preparing base Mistral model for conceptual MLM fine-tuning: {e}")
    instruct_mlm_model_to_finetune = None; instruct_mlm_tokenizer = None; instruct_model_finetuned_base = None

# MLM Dataset Preparation (using the tokenizer for the base model)
if instruct_mlm_tokenizer and cleaned_sentences_for_instruct_analysis and instruct_model_finetuned_base is not None:
    num_ft_sents_instruct = min(len(cleaned_sentences_for_instruct_analysis), 100) # Even smaller subset for 7B model demo
    ft_texts_instruct = cleaned_sentences_for_instruct_analysis[:num_ft_sents_instruct]
    if ft_texts_instruct:
        mlm_dataset_instruct = MLMDatasetShared(ft_texts_instruct, instruct_mlm_tokenizer, max_length=256) # Shorter max_length for MLM
        if len(mlm_dataset_instruct) > 0:
            # We won't run the actual fine-tuning loop here due to resource constraints
            # but we prepare the dataloader to show the setup.
            instruct_mlm_dataloader = DataLoader(
                mlm_dataset_instruct, batch_size=1, shuffle=True, collate_fn=mlm_dataset_instruct.data_collator # Batch size 1 for 7B
            )
            print(f"Prepared MLM dataset for Instruction model's base with {len(mlm_dataset_instruct)} instances (FINE-TUNING LOOP IS SKIPPED).")
        else: print("MLM dataset for Instruction model base is empty.")
    else: print("No texts selected for Instruction model MLM fine-tuning.")
else:
    if not instruct_model_finetuned_base:
        print("Base model for fine-tuning not available.")
    else:
        print("MLM Tokenizer not available for Instruction model fine-tuning.")
    instruct_mlm_dataloader = None

### 5.2 Fine-tune Base Model (Conceptual - Loop Skipped for 7B Model Practicality)
Due to the immense computational resources required to fine-tune a 7B parameter model like Mistral, we will not execute the training loop in this notebook. We have prepared the data and model structure as a demonstration of how one *would* approach it.
For the "fine-tuned" evaluation below, we will use the *original pre-trained base* of `intfloat/e5-mistral-7b-instruct` as a stand-in. The purpose is to show the *structure* of comparison, assuming fine-tuning had occurred and produced an `instruct_model_finetuned_base`. In a real experiment with adequate resources, this base model would be the output of an actual MLM training run on the "Lumina Codex".

In [None]:
if 'instruct_model_finetuned_base' in locals() and instruct_model_finetuned_base is not None:
    print("Proceeding with 'instruct_model_finetuned_base' (stand-in for actual fine-tuned weights) for evaluation.")
    instruct_model_finetuned_base.eval() # Ensure it's in eval mode
else:
    print("Fine-tuned base for instruction model not available. Skipping fine-tuned evaluation.")

### 5.3 Evaluate "Fine-tuned" Instruction-Aware Embeddings
We reconstruct a SentenceTransformer using the (stand-in) fine-tuned base Mistral model and the original pooling/normalization layers from `intfloat/e5-mistral-7b-instruct`. Then we evaluate sentence embeddings using the same default "passage: " instruction.

In [None]:
sentence_embeddings_instruct_ft = None
instructor_model_finetuned_full = None

if 'instruct_model_finetuned_base' in locals() and instruct_model_finetuned_base is not None and \
   'instructor_model_pretrained' in locals() and instructor_model_pretrained is not None and \
   'instruct_mlm_tokenizer' in locals() and instruct_mlm_tokenizer is not None:
    
    print("Reconstructing SentenceTransformer with 'fine-tuned' (stand-in) instruction model base...")
    try:
        # We need to save the 'instruct_model_finetuned_base' and its 'instruct_mlm_tokenizer'
        # to a temporary path so models.Transformer can load them together.
        temp_finetuned_instruct_base_path = "./instruct_finetuned_base_temp"
        
        instruct_model_finetuned_base.save_pretrained(temp_finetuned_instruct_base_path)
        instruct_mlm_tokenizer.save_pretrained(temp_finetuned_instruct_base_path)
        print(f"Stand-in fine-tuned instruction base model and tokenizer saved to {temp_finetuned_instruct_base_path}")

        # Create the word_embedding_module using the path
        word_embedding_module_instruct_ft = models.Transformer(
            model_name_or_path=temp_finetuned_instruct_base_path,
            max_seq_length=instructor_model_pretrained[0].max_seq_length, # from original SBERT wrapper
            trust_remote_code=True # Might be needed if base Mistral has custom code
        )
        word_embedding_module_instruct_ft.to(device)
        
        # Get pooling and normalization from the original pre-trained SBERT object
        pooling_model_instruct = instructor_model_pretrained[1] # Assumes pooling is the second module
        pooling_model_instruct.to(device)
        
        reconstructed_modules_instruct_ft = [word_embedding_module_instruct_ft, pooling_model_instruct]
        
        if len(instructor_model_pretrained) > 2 and isinstance(instructor_model_pretrained[2], models.Normalize):
            normalize_model_instruct = instructor_model_pretrained[2]
            normalize_model_instruct.to(device)
            reconstructed_modules_instruct_ft.append(normalize_model_instruct)
        elif not any(isinstance(m, models.Normalize) for m in reconstructed_modules_instruct_ft):
            print("Adding Normalize layer for reconstructed instruction model.")
            reconstructed_modules_instruct_ft.append(models.Normalize())

        instructor_model_finetuned_full = SentenceTransformer(modules=reconstructed_modules_instruct_ft, device=device_str)
        print("Successfully reconstructed SentenceTransformer with 'fine-tuned' (stand-in) instruction model base.")

        # Optional: Clean up
        # shutil.rmtree(temp_finetuned_instruct_base_path)

        # Proceed with evaluation using the default instruction
        if instructor_model_finetuned_full and sentences_with_default_e5_instruction: # From cell 4.2
            print(f"\nGenerating Instruction-Aware ('fine-tuned' stand-in) embeddings for {len(sentences_with_default_e5_instruction)} sentences...")
            sentence_embeddings_instruct_ft = instructor_model_finetuned_full.encode(
                sentences_with_default_e5_instruction, convert_to_numpy=True, normalize_embeddings=True
            )
            print(f"Generated Instruction-Aware ('fine-tuned' stand-in) embeddings. Shape: {sentence_embeddings_instruct_ft.shape if sentence_embeddings_instruct_ft is not None else 'None'}")

            if sentence_embeddings_instruct_ft is not None and sentence_embeddings_instruct_ft.shape[0] >= 2:
                embedding_matrix_instruct_ft = sentence_embeddings_instruct_ft
                plot_labels_instruct_ft = selected_sentence_labels_instruct_analysis # Same labels

                print("\n--- Cosine Similarity Heatmap (Sentences - 'Fine-tuned' Instruction-Aware, default instruction) ---")
                similarity_matrix_instruct_ft = np.dot(embedding_matrix_instruct_ft, embedding_matrix_instruct_ft.T)
                plt.figure(figsize=(fig_w_instruct_pt, fig_h_instruct_pt)) # Reuse fig sizes from pre-trained
                annot_hm_instruct_ft = num_lbl_instruct_pt < 20
                sns.heatmap(similarity_matrix_instruct_ft, annot=annot_hm_instruct_ft, cmap='inferno', fmt=".2f", xticklabels=plot_labels_instruct_ft, yticklabels=plot_labels_instruct_ft, linewidths=.5, cbar_kws={"shrink":.8}, vmin=-1, vmax=1)
                plt.title(f'Instruction-Aware ({INSTRUCT_MODEL_NAME}) Similarity ("Fine-tuned" Stand-in)', fontsize=16)
                plt.xticks(rotation=65, ha='right'); plt.yticks(rotation=0); plt.tight_layout(); plt.show()

                print("\n--- PCA Visualization (Sentences - 'Fine-tuned' Instruction-Aware, default instruction) ---")
                pca_instruct_ft = PCA(n_components=2)
                embeddings_2d_instruct_ft = pca_instruct_ft.fit_transform(embedding_matrix_instruct_ft)
                plt.figure(figsize=(fig_w_instruct_pt*0.9, fig_h_instruct_pt*0.9))
                plt.scatter(embeddings_2d_instruct_ft[:,0], embeddings_2d_instruct_ft[:,1], alpha=0.7, s=60)
                for i,lbl in enumerate(plot_labels_instruct_ft): plt.annotate(lbl,(embeddings_2d_instruct_ft[i,0],embeddings_2d_instruct_ft[i,1]),textcoords="offset points",xytext=(5,5),ha='center')
                plt.title(f'Instruction-Aware ({INSTRUCT_MODEL_NAME}) Embeddings ("Fine-tuned" Stand-in) - PCA', fontsize=16)
                plt.xlabel('PCA Comp 1'); plt.ylabel('PCA Comp 2'); plt.grid(True); plt.tight_layout(); plt.show()
        else:
            print("Not enough embeddings for 'fine-tuned' (stand-in) Instruction-Aware model analysis.")
            
    except Exception as e:
        print(f"Error during 'fine-tuned' (stand-in) instruction model evaluation: {e}")
else:
    print("Prerequisites for 'fine-tuned' (stand-in) instruction model evaluation not met.")

### Interpretation Notes for Fine-tuned Nomic Embed:
* **Compare with Pre-trained Nomic:** The main goal is to see if MLM fine-tuning on "Lumina Codex" results in noticeable changes in how sentences (with the `search_document:` prefix) relate to each other. Does the embedding space become more specialized for "Lumina Codex" themes?
* **Instruction Impact:** Remember that the primary way Nomic Embed adapts is via its input prefixes. This MLM fine-tuning is an *additional* layer of domain adaptation for its underlying representations.