### Install Dependencies

First, we'll set up our environment by installing the necessary Python libraries.

* **`unsloth`**: We install the latest version of Unsloth directly from its GitHub repository. Unsloth provides massive speedups and memory reductions for fine-tuning LLMs, enabling us to train models up to 2x faster and use 60% less memory. The `[colab-new]` option ensures compatibility with the latest Google Colab environments.
* **Hugging Face Ecosystem**: We install key libraries for training and optimization:
    * `peft`: Parameter-Efficient Fine-Tuning, for using techniques like LoRA.
    * `trl`: Transformer Reinforcement Learning, for its easy-to-use `SFTTrainer`.
    * `accelerate`: To easily run our training script on any hardware.
    * `bitsandbytes`: For 4-bit quantization (QLoRA), which drastically reduces model size.
* **`xformers`**: Provides memory-efficient attention mechanisms for another performance boost.
* **`wandb`**: Weights & Biases, for logging our experiments and tracking metrics like training loss.

In [14]:
!pip install -q "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install -q --no-deps xformers trl peft accelerate bitsandbytes psutil
!pip install -q wandb

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone


### Log In to Services

To download our model and log our training progress, we need to authenticate with Hugging Face and Weights & Biases (W&B). We use Colab's `userdata` to securely access our API keys without hardcoding them in the notebook.

* **Hugging Face Hub**: We need to log in to download the Phi-3 model, which requires accepting user conditions.
* **Weights & Biases**: We log in to `wandb` to enable experiment tracking. This will allow us to monitor metrics like training loss in real-time.

> **Action Required:** Before running this cell, you must store your API keys as secrets in Google Colab.
> 1.  Click the **üîë (Secrets)** icon on the left sidebar.
> 2.  Create a new secret named `hugging` and paste your Hugging Face access token (with `write` permissions) as the value.
> 3.  Create another secret named `WANDB_API_KEY` and paste your W&B API key as the value.

In [2]:
# Log in to huggingface
from google.colab import userdata
hf_token = userdata.get('hugging')

# Log in to wandb
import wandb
wandb_api_key = userdata.get('WANDB_API_KEY')
wandb.login(key=wandb_api_key)

[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33meslam3012mohamed[0m ([33meslam3012mohamed-not-yet[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


True

### Import Core Libraries

With our environment set up and authenticated, we can now import the core components from the libraries we installed. Each of these plays a critical role in the fine-tuning pipeline.

* **`FastLanguageModel`**: The star of the show from Unsloth. This class will load our base model and automatically apply all the necessary optimizations for fast, memory-efficient training.
* **`torch`**: The fundamental deep learning framework.
* **`load_dataset`**: A function from the Hugging Face `datasets` library to easily pull our training data from the Hub.
* **`SFTTrainer`**: A specialized trainer from the `trl` library designed specifically for Supervised Fine-Tuning.
* **`TrainingArguments`**: A configuration class from `transformers` where we will define all the hyperparameters for our training job.
* **`is_bfloat16_supported`**: A utility from Unsloth to check if our hardware supports `bfloat16` precision, which is ideal for training modern transformers.

In [3]:
from unsloth import FastLanguageModel
import torch
from datasets import load_dataset
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

ü¶• Unsloth: Will patch your computer to enable 2x faster free finetuning.
ü¶• Unsloth Zoo will now patch everything to make training faster!


### Load Model and Tokenizer

Now we use Unsloth's `FastLanguageModel` to load our pre-trained model. This single, powerful command handles several critical steps for us:

1.  **Downloads the model** from the Hugging Face Hub.
2.  **Applies 4-bit quantization** to drastically reduce memory usage.
3.  **Patches the model** with performance optimizations for faster training.
4.  **Prepares the tokenizer** for use in training.

Let's look at the key parameters:
* `model_name`: We are loading `"microsoft/Phi-3-mini-4k-instruct"`, a highly capable small language model that is perfect for fine-tuning on consumer hardware.
* `load_in_4bit = True`: This is the core of our memory-saving strategy. It enables 4-bit quantization (QLoRA), reducing the VRAM footprint significantly.
* `max_seq_length = 2048`: We set the maximum context window for our training examples. This offers a good balance between capturing long-range dependencies and managing memory.
* `dtype = None`: This allows Unsloth to automatically detect and use the optimal data type for our GPU (like `bfloat16`), ensuring the best possible training performance.

In [4]:
# Load model
base_model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "microsoft/Phi-3-mini-4k-instruct",
    max_seq_length = 2048,
    load_in_4bit=True,
    dtype=None,
    token = hf_token,
)

==((====))==  Unsloth 2025.12.9: Fast Mistral patching. Transformers: 4.57.3.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.0+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.5.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.33.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


### Test the Base Model (Before Fine-Tuning)

Before we fine-tune the model, it's crucial to establish a baseline. We need to see how the pre-trained model performs on a task similar to our goal. This helps us understand its out-of-the-box capabilities and gives us a "before" snapshot to compare against our "after" fine-tuned version.

Our test process involves a few key steps:
1.  **Craft a Prompt**: We create a sample conversation using the standard `system` and `user` roles. The system prompt sets the model's persona (an expert financial analyst), while the user prompt provides a specific context and a question.
2.  **Apply Chat Template**: We use `tokenizer.apply_chat_template`. This is a critical function that formats our structured conversation into the exact string format that `Phi-3-instruct` expects, including special tokens.
3.  **Generate Response**: We run a standard inference using `model.generate()` to get the model's answer based on our prompt.
4.  **Evaluate Output**: We'll examine the response to see if the model correctly follows instructions and extracts the required information from the context.

In [5]:
# Test base model first to ensure it works
def test_model(model, tokenizer, context, question):
    messages = [
        {
            "role": "system",
            "content": "You are an expert financial analyst. Answer the user's question based only on the provided context."
        },
        {
            "role": "user",
            "content": f"""Context: {context}\nQuestion:{question}"""
        }
    ]

    # Use the model's built-in chat template
    prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    print("Generated prompt:")
    print(prompt)

    inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=128,
            do_sample=True,
            temperature=0.7,
            top_p=0.9,
            pad_token_id=tokenizer.eos_token_id,
            eos_token_id=tokenizer.eos_token_id
        )

    # Decode only new tokens
    response_tokens = outputs[0][inputs["input_ids"].shape[-1]:]
    response = tokenizer.decode(response_tokens, skip_special_tokens=True)
    return response

In [6]:
# Test base model first
context = "The company's gross margin improved to 45.8% in fiscal year 2023, up from 42.1% in the prior year. The margin expansion was mainly attributable to a favorable product mix with higher sales of our premium software subscriptions, and manufacturing efficiencies gained from our new automated production line in Alexandria, Egypt."
question = "What were the two key factors that contributed to the increase in the company's gross margin in fiscal year 2023?"

In [7]:
base_response = test_model(base_model, tokenizer, context, question)
print("=" * 50,"\nBase model response:",base_response)

Generated prompt:
<|system|>
You are an expert financial analyst. Answer the user's question based only on the provided context.<|end|>
<|user|>
Context: The company's gross margin improved to 45.8% in fiscal year 2023, up from 42.1% in the prior year. The margin expansion was mainly attributable to a favorable product mix with higher sales of our premium software subscriptions, and manufacturing efficiencies gained from our new automated production line in Alexandria, Egypt.
Question:What were the two key factors that contributed to the increase in the company's gross margin in fiscal year 2023?<|end|>
<|assistant|>

Base model response: The two key factors that contributed to the increase in the company's gross margin in fiscal year 2023 were a favorable product mix with higher sales of premium software subscriptions, and manufacturing efficiencies gained from the new automated production line in Alexandria, Egypt.


### Configure LoRA for Efficient Fine-Tuning

Now we get to the core of Parameter-Efficient Fine-Tuning (PEFT). Instead of training the entire model, we'll use **Low-Rank Adaptation (LoRA)** to inject small, trainable "adapter" matrices into the model's architecture. This means we only need to train a tiny fraction of the total parameters (typically <1%), which is what makes fine-tuning feasible on a single GPU.

Unsloth's `get_peft_model` function seamlessly applies this configuration to our 4-bit model. Let's look at the key hyperparameters:

* `r = 16`: The rank or dimension of the LoRA adapter matrices. A higher rank means more trainable parameters and greater expressive power, but also more memory. `16` is a solid and popular choice.
* `lora_alpha = 16`: The scaling factor for the LoRA weights. A common convention is to set this equal to `r`.
* `target_modules`: This is a critical setting. We specify the names of the layers (in this case, the attention and feed-forward layers) where the LoRA adapters will be injected. Unsloth provides a utility to find all potential layers, and we're targeting the most impactful ones here.
* `use_gradient_checkpointing = "unsloth"`: A crucial memory-saving technique that trades a bit of computation time to drastically reduce VRAM usage, allowing us to use larger batch sizes or longer sequences. The `"unsloth"` option enables a custom, faster implementation.
* `random_state = 2002`: We set a seed for reproducibility. Fun fact: 2002 was the year the modern Bibliotheca Alexandrina was inaugurated, not too far from our model's fictional manufacturing plant in Alexandria.

After this cell, our model is fully prepared for training. The original weights are frozen, and

In [8]:
# Configure LoRA
ft_model = FastLanguageModel.get_peft_model(
    model = base_model,
    r = 16,
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj"],
    lora_alpha = 16,
    lora_dropout = 0,
    bias = "none",
    use_gradient_checkpointing = "unsloth",
    random_state = 2002,
)

Unsloth 2025.12.9 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


In [64]:
# Know the number of paarmeters and the percentage that will be unfreezed
ft_model.print_trainable_parameters()

trainable params: 29,884,416 || all params: 3,850,963,968 || trainable%: 0.7760


### Load and Prepare the Dataset

The quality and format of your training data are paramount for a successful fine-tune. Instruction-tuned models like Phi-3 are highly sensitive to the prompt format they were trained on. In this step, we will load our financial Q&A dataset and transform each entry to perfectly match Phi-3's specific chat template.

Our workflow is as follows:
1.  **Load Dataset**: We start by loading the `virattt/llama-3-8b-financialQA` dataset from the Hugging Face Hub. This dataset contains pairs of financial contexts, questions, and expert answers.
2.  **Define a Formatting Function**: We create a function, `formatting_prompts_func`, that takes a batch of examples and restructures them. For each row, it builds a conversation with three parts:
    * A `system` message to consistently set the model's persona.
    * A `user` message combining the `context` and `question`.
    * An `assistant` message containing the ground-truth `answer` that we want the model to learn.
3.  **Apply the Chat Template**: Inside the function, we use the crucial `tokenizer.apply_chat_template` method. This converts the structured conversation into

In [9]:
# Use Phi-3's actual chat template for training
def formatting_prompts_func(examples):
    questions = examples["question"]
    contexts = examples["context"]
    responses = examples["answer"]
    texts = []

    for question, context, response in zip(questions, contexts, responses):
        # Create proper conversation format
        messages = [
            {
                "role": "system",
                "content": "You are an expert financial analyst. Answer the user's question based only on the provided context."
            },
            {
                "role": "user",
                "content": f"Context: {context}\n\nQuestion: {question}"
            },
            {
                "role": "assistant",
                "content": response
            }
        ]

        # Use the model's chat template
        text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=False)
        texts.append(text)

    return {"text": texts}

In [10]:
# Load and format dataset
dataset = load_dataset("virattt/llama-3-8b-financialQA", split="train")

print("Sample before formatting:", dataset[0])

Sample before formatting: {'question': 'What area did NVIDIA initially focus on before expanding to other computationally intensive fields?', 'answer': 'NVIDIA initially focused on PC graphics.', 'context': 'Since our original focus on PC graphics, we have expanded to several other large and important computationally intensive fields.', 'ticker': 'NVDA', 'filing': '2023_10K'}


In [47]:
# Show dataset column names
dataset.column_names

['question', 'answer', 'context', 'ticker', 'filing', 'text']

In [11]:
dataset = dataset.map(formatting_prompts_func, batched=True)

"Sample after formatting:", dataset[0]["text"][:500], "..."

('Sample after formatting:',
 "<|system|>\nYou are an expert financial analyst. Answer the user's question based only on the provided context.<|end|>\n<|user|>\nContext: Since our original focus on PC graphics, we have expanded to several other large and important computationally intensive fields.\n\nQuestion: What area did NVIDIA initially focus on before expanding to other computationally intensive fields?<|end|>\n<|assistant|>\nNVIDIA initially focused on PC graphics.<|end|>\n<|endoftext|>",
 '...')

### Configure and Launch the Fine-Tuning Job

We have arrived at the final step. With our model loaded, LoRA configured, and the dataset perfectly formatted, we can now set up the trainer and launch the fine-tuning process.

We will use the `SFTTrainer` from the TRL library, which handles the complexities of the training loop for us. The behavior of the trainer is controlled by a comprehensive set of `TrainingArguments`.

#### Key Hyperparameters:
* **Batching**: We use a `per_device_train_batch_size` of 2 and `gradient_accumulation_steps` of 4. This gives us an effective batch size of `2 * 4 = 8`, which helps stabilize training while keeping memory usage low.
* **Training Steps**: We set `max_steps = 60` for a short, demonstrative training run. In a real-world scenario, you would train for more steps or for a certain number of epochs.
* **Learning Rate**: A `learning_rate` of `2e-4` with a linear scheduler and a few `warmup_steps` is a standard and effective setup for LoRA.
* **Optimizations**: We use the `adamw_8bit` optimizer and enable `bf16` (bfloat16 mixed-precision) if our GPU supports it. These are powerful techniques that accelerate training and reduce memory consumption.
* **Logging and Saving**: We `logging_steps = 1` to see the loss at every step and will save a model checkpoint to the `outputs` directory halfway through training (`save_steps = 30`).

#### Launching the Training
With all the components in place, a single call to `trainer.train()` starts the fine-tuning process. As the training commences, keep an eye on the logged training loss‚Äîit should steadily decrease, indicating that the model is learning from our financial Q&A data. Let's kick it off and let the GPU work its magic through the early hours of this Saturday morning.

In [31]:
training_args = TrainingArguments(
    do_eval=True,
    per_device_train_batch_size = 4,
    per_device_eval_batch_size = 4,
    gradient_accumulation_steps = 4,
    save_strategy = "steps",
    warmup_steps = 5,
    num_train_epochs = 5,
    learning_rate = 2e-4,
    fp16 = not torch.cuda.is_bf16_supported(),
    bf16 = torch.cuda.is_bf16_supported(),
    optim = "adamw_8bit",
    lr_scheduler_type = "cosine",
    output_dir = "outputs",
    logging_steps = 100,
    seed = 2002,
    max_steps = 60,
    weight_decay = 0.01,
    save_steps = 30,
)

In [32]:
# Training configuration
trainer = SFTTrainer(
    model = ft_model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = 2048,
    dataset_num_proc = 1,
    packing = False,
    args = training_args,
)

In [33]:
# Train the model
print("Starting training...")
trainer_stats = trainer.train()

The model is already on multiple devices. Skipping the move to device specified in `args`.


Starting training...


==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 7,000 | Num Epochs = 1 | Total steps = 60
O^O/ \_/ \    Batch size per device = 4 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (4 x 4 x 1) = 16
 "-____-"     Trainable parameters = 29,884,416 of 3,850,963,968 (0.78% trained)


Step,Training Loss


0,1
train/epoch,‚ñÅ
train/global_step,‚ñÅ

0,1
total_flos,4303114843299840.0
train/epoch,0.13714
train/global_step,60.0
train_loss,1.05234
train_runtime,327.2726
train_samples_per_second,2.933
train_steps_per_second,0.183


### Test the Fine-Tuned Model

The training is complete! Now for the moment of truth: did our fine-tuning work? We will now test our specialized model and compare its performance directly against the baseline we established in Step 5.

To ensure a fair evaluation, our process is simple but critical:
1.  **Use a Consistent Prompt Format**: Our new `inference` function formats the prompt using the **exact same** system message and chat template that the model was trained on. This consistency is crucial for unlocking the model's new capabilities.
2.  **Rerun the Original Test Case**: We will ask the **exact same question** using the same context from our baseline test.

This provides our "after" snapshot. Compare this response to the one from the base model. Look for improvements in accuracy, conciseness, formatting (e.g., using a proper list), and overall adherence to the system prompt's instructions.

After a short but intense training session in the quiet of the Giza night, let's see how our newly specialized financial analyst performs.

In [34]:
# Test the fine-tuned model
context = "The company's gross margin improved to 45.8% in fiscal year 2023, up from 42.1% in the prior year. The margin expansion was mainly attributable to a favorable product mix with higher sales of our premium software subscriptions, and manufacturing efficiencies gained from our new automated production line in Alexandria, Egypt."
question = "What were the two key factors that contributed to the increase in the company's gross margin in fiscal year 2023?"

response = test_model(ft_model, tokenizer, context, question)
print("\n" + "="*50)
print(f"Question: {question}")
print(f"Response: {response}")

Generated prompt:
<|system|>
You are an expert financial analyst. Answer the user's question based only on the provided context.<|end|>
<|user|>
Context: The company's gross margin improved to 45.8% in fiscal year 2023, up from 42.1% in the prior year. The margin expansion was mainly attributable to a favorable product mix with higher sales of our premium software subscriptions, and manufacturing efficiencies gained from our new automated production line in Alexandria, Egypt.
Question:What were the two key factors that contributed to the increase in the company's gross margin in fiscal year 2023?<|end|>
<|assistant|>


Question: What were the two key factors that contributed to the increase in the company's gross margin in fiscal year 2023?
Response: The gross margin increased due to a favorable product mix with higher sales of premium software subscriptions and manufacturing efficiencies gained from a new automated production line in Alexandria, Egypt.


In [81]:
import time

# Evaluation

In [55]:
!pip install -q evaluate sentence-transformers bert-score scikit-learn

[?25l   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m0.0/61.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m61.1/61.1 kB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
[?25h

In [84]:
import torch
import time
import numpy as np
import evaluate
from tqdm import tqdm

class ProductionModelEvaluator:
  """
  Comprehensive evaluation framework for production-ready language models.
  Measures latency, throughput, quality metrics, and resource usage.
  """

  def __init__(self, model, tokenizer, max_new_tokens=20, device='cuda'):
      """
      Initialize evaluator with model and tokenizer.

      Args:
          model: Finetuned language model
          tokenizer: Corresponding tokenizer
          device: 'cuda' or 'cpu'
      """
      self.model = model
      self.tokenizer = tokenizer
      self.device = device
      self.model.to(device)
      self.max_new_tokens = max_new_tokens
      self.model.eval()  # Set to evaluation mode

      # Storage for results
      self.results = {
          'latency': [],
          'throughput': [],
          'quality': {},
          'resource': {}
      }

  # ============================================================
  # 1. LATENCY METRICS - How fast is inference?
  # ============================================================

  def measure_latency(self, test_inputs, num_runs=100, warmup_runs=10):
    """
    Measure inference latency metrics including tokenization and decoding.

    Args:
        test_inputs: List of input texts to test
        num_runs: Number of runs for averaging
        warmup_runs: GPU warmup iterations

    Returns:
        dict with latency statistics
    """
    print("üìä Measuring Latency...")
    total_latencies = []
    ttft_latencies = []
    tgt_latencies = []
    tpot_latencies = []

    # Warmup GPU
    for _ in range(warmup_runs):
        with torch.no_grad():
            inputs = self.tokenizer(test_inputs[0], return_tensors='pt').to(self.device)
            _ = self.model.generate(**inputs, max_new_tokens=self.max_new_tokens)

    # Measure actual latency
    for _ in tqdm(range(num_runs), desc="Latency Test"):
        test_text = np.random.choice(test_inputs)

        # Start timing (includes tokenization)
        start_time = time.time()

        with torch.no_grad():
            inputs = self.tokenizer(test_text, return_tensors='pt').to(self.device)
            input_length = inputs['input_ids'].shape[1]  # Get input length BEFORE generation

            # Generate
            outputs = self.model.generate(
                **inputs,
                max_new_tokens=self.max_new_tokens,
                return_dict_in_generate=True,
                output_scores=True
            )

            torch.cuda.synchronize()

            # Extract ONLY the new tokens (not the input prompt)
            generated_ids = outputs.sequences[0]
            new_tokens_ids = generated_ids[input_length:]  # Slice to get only new tokens

            # Decode only the NEW tokens
            output_text = self.tokenizer.decode(new_tokens_ids, skip_special_tokens=True)

            # End timing (includes decoding)
            end_time = time.time()

        # Do calculations outside timing
        output_length = generated_ids.shape[0]
        num_new_tokens = output_length - input_length

        # Total latency (entire process)
        total_latency = (end_time - start_time) * 1000  # ms
        total_latencies.append(total_latency)

        # TTFT approximation
        # For autoregressive models, first token typically takes 10-15% of total time
        ttft = total_latency * 0.12  # More realistic approximation
        ttft_latencies.append(ttft)

        # TGT: Time after first token
        tgt = total_latency - ttft
        tgt_latencies.append(tgt)

        # TPOT: Time per output token
        if num_new_tokens > 1:
            tpot = tgt / (num_new_tokens - 1)
        else:
            tpot = tgt
        tpot_latencies.append(tpot)

    latency_stats = {
        'total_mean_ms': np.mean(total_latencies),
        'total_median_ms': np.median(total_latencies),
        'total_p95_ms': np.percentile(total_latencies, 95),
        'total_p99_ms': np.percentile(total_latencies, 99),
        'std_ms': np.std(total_latencies),
        'min_ms': np.min(total_latencies),
        'max_ms': np.max(total_latencies),
        'ttft_mean_ms': np.mean(ttft_latencies),
        'ttft_median_ms': np.median(ttft_latencies),
        'ttft_p95_ms': np.percentile(ttft_latencies, 95),
        'tgt_mean_ms': np.mean(tgt_latencies),
        'tgt_median_ms': np.median(tgt_latencies),
        'tpot_mean_ms': np.mean(tpot_latencies),
        'tpot_median_ms': np.median(tpot_latencies),
        'tpot_p95_ms': np.percentile(tpot_latencies, 95)
    }

    self.results['latency'] = latency_stats
    self._print_latency_results(latency_stats)
    return latency_stats

  def _print_latency_results(self, stats):
      """Print latency statistics in readable format."""
      print("\n‚úÖ Latency Results:")
      print(f"\n  Total Latency:")
      print(f"    Mean: {stats['total_mean_ms']:.2f} ms")
      print(f"    P95: {stats['total_p95_ms']:.2f} ms")

      print(f"\n  TTFT (Time to First Token):")
      print(f"    Mean: {stats['ttft_mean_ms']:.2f} ms")
      print(f"    P95: {stats['ttft_p95_ms']:.2f} ms")

      print(f"\n  TGT (Token Generation Time):")
      print(f"    Mean: {stats['tgt_mean_ms']:.2f} ms")

      print(f"\n  TPOT (Time Per Output Token):")
      print(f"    Mean: {stats['tpot_mean_ms']:.2f} ms")
      print(f"    P95: {stats['tpot_p95_ms']:.2f} ms")

      # Production readiness
      if stats['total_p95_ms'] < 100:
          print("\n  ‚úÖ EXCELLENT - Production ready")
      elif stats['total_p95_ms'] < 500:
          print("\n  ‚ö†Ô∏è  GOOD - Acceptable")
      else:
          print("\n  ‚ùå SLOW - Needs optimization")

  # ============================================================
  # 2. THROUGHPUT METRICS - How many requests per second?
  # ============================================================

  def measure_throughput(self, test_inputs, batch_sizes=[1, 4, 8, 16]):
    """
    Measure throughput (requests/second and tokens/second) for different batch sizes.
    Includes tokenization and decoding in timing.

    Args:
        test_inputs: List of input texts
        batch_sizes: List of batch sizes to test

    Returns:
        dict with throughput for each batch size
    """
    print("\nüìä Measuring Throughput...")
    throughput_results = {}

    for batch_size in batch_sizes:
        batch_texts = test_inputs[:batch_size] * 10

        total_processed = 0
        total_tokens_generated = 0
        total_elapsed = 0

        with torch.no_grad():
            for i in range(0, len(batch_texts), batch_size):
                batch = batch_texts[i:i+batch_size]

                # Start timing for this batch
                batch_start = time.time()

                # Tokenize
                inputs = self.tokenizer(
                    batch,
                    return_tensors='pt',
                    padding=True,
                    truncation=True
                ).to(self.device)

                input_length = inputs['input_ids'].shape[1]  # Get input length

                # Generate
                outputs = self.model.generate(
                    **inputs,
                    max_new_tokens=self.max_new_tokens
                )

                torch.cuda.synchronize()

                # Extract only NEW tokens for each sequence in batch
                new_tokens_batch = outputs[:, input_length:]  # Slice all sequences

                # Decode (part of user-facing process)
                decoded_outputs = [
                    self.tokenizer.decode(new_tokens, skip_special_tokens=True)
                    for new_tokens in new_tokens_batch  # Decode only new tokens
                ]

                # End timing for this batch
                batch_end = time.time()

                # Add to total elapsed time
                total_elapsed += (batch_end - batch_start)

                # Do calculations
                total_processed += len(batch)
                # Count only NEW tokens generated
                tokens_per_sequence = new_tokens_batch.shape[1]  # Shape of new tokens only
                total_tokens_generated += tokens_per_sequence * len(batch)

        # Calculate metrics using summed elapsed time
        rps = total_processed / total_elapsed
        tps = total_tokens_generated / total_elapsed

        throughput_results[batch_size] = {'rps': rps, 'tps': tps}

        print(f"  Batch size {batch_size}:")
        print(f"    RPS (Requests/sec): {rps:.2f}")
        print(f"    TPS (Tokens/sec): {tps:.2f}")

    self.results['throughput'] = throughput_results
    return throughput_results

  # ============================================================
  # 3. QUALITY METRICS - How good are the predictions?
  # ============================================================

  def evaluate_quality(self, test_data, metrics=['bleu', 'rouge', 'cosine', 'bertscore', 'bi_encoder']):
    """
    Evaluate generation quality using standard NLP metrics.

    Args:
        test_data: List of dicts with 'input' and 'expected_output'
        metrics: List of metrics to compute

    Returns:
        dict with quality scores
    """
    print("\nüìä Evaluating Quality...")

    predictions = []
    references = []

    # Generate predictions
    for sample in tqdm(test_data, desc="Generating predictions"):
        with torch.no_grad():
            inputs = self.tokenizer(sample['input'], return_tensors='pt').to(self.device)
            input_length = inputs['input_ids'].shape[1]  # Get input length

            outputs = self.model.generate(
                **inputs,
                max_new_tokens=self.max_new_tokens
            )

            # Extract only NEW tokens (not the input prompt)
            new_tokens = outputs[0][input_length:]  # Slice to get only new tokens
            pred_text = self.tokenizer.decode(new_tokens, skip_special_tokens=True)

            predictions.append(pred_text)
            references.append(sample['expected_output'])

    quality_scores = {}

    # Calculate BLEU score (measures n-gram overlap)
    if 'bleu' in metrics:
        bleu_scores = []
        for pred, ref in zip(predictions, references):
            pred_words = set(pred.lower().split())
            ref_words = set(ref.lower().split())
            if len(ref_words) > 0:
                overlap = len(pred_words.intersection(ref_words))
                bleu = overlap / len(ref_words)
                bleu_scores.append(bleu)

        quality_scores['bleu'] = np.mean(bleu_scores)
        print(f"  BLEU Score: {quality_scores['bleu']:.4f}")

    # Cosine Similarity with TF-IDF
    if 'cosine' in metrics:
        from sklearn.feature_extraction.text import TfidfVectorizer
        from sklearn.metrics.pairwise import cosine_similarity

        print("  Computing Cosine Similarity...")
        cosine_scores = []

        for pred, ref in zip(predictions, references):
            try:
                vectorizer = TfidfVectorizer()
                tfidf_matrix = vectorizer.fit_transform([pred, ref])
                cosine_sim = cosine_similarity(tfidf_matrix[0:1], tfidf_matrix[1:2])[0][0]
                cosine_scores.append(cosine_sim)
            except:
                cosine_scores.append(0.0)

        quality_scores['cosine_similarity'] = np.mean(cosine_scores)
        print(f"  Cosine Similarity: {quality_scores['cosine_similarity']:.4f}")

    # BERTScore
    if 'bertscore' in metrics:
        from bert_score import score

        print("  Computing BERTScore...")
        P, R, F1 = score(predictions, references, lang='en', verbose=False)

        quality_scores['bertscore_precision'] = P.mean().item()
        quality_scores['bertscore_recall'] = R.mean().item()
        quality_scores['bertscore_f1'] = F1.mean().item()

        print(f"  BERTScore F1: {quality_scores['bertscore_f1']:.4f}")
        print(f"  BERTScore Precision: {quality_scores['bertscore_precision']:.4f}")
        print(f"  BERTScore Recall: {quality_scores['bertscore_recall']:.4f}")

    # Bi-Encoder Semantic Similarity
    if 'bi_encoder' in metrics:
        from sentence_transformers import SentenceTransformer, util

        print("  Computing Bi-Encoder Similarity...")

        # Load bi-encoder model
        bi_encoder = SentenceTransformer('all-MiniLM-L6-v2')

        # Encode all predictions and references
        pred_embeddings = bi_encoder.encode(predictions, convert_to_tensor=True)
        ref_embeddings = bi_encoder.encode(references, convert_to_tensor=True)

        # Calculate cosine similarity
        bi_encoder_scores = []
        for pred_emb, ref_emb in zip(pred_embeddings, ref_embeddings):
            similarity = util.cos_sim(pred_emb, ref_emb).item()
            bi_encoder_scores.append(similarity)

        quality_scores['bi_encoder_similarity'] = np.mean(bi_encoder_scores)
        print(f"  Bi-Encoder Similarity: {quality_scores['bi_encoder_similarity']:.4f}")

    # Calculate exact match accuracy
    exact_matches = sum([1 for p, r in zip(predictions, references)
                        if p.strip().lower() == r.strip().lower()])
    quality_scores['exact_match'] = exact_matches / len(predictions)
    print(f"  Exact Match: {quality_scores['exact_match']:.2%}")

    # Calculate partial match
    partial_matches = sum([1 for p, r in zip(predictions, references)
                          if r.lower() in p.lower()])
    quality_scores['partial_match'] = partial_matches / len(predictions)
    print(f"  Partial Match: {quality_scores['partial_match']:.2%}")

    # Store sample predictions
    quality_scores['samples'] = [
        {'input': test_data[i]['input'],
         'expected': test_data[i]['expected_output'],
         'predicted': predictions[i]}
        for i in range(min(5, len(predictions)))
    ]

    self.results['quality'] = quality_scores
    return quality_scores

  # ============================================================
  # 4. RESOURCE USAGE - Memory and GPU utilization
  # ============================================================

  def measure_resource_usage(self, test_input):
    """
    Measure GPU memory usage during inference.

    Args:
        test_input: Single input text for testing

    Returns:
        dict with memory statistics
    """
    print("\nüìä Measuring Resource Usage...")

    # Reset memory stats and clear cache PROPERLY
    if self.device == 'cuda':
        torch.cuda.empty_cache()
        torch.cuda.reset_peak_memory_stats()
        torch.cuda.synchronize()  # Ensure all operations complete

    # Measure memory after cache clear
    if self.device == 'cuda':
        mem_before = torch.cuda.memory_allocated() / 1024**2  # MB

    # Run inference
    with torch.no_grad():
        inputs = self.tokenizer(test_input, return_tensors='pt').to(self.device)
        input_length = inputs['input_ids'].shape[1]  # Get input length

        outputs = self.model.generate(
            **inputs,
            max_new_tokens=self.max_new_tokens
        )

        torch.cuda.synchronize()  # Ensure generation completes

        # Extract only new tokens
        new_tokens = outputs[0][input_length:]
        decoded = self.tokenizer.decode(new_tokens, skip_special_tokens=True)

    # Measure memory after generation completes
    if self.device == 'cuda':
        torch.cuda.synchronize()  # Ensure all operations complete
        mem_after = torch.cuda.memory_allocated() / 1024**2  # MB
        mem_peak = torch.cuda.max_memory_allocated() / 1024**2  # MB

        resource_stats = {
            'memory_before_mb': mem_before,
            'memory_after_mb': mem_after,
            'memory_peak_mb': mem_peak,
            'memory_used_mb': mem_peak - mem_before  # Use peak
        }

        print(f"  Memory Before: {mem_before:.2f} MB")
        print(f"  Memory After: {mem_after:.2f} MB")
        print(f"  Memory Peak: {mem_peak:.2f} MB")
        print(f"  Memory Used (Peak - Before): {resource_stats['memory_used_mb']:.2f} MB")
    else:
        resource_stats = {'device': 'cpu', 'memory_tracking': 'not_available'}

    self.results['resource'] = resource_stats
    return resource_stats

  # ============================================================
  # 5. COMPREHENSIVE REPORT
  # ============================================================

  def generate_report(self, save_path='model_evaluation_report.html'):
    """
    Generate comprehensive HTML report with all metrics.

    Args:
        save_path: Path to save HTML report
    """
    print(f"\nüìù Generating comprehensive report...")

    # Create report content
    report = f"""
    <html>
    <head>
        <title>Model Evaluation Report</title>
        <style>
            body {{ font-family: Arial, sans-serif; margin: 40px; }}
            h1 {{ color: #333; }}
            h2 {{ color: #666; border-bottom: 2px solid #ddd; padding-bottom: 10px; }}
            .metric {{ background: #f5f5f5; padding: 15px; margin: 10px 0; border-radius: 5px; }}
            .good {{ color: green; font-weight: bold; }}
            .warning {{ color: orange; font-weight: bold; }}
            .bad {{ color: red; font-weight: bold; }}
            table {{ border-collapse: collapse; width: 100%; }}
            th, td {{ border: 1px solid #ddd; padding: 12px; text-align: left; }}
            th {{ background-color: #4CAF50; color: white; }}
        </style>
    </head>
    <body>
        <h1>üöÄ Production Model Evaluation Report</h1>
        <p>Generated on: {time.strftime('%Y-%m-%d %H:%M:%S')}</p>

        <h2>‚ö° Latency Metrics</h2>
        <div class="metric">
            <h3>Total Latency</h3>
            <table>
                <tr><th>Metric</th><th>Value</th></tr>
                <tr><td>Mean</td><td>{self.results['latency']['total_mean_ms']:.2f} ms</td></tr>
                <tr><td>Median</td><td>{self.results['latency']['total_median_ms']:.2f} ms</td></tr>
                <tr><td>P95</td><td>{self.results['latency']['total_p95_ms']:.2f} ms</td></tr>
                <tr><td>P99</td><td>{self.results['latency']['total_p99_ms']:.2f} ms</td></tr>
                <tr><td>Standard Deviation</td><td>{self.results['latency']['std_ms']:.2f} ms</td></tr>
                <tr><td>Min</td><td>{self.results['latency']['min_ms']:.2f} ms</td></tr>
                <tr><td>Max</td><td>{self.results['latency']['max_ms']:.2f} ms</td></tr>
            </table>

            <h3>TTFT (Time to First Token)</h3>
            <table>
                <tr><th>Metric</th><th>Value</th></tr>
                <tr><td>Mean</td><td>{self.results['latency']['ttft_mean_ms']:.2f} ms</td></tr>
                <tr><td>Median</td><td>{self.results['latency']['ttft_median_ms']:.2f} ms</td></tr>
                <tr><td>P95</td><td>{self.results['latency']['ttft_p95_ms']:.2f} ms</td></tr>
            </table>

            <h3>TGT (Token Generation Time)</h3>
            <table>
                <tr><th>Metric</th><th>Value</th></tr>
                <tr><td>Mean</td><td>{self.results['latency']['tgt_mean_ms']:.2f} ms</td></tr>
                <tr><td>Median</td><td>{self.results['latency']['tgt_median_ms']:.2f} ms</td></tr>
            </table>

            <h3>TPOT (Time Per Output Token)</h3>
            <table>
                <tr><th>Metric</th><th>Value</th></tr>
                <tr><td>Mean</td><td>{self.results['latency']['tpot_mean_ms']:.2f} ms</td></tr>
                <tr><td>Median</td><td>{self.results['latency']['tpot_median_ms']:.2f} ms</td></tr>
                <tr><td>P95</td><td>{self.results['latency']['tpot_p95_ms']:.2f} ms</td></tr>
            </table>
        </div>

        <h2>üìà Throughput</h2>
        <div class="metric">
            <table>
                <tr><th>Batch Size</th><th>Requests/Second</th><th>Tokens/Second</th></tr>
    """

    for batch_size, throughput in self.results['throughput'].items():
        report += f"<tr><td>{batch_size}</td><td>{throughput['rps']:.2f}</td><td>{throughput['tps']:.2f}</td></tr>"

    report += f"""
            </table>
        </div>

        <h2>‚ú® Quality Metrics</h2>
        <div class="metric">
            <table>
                <tr><th>Metric</th><th>Score</th></tr>
                <tr><td>BLEU Score</td><td>{self.results['quality'].get('bleu', 0):.4f}</td></tr>
                <tr><td>Cosine Similarity</td><td>{self.results['quality'].get('cosine_similarity', 0):.4f}</td></tr>
                <tr><td>BERTScore F1</td><td>{self.results['quality'].get('bertscore_f1', 0):.4f}</td></tr>
                <tr><td>BERTScore Precision</td><td>{self.results['quality'].get('bertscore_precision', 0):.4f}</td></tr>
                <tr><td>BERTScore Recall</td><td>{self.results['quality'].get('bertscore_recall', 0):.4f}</td></tr>
                <tr><td>Bi-Encoder Similarity</td><td>{self.results['quality'].get('bi_encoder_similarity', 0):.4f}</td></tr>
                <tr><td>Exact Match</td><td>{self.results['quality'].get('exact_match', 0):.2%}</td></tr>
                <tr><td>Partial Match</td><td>{self.results['quality'].get('partial_match', 0):.2%}</td></tr>
            </table>
        </div>

        <h2>üíæ Resource Usage</h2>
        <div class="metric">
            <table>
                <tr><th>Metric</th><th>Value</th></tr>
                <tr><td>Memory Before</td><td>{self.results['resource'].get('memory_before_mb', 0):.2f} MB</td></tr>
                <tr><td>Peak Memory</td><td>{self.results['resource'].get('memory_peak_mb', 0):.2f} MB</td></tr>
                <tr><td>Memory Used</td><td>{self.results['resource'].get('memory_used_mb', 0):.2f} MB</td></tr>
            </table>
        </div>

        <h2>üìã Sample Predictions</h2>
    """

    if 'samples' in self.results['quality']:
        for i, sample in enumerate(self.results['quality']['samples']):
            # Escape HTML characters
            input_text = sample['input'][:200].replace('<', '&lt;').replace('>', '&gt;')
            expected_text = sample['expected'].replace('<', '&lt;').replace('>', '&gt;')
            predicted_text = sample['predicted'].replace('<', '&lt;').replace('>', '&gt;')

            report += f"""
            <div class="metric">
                <p><strong>Sample {i+1}:</strong></p>
                <p><strong>Input:</strong> {input_text}...</p>
                <p><strong>Expected:</strong> {expected_text}</p>
                <p><strong>Predicted:</strong> {predicted_text}</p>
            </div>
            """

    report += """
    </body>
    </html>
    """

    # Save report
    with open(save_path, 'w') as f:
        f.write(report)

    print(f"‚úÖ Report saved to: {save_path}")

  # ============================================================
  # 6. PRODUCTION READINESS CHECK
  # ============================================================

  def production_readiness_check(self):
    """
    Determine if model is production-ready based on all metrics.

    Returns:
        dict with pass/fail for each criterion
    """
    print("\nüîç Production Readiness Check...")

    checks = {
        'latency_p95': {
            'pass': self.results['latency']['total_p95_ms'] < 500,
            'value': f"{self.results['latency']['total_p95_ms']:.2f} ms",
            'threshold': '< 500 ms'
        },
        'quality_bleu': {
            'pass': self.results['quality'].get('bleu', 0) > 0.3,
            'value': f"{self.results['quality'].get('bleu', 0):.4f}",
            'threshold': '> 0.3'
        },
        'bertscore_f1': {  # NEW
            'pass': self.results['quality'].get('bertscore_f1', 0) > 0.7,
            'value': f"{self.results['quality'].get('bertscore_f1', 0):.4f}",
            'threshold': '> 0.7'
        },
        'bi_encoder_similarity': {  # NEW
            'pass': self.results['quality'].get('bi_encoder_similarity', 0) > 0.7,
            'value': f"{self.results['quality'].get('bi_encoder_similarity', 0):.4f}",
            'threshold': '> 0.7'
        },
        'memory_usage': {
            'pass': self.results['resource'].get('memory_peak_mb', float('inf')) < 8000,
            'value': f"{self.results['resource'].get('memory_peak_mb', 0):.2f} MB",
            'threshold': '< 8000 MB'
        }
    }

    print("\n" + "="*60)
    for check_name, check_data in checks.items():
        status = "‚úÖ PASS" if check_data['pass'] else "‚ùå FAIL"
        print(f"{status} | {check_name}: {check_data['value']} (threshold: {check_data['threshold']})")
    print("="*60)

    overall_pass = all([c['pass'] for c in checks.values()])
    if overall_pass:
        print("\nüéâ MODEL IS PRODUCTION READY!")
    else:
        print("\n‚ö†Ô∏è  MODEL NEEDS OPTIMIZATION BEFORE PRODUCTION")

    return checks


In [85]:
# Create evaluator instance
evaluator = ProductionModelEvaluator(
    model=ft_model,
    tokenizer=tokenizer,
    max_new_tokens=100,
    device='cuda'
)

### 1. RUN LATENCY EVALUATION

In [87]:
# Prepare latency test data from the first 100 sample
latency_test_data = [question for question in dataset[:100]['question']]
print("First question:",latency_test_data[0])

First question: What area did NVIDIA initially focus on before expanding to other computationally intensive fields?


In [88]:
# Evaluate model on 200 runs starting with 10 warmup runs for the gpu
latency_results = evaluator.measure_latency(
    test_inputs=latency_test_data,
    num_runs=100,
    warmup_runs=10
)

üìä Measuring Latency...


Latency Test: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 100/100 [09:32<00:00,  5.73s/it]


‚úÖ Latency Results:

  Total Latency:
    Mean: 5725.98 ms
    P95: 6877.57 ms

  TTFT (Time to First Token):
    Mean: 687.12 ms
    P95: 825.31 ms

  TGT (Token Generation Time):
    Mean: 5038.86 ms

  TPOT (Time Per Output Token):
    Mean: 56.44 ms
    P95: 61.21 ms

  ‚ùå SLOW - Needs optimization





In [89]:
# Store results
print(f"\nüìä Latency Results Summary:")
print(f"  Mean Latency: {latency_results['total_mean_ms']:.2f} ms")
print(f"  P95 Latency: {latency_results['total_p95_ms']:.2f} ms")
print(f"  TTFT Mean: {latency_results['ttft_mean_ms']:.2f} ms")
print(f"  TPOT Mean: {latency_results['tpot_mean_ms']:.2f} ms")


üìä Latency Results Summary:
  Mean Latency: 5725.98 ms
  P95 Latency: 6877.57 ms
  TTFT Mean: 687.12 ms
  TPOT Mean: 56.44 ms


### 2. RUN THROUGHPUT EVALUATION

In [90]:
# Prepare throughput test data
throughput_test_data = latency_test_data
print("First question:",throughput_test_data[0])

First question: What area did NVIDIA initially focus on before expanding to other computationally intensive fields?


In [91]:
throughput_results = evaluator.measure_throughput(
    test_inputs=throughput_test_data,
)


üìä Measuring Throughput...
  Batch size 1:
    RPS (Requests/sec): 0.16
    TPS (Tokens/sec): 15.91
  Batch size 4:
    RPS (Requests/sec): 0.33
    TPS (Tokens/sec): 33.48
  Batch size 8:
    RPS (Requests/sec): 0.51
    TPS (Tokens/sec): 50.76
  Batch size 16:
    RPS (Requests/sec): 0.69
    TPS (Tokens/sec): 68.87


In [92]:
# Store results
print(f"\nüìä Throughput Results Summary:")
for batch_size, metrics in throughput_results.items():
    print(f"  Batch {batch_size}: {metrics['rps']:.2f} req/s, {metrics['tps']:.2f} tokens/s")


üìä Throughput Results Summary:
  Batch 1: 0.16 req/s, 15.91 tokens/s
  Batch 4: 0.33 req/s, 33.48 tokens/s
  Batch 8: 0.51 req/s, 50.76 tokens/s
  Batch 16: 0.69 req/s, 68.87 tokens/s


### 3. RUN RESOURCE USAGE EVALUATION

In [93]:
# Prepare resource usage test sample example
resource_usage_test_sample = latency_test_data[0]
resource_usage_test_sample

'What area did NVIDIA initially focus on before expanding to other computationally intensive fields?'

In [94]:
resource_results = evaluator.measure_resource_usage(
    test_input=resource_usage_test_sample
)


üìä Measuring Resource Usage...
  Memory Before: 5731.42 MB
  Memory After: 2669.65 MB
  Memory Peak: 5731.65 MB
  Memory Used (Peak - Before): 0.24 MB


In [95]:
# Store results
print(f"\nüìä Resource Usage Summary:")
print(f"  Peak Memory: {resource_results.get('memory_peak_mb', 0):.2f} MB")
print(f"  Memory Used: {resource_results.get('memory_used_mb', 0):.2f} MB")


üìä Resource Usage Summary:
  Peak Memory: 5731.65 MB
  Memory Used: 0.24 MB


### 4. RUN QUALITY EVALUATION

In [96]:
# Prepare quality test data with 50 samples
test_data_quality = [
    {'input':question, \
     'expected_output':completion
     }
    for question, completion in zip(dataset['question'][:50], dataset['answer'][:50])]

test_data_quality[0]

{'input': 'What area did NVIDIA initially focus on before expanding to other computationally intensive fields?',
 'expected_output': 'NVIDIA initially focused on PC graphics.'}

In [97]:
quality_results = evaluator.evaluate_quality(
    test_data=test_data_quality
)


üìä Evaluating Quality...


Generating predictions: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 50/50 [04:44<00:00,  5.70s/it]


  BLEU Score: 0.3955
  Computing Cosine Similarity...
  Cosine Similarity: 0.3147
  Computing BERTScore...


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  BERTScore F1: 0.8662
  BERTScore Precision: 0.8482
  BERTScore Recall: 0.8857
  Computing Bi-Encoder Similarity...
  Bi-Encoder Similarity: 0.6621
  Exact Match: 0.00%
  Partial Match: 0.00%


In [98]:
# Store results
print(f"\nüìä Quality Results Summary:")
print(f"  BLEU Score: {quality_results.get('bleu', 0):.4f}")
print(f"  Cosine Similarity: {quality_results.get('cosine_similarity', 0):.4f}")
print(f"  BERTScore F1: {quality_results.get('bertscore_f1', 0):.4f}")
print(f"  Bi-Encoder Similarity: {quality_results.get('bi_encoder_similarity', 0):.4f}")
print(f"  Exact Match: {quality_results.get('exact_match', 0):.2%}")
print(f"  Partial Match: {quality_results.get('partial_match', 0):.2%}")

# Print sample predictions
print(f"\nüìã Sample Predictions:")
for i, sample in enumerate(quality_results['samples'][:3]):
    print(f"\n  Sample {i+1}:")
    print(f"    Input: {sample['input'][:80]}...")
    print(f"    Expected: {sample['expected'][:80]}...")
    print(f"    Predicted: {sample['predicted'][:80]}...")


üìä Quality Results Summary:
  BLEU Score: 0.3955
  Cosine Similarity: 0.3147
  BERTScore F1: 0.8662
  Bi-Encoder Similarity: 0.6621
  Exact Match: 0.00%
  Partial Match: 0.00%

üìã Sample Predictions:

  Sample 1:
    Input: What area did NVIDIA initially focus on before expanding to other computationall...
    Expected: NVIDIA initially focused on PC graphics....
    Predicted: 

A) Mobile devices
B) Desktop computers
C) Gaming
D) Scientific computing

Answ...

  Sample 2:
    Input: What are some of the recent applications of GPU-powered deep learning as mention...
    Expected: Recent applications of GPU-powered deep learning include recommendation systems,...
    Predicted: 

## Answer:
NVIDIA has been at the forefront of GPU-powered deep learning, with...

  Sample 3:
    Input: What significant invention did NVIDIA create in 1999?...
    Expected: NVIDIA invented the GPU in 1999....
    Predicted: 

NVIDIA created the GeForce 256, the first GPU with pixel shading capability, 

### 5. GENERATE COMPREHENSIVE REPORT

In [99]:
report_path = './finetuned_phi3_evaluation_report.html'
evaluator.generate_report(save_path=report_path)

print(f"\n‚úÖ HTML Report saved to: {report_path}")


üìù Generating comprehensive report...
‚úÖ Report saved to: ./finetuned_phi3_evaluation_report.html

‚úÖ HTML Report saved to: ./finetuned_phi3_evaluation_report.html


### 6. PRODUCTION READINESS CHECK

In [100]:
readiness_results = evaluator.production_readiness_check()


üîç Production Readiness Check...

‚ùå FAIL | latency_p95: 6877.57 ms (threshold: < 500 ms)
‚úÖ PASS | quality_bleu: 0.3955 (threshold: > 0.3)
‚úÖ PASS | bertscore_f1: 0.8662 (threshold: > 0.7)
‚ùå FAIL | bi_encoder_similarity: 0.6621 (threshold: > 0.7)
‚úÖ PASS | memory_usage: 5731.65 MB (threshold: < 8000 MB)

‚ö†Ô∏è  MODEL NEEDS OPTIMIZATION BEFORE PRODUCTION


### Save results into wandb

In [78]:
wandb.init()

In [79]:
# Log all evaluator results
import pandas as pd

wandb.log({
    # Latency - all metrics
    "latency/total_mean_ms": evaluator.results['latency']['total_mean_ms'],
    "latency/total_median_ms": evaluator.results['latency']['total_median_ms'],
    "latency/total_p95_ms": evaluator.results['latency']['total_p95_ms'],
    "latency/total_p99_ms": evaluator.results['latency']['total_p99_ms'],
    "latency/std_ms": evaluator.results['latency']['std_ms'],
    "latency/min_ms": evaluator.results['latency']['min_ms'],
    "latency/max_ms": evaluator.results['latency']['max_ms'],
    "latency/ttft_mean_ms": evaluator.results['latency']['ttft_mean_ms'],
    "latency/ttft_median_ms": evaluator.results['latency']['ttft_median_ms'],
    "latency/ttft_p95_ms": evaluator.results['latency']['ttft_p95_ms'],
    "latency/tgt_mean_ms": evaluator.results['latency']['tgt_mean_ms'],
    "latency/tgt_median_ms": evaluator.results['latency']['tgt_median_ms'],
    "latency/tpot_mean_ms": evaluator.results['latency']['tpot_mean_ms'],
    "latency/tpot_median_ms": evaluator.results['latency']['tpot_median_ms'],
    "latency/tpot_p95_ms": evaluator.results['latency']['tpot_p95_ms'],

    # Throughput - all batch sizes
    "throughput/batch_1_rps": evaluator.results['throughput'][1]['rps'],
    "throughput/batch_1_tps": evaluator.results['throughput'][1]['tps'],
    "throughput/batch_4_rps": evaluator.results['throughput'][4]['rps'],
    "throughput/batch_4_tps": evaluator.results['throughput'][4]['tps'],
    "throughput/batch_8_rps": evaluator.results['throughput'][8]['rps'],
    "throughput/batch_8_tps": evaluator.results['throughput'][8]['tps'],
    "throughput/batch_16_rps": evaluator.results['throughput'][16]['rps'],
    "throughput/batch_16_tps": evaluator.results['throughput'][16]['tps'],
    "throughput/best_rps": max([v['rps'] for v in evaluator.results['throughput'].values()]),
    "throughput/best_tps": max([v['tps'] for v in evaluator.results['throughput'].values()]),

    # Quality - all metrics
    "quality/bleu": evaluator.results['quality']['bleu'],
    "quality/cosine_similarity": evaluator.results['quality']['cosine_similarity'],
    "quality/bertscore_precision": evaluator.results['quality']['bertscore_precision'],
    "quality/bertscore_recall": evaluator.results['quality']['bertscore_recall'],
    "quality/bertscore_f1": evaluator.results['quality']['bertscore_f1'],
    "quality/bi_encoder_similarity": evaluator.results['quality']['bi_encoder_similarity'],
    "quality/exact_match": evaluator.results['quality']['exact_match'],
    "quality/partial_match": evaluator.results['quality']['partial_match'],

    # Resource - all metrics
    "resource/memory_before_mb": evaluator.results['resource']['memory_before_mb'],
    "resource/memory_after_mb": evaluator.results['resource']['memory_after_mb'],
    "resource/memory_peak_mb": evaluator.results['resource']['memory_peak_mb'],
    "resource/memory_used_mb": evaluator.results['resource']['memory_used_mb'],
})

# Log predictions table
samples_df = pd.DataFrame(evaluator.results['quality']['samples'])
wandb.log({"predictions": wandb.Table(dataframe=samples_df)})

In [80]:
# End wandb tracing
wandb.finish()

0,1
latency/max_ms,‚ñÅ
latency/min_ms,‚ñÅ
latency/std_ms,‚ñÅ
latency/tgt_mean_ms,‚ñÅ
latency/tgt_median_ms,‚ñÅ
latency/total_mean_ms,‚ñÅ
latency/total_median_ms,‚ñÅ
latency/total_p95_ms,‚ñÅ
latency/total_p99_ms,‚ñÅ
latency/tpot_mean_ms,‚ñÅ

0,1
latency/max_ms,18137.65669
latency/min_ms,1313.02953
latency/std_ms,2579.27504
latency/tgt_mean_ms,6222.37054
latency/tgt_median_ms,5863.78845
latency/total_mean_ms,6763.44623
latency/total_median_ms,6373.68309
latency/total_p95_ms,12547.97881
latency/total_p99_ms,14830.65121
latency/tpot_mean_ms,70.55164


# Merge, Save, and Package the Final Model

Our work is not complete until the model is saved and ready for deployment. The fine-tuning process created lightweight LoRA "adapter" weights, which are separate from the original base model. For easy, portable inference, we need to merge these adapters back into the base model to create a single, unified set of weights.

This final step effectively "bakes in" our specialized financial knowledge.

1.  **Merge and Unload**: We call `model.merge_and_unload()`. This powerful Unsloth function performs two actions:
    * **Merges** the trained LoRA weights directly into the base model's attention and MLP layers.
    * **Unloads** the PEFT wrapper, returning a standard Hugging Face `PreTrainedModel` object. This new object is a complete, standalone model that doesn't require the `peft` library for inference.
2.  **Save the Model**: We use the standard `save_pretrained` method to save our new, merged model. We specify `safe_serialization=True` to use the modern and secure `safetensors` format.
3.  **Save the Tokenizer**: Crucially, we also save the tokenizer in the same directory. The model weights and the tokenizer are a pair; you need both to correctly run inference.

With the first light of dawn approaching over the Giza plateau, our final, expert financial analyst model is now serialized to disk, ready to be uploaded, shared, and deployed.

In [None]:
import os

# Create directory for the complete fine-tuned model
save_directory = "Data/complete_finetuned_model"
os.makedirs(save_directory, exist_ok=True)

print("Merging LoRA adapter with base model...")
# Merge the LoRA adapter with the base model
merged_model = ft_model.merge_and_unload()

Merging LoRA adapter with base model...




In [None]:
print("Saving merged model and tokenizer...")

# Save the complete merged model
merged_model.save_pretrained(
    save_directory,
    safe_serialization=True,  # Use safetensors format (recommended)
    max_shard_size="2GB"      # Split large models into 2GB chunks
)

# Save the tokenizer
tokenizer.save_pretrained(save_directory)

Saving merged model and tokenizer...


('Data/complete_finetuned_model/tokenizer_config.json',
 'Data/complete_finetuned_model/special_tokens_map.json',
 'Data/complete_finetuned_model/chat_template.jinja',
 'Data/complete_finetuned_model/tokenizer.model',
 'Data/complete_finetuned_model/added_tokens.json',
 'Data/complete_finetuned_model/tokenizer.json')