# üöÄ RunPod GPU Setup

**This notebook is optimized for RunPod GPU pods with NVIDIA GPUs**

## Quick Start on RunPod:

1. **Launch a GPU Pod** (RTX 3090, 4090, or A5000 recommended)
2. **Upload this notebook** to the pod
3. **Upload test data** (`test_run_20251106.jsonl`) to `/workspace/data/`
4. **Run cells in order** - evaluation should complete in ~5-10 minutes

## Expected Performance:
- **GPU**: RTX 3090/4090 ‚Üí ~0.5-1 sec/sample (~5 min total)
- **GPU**: RTX A5000 ‚Üí ~1-2 sec/sample (~10 min total)
- **Full evaluation**: 300 samples

---

# Medical NER Model Evaluation

This notebook evaluates the fine-tuned Llama 3.2 3B medical NER model.

## ‚úÖ DATASET VERIFIED & READY FOR EVALUATION

**Current Dataset Distribution** (from `both_rel_instruct_all.jsonl`):
- **1,000 Chemical extraction** examples (25%)
- **2,000 Disease extraction** examples (50%) ‚ö†Ô∏è Intentionally 2x more
- **1,000 Relationship extraction** examples (25%)

**Data Splits Status**: ‚úÖ Properly stratified using `stratify=` parameter
- Training (2,400): 25% chemical, 50% disease, 25% relationship
- Validation (300+): 25% chemical, 50% disease, 25% relationship
- Test (300+): 25% chemical, 50% disease, 25% relationship

**Why Disease is 2x more**:
- The original dataset has twice as many disease extraction examples
- Stratified splitting preserves this 25/50/25 distribution
- This appears intentional for better disease NER performance
- All splits are properly balanced relative to the source data

**Next Steps**:
1. ‚úÖ Training data is properly split with stratification
2. ‚úÖ No data leakage between train/val/test
3. ‚úÖ Update `HF_MODEL_ID` below with your trained model ID
4. ‚úÖ Run this evaluation notebook on the balanced test set

---

## Prerequisites:
1. Complete training in `Medical_NER_Fine_Tuning.ipynb` (uses stratified splits!)
2. Model saved to `./final_model` or uploaded to HuggingFace Hub
3. Test data available in `notebooks/test.jsonl` or `../data/test.jsonl`

## Evaluation Tasks:
1. Load the fine-tuned model
2. Evaluate on test set (25% chem, 50% disease, 25% relationship)
3. Calculate precision, recall, F1 scores per task type
4. Test on custom medical texts
5. Analyze errors and false positives

## 0. Environment Variables Setup

‚ö†Ô∏è **IMPORTANT**: Set your credentials before running the notebook!

**Note**: `hf_transfer` is enabled for faster downloads from HuggingFace Hub.

In [1]:
import os
from getpass import getpass

# Enable hf_transfer for faster downloads from HuggingFace Hub
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"

# HuggingFace Token (required to download your model from Hub)
# Get your token from: https://huggingface.co/settings/tokens
hf_token = os.getenv("HF_TOKEN")
if not hf_token:
    print("HF_TOKEN not found in environment variables")
    hf_token = getpass("Enter your HuggingFace token: ")
    os.environ["HF_TOKEN"] = hf_token
else:
    print("‚úì HF_TOKEN loaded from environment")

# Weights & Biases API Key (optional - only if tracking evaluation metrics)
# Get your key from: https://wandb.ai/authorize
wandb_key = os.getenv("WANDB_API_KEY")
if wandb_key:
    print("‚úì WANDB_API_KEY loaded from environment")
else:
    print("‚Ñπ WANDB_API_KEY not set (optional)")

print("\n‚úì Environment variables configured")
print(f"  HF_HUB_ENABLE_HF_TRANSFER: {os.getenv('HF_HUB_ENABLE_HF_TRANSFER')}")

HF_TOKEN not found in environment variables


Enter your HuggingFace token:  ¬∑¬∑¬∑¬∑¬∑¬∑¬∑¬∑


‚Ñπ WANDB_API_KEY not set (optional)

‚úì Environment variables configured
  HF_HUB_ENABLE_HF_TRANSFER: 1


## 1. Setup and Installation


In [2]:
# Install PyTorch and other required packages
!pip install -q transformers datasets peft accelerate bitsandbytes
!pip install -q huggingface-hub tokenizers hf-transfer

print("‚úì All packages installed successfully!")
print("  - transformers (HuggingFace models)")
print("  - peft (LoRA adapters)")
print("  - accelerate (device management)")
print("  - bitsandbytes (quantization)")
print("  - hf-transfer (fast downloads)")

‚úì All packages installed successfully!
  - transformers (HuggingFace models)
  - peft (LoRA adapters)
  - accelerate (device management)
  - bitsandbytes (quantization)
  - hf-transfer (fast downloads)


## 2. Import Libraries


In [3]:

import json
import torch
from pathlib import Path
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
from huggingface_hub import login

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")

PyTorch version: 2.8.0+cu128
CUDA available: True
GPU: NVIDIA RTX 2000 Ada Generation


## 3. Configuration

‚ö†Ô∏è **Update these paths** to match your model location!


In [4]:
# Model configuration
BASE_MODEL = "meta-llama/Llama-3.2-3B-Instruct"

# ‚ö†Ô∏è IMPORTANT: Update with YOUR HuggingFace model ID
# Find it at: https://huggingface.co/your-username
# Format: "your-username/llama3-medical-ner-lora-YYYYMMDD_HHMMSS"
HF_MODEL_ID = "albyos/llama3-medical-ner-checkpoint-450-20251106_145403"  # ‚Üê UPDATE THIS!

# Alternative: Use local model if you prefer
USE_HF_HUB = True  # Set to False to use local ../final_model
PROJECT_ROOT = Path.cwd().parent
LOCAL_MODEL_PATH = PROJECT_ROOT / "final_model"

ADAPTER_PATH = HF_MODEL_ID if USE_HF_HUB else str(LOCAL_MODEL_PATH)

# Data configuration
# For RunPod: Upload test data to /workspace/data/test_run_20251106.jsonl
# For local: Use your local path
try:
    # Try RunPod/workspace path first
    TEST_DATA_PATH = Path("test_run_20251106.jsonl")
    if not TEST_DATA_PATH.exists():
        # Fallback to local path
        TEST_DATA_PATH = Path.cwd().parent / "data" / "test_run_20251106.jsonl"
        if not TEST_DATA_PATH.exists():
            # Another fallback
            TEST_DATA_PATH = Path("/Users/alberto/projects/courses/building_llms/ch_10_fine_tuning/data/test_run_20251106.jsonl")
except Exception:
    TEST_DATA_PATH = Path("test_run_20251106.jsonl")

# Verify test data exists
if not TEST_DATA_PATH.exists():
    print(f"‚ùå Test data not found at {TEST_DATA_PATH}")
    print(f"üí° RunPod: Upload to /workspace/data/test_run_20251106.jsonl")
    print(f"üí° Local: Place in ../data/test_run_20251106.jsonl")
    raise FileNotFoundError(f"Test data file not found: {TEST_DATA_PATH}")

print("‚úì Configuration loaded")
print(f"  Base model: {BASE_MODEL}")
print(f"  Adapter source: {'HuggingFace Hub' if USE_HF_HUB else 'Local filesystem'}")
print(f"  Adapter path: {ADAPTER_PATH}")
print(f"  Test data: {TEST_DATA_PATH}")
print(f"  Test data exists: {TEST_DATA_PATH.exists()}")

‚úì Configuration loaded
  Base model: meta-llama/Llama-3.2-3B-Instruct
  Adapter source: HuggingFace Hub
  Adapter path: albyos/llama3-medical-ner-checkpoint-450-20251106_145403
  Test data: test_run_20251106.jsonl
  Test data exists: True


## 4. Authenticate with Hugging Face

Log into Hugging Face to download the LoRA adapter when `USE_HF_HUB` is enabled.

In [5]:
# Login to HuggingFace Hub to access your model
import os
from huggingface_hub import login

hf_token = os.environ.get("HF_TOKEN")

if not hf_token:
    print("‚ùå HF_TOKEN not found in environment")
    print("   Please run cell #3 first to set your HF token")
    raise ValueError("HF_TOKEN is required to download model from HuggingFace Hub")

# Login to HuggingFace
login(token=hf_token, add_to_git_credential=True)

print("‚úì Logged into Hugging Face Hub")
print(f"  Will load model from: {HF_MODEL_ID}")

Token has not been saved to git credential helper.
Note: Environment variable`HF_TOKEN` is set and is the current active token independently from the token you've just configured.


[1m[31mCannot authenticate through git-credential as no helper is defined on your machine.
You might have to re-authenticate when pushing to the Hugging Face Hub.
Run the following command in your terminal in case you want to set the 'store' credential helper as default.

git config --global credential.helper store

Read https://git-scm.com/book/en/v2/Git-Tools-Credential-Storage for more details.[0m
‚úì Logged into Hugging Face Hub
  Will load model from: albyos/llama3-medical-ner-checkpoint-450-20251106_145403


## 5. Load the Fine-Tuned Model

Load the base model and attach the LoRA adapter from either Hugging Face Hub or your local filesystem.

**Note**: Using `hf_transfer` for faster downloads from HuggingFace Hub.

In [6]:
# Ensure hf_transfer is enabled for faster downloads
import os
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"

# Load the fine-tuned model for inference
print("="*80)
print("LOADING FINE-TUNED MODEL")
print("="*80)

print(f"\nLoading base model: {BASE_MODEL}...")

#Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL)
tokenizer.pad_token = tokenizer.eos_token

print(f"‚úì Tokenizer loaded")

# Check for GPU support (optimized for RunPod/CUDA)
if torch.cuda.is_available():
    device = "cuda"
    print(f"üöÄ NVIDIA GPU detected: {torch.cuda.get_device_name(0)}")
    print(f"   GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
elif torch.backends.mps.is_available():
    device = "mps"
    print(f"üöÄ Apple Silicon GPU (MPS) detected")
else:
    device = "cpu"
    print(f"‚ö†Ô∏è  No GPU detected, using CPU (very slow)")

# Load base model with GPU acceleration
# On RunPod: Uses CUDA with float16 for optimal performance
base_model = AutoModelForCausalLM.from_pretrained(
    BASE_MODEL,
    torch_dtype=torch.float16,
    device_map="auto",  # Automatically distribute model across available GPUs
    low_cpu_mem_usage=True,
)

print(f"\n‚úì Base model loaded: {BASE_MODEL}")
print(f"  Device: {device.upper()}")
print(f"  Precision: {base_model.dtype}")
if device == "cuda":
    print(f"  GPU Memory Used: {torch.cuda.memory_allocated() / 1e9:.2f} GB")

# Load LoRA adapter from HuggingFace Hub or local path
print(f"\nLoading LoRA adapter from: {ADAPTER_PATH}...")
print(f"  Using hf_transfer for faster downloads...")

model = PeftModel.from_pretrained(
    base_model,
    ADAPTER_PATH,
)
model.eval()

print(f"\n‚úì Fine-tuned model loaded successfully!")
print(f"  Base: {BASE_MODEL}")
print(f"  LoRA adapter: {ADAPTER_PATH}")
print(f"  Source: {'HuggingFace Hub' if USE_HF_HUB else 'Local filesystem'}")

LOADING FINE-TUNED MODEL

Loading base model: meta-llama/Llama-3.2-3B-Instruct...


tokenizer_config.json:   0%|          | 0.00/54.5k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/296 [00:00<?, ?B/s]

‚úì Tokenizer loaded
üöÄ NVIDIA GPU detected: NVIDIA RTX 2000 Ada Generation
   GPU Memory: 16.8 GB


config.json:   0%|          | 0.00/878 [00:00<?, ?B/s]

`torch_dtype` is deprecated! Use `dtype` instead!


model.safetensors.index.json:   0%|          | 0.00/20.9k [00:00<?, ?B/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/1.46G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/189 [00:00<?, ?B/s]


‚úì Base model loaded: meta-llama/Llama-3.2-3B-Instruct
  Device: CUDA
  Precision: torch.float16
  GPU Memory Used: 6.43 GB

Loading LoRA adapter from: albyos/llama3-medical-ner-checkpoint-450-20251106_145403...
  Using hf_transfer for faster downloads...


adapter_config.json:   0%|          | 0.00/944 [00:00<?, ?B/s]

adapter_model.safetensors:   0%|          | 0.00/97.3M [00:00<?, ?B/s]


‚úì Fine-tuned model loaded successfully!
  Base: meta-llama/Llama-3.2-3B-Instruct
  LoRA adapter: albyos/llama3-medical-ner-checkpoint-450-20251106_145403
  Source: HuggingFace Hub


In [7]:
def generate_response(prompt_text, max_new_tokens=128):
    """Generate a response for a given prompt - OPTIMIZED FOR SPEED."""
    formatted_prompt = f"""<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are a medical NER expert. Extract the requested entities from medical texts accurately.<|eot_id|><|start_header_id|>user<|end_header_id|>

{prompt_text}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

"""
    
    inputs = tokenizer(formatted_prompt, return_tensors="pt").to(model.device)
    
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,  # Reduced from 512 to 128 for faster inference
            temperature=0.3,  # Lower temperature for faster, more focused generation
            do_sample=True,
            top_k=50,  # Added top_k for faster sampling
            top_p=0.9,  # Reduced from 0.95 for faster generation
            repetition_penalty=1.2,
            pad_token_id=tokenizer.eos_token_id,
            eos_token_id=tokenizer.eos_token_id,
            use_cache=True,  # Enable KV cache for faster generation
        )
    
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    
    # Extract assistant's response
    if "assistant\n\n" in response:
        response = response.split("assistant\n\n")[-1]
    elif "assistant" in response:
        response = response.split("assistant")[-1].strip()
    
    return response.strip()

print("‚úì Inference function ready - OPTIMIZED FOR GPU")
print("  Generation parameters:")
print("    - max_new_tokens: 128 (optimal for NER tasks)")
print("    - temperature: 0.3 (focused, deterministic)")
print("    - top_k: 50, top_p: 0.9 (balanced sampling)")
print("    - use_cache: True (KV cache for speed)")
print("\n  Expected speed on RunPod GPU:")
print("    - RTX 3090/4090: ~0.5-1 second per sample")
print("    - RTX A5000: ~1-2 seconds per sample")
print("    - Full evaluation (300 samples): ~5-10 minutes")


‚úì Inference function ready - OPTIMIZED FOR GPU
  Generation parameters:
    - max_new_tokens: 128 (optimal for NER tasks)
    - temperature: 0.3 (focused, deterministic)
    - top_k: 50, top_p: 0.9 (balanced sampling)
    - use_cache: True (KV cache for speed)

  Expected speed on RunPod GPU:
    - RTX 3090/4090: ~0.5-1 second per sample
    - RTX A5000: ~1-2 seconds per sample
    - Full evaluation (300 samples): ~5-10 minutes


## 6. Evaluate on the Held-Out Test Set

Run inference on a subset of the unseen test set and compute per-sample precision, recall, and F1 scores.

**Note**: The evaluation now uses case-insensitive matching to handle capitalization differences.

In [8]:
# Test on COMPLETELY UNSEEN test samples
# The test set was not used for training OR validation monitoring
# Load test data
with open(TEST_DATA_PATH, 'r', encoding='utf-8') as f:
    test_data = [json.loads(line) for line in f]


num_test_samples = len(test_data)
print(f"Testing on {num_test_samples} samples from TEST SET")
print(f"Total test set size: {len(test_data)}")
print(f"\n‚ö†Ô∏è  IMPORTANT:")
print(f"  - Training set (80%): Used for fine-tuning")
print(f"  - Validation set (10%): Monitored during training (W&B)")
print(f"  - Test set (10%): Used ONLY NOW for final evaluation")

# Aggregate metrics
total_correct = 0
total_predicted = 0
total_expected = 0

def normalize_text(text):
    """Normalize text for comparison: lowercase and strip whitespace."""
    return text.lower().strip()

for i, sample in enumerate(test_data[:num_test_samples]):
    print("\n" + "="*80)
    print(f"FINAL TEST EXAMPLE {i+1}/{num_test_samples}")
    print("="*80)
    
    # Show prompt (truncated for readability)
    print(f"\nüìù PROMPT:")
    prompt_preview = sample['prompt'][:250] + "..." if len(sample['prompt']) > 250 else sample['prompt']
    print(f"{prompt_preview}")
    
    # Show expected output
    print(f"\n‚úÖ EXPECTED OUTPUT:")
    print(f"{sample['completion']}")
    
    # Generate prediction
    print(f"\nü§ñ MODEL PREDICTION:")
    prediction = generate_response(sample['prompt'])
    print(f"{prediction}")
    
    # Calculate metrics with normalization for case-insensitive comparison
    expected_items = set([normalize_text(item) for item in sample['completion'].split('\n') if item.strip()])
    predicted_items = set([normalize_text(item) for item in prediction.split('\n') if item.strip()])
    
    common = expected_items & predicted_items
    missing = expected_items - predicted_items
    extra = predicted_items - expected_items
    
    # Update aggregate counts
    total_correct += len(common)
    total_predicted += len(predicted_items)
    total_expected += len(expected_items)
    
    # Per-sample metrics
    accuracy = len(common) / len(expected_items) * 100 if len(expected_items) > 0 else 0
    precision = len(common) / len(predicted_items) * 100 if len(predicted_items) > 0 else 0
    recall = len(common) / len(expected_items) * 100 if len(expected_items) > 0 else 0
    f1 = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0
    
    print(f"\nüìä EVALUATION METRICS:")
    print(f"  ‚úì Correct extractions: {len(common)}/{len(expected_items)}")
    print(f"  ‚úó Missed extractions: {len(missing)}")
    print(f"  ‚ö† Extra extractions: {len(extra)}")
    print(f"\n  üìà Per-Sample Metrics:")
    print(f"    Accuracy:  {accuracy:.1f}%")
    print(f"    Precision: {precision:.1f}%")
    print(f"    Recall:    {recall:.1f}%")
    print(f"    F1 Score:  {f1:.1f}%")
    
    if missing:
        print(f"\n  Missed items: {list(missing)[:3]}")
    if extra:
        print(f"  Extra items: {list(extra)[:3]}")
    
    # Show matched items for verification
    if common:
        print(f"\n  ‚úì Matched items: {list(common)[:5]}")

Testing on 300 samples from TEST SET
Total test set size: 300

‚ö†Ô∏è  IMPORTANT:
  - Training set (80%): Used for fine-tuning
  - Validation set (10%): Monitored during training (W&B)
  - Test set (10%): Used ONLY NOW for final evaluation

FINAL TEST EXAMPLE 1/300

üìù PROMPT:
The following article contains technical terms including diseases, drugs and chemicals. Create a list only of the chemicals mentioned.

Large parenteral doses of vitamin D3 (15 to 17.5 x 10(6) IU vitamin D3) were associated with prolonged hypercalcem...

‚úÖ EXPECTED OUTPUT:
- vitamin D3
- Calcium

ü§ñ MODEL PREDICTION:
- vitamin D
- Vitamin D3
- calcium
- phosphatemia
- Ca
- vitamin D
- Vitamin D3
- milk fever
- vitamin D3
- calcium
- Vitamin D
- Vitamin D3
- calcium
- Phosphate
- Milk Fever
- milk fever
- calcium
- Vitamin D3
- milk fever
- vitamin D3
- calcium
- Vitamin D3
- milk fever
- milk fever
- Vitamin D3
- calcium
- Vitamin D3
- milk fever
- milk fever
- calcium
- Vitamin D3
-

üìä EVALUATION METRIC

## 7. Aggregate Metrics

Summarize performance across the evaluated samples to understand overall precision, recall, and F1 score.

In [9]:
# Aggregate Metrics across all test samples
print("\n" + "="*80)
print("AGGREGATE METRICS ACROSS TEST SAMPLES")
print("="*80)

# Calculate aggregate metrics
aggregate_precision = total_correct / total_predicted * 100 if total_predicted > 0 else 0
aggregate_recall = total_correct / total_expected * 100 if total_expected > 0 else 0
aggregate_f1 = 2 * (aggregate_precision * aggregate_recall) / (aggregate_precision + aggregate_recall) if (aggregate_precision + aggregate_recall) > 0 else 0
aggregate_accuracy = total_correct / total_expected * 100 if total_expected > 0 else 0

print(f"\nEvaluated on {num_test_samples} test samples:")
print(f"\nüìä Overall Performance:")
print(f"  Total expected entities:  {total_expected}")
print(f"  Total predicted entities: {total_predicted}")
print(f"  Correctly predicted:      {total_correct}")

print(f"\nüìà Aggregate Metrics:")
print(f"  Accuracy:  {aggregate_accuracy:.2f}%")
print(f"  Precision: {aggregate_precision:.2f}% (fewer false positives)")
print(f"  Recall:    {aggregate_recall:.2f}% (fewer false negatives)")
print(f"  F1 Score:  {aggregate_f1:.2f}% (balanced metric)")

print(f"\nüí° Interpretation:")
print(f"  - Accuracy: {aggregate_accuracy:.1f}% of expected entities were found")
print(f"  - Precision: Of all entities predicted, {aggregate_precision:.1f}% were correct")
print(f"  - Recall: Of all actual entities, {aggregate_recall:.1f}% were found")
print(f"  - F1: Harmonic mean balancing precision and recall")

print(f"\nüéØ What these metrics mean:")
print(f"  - High Precision, Low Recall ‚Üí Model is conservative (misses entities)")
print(f"  - Low Precision, High Recall ‚Üí Model is aggressive (predicts too many)")
print(f"  - High F1 Score ‚Üí Good balance between precision and recall")


AGGREGATE METRICS ACROSS TEST SAMPLES

Evaluated on 300 test samples:

üìä Overall Performance:
  Total expected entities:  955
  Total predicted entities: 3430
  Correctly predicted:      753

üìà Aggregate Metrics:
  Accuracy:  78.85%
  Precision: 21.95% (fewer false positives)
  Recall:    78.85% (fewer false negatives)
  F1 Score:  34.34% (balanced metric)

üí° Interpretation:
  - Accuracy: 78.8% of expected entities were found
  - Precision: Of all entities predicted, 22.0% were correct
  - Recall: Of all actual entities, 78.8% were found
  - F1: Harmonic mean balancing precision and recall

üéØ What these metrics mean:
  - High Precision, Low Recall ‚Üí Model is conservative (misses entities)
  - Low Precision, High Recall ‚Üí Model is aggressive (predicts too many)
  - High F1 Score ‚Üí Good balance between precision and recall


## 7.5 False Positive Analysis

Analyze the types of errors the model is making to understand and improve performance.

In [None]:
# Detailed False Positive Analysis
print("="*80)
print("FALSE POSITIVE ANALYSIS")
print("="*80)

# Re-analyze test data to collect all false positives
all_false_positives = []
all_false_negatives = []
all_true_positives = []

for i, sample in enumerate(test_data[:num_test_samples]):
    prediction = generate_response(sample['prompt'])
    
    # Normalize for comparison
    expected_items = set([normalize_text(item) for item in sample['completion'].split('\n') if item.strip()])
    predicted_items = set([normalize_text(item) for item in prediction.split('\n') if item.strip()])
    
    common = expected_items & predicted_items
    false_positives = predicted_items - expected_items  # Model predicted but not in ground truth
    false_negatives = expected_items - predicted_items  # In ground truth but model missed
    
    all_false_positives.extend(false_positives)
    all_false_negatives.extend(false_negatives)
    all_true_positives.extend(common)

print(f"\nüìä Error Distribution:")
print(f"  True Positives:   {len(all_true_positives)} (Correct predictions)")
print(f"  False Positives:  {len(all_false_positives)} (Extra/wrong predictions)")
print(f"  False Negatives:  {len(all_false_negatives)} (Missed entities)")

# Calculate error rates
total_predictions = len(all_true_positives) + len(all_false_positives)
total_ground_truth = len(all_true_positives) + len(all_false_negatives)

false_positive_rate = len(all_false_positives) / total_predictions * 100 if total_predictions > 0 else 0
false_negative_rate = len(all_false_negatives) / total_ground_truth * 100 if total_ground_truth > 0 else 0

print(f"\nüìà Error Rates:")
print(f"  False Positive Rate: {false_positive_rate:.1f}% (of all predictions)")
print(f"  False Negative Rate: {false_negative_rate:.1f}% (of all expected)")

# Show example false positives
print(f"\n‚ö†Ô∏è  Example False Positives (Extra predictions):")
for i, fp in enumerate(all_false_positives[:10], 1):
    print(f"  {i}. {fp}")

# Show example false negatives
print(f"\n‚ùå Example False Negatives (Missed entities):")
for i, fn in enumerate(all_false_negatives[:10], 1):
    print(f"  {i}. {fn}")

# Analysis insights
print(f"\nüí° Insights:")
if false_positive_rate > 20:
    print(f"  ‚ö†Ô∏è  High false positive rate ({false_positive_rate:.1f}%)")
    print(f"     ‚Üí Model is too aggressive, predicting entities that aren't in ground truth")
    print(f"     ‚Üí Consider: More conservative prompting, post-processing filters, or additional training")
elif false_positive_rate < 10:
    print(f"  ‚úì Low false positive rate ({false_positive_rate:.1f}%)")
    print(f"     ‚Üí Model is conservative and precise")

if false_negative_rate > 20:
    print(f"  ‚ö†Ô∏è  High false negative rate ({false_negative_rate:.1f}%)")
    print(f"     ‚Üí Model is missing many expected entities")
    print(f"     ‚Üí Consider: More training data, longer context, or prompt engineering")
elif false_negative_rate < 10:
    print(f"  ‚úì Low false negative rate ({false_negative_rate:.1f}%)")
    print(f"     ‚Üí Model has good recall")

print(f"\nüéØ Recommendations:")
if false_positive_rate > false_negative_rate:
    print(f"  Primary issue: TOO MANY FALSE POSITIVES")
    print(f"  Solutions:")
    print(f"    1. Add post-processing to filter common false positives")
    print(f"    2. Adjust generation parameters (lower temperature, higher top_p)")
    print(f"    3. Fine-tune with more negative examples")
    print(f"    4. Use stricter prompt instructions")
else:
    print(f"  Primary issue: TOO MANY FALSE NEGATIVES")
    print(f"  Solutions:")
    print(f"    1. Increase training data quantity")
    print(f"    2. Improve prompt clarity")
    print(f"    3. Check if test data format matches training data")
    print(f"    4. Consider ensemble methods")

FALSE POSITIVE ANALYSIS


## 8. Interpret the Metrics

### Accuracy
- **Formula**: `Correct / Total Expected`
- **Meaning**: Percentage of expected entities that were correctly predicted
- **Limitation**: Doesn't account for false positives (extra predictions)

### Precision
- **Formula**: `Correct / Total Predicted`
- **Meaning**: Of all entities the model predicted, how many were correct?
- **High Precision**: Model rarely makes false positive errors (rarely predicts wrong entities)

### Recall
- **Formula**: `Correct / Total Expected`
- **Meaning**: Of all actual entities, how many did the model find?
- **High Recall**: Model rarely makes false negative errors (rarely misses entities)

### F1 Score
- **Formula**: `2 √ó (Precision √ó Recall) / (Precision + Recall)`
- **Meaning**: Harmonic mean that balances precision and recall
- **Best metric**: When you care equally about false positives and false negatives

**Example**:
```
Ground truth: ['aspirin', 'ibuprofen', 'NSAIDs']
Prediction:   ['aspirin', 'ibuprofen']

Accuracy:  66.7% (2/3 found)
Precision: 100% (2/2 predicted were correct)
Recall:    66.7% (2/3 actual entities found)
F1 Score:  80.0% (balanced metric)
```

## 9. Custom Test Cases ‚Äî Comprehensive NER Evaluation

Test the model's ability to:
1. **Extract Chemicals** - Identify drug names and chemical compounds
2. **Extract Diseases** - Identify medical conditions and diseases
3. **Extract Relationships** - Identify which chemicals are related to which diseases

In [None]:
# Test 1: Chemical Extraction
print("="*80)
print("TEST 1: CHEMICAL EXTRACTION")
print("="*80)

chemical_test = """The following article contains technical terms including diseases, drugs and chemicals. Create a list only of the chemicals mentioned.

A patient was treated with aspirin and ibuprofen for pain relief. The combination of these NSAIDs proved effective in reducing inflammation. Additionally, metformin was prescribed for glucose control.

List of extracted chemicals:
"""

print(f"\nüìù Prompt:\n{chemical_test}")
print("\nü§ñ Model Output:")
print(generate_response(chemical_test))

In [None]:
# Test 2: Disease Extraction
print("\n" + "="*80)
print("TEST 2: DISEASE EXTRACTION")
print("="*80)

disease_test = """The following article contains technical terms including diseases, drugs and chemicals. Create a list only of the diseases mentioned.

The patient presented with hypertension, diabetes mellitus, and chronic kidney disease. Laboratory findings revealed proteinuria and elevated creatinine levels, suggesting diabetic nephropathy.

List of extracted diseases:
"""

print(f"\nüìù Prompt:\n{disease_test}")
print("\nü§ñ Model Output:")
print(generate_response(disease_test))

In [None]:
# Test 3: Chemical-Disease Relationship Extraction
print("\n" + "="*80)
print("TEST 3: RELATIONSHIP EXTRACTION - BASIC")
print("="*80)

relationship_test_1 = """The following article contains technical terms including diseases, drugs and chemicals. Extract the relationships between chemicals and diseases mentioned in the text.

Metformin is commonly prescribed for type 2 diabetes by improving insulin sensitivity and reducing hepatic glucose production. Aspirin is used in cardiovascular disease management in high-risk patients.

List the chemical-disease relationships:
"""

print(f"\nüìù Prompt:\n{relationship_test_1}")
print("\nü§ñ Model Output:")
print(generate_response(relationship_test_1, max_new_tokens=600))

In [None]:
# Test 4: Multiple Relationship Extraction
print("\n" + "="*80)
print("TEST 4: RELATIONSHIP EXTRACTION - MULTIPLE PAIRS")
print("="*80)

relationship_test_2 = """The following article contains technical terms including diseases, drugs and chemicals. Identify all chemical-disease pairs and their relationships.

Long-term use of corticosteroids is associated with osteoporosis and increases the risk of bone fractures. NSAIDs are linked to chronic kidney disease and gastrointestinal bleeding in susceptible patients.

List of chemical-disease relationships:
"""

print(f"\nüìù Prompt:\n{relationship_test_2}")
print("\nü§ñ Model Output:")
print(generate_response(relationship_test_2, max_new_tokens=600))

In [None]:
# Test 5: Complex Multi-Entity Relationship Extraction
print("\n" + "="*80)
print("TEST 5: COMPREHENSIVE EXTRACTION - ALL ENTITIES & RELATIONSHIPS")
print("="*80)

relationship_test_3 = """The following article contains technical terms including diseases, drugs and chemicals. Extract:
1. All chemicals mentioned
2. All diseases mentioned
3. All relationships between chemicals and diseases

The patient with rheumatoid arthritis was started on methotrexate for inflammatory joint disease. However, methotrexate is associated with hepatotoxicity and requires monitoring. The patient also has hypertension managed with lisinopril. Statins were prescribed for cardiovascular disease prevention given elevated cholesterol levels.

Extracted information:
"""

print(f"\nüìù Prompt:\n{relationship_test_3}")
print("\nü§ñ Model Output:")
print(generate_response(relationship_test_3, max_new_tokens=800))

## 10. Suggested Next Steps

- Evaluate the full test set (set `num_test_samples = len(test_data)`) to capture complete performance.
- Compare with the base model to quantify the lift from fine-tuning.
- Log metrics to Weights & Biases or another tracker for experiment history.
- Export predictions for manual spot checks with subject-matter experts.

## 11. Usage Example (Optional)

How to load the model in a production script or service.

In [None]:
# Example: How to load and use the model later
usage_code = '''
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
import torch

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-3B-Instruct")

# Load base model
base_model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.2-3B-Instruct",
    torch_dtype=torch.float16,
    device_map="auto"
)

# Load LoRA adapter from Hub
model = PeftModel.from_pretrained(
    base_model,
    "your-username/llama3-medical-ner-lora"  # Your model ID
)
model.eval()

# Use the model
prompt = """The following article contains technical terms including diseases, drugs and chemicals. Create a list only of the chemicals mentioned.

Patient was treated with metformin and insulin for diabetes management.

List of extracted chemicals:
"""

# Generate response
# ... (use the generate_response function from above)
'''

print("Usage Example:")
print("="*80)
print(usage_code)

---

## Summary

This notebook:
1. ‚úÖ Configured environment variables and authentication for Hugging Face and W&B.
2. ‚úÖ Installed required evaluation dependencies.
3. ‚úÖ Loaded the fine-tuned medical NER model (base + LoRA adapter).
4. ‚úÖ Evaluated performance on unseen test samples with detailed metrics.
5. ‚úÖ Aggregated precision, recall, and F1 across all evaluated examples.
6. ‚úÖ Validated behaviour on curated chemical, disease, and relationship prompts.
7. ‚úÖ Outlined next steps and provided a ready-to-use inference snippet.

**Your medical NER evaluation workflow is ready! üöÄ**