# üöÄ RunPod GPU Setup

**This notebook is optimized for RunPod GPU pods with NVIDIA GPUs**

## Quick Start on RunPod:

1. **Launch a GPU Pod** (RTX 3090, 4090, or A5000 recommended)
2. **Upload this notebook** to the pod
3. **Upload test data** (`test_run_20251106.jsonl`) to `/workspace/data/`
4. **Run cells in order** - evaluation should complete in ~5-10 minutes

## Expected Performance:
- **GPU**: RTX 3090/4090 ‚Üí ~0.5-1 sec/sample (~5 min total)
- **GPU**: RTX A5000 ‚Üí ~1-2 sec/sample (~10 min total)
- **Full evaluation**: 300 samples

---

# Medical NER Model Evaluation

This notebook evaluates the fine-tuned Llama 3.2 3B medical NER model.

## ‚úÖ DATASET VERIFIED & READY FOR EVALUATION

**Current Dataset Distribution** (from `both_rel_instruct_all.jsonl`):
- **1,000 Chemical extraction** examples (25%)
- **2,000 Disease extraction** examples (50%) ‚ö†Ô∏è Intentionally 2x more
- **1,000 Relationship extraction** examples (25%)

**Data Splits Status**: ‚úÖ Properly stratified using `stratify=` parameter
- Training (2,400): 25% chemical, 50% disease, 25% relationship
- Validation (300+): 25% chemical, 50% disease, 25% relationship
- Test (300+): 25% chemical, 50% disease, 25% relationship

**Why Disease is 2x more**:
- The original dataset has twice as many disease extraction examples
- Stratified splitting preserves this 25/50/25 distribution
- This appears intentional for better disease NER performance
- All splits are properly balanced relative to the source data

**Next Steps**:
1. ‚úÖ Training data is properly split with stratification
2. ‚úÖ No data leakage between train/val/test
3. ‚úÖ Update `HF_MODEL_ID` below with your trained model ID
4. ‚úÖ Run this evaluation notebook on the balanced test set

---

## Prerequisites:
1. Complete training in `Medical_NER_Fine_Tuning.ipynb` (uses stratified splits!)
2. Model saved to `./final_model` or uploaded to HuggingFace Hub
3. Test data available in `notebooks/test.jsonl` or `../data/test.jsonl`

## Evaluation Tasks:
1. Load the fine-tuned model
2. Evaluate on test set (25% chem, 50% disease, 25% relationship)
3. Calculate precision, recall, F1 scores per task type
4. Test on custom medical texts
5. Analyze errors and false positives

## 0. Environment Variables Setup

‚ö†Ô∏è **IMPORTANT**: Set your credentials before running the notebook!

**Note**: `hf_transfer` is enabled for faster downloads from HuggingFace Hub.

In [1]:
import os
from getpass import getpass

# Enable hf_transfer for faster downloads from HuggingFace Hub
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"

# HuggingFace Token (required to download your model from Hub)
# Get your token from: https://huggingface.co/settings/tokens
hf_token = os.getenv("HF_TOKEN")
if not hf_token:
    print("HF_TOKEN not found in environment variables")
    hf_token = getpass("Enter your HuggingFace token: ")
    os.environ["HF_TOKEN"] = hf_token
else:
    print("‚úì HF_TOKEN loaded from environment")

# Weights & Biases API Key (optional - only if tracking evaluation metrics)
# Get your key from: https://wandb.ai/authorize
wandb_key = os.getenv("WANDB_API_KEY")
if wandb_key:
    print("‚úì WANDB_API_KEY loaded from environment")
else:
    print("‚Ñπ WANDB_API_KEY not set (optional)")

print("\n‚úì Environment variables configured")
print(f"  HF_HUB_ENABLE_HF_TRANSFER: {os.getenv('HF_HUB_ENABLE_HF_TRANSFER')}")

HF_TOKEN not found in environment variables


Enter your HuggingFace token:  ¬∑¬∑¬∑¬∑¬∑¬∑¬∑¬∑


‚Ñπ WANDB_API_KEY not set (optional)

‚úì Environment variables configured
  HF_HUB_ENABLE_HF_TRANSFER: 1


## 1. Setup and Installation


In [2]:
# Install PyTorch and other required packages
!pip install -q transformers datasets peft accelerate bitsandbytes
!pip install -q huggingface-hub tokenizers hf-transfer

print("‚úì All packages installed successfully!")
print("  - transformers (HuggingFace models)")
print("  - peft (LoRA adapters)")
print("  - accelerate (device management)")
print("  - bitsandbytes (quantization)")
print("  - hf-transfer (fast downloads)")

‚úì All packages installed successfully!
  - transformers (HuggingFace models)
  - peft (LoRA adapters)
  - accelerate (device management)
  - bitsandbytes (quantization)
  - hf-transfer (fast downloads)


## 2. Import Libraries


In [3]:

import json
import torch
from pathlib import Path
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
from huggingface_hub import login

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")

PyTorch version: 2.8.0+cu128
CUDA available: True
GPU: NVIDIA RTX A4000


## 0) Reusable Utilities

‚ö†Ô∏è **IMPORTANT**: Run this cell BEFORE running evaluation cells below!

These utility functions provide text normalization, hashing, parsing, and validation for the evaluation pipeline.

In [4]:
# ===== Utilities: normalization, hashing, parsing =====
import re, json, hashlib
from collections import Counter

def dehyphenate(s: str) -> str:
    # Join words broken across lines with hyphens + whitespace
    return re.sub(r"(\w+)-\s+(\w+)", r"\1\2", s)

def normalize_text(s: str) -> str:
    s = dehyphenate(s or "")
    s = s.lower()
    s = re.sub(r"[\u00A0\t\r\n]+", " ", s)     # spaces/newlines
    s = re.sub(r"\s+", " ", s).strip()
    return s

def prompt_hash(prompt: str) -> str:
    return hashlib.md5(normalize_text(prompt).encode("utf-8")).hexdigest()

def parse_bullets(text: str):
    items = []
    for line in (text or "").splitlines():
        m = re.match(r"^\s*[-*]\s*(.+?)\s*$", line)
        if m:
            items.append(m.group(1))
    return items

def normalize_item(s: str) -> str:
    s = (s or "").lower()
    # Keep hyphens intact (e.g., "type-2 diabetes" stays "type-2 diabetes")
    s = re.sub(r"\s+", " ", s)  # Only normalize whitespace
    s = re.sub(r"[\.,;:]+$", "", s).strip()
    return s

def in_text(item: str, text: str) -> bool:
    """Check if item appears in text using word boundaries to avoid partial matches."""
    item_norm = normalize_item(item)
    text_norm = normalize_text(text)
    # Use word boundaries to avoid matching "aspirin" in "aspirinate"
    pattern = r'\b' + re.escape(item_norm) + r'\b'
    return bool(re.search(pattern, text_norm))

def unique_preserve_order(seq):
    seen = set()
    out = []
    for x in seq:
        if x not in seen:
            seen.add(x); out.append(x)
    return out

print("‚úì Utility functions loaded")

‚úì Utility functions loaded


## 3. Configuration

‚ö†Ô∏è **Update these paths** to match your model location!


In [5]:
# Model configuration
BASE_MODEL = "meta-llama/Llama-3.2-3B-Instruct"

# ‚ö†Ô∏è IMPORTANT: Update with YOUR HuggingFace model ID
# Find it at: https://huggingface.co/your-username
# Format: "your-username/llama3-medical-ner-lora-YYYYMMDD_HHMMSS"
HF_MODEL_ID = "albyos/llama3-medical-ner-checkpoint-450-20251108_114135"  # ‚Üê UPDATE THIS!

# Alternative: Use local model if you prefer
USE_HF_HUB = True  # Set to False to use local ../final_model
PROJECT_ROOT = Path.cwd().parent
LOCAL_MODEL_PATH = PROJECT_ROOT / "final_model"

ADAPTER_PATH = HF_MODEL_ID if USE_HF_HUB else str(LOCAL_MODEL_PATH)

# Data configuration
# For RunPod: Upload test data to /workspace/data/test.jsonl
# For local: Use your local path
try:
    # Try current directory first (for RunPod/workspace)
    TEST_DATA_PATH = Path("test_run-20251108.jsonl")
    if not TEST_DATA_PATH.exists():
        # Fallback to parent data directory (for local)
        TEST_DATA_PATH = Path.cwd().parent / "test_run-20251108.jsonl"
        if not TEST_DATA_PATH.exists():
            # Another fallback - notebooks directory test.jsonl
            TEST_DATA_PATH = Path.cwd() / "test_run-20251108.jsonl"
except Exception:
    TEST_DATA_PATH = Path("test_run-20251108.jsonl")

# Verify test data exists
if not TEST_DATA_PATH.exists():
    print(f"‚ùå Test data not found at {TEST_DATA_PATH}")
    print(f"üí° RunPod: Upload to /workspace/data/test.jsonl")
    print(f"üí° Local: Place in ../data/test.jsonl or notebooks/test.jsonl")
    raise FileNotFoundError(f"Test data file not found: {TEST_DATA_PATH}")

print("‚úì Configuration loaded")
print(f"  Base model: {BASE_MODEL}")
print(f"  Adapter source: {'HuggingFace Hub' if USE_HF_HUB else 'Local filesystem'}")
print(f"  Adapter path: {ADAPTER_PATH}")
print(f"  Test data: {TEST_DATA_PATH}")
print(f"  Test data exists: {TEST_DATA_PATH.exists()}")

‚úì Configuration loaded
  Base model: meta-llama/Llama-3.2-3B-Instruct
  Adapter source: HuggingFace Hub
  Adapter path: albyos/llama3-medical-ner-checkpoint-450-20251108_114135
  Test data: test_run-20251108.jsonl
  Test data exists: True


## 4. Authenticate with Hugging Face

Log into Hugging Face to download the LoRA adapter when `USE_HF_HUB` is enabled.

In [6]:
# Login to HuggingFace Hub to access your model
import os
from huggingface_hub import login

hf_token = os.environ.get("HF_TOKEN")

if not hf_token:
    print("‚ùå HF_TOKEN not found in environment")
    print("   Please run cell #3 first to set your HF token")
    raise ValueError("HF_TOKEN is required to download model from HuggingFace Hub")

# Login to HuggingFace
login(token=hf_token, add_to_git_credential=True)

print("‚úì Logged into Hugging Face Hub")
print(f"  Will load model from: {HF_MODEL_ID}")

Token has not been saved to git credential helper.
Note: Environment variable`HF_TOKEN` is set and is the current active token independently from the token you've just configured.


[1m[31mCannot authenticate through git-credential as no helper is defined on your machine.
You might have to re-authenticate when pushing to the Hugging Face Hub.
Run the following command in your terminal in case you want to set the 'store' credential helper as default.

git config --global credential.helper store

Read https://git-scm.com/book/en/v2/Git-Tools-Credential-Storage for more details.[0m
‚úì Logged into Hugging Face Hub
  Will load model from: albyos/llama3-medical-ner-checkpoint-450-20251108_114135


## 5. Load the Fine-Tuned Model

Load the base model and attach the LoRA adapter from either Hugging Face Hub or your local filesystem.

**Note**: Using `hf_transfer` for faster downloads from HuggingFace Hub.

In [7]:
# Ensure hf_transfer is enabled for faster downloads
import os
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"

# Load the fine-tuned model for inference
print("="*80)
print("LOADING FINE-TUNED MODEL")
print("="*80)

print(f"\nLoading base model: {BASE_MODEL}...")

#Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL)
tokenizer.pad_token = tokenizer.eos_token

print(f"‚úì Tokenizer loaded")

# Check for GPU support (optimized for RunPod/CUDA)
if torch.cuda.is_available():
    device = "cuda"
    print(f"üöÄ NVIDIA GPU detected: {torch.cuda.get_device_name(0)}")
    print(f"   GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
elif torch.backends.mps.is_available():
    device = "mps"
    print(f"üöÄ Apple Silicon GPU (MPS) detected")
else:
    device = "cpu"
    print(f"‚ö†Ô∏è  No GPU detected, using CPU (very slow)")

# Load base model with GPU acceleration
# On RunPod: Uses CUDA with float16 for optimal performance
base_model = AutoModelForCausalLM.from_pretrained(
    BASE_MODEL,
    torch_dtype=torch.float16,
    device_map="auto",  # Automatically distribute model across available GPUs
    low_cpu_mem_usage=True,
)

print(f"\n‚úì Base model loaded: {BASE_MODEL}")
print(f"  Device: {device.upper()}")
print(f"  Precision: {base_model.dtype}")
if device == "cuda":
    print(f"  GPU Memory Used: {torch.cuda.memory_allocated() / 1e9:.2f} GB")

# Load LoRA adapter from HuggingFace Hub or local path
print(f"\nLoading LoRA adapter from: {ADAPTER_PATH}...")
print(f"  Using hf_transfer for faster downloads...")

model = PeftModel.from_pretrained(
    base_model,
    ADAPTER_PATH,
)
model.eval()

print(f"\n‚úì Fine-tuned model loaded successfully!")
print(f"  Base: {BASE_MODEL}")
print(f"  LoRA adapter: {ADAPTER_PATH}")
print(f"  Source: {'HuggingFace Hub' if USE_HF_HUB else 'Local filesystem'}")

LOADING FINE-TUNED MODEL

Loading base model: meta-llama/Llama-3.2-3B-Instruct...


tokenizer_config.json:   0%|          | 0.00/54.5k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/296 [00:00<?, ?B/s]

‚úì Tokenizer loaded
üöÄ NVIDIA GPU detected: NVIDIA RTX A4000
   GPU Memory: 16.9 GB


config.json:   0%|          | 0.00/878 [00:00<?, ?B/s]

`torch_dtype` is deprecated! Use `dtype` instead!


model.safetensors.index.json:   0%|          | 0.00/20.9k [00:00<?, ?B/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/1.46G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/189 [00:00<?, ?B/s]


‚úì Base model loaded: meta-llama/Llama-3.2-3B-Instruct
  Device: CUDA
  Precision: torch.float16
  GPU Memory Used: 6.43 GB

Loading LoRA adapter from: albyos/llama3-medical-ner-checkpoint-450-20251108_114135...
  Using hf_transfer for faster downloads...


adapter_config.json:   0%|          | 0.00/944 [00:00<?, ?B/s]

adapter_model.safetensors:   0%|          | 0.00/97.3M [00:00<?, ?B/s]


‚úì Fine-tuned model loaded successfully!
  Base: meta-llama/Llama-3.2-3B-Instruct
  LoRA adapter: albyos/llama3-medical-ner-checkpoint-450-20251108_114135
  Source: HuggingFace Hub


In [8]:
# ===== Deterministic generation for evaluation =====
def generate_response(prompt_text, max_new_tokens=128):
    """
    Generate a response for a given prompt - DETERMINISTIC for precision.
    
    Key changes from training version:
    - do_sample=False: Greedy decoding prevents hallucinations
    - temperature=0.0: No randomness
    - Removes sampling parameters (top_k, top_p)
    """
    formatted_prompt = f"""<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Return ONLY entities that appear verbatim in the article.
Output one item per line, each starting with '- '.
If none exist, return nothing.
Do not add explanations or examples.<|eot_id|><|start_header_id|>user<|end_header_id|>

{prompt_text}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

"""
    
    inputs = tokenizer(formatted_prompt, return_tensors="pt").to(model.device)
    
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            do_sample=False,  # Greedy decoding (deterministic)
            temperature=0.0,  # No randomness
            top_p=1.0,  # Not used with do_sample=False, but set for clarity
            num_beams=1,  # No beam search (faster)
            repetition_penalty=1.15,  # Slight penalty to avoid repetition
            eos_token_id=tokenizer.eos_token_id,
            pad_token_id=tokenizer.eos_token_id,
            use_cache=True,  # Enable KV cache for faster generation
        )
    
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    
    # Extract assistant's response
    if "assistant\n\n" in response:
        response = response.split("assistant\n\n")[-1]
    elif "assistant" in response:
        response = response.split("assistant")[-1].strip()
    
    return response.strip()

print("‚úì Deterministic inference function ready")
print("  Generation parameters:")
print("    - do_sample: False (greedy decoding)")
print("    - temperature: 0.0 (no randomness)")
print("    - max_new_tokens: 128 (optimal for NER tasks)")
print("    - use_cache: True (KV cache for speed)")
print("\n  Benefits:")
print("    - Reproducible results (same input ‚Üí same output)")
print("    - Reduced hallucinations and false positives")
print("    - Faster inference (no sampling overhead)")
print("\n  Expected speed on RunPod GPU:")
print("    - RTX 3090/4090: ~0.5-1 second per sample")
print("    - RTX A5000: ~1-2 seconds per sample")
print("    - Full evaluation (300 samples): ~5-10 minutes")

‚úì Deterministic inference function ready
  Generation parameters:
    - do_sample: False (greedy decoding)
    - temperature: 0.0 (no randomness)
    - max_new_tokens: 128 (optimal for NER tasks)
    - use_cache: True (KV cache for speed)

  Benefits:
    - Reproducible results (same input ‚Üí same output)
    - Reduced hallucinations and false positives
    - Faster inference (no sampling overhead)

  Expected speed on RunPod GPU:
    - RTX 3090/4090: ~0.5-1 second per sample
    - RTX A5000: ~1-2 seconds per sample
    - Full evaluation (300 samples): ~5-10 minutes


## 6. Task Classification and Post-Filters

These functions classify tasks from prompts and filter predictions to ensure they appear in the source text, reducing false positives.

In [9]:
# ===== Task classification and post-filters =====

# Task classifier
def task_from_prompt(prompt: str) -> str:
    """Classify task type from prompt text."""
    p = normalize_text(prompt)
    if "list of extracted chemicals" in p: return "chemicals"
    if "list of extracted diseases"  in p: return "diseases"
    if "list of extracted influences" in p: return "influences"
    # Fallback patterns
    if "chemicals mentioned" in p: return "chemicals"
    if "diseases mentioned" in p: return "diseases"
    if "influences between" in p: return "influences"
    return "other"

# Entity extraction and filtering
def extract_list_from_generation(gen_text):
    """Parse bullets from the model output."""
    return parse_bullets(gen_text)

def filter_items_against_text(pred_items, prompt_text):
    """Keep only items that appear in the source text (after normalization). Deduplicate."""
    keep = []
    for it in pred_items:
        if in_text(it, prompt_text):
            keep.append(normalize_item(it))
    return unique_preserve_order(keep)

# Influences/Relationships - parse as pairs
def parse_pairs(gen_text):
    """Parse 'chemical | disease' pairs from generation output."""
    pairs = []
    for line in parse_bullets(gen_text):
        parts = [p.strip() for p in line.split("|")]
        if len(parts)==2:
            pairs.append(tuple(parts))
    return unique_preserve_order(pairs)

def parse_pairs_from_sentence(gen_text):
    """Parse OLD format: 'chemical X influences disease Y' from generation."""
    pairs = []
    for line in parse_bullets(gen_text):
        # Match pattern: "chemical NAME influences disease NAME"
        m = re.match(r'^\s*chemical\s+(.+?)\s+influences\s+disease\s+(.+?)\s*$', line, re.I)
        if m:
            pairs.append((m.group(1).strip(), m.group(2).strip()))
    return unique_preserve_order(pairs)

def filter_pairs_against_text(pairs, prompt_text):
    """Keep the pair only if BOTH sides appear in the prompt."""
    kept = []
    for chem, dis in pairs:
        if in_text(chem, prompt_text) and in_text(dis, prompt_text):
            kept.append((normalize_item(chem), normalize_item(dis)))
    # Deduplicate normalized pairs
    seen=set(); out=[]
    for p in kept:
        if p not in seen:
            seen.add(p); out.append(p)
    return out

# Temporary fallback if you still have sentence outputs
def sentence_to_pair(line):
    """Parse sentence-style influences: 'Chemical X influences disease Y'"""
    m = re.match(r"^\s*chemical\s+(.+?)\s+influences\s+disease\s+(.+?)\s*$", line, re.I)
    return (m.group(1), m.group(2)) if m else None

print("‚úì Task classification and filter functions loaded")
print("  Functions:")
print("    - task_from_prompt(): Classify task type")
print("    - filter_items_against_text(): Keep only entities in source text")
print("    - parse_pairs(): Parse 'chemical | disease' pairs")
print("    - filter_pairs_against_text(): Keep pairs where both sides exist")

‚úì Task classification and filter functions loaded
  Functions:
    - task_from_prompt(): Classify task type
    - filter_items_against_text(): Keep only entities in source text
    - parse_pairs(): Parse 'chemical | disease' pairs
    - filter_pairs_against_text(): Keep pairs where both sides exist


## 7. Evaluate on the Held-Out Test Set

Run inference on the test set with deterministic generation and post-filters.

**Key Features**:
- **Deterministic generation**: No sampling (do_sample=False)
- **Post-filters**: Keep only entities that appear in source text
- **Per-task metrics**: Separate P/R/F1 for chemicals, diseases, influences
- **Sanity checks**: Show examples of false positives and false negatives

## üîß Critical Fixes Applied

**Format Mismatch Issue Resolved:**

The test data uses OLD format for influences:
```
"- chemical cyclophosphamide influences disease urinary bladder cancer"
```

But the model may output NEW format:
```
"- cyclophosphamide | urinary bladder cancer"
```

**Solution:** The evaluation now handles BOTH formats automatically by:
1. Parsing gold data from OLD sentence format
2. Trying to parse model output from NEW format first, then OLD format as fallback
3. Normalizing both to `"chemical | disease"` format for comparison

This ensures accurate metrics regardless of which format the model learned!

In [None]:
# ===== Evaluation with per-task metrics and filters =====
from statistics import mean

def f1(p, r): 
    return 0.0 if (p+r)==0 else 2*p*r/(p+r)

# Load test data
with open(TEST_DATA_PATH, 'r', encoding='utf-8') as f:
    test_data = [json.loads(line) for line in f]

print(f"‚úì Loaded test set: {len(test_data)} samples")
print(f"\n‚ö†Ô∏è  IMPORTANT:")
print(f"  - Training set (80%): Used for fine-tuning")
print(f"  - Validation set (10%): Monitored during training (W&B)")
print(f"  - Test set (10%): Used ONLY NOW for final evaluation")
print(f"\nRunning evaluation with deterministic generation + post-filters...")

# Initialize per-task counters
gold_total = {"chemicals":0, "diseases":0, "influences":0}
pred_total = {"chemicals":0, "diseases":0, "influences":0}
tp_total   = {"chemicals":0, "diseases":0, "influences":0}

examples_fp = []  # False positives
examples_fn = []  # False negatives

# Process each test sample
for idx, row in enumerate(test_data):
    if (idx + 1) % 50 == 0:
        print(f"  Progress: {idx + 1}/{len(test_data)} samples...")
    
    prompt = row["prompt"]
    gold_items = [normalize_item(x) for x in parse_bullets(row.get("completion",""))]
    task = task_from_prompt(prompt)
    
    # Generate prediction
    gen = generate_response(prompt, max_new_tokens=128)
    pred_raw = extract_list_from_generation(gen)
    
    # Apply filters based on task type
    if task in {"chemicals", "diseases"}:
        pred = filter_items_against_text(pred_raw, prompt)
    elif task == "influences":
        # Parse gold data (OLD format: "chemical X influences disease Y")
        gold_pairs = []
        for item in parse_bullets(row.get("completion","")):
            # Try parsing sentence format
            m = re.match(r'^\s*chemical\s+(.+?)\s+influences\s+disease\s+(.+?)\s*$', item, re.I)
            if m:
                chem = normalize_item(m.group(1))
                dis = normalize_item(m.group(2))
                gold_pairs.append(f"{chem} | {dis}")
        gold_items = gold_pairs
        
        # Parse model output (could be NEW format "chem | disease" OR OLD format)
        pairs_new = parse_pairs(gen)  # Try new format first
        pairs_old = parse_pairs_from_sentence(gen)  # Try old format as fallback
        all_pairs = pairs_new if pairs_new else pairs_old
        
        # Normalize both sides of the pair for consistent comparison
        pred = [f"{normalize_item(c)} | {normalize_item(d)}" 
                for (c,d) in filter_pairs_against_text(all_pairs, prompt)]
    else:
        pred = []
    
    # Convert to sets for metrics
    gs = set(gold_items)
    ps = set(pred)
    
    tp = len(gs & ps)
    fp = len(ps - gs)
    fn = len(gs - ps)
    
    gold_total[task] += len(gs)
    pred_total[task] += len(ps)
    tp_total[task]   += tp
    
    # Collect examples for analysis
    if fp and len(examples_fp) < 8:
        examples_fp.append({
            "task": task,
            "prompt_preview": prompt[:120]+"...",
            "pred_extras": list(ps-gs)[:5]
        })
    if fn and len(examples_fn) < 8:
        examples_fn.append({
            "task": task,
            "prompt_preview": prompt[:120]+"...",
            "missed": list(gs-ps)[:5]
        })

print(f"\n‚úì Evaluation complete!")
print(f"\n{'='*80}")
print("PER-TASK METRICS (with post-filters)")
print(f"{'='*80}\n")

# Calculate and display metrics for each task
for t in ["chemicals", "diseases", "influences"]:
    P = 0.0 if pred_total[t]==0 else tp_total[t]/pred_total[t]
    R = 0.0 if gold_total[t]==0 else tp_total[t]/gold_total[t]
    F = f1(P,R)
    print(f"{t.upper()}")
    print(f"  Precision: {P*100:5.1f}%  (TP={tp_total[t]}, Pred={pred_total[t]})")
    print(f"  Recall:    {R*100:5.1f}%  (TP={tp_total[t]}, Gold={gold_total[t]})")
    print(f"  F1 Score:  {F*100:5.1f}%")
    print()

# Overall metrics
total_tp = sum(tp_total.values())
total_pred = sum(pred_total.values())
total_gold = sum(gold_total.values())
overall_P = 0.0 if total_pred==0 else total_tp/total_pred
overall_R = 0.0 if total_gold==0 else total_tp/total_gold
overall_F = f1(overall_P, overall_R)

print(f"{'='*80}")
print("OVERALL METRICS")
print(f"{'='*80}")
print(f"  Precision: {overall_P*100:5.1f}%")
print(f"  Recall:    {overall_R*100:5.1f}%")
print(f"  F1 Score:  {overall_F*100:5.1f}%")
print(f"\n  Total TP: {total_tp}, Total Pred: {total_pred}, Total Gold: {total_gold}")

# Show example errors
if examples_fp:
    print(f"\n{'='*80}")
    print("EXAMPLE FALSE POSITIVES (model predicted, but not in gold)")
    print(f"{'='*80}")
    for e in examples_fp[:5]:
        print(f"\nTask: {e['task']}")
        print(f"Prompt: {e['prompt_preview']}")
        print(f"Extra predictions: {e['pred_extras']}")

if examples_fn:
    print(f"\n{'='*80}")
    print("EXAMPLE FALSE NEGATIVES (in gold, but model missed)")
    print(f"{'='*80}")
    for e in examples_fn[:5]:
        print(f"\nTask: {e['task']}")
        print(f"Prompt: {e['prompt_preview']}")
        print(f"Missed items: {e['missed']}")

The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


‚úì Loaded test set: 300 samples

‚ö†Ô∏è  IMPORTANT:
  - Training set (80%): Used for fine-tuning
  - Validation set (10%): Monitored during training (W&B)
  - Test set (10%): Used ONLY NOW for final evaluation

Running evaluation with deterministic generation + post-filters...


## 8. Custom Test Cases ‚Äî Comprehensive NER Evaluation

Test the model's ability to:
1. **Extract Chemicals** - Identify drug names and chemical compounds
2. **Extract Diseases** - Identify medical conditions and diseases
3. **Extract Relationships** - Identify which chemicals are related to which diseases

In [None]:
# Test 1: Chemical Extraction
print("="*80)
print("TEST 1: CHEMICAL EXTRACTION")
print("="*80)

chemical_test = """The following article contains technical terms including diseases, drugs and chemicals. Create a list only of the chemicals mentioned.

A patient was treated with aspirin and ibuprofen for pain relief. The combination of these NSAIDs proved effective in reducing inflammation. Additionally, metformin was prescribed for glucose control.

List of extracted chemicals:
"""

print(f"\nüìù Prompt:\n{chemical_test}")
print("\nü§ñ Model Output:")
print(generate_response(chemical_test))

In [None]:
# Test 2: Disease Extraction
print("\n" + "="*80)
print("TEST 2: DISEASE EXTRACTION")
print("="*80)

disease_test = """The following article contains technical terms including diseases, drugs and chemicals. Create a list only of the diseases mentioned.

The patient presented with hypertension, diabetes mellitus, and chronic kidney disease. Laboratory findings revealed proteinuria and elevated creatinine levels, suggesting diabetic nephropathy.

List of extracted diseases:
"""

print(f"\nüìù Prompt:\n{disease_test}")
print("\nü§ñ Model Output:")
print(generate_response(disease_test))

In [None]:
# Test 3: Chemical-Disease Relationship Extraction
print("\n" + "="*80)
print("TEST 3: RELATIONSHIP EXTRACTION - BASIC")
print("="*80)

relationship_test_1 = """The following article contains technical terms including diseases, drugs and chemicals. Extract the relationships between chemicals and diseases mentioned in the text.

Metformin is commonly prescribed for type 2 diabetes by improving insulin sensitivity and reducing hepatic glucose production. Aspirin is used in cardiovascular disease management in high-risk patients.

List the chemical-disease relationships:
"""

print(f"\nüìù Prompt:\n{relationship_test_1}")
print("\nü§ñ Model Output:")
print(generate_response(relationship_test_1, max_new_tokens=600))

In [None]:
# Test 4: Multiple Relationship Extraction
print("\n" + "="*80)
print("TEST 4: RELATIONSHIP EXTRACTION - MULTIPLE PAIRS")
print("="*80)

relationship_test_2 = """The following article contains technical terms including diseases, drugs and chemicals. Identify all chemical-disease pairs and their relationships.

Long-term use of corticosteroids is associated with osteoporosis and increases the risk of bone fractures. NSAIDs are linked to chronic kidney disease and gastrointestinal bleeding in susceptible patients.

List of chemical-disease relationships:
"""

print(f"\nüìù Prompt:\n{relationship_test_2}")
print("\nü§ñ Model Output:")
print(generate_response(relationship_test_2, max_new_tokens=600))

In [None]:
# Test 5: Complex Multi-Entity Relationship Extraction
print("\n" + "="*80)
print("TEST 5: COMPREHENSIVE EXTRACTION - ALL ENTITIES & RELATIONSHIPS")
print("="*80)

relationship_test_3 = """The following article contains technical terms including diseases, drugs and chemicals. Extract:
1. All chemicals mentioned
2. All diseases mentioned
3. All relationships between chemicals and diseases

The patient with rheumatoid arthritis was started on methotrexate for inflammatory joint disease. However, methotrexate is associated with hepatotoxicity and requires monitoring. The patient also has hypertension managed with lisinopril. Statins were prescribed for cardiovascular disease prevention given elevated cholesterol levels.

Extracted information:
"""

print(f"\nüìù Prompt:\n{relationship_test_3}")
print("\nü§ñ Model Output:")
print(generate_response(relationship_test_3, max_new_tokens=800))

## 10. Suggested Next Steps

- Evaluate the full test set (set `num_test_samples = len(test_data)`) to capture complete performance.
- Compare with the base model to quantify the lift from fine-tuning.
- Log metrics to Weights & Biases or another tracker for experiment history.
- Export predictions for manual spot checks with subject-matter experts.

## 11. Usage Example (Optional)

How to load the model in a production script or service.

In [None]:
# Example: How to load and use the model later
usage_code = '''
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
import torch

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-3B-Instruct")

# Load base model
base_model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.2-3B-Instruct",
    torch_dtype=torch.float16,
    device_map="auto"
)

# Load LoRA adapter from Hub
model = PeftModel.from_pretrained(
    base_model,
    "your-username/llama3-medical-ner-lora"  # Your model ID
)
model.eval()

# Use the model
prompt = """The following article contains technical terms including diseases, drugs and chemicals. Create a list only of the chemicals mentioned.

Patient was treated with metformin and insulin for diabetes management.

List of extracted chemicals:
"""

# Generate response
# ... (use the generate_response function from above)
'''

print("Usage Example:")
print("="*80)
print(usage_code)

---

## Summary

This notebook:
1. ‚úÖ Configured environment variables and authentication for Hugging Face and W&B.
2. ‚úÖ Installed required evaluation dependencies.
3. ‚úÖ Loaded the fine-tuned medical NER model (base + LoRA adapter).
4. ‚úÖ Evaluated performance on unseen test samples with detailed metrics.
5. ‚úÖ Aggregated precision, recall, and F1 across all evaluated examples.
6. ‚úÖ Validated behaviour on curated chemical, disease, and relationship prompts.
7. ‚úÖ Outlined next steps and provided a ready-to-use inference snippet.

**Your medical NER evaluation workflow is ready! üöÄ**