# Medical NER Model Evaluation

This notebook evaluates the fine-tuned Llama 3.2 3B medical NER model.

## ‚úÖ DATASET VERIFIED & READY FOR EVALUATION

**Current Dataset Distribution** (from `both_rel_instruct_all.jsonl`):
- **1,000 Chemical extraction** examples (25%)
- **2,000 Disease extraction** examples (50%) ‚ö†Ô∏è Intentionally 2x more
- **1,000 Relationship extraction** examples (25%)

**Data Splits Status**: ‚úÖ Properly stratified using `stratify=` parameter
- Training (2,400): 25% chemical, 50% disease, 25% relationship
- Validation (300+): 25% chemical, 50% disease, 25% relationship
- Test (300+): 25% chemical, 50% disease, 25% relationship

**Why Disease is 2x more**:
- The original dataset has twice as many disease extraction examples
- Stratified splitting preserves this 25/50/25 distribution
- This appears intentional for better disease NER performance
- All splits are properly balanced relative to the source data

**Next Steps**:
1. ‚úÖ Training data is properly split with stratification
2. ‚úÖ No data leakage between train/val/test
3. ‚úÖ Update `HF_MODEL_ID` below with your trained model ID
4. ‚úÖ Run this evaluation notebook on the balanced test set

---

## Prerequisites:
1. Complete training in `Medical_NER_Fine_Tuning.ipynb` (uses stratified splits!)
2. Model saved to `./final_model` or uploaded to HuggingFace Hub
3. Test data available in `notebooks/test.jsonl` or `../data/test.jsonl`

## Evaluation Tasks:
1. Load the fine-tuned model
2. Evaluate on test set (25% chem, 50% disease, 25% relationship)
3. Calculate precision, recall, F1 scores per task type
4. Test on custom medical texts
5. Analyze errors and false positives

## 0. Environment Variables Setup

‚ö†Ô∏è **IMPORTANT**: Set your credentials before running the notebook!

**Note**: `hf_transfer` is enabled for faster downloads from HuggingFace Hub.

In [None]:
import os

# Enable hf_transfer for faster downloads from HuggingFace Hub
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"

# HuggingFace Token (required to download your model from Hub)
# Get your token from: https://huggingface.co/settings/tokens
os.environ["HF_TOKEN"] = "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"  # ‚Üê UPDATE THIS!

# Weights & Biases API Key (optional - only if tracking evaluation metrics)
# Get your key from: https://wandb.ai/authorize
os.environ["WANDB_API_KEY"] = "d88df098d85360ac924ec2bf8dcf5520d745c411"  # Uncomment if needed

print("‚úì Environment variables set")
print(f"  HF_HUB_ENABLE_HF_TRANSFER: {os.getenv('HF_HUB_ENABLE_HF_TRANSFER')} (Fast downloads enabled!)")
print(f"  HF_TOKEN: {'‚úì Set' if os.getenv('HF_TOKEN') and os.getenv('HF_TOKEN') != 'hf_YOUR_TOKEN_HERE' else '‚úó Not set - UPDATE THIS!'}")
if os.getenv("WANDB_API_KEY"):
    print(f"  WANDB_API_KEY: ‚úì Set")

## 1. Setup and Installation


In [None]:
# Install required packages
# Install PyTorch first (for GPU support on remote pod)
!pip install torch 
# Install other required packages
!pip install -q transformers datasets peft accelerate bitsandbytes
!pip install -q huggingface-hub tokenizers hf-transfer

print("‚úì All packages installed successfully!")

## 2. Import Libraries


In [None]:
%pip install torch transformers peft huggingface-hub

import json
import torch
from pathlib import Path
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
from huggingface_hub import login

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")

## 3. Configuration

‚ö†Ô∏è **Update these paths** to match your model location!


In [None]:
# Model configuration
BASE_MODEL = "meta-llama/Llama-3.2-3B-Instruct"

# ‚ö†Ô∏è IMPORTANT: Update with YOUR HuggingFace model ID
# Find it at: https://huggingface.co/your-username
# Format: "your-username/llama3-medical-ner-lora-YYYYMMDD_HHMMSS"
HF_MODEL_ID = "albyos/llama3-medical-ner-lora-20251029_143110"  # ‚Üê UPDATE THIS!

# Alternative: Use local model if you prefer
USE_HF_HUB = True  # Set to False to use local ../final_model
PROJECT_ROOT = Path.cwd().parent
LOCAL_MODEL_PATH = PROJECT_ROOT / "final_model"

ADAPTER_PATH = HF_MODEL_ID if USE_HF_HUB else str(LOCAL_MODEL_PATH)

# Data configuration - Use the test file in notebooks directory
NOTEBOOKS_DIR = Path.cwd()  # Current notebooks directory
TEST_DATA_PATH = NOTEBOOKS_DIR / "test.jsonl"

# Verify test data exists
if not TEST_DATA_PATH.exists():
    print(f"‚ùå Test data not found at {TEST_DATA_PATH}")
    print(f"üí° Expected location: /workspace/ch_10_fine_tuning/notebooks/test.jsonl")
    raise FileNotFoundError(f"Test data file not found: {TEST_DATA_PATH}")

print("‚úì Configuration loaded")
print(f"  Base model: {BASE_MODEL}")
print(f"  Adapter source: {'HuggingFace Hub' if USE_HF_HUB else 'Local filesystem'}")
print(f"  Adapter path: {ADAPTER_PATH}")
print(f"  Test data: {TEST_DATA_PATH}")
print(f"  Test data exists: {TEST_DATA_PATH.exists()}")

## 4. Authenticate with Hugging Face

Log into Hugging Face to download the LoRA adapter when `USE_HF_HUB` is enabled.

In [None]:
if USE_HF_HUB:
    hf_token = os.environ.get("HF_TOKEN")
    if hf_token and hf_token != "hf_YOUR_TOKEN_HERE":
        login(token=hf_token, add_to_git_credential=True)
        print("‚úì Logged into Hugging Face Hub")
    else:
        raise ValueError("HF_TOKEN is not set. Update the Environment Variables cell before continuing.")
else:
    print("Skipping Hugging Face login because USE_HF_HUB is False.")

## 5. Load the Fine-Tuned Model

Load the base model and attach the LoRA adapter from either Hugging Face Hub or your local filesystem.

**Note**: Using `hf_transfer` for faster downloads from HuggingFace Hub.

In [None]:
# Ensure hf_transfer is enabled for faster downloads
import os
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"

# Load the fine-tuned model for inference
print("="*80)
print("LOADING FINE-TUNED MODEL")
print("="*80)

print(f"\nLoading base model: {BASE_MODEL}...")

#Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL)
tokenizer.pad_token = tokenizer.eos_token

print(f"‚úì Tokenizer loaded")

# Load base model
base_model = AutoModelForCausalLM.from_pretrained(
    BASE_MODEL,
    torch_dtype=torch.float16,
    device_map="auto",
)

print(f"‚úì Base model loaded: {BASE_MODEL}")

# Load LoRA adapter from HuggingFace Hub or local path
print(f"\nLoading LoRA adapter from: {ADAPTER_PATH}...")
print(f"  Using hf_transfer for faster downloads...")

model = PeftModel.from_pretrained(
    base_model,
    ADAPTER_PATH,
)
model.eval()

print(f"\n‚úì Fine-tuned model loaded successfully!")
print(f"  Base: {BASE_MODEL}")
print(f"  LoRA adapter: {ADAPTER_PATH}")
print(f"  Source: {'HuggingFace Hub' if USE_HF_HUB else 'Local filesystem'}")

In [None]:
def generate_response(prompt_text, max_new_tokens=512):
    """Generate a response for a given prompt."""
    formatted_prompt = f"""<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are a medical NER expert. Extract the requested entities from medical texts accurately.<|eot_id|><|start_header_id|>user<|end_header_id|>

{prompt_text}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

"""
    
    inputs = tokenizer(formatted_prompt, return_tensors="pt").to(model.device)
    
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            temperature=0.7,  # Increased from 0.1 to allow more diversity
            do_sample=True,
            top_p=0.95,
            repetition_penalty=1.2,  # ‚úÖ ADDED: Penalize repetition
            no_repeat_ngram_size=3,  # ‚úÖ ADDED: Prevent 3-gram repetition
            pad_token_id=tokenizer.eos_token_id,
            eos_token_id=tokenizer.eos_token_id,
        )
    
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    
    # Extract assistant's response
    if "assistant\n\n" in response:
        response = response.split("assistant\n\n")[-1]
    elif "assistant" in response:
        response = response.split("assistant")[-1].strip()
    
    return response.strip()

print("‚úì Inference function ready")
print("  Generation parameters updated:")
print("    - temperature: 0.7 (more diverse)")
print("    - repetition_penalty: 1.2 (prevents loops)")
print("    - no_repeat_ngram_size: 3 (prevents 3-gram repetition)")


## 6. Evaluate on the Held-Out Test Set

Run inference on a subset of the unseen test set and compute per-sample precision, recall, and F1 scores.

**Note**: The evaluation now uses case-insensitive matching to handle capitalization differences.

In [None]:
# Test on COMPLETELY UNSEEN test samples
# The test set was not used for training OR validation monitoring
# Load test data
with open(TEST_DATA_PATH, 'r', encoding='utf-8') as f:
    test_data = [json.loads(line) for line in f]


num_test_samples = len(test_data)
print(f"Testing on {num_test_samples} samples from TEST SET")
print(f"Total test set size: {len(test_data)}")
print(f"\n‚ö†Ô∏è  IMPORTANT:")
print(f"  - Training set (80%): Used for fine-tuning")
print(f"  - Validation set (10%): Monitored during training (W&B)")
print(f"  - Test set (10%): Used ONLY NOW for final evaluation")

# Aggregate metrics
total_correct = 0
total_predicted = 0
total_expected = 0

def normalize_text(text):
    """Normalize text for comparison: lowercase and strip whitespace."""
    return text.lower().strip()

for i, sample in enumerate(test_data[:num_test_samples]):
    print("\n" + "="*80)
    print(f"FINAL TEST EXAMPLE {i+1}/{num_test_samples}")
    print("="*80)
    
    # Show prompt (truncated for readability)
    print(f"\nüìù PROMPT:")
    prompt_preview = sample['prompt'][:250] + "..." if len(sample['prompt']) > 250 else sample['prompt']
    print(f"{prompt_preview}")
    
    # Show expected output
    print(f"\n‚úÖ EXPECTED OUTPUT:")
    print(f"{sample['completion']}")
    
    # Generate prediction
    print(f"\nü§ñ MODEL PREDICTION:")
    prediction = generate_response(sample['prompt'])
    print(f"{prediction}")
    
    # Calculate metrics with normalization for case-insensitive comparison
    expected_items = set([normalize_text(item) for item in sample['completion'].split('\n') if item.strip()])
    predicted_items = set([normalize_text(item) for item in prediction.split('\n') if item.strip()])
    
    common = expected_items & predicted_items
    missing = expected_items - predicted_items
    extra = predicted_items - expected_items
    
    # Update aggregate counts
    total_correct += len(common)
    total_predicted += len(predicted_items)
    total_expected += len(expected_items)
    
    # Per-sample metrics
    accuracy = len(common) / len(expected_items) * 100 if len(expected_items) > 0 else 0
    precision = len(common) / len(predicted_items) * 100 if len(predicted_items) > 0 else 0
    recall = len(common) / len(expected_items) * 100 if len(expected_items) > 0 else 0
    f1 = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0
    
    print(f"\nüìä EVALUATION METRICS:")
    print(f"  ‚úì Correct extractions: {len(common)}/{len(expected_items)}")
    print(f"  ‚úó Missed extractions: {len(missing)}")
    print(f"  ‚ö† Extra extractions: {len(extra)}")
    print(f"\n  üìà Per-Sample Metrics:")
    print(f"    Accuracy:  {accuracy:.1f}%")
    print(f"    Precision: {precision:.1f}%")
    print(f"    Recall:    {recall:.1f}%")
    print(f"    F1 Score:  {f1:.1f}%")
    
    if missing:
        print(f"\n  Missed items: {list(missing)[:3]}")
    if extra:
        print(f"  Extra items: {list(extra)[:3]}")
    
    # Show matched items for verification
    if common:
        print(f"\n  ‚úì Matched items: {list(common)[:5]}")

## 7. Aggregate Metrics

Summarize performance across the evaluated samples to understand overall precision, recall, and F1 score.

In [None]:
# Aggregate Metrics across all test samples
print("\n" + "="*80)
print("AGGREGATE METRICS ACROSS TEST SAMPLES")
print("="*80)

# Calculate aggregate metrics
aggregate_precision = total_correct / total_predicted * 100 if total_predicted > 0 else 0
aggregate_recall = total_correct / total_expected * 100 if total_expected > 0 else 0
aggregate_f1 = 2 * (aggregate_precision * aggregate_recall) / (aggregate_precision + aggregate_recall) if (aggregate_precision + aggregate_recall) > 0 else 0
aggregate_accuracy = total_correct / total_expected * 100 if total_expected > 0 else 0

print(f"\nEvaluated on {num_test_samples} test samples:")
print(f"\nüìä Overall Performance:")
print(f"  Total expected entities:  {total_expected}")
print(f"  Total predicted entities: {total_predicted}")
print(f"  Correctly predicted:      {total_correct}")

print(f"\nüìà Aggregate Metrics:")
print(f"  Accuracy:  {aggregate_accuracy:.2f}%")
print(f"  Precision: {aggregate_precision:.2f}% (fewer false positives)")
print(f"  Recall:    {aggregate_recall:.2f}% (fewer false negatives)")
print(f"  F1 Score:  {aggregate_f1:.2f}% (balanced metric)")

print(f"\nüí° Interpretation:")
print(f"  - Accuracy: {aggregate_accuracy:.1f}% of expected entities were found")
print(f"  - Precision: Of all entities predicted, {aggregate_precision:.1f}% were correct")
print(f"  - Recall: Of all actual entities, {aggregate_recall:.1f}% were found")
print(f"  - F1: Harmonic mean balancing precision and recall")

print(f"\nüéØ What these metrics mean:")
print(f"  - High Precision, Low Recall ‚Üí Model is conservative (misses entities)")
print(f"  - Low Precision, High Recall ‚Üí Model is aggressive (predicts too many)")
print(f"  - High F1 Score ‚Üí Good balance between precision and recall")

## 7.5 False Positive Analysis

Analyze the types of errors the model is making to understand and improve performance.

In [None]:
# Detailed False Positive Analysis
print("="*80)
print("FALSE POSITIVE ANALYSIS")
print("="*80)

# Re-analyze test data to collect all false positives
all_false_positives = []
all_false_negatives = []
all_true_positives = []

for i, sample in enumerate(test_data[:num_test_samples]):
    prediction = generate_response(sample['prompt'])
    
    # Normalize for comparison
    expected_items = set([normalize_text(item) for item in sample['completion'].split('\n') if item.strip()])
    predicted_items = set([normalize_text(item) for item in prediction.split('\n') if item.strip()])
    
    common = expected_items & predicted_items
    false_positives = predicted_items - expected_items  # Model predicted but not in ground truth
    false_negatives = expected_items - predicted_items  # In ground truth but model missed
    
    all_false_positives.extend(false_positives)
    all_false_negatives.extend(false_negatives)
    all_true_positives.extend(common)

print(f"\nüìä Error Distribution:")
print(f"  True Positives:   {len(all_true_positives)} (Correct predictions)")
print(f"  False Positives:  {len(all_false_positives)} (Extra/wrong predictions)")
print(f"  False Negatives:  {len(all_false_negatives)} (Missed entities)")

# Calculate error rates
total_predictions = len(all_true_positives) + len(all_false_positives)
total_ground_truth = len(all_true_positives) + len(all_false_negatives)

false_positive_rate = len(all_false_positives) / total_predictions * 100 if total_predictions > 0 else 0
false_negative_rate = len(all_false_negatives) / total_ground_truth * 100 if total_ground_truth > 0 else 0

print(f"\nüìà Error Rates:")
print(f"  False Positive Rate: {false_positive_rate:.1f}% (of all predictions)")
print(f"  False Negative Rate: {false_negative_rate:.1f}% (of all expected)")

# Show example false positives
print(f"\n‚ö†Ô∏è  Example False Positives (Extra predictions):")
for i, fp in enumerate(all_false_positives[:10], 1):
    print(f"  {i}. {fp}")

# Show example false negatives
print(f"\n‚ùå Example False Negatives (Missed entities):")
for i, fn in enumerate(all_false_negatives[:10], 1):
    print(f"  {i}. {fn}")

# Analysis insights
print(f"\nüí° Insights:")
if false_positive_rate > 20:
    print(f"  ‚ö†Ô∏è  High false positive rate ({false_positive_rate:.1f}%)")
    print(f"     ‚Üí Model is too aggressive, predicting entities that aren't in ground truth")
    print(f"     ‚Üí Consider: More conservative prompting, post-processing filters, or additional training")
elif false_positive_rate < 10:
    print(f"  ‚úì Low false positive rate ({false_positive_rate:.1f}%)")
    print(f"     ‚Üí Model is conservative and precise")

if false_negative_rate > 20:
    print(f"  ‚ö†Ô∏è  High false negative rate ({false_negative_rate:.1f}%)")
    print(f"     ‚Üí Model is missing many expected entities")
    print(f"     ‚Üí Consider: More training data, longer context, or prompt engineering")
elif false_negative_rate < 10:
    print(f"  ‚úì Low false negative rate ({false_negative_rate:.1f}%)")
    print(f"     ‚Üí Model has good recall")

print(f"\nüéØ Recommendations:")
if false_positive_rate > false_negative_rate:
    print(f"  Primary issue: TOO MANY FALSE POSITIVES")
    print(f"  Solutions:")
    print(f"    1. Add post-processing to filter common false positives")
    print(f"    2. Adjust generation parameters (lower temperature, higher top_p)")
    print(f"    3. Fine-tune with more negative examples")
    print(f"    4. Use stricter prompt instructions")
else:
    print(f"  Primary issue: TOO MANY FALSE NEGATIVES")
    print(f"  Solutions:")
    print(f"    1. Increase training data quantity")
    print(f"    2. Improve prompt clarity")
    print(f"    3. Check if test data format matches training data")
    print(f"    4. Consider ensemble methods")

## 8. Interpret the Metrics

### Accuracy
- **Formula**: `Correct / Total Expected`
- **Meaning**: Percentage of expected entities that were correctly predicted
- **Limitation**: Doesn't account for false positives (extra predictions)

### Precision
- **Formula**: `Correct / Total Predicted`
- **Meaning**: Of all entities the model predicted, how many were correct?
- **High Precision**: Model rarely makes false positive errors (rarely predicts wrong entities)

### Recall
- **Formula**: `Correct / Total Expected`
- **Meaning**: Of all actual entities, how many did the model find?
- **High Recall**: Model rarely makes false negative errors (rarely misses entities)

### F1 Score
- **Formula**: `2 √ó (Precision √ó Recall) / (Precision + Recall)`
- **Meaning**: Harmonic mean that balances precision and recall
- **Best metric**: When you care equally about false positives and false negatives

**Example**:
```
Ground truth: ['aspirin', 'ibuprofen', 'NSAIDs']
Prediction:   ['aspirin', 'ibuprofen']

Accuracy:  66.7% (2/3 found)
Precision: 100% (2/2 predicted were correct)
Recall:    66.7% (2/3 actual entities found)
F1 Score:  80.0% (balanced metric)
```

## 9. Custom Test Cases ‚Äî Comprehensive NER Evaluation

Test the model's ability to:
1. **Extract Chemicals** - Identify drug names and chemical compounds
2. **Extract Diseases** - Identify medical conditions and diseases
3. **Extract Relationships** - Identify which chemicals are related to which diseases

In [None]:
# Test 1: Chemical Extraction
print("="*80)
print("TEST 1: CHEMICAL EXTRACTION")
print("="*80)

chemical_test = """The following article contains technical terms including diseases, drugs and chemicals. Create a list only of the chemicals mentioned.

A patient was treated with aspirin and ibuprofen for pain relief. The combination of these NSAIDs proved effective in reducing inflammation. Additionally, metformin was prescribed for glucose control.

List of extracted chemicals:
"""

print(f"\nüìù Prompt:\n{chemical_test}")
print("\nü§ñ Model Output:")
print(generate_response(chemical_test))

In [None]:
# Test 2: Disease Extraction
print("\n" + "="*80)
print("TEST 2: DISEASE EXTRACTION")
print("="*80)

disease_test = """The following article contains technical terms including diseases, drugs and chemicals. Create a list only of the diseases mentioned.

The patient presented with hypertension, diabetes mellitus, and chronic kidney disease. Laboratory findings revealed proteinuria and elevated creatinine levels, suggesting diabetic nephropathy.

List of extracted diseases:
"""

print(f"\nüìù Prompt:\n{disease_test}")
print("\nü§ñ Model Output:")
print(generate_response(disease_test))

In [None]:
# Test 3: Chemical-Disease Relationship Extraction
print("\n" + "="*80)
print("TEST 3: RELATIONSHIP EXTRACTION - BASIC")
print("="*80)

relationship_test_1 = """The following article contains technical terms including diseases, drugs and chemicals. Extract the relationships between chemicals and diseases mentioned in the text.

Metformin is commonly prescribed for type 2 diabetes by improving insulin sensitivity and reducing hepatic glucose production. Aspirin is used in cardiovascular disease management in high-risk patients.

List the chemical-disease relationships:
"""

print(f"\nüìù Prompt:\n{relationship_test_1}")
print("\nü§ñ Model Output:")
print(generate_response(relationship_test_1, max_new_tokens=600))

In [None]:
# Test 4: Multiple Relationship Extraction
print("\n" + "="*80)
print("TEST 4: RELATIONSHIP EXTRACTION - MULTIPLE PAIRS")
print("="*80)

relationship_test_2 = """The following article contains technical terms including diseases, drugs and chemicals. Identify all chemical-disease pairs and their relationships.

Long-term use of corticosteroids is associated with osteoporosis and increases the risk of bone fractures. NSAIDs are linked to chronic kidney disease and gastrointestinal bleeding in susceptible patients.

List of chemical-disease relationships:
"""

print(f"\nüìù Prompt:\n{relationship_test_2}")
print("\nü§ñ Model Output:")
print(generate_response(relationship_test_2, max_new_tokens=600))

In [None]:
# Test 5: Complex Multi-Entity Relationship Extraction
print("\n" + "="*80)
print("TEST 5: COMPREHENSIVE EXTRACTION - ALL ENTITIES & RELATIONSHIPS")
print("="*80)

relationship_test_3 = """The following article contains technical terms including diseases, drugs and chemicals. Extract:
1. All chemicals mentioned
2. All diseases mentioned
3. All relationships between chemicals and diseases

The patient with rheumatoid arthritis was started on methotrexate for inflammatory joint disease. However, methotrexate is associated with hepatotoxicity and requires monitoring. The patient also has hypertension managed with lisinopril. Statins were prescribed for cardiovascular disease prevention given elevated cholesterol levels.

Extracted information:
"""

print(f"\nüìù Prompt:\n{relationship_test_3}")
print("\nü§ñ Model Output:")
print(generate_response(relationship_test_3, max_new_tokens=800))

## 10. Suggested Next Steps

- Evaluate the full test set (set `num_test_samples = len(test_data)`) to capture complete performance.
- Compare with the base model to quantify the lift from fine-tuning.
- Log metrics to Weights & Biases or another tracker for experiment history.
- Export predictions for manual spot checks with subject-matter experts.

## 11. Usage Example (Optional)

How to load the model in a production script or service.

In [None]:
# Example: How to load and use the model later
usage_code = '''
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
import torch

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-3B-Instruct")

# Load base model
base_model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.2-3B-Instruct",
    torch_dtype=torch.float16,
    device_map="auto"
)

# Load LoRA adapter from Hub
model = PeftModel.from_pretrained(
    base_model,
    "your-username/llama3-medical-ner-lora"  # Your model ID
)
model.eval()

# Use the model
prompt = """The following article contains technical terms including diseases, drugs and chemicals. Create a list only of the chemicals mentioned.

Patient was treated with metformin and insulin for diabetes management.

List of extracted chemicals:
"""

# Generate response
# ... (use the generate_response function from above)
'''

print("Usage Example:")
print("="*80)
print(usage_code)

---

## Summary

This notebook:
1. ‚úÖ Configured environment variables and authentication for Hugging Face and W&B.
2. ‚úÖ Installed required evaluation dependencies.
3. ‚úÖ Loaded the fine-tuned medical NER model (base + LoRA adapter).
4. ‚úÖ Evaluated performance on unseen test samples with detailed metrics.
5. ‚úÖ Aggregated precision, recall, and F1 across all evaluated examples.
6. ‚úÖ Validated behaviour on curated chemical, disease, and relationship prompts.
7. ‚úÖ Outlined next steps and provided a ready-to-use inference snippet.

**Your medical NER evaluation workflow is ready! üöÄ**