# QLoRA Fine-Tuned Model Performance on AG News Dataset

## Overview

This notebook evaluates the **QLoRA fine-tuned Qwen2.5-7B model** on the AG News test set and compares its performance against the base model.

| Aspect | Details |
|--------|--------|
| **Base Model** | unsloth/Qwen2.5-7B-Instruct |
| **Fine-Tuning Method** | QLoRA (4-bit quantized base + LoRA adapters) |
| **Adapter Name** | qlora-ag-news |
| **Task** | 4-class text classification (AG News) |
| **Test Set Size** | 7,600 samples |
| **Max Concurrency** | 64 workers |

## Base Model Performance (Baseline)

| Metric | Base Model |
|--------|------------|
| **Accuracy** | 78.76% |
| **F1 (macro)** | 77.97% |
| **F1 (weighted)** | 77.97% |
| **Sci/Tech F1** | 62.06% |
| **Business Precision** | 63.66% |

## Target Performance

| Metric | Target |
|--------|---------|
| **Accuracy** | >85% |
| **Sci/Tech F1** | >75% |
| **Business Precision** | >75% |

---

## Prerequisites

This notebook uses **vLLM with QLoRA adapter support** for fast parallel inference.

Start the server with QLoRA adapter enabled:

```bash
cd 6-open-source
./start_docker.sh start qwen7b-qlora
```

Wait for the server to be ready, then run this notebook.

## Setup and Dependencies

In [None]:
import asyncio
import time
import json
import httpx
from dataclasses import dataclass
from typing import Optional
from enum import Enum

from openai import OpenAI, AsyncOpenAI
from pydantic import BaseModel
from datasets import load_dataset
from tqdm.asyncio import tqdm_asyncio

# Fix for Jupyter's event loop - allows nested async calls
import nest_asyncio
nest_asyncio.apply()

print("Libraries loaded successfully!")

## Configuration

In [None]:
# vLLM Server Configuration
VLLM_BASE_URL = "http://localhost:8000/v1"

# Model names - use the LoRA adapter name for fine-tuned inference
BASE_MODEL_NAME = "unsloth/Qwen2.5-7B-Instruct"  # Base model
QLORA_MODEL_NAME = "qlora-ag-news"  # LoRA adapter name (registered with vLLM)

# Use the QLoRA adapter for this evaluation
MODEL_NAME = QLORA_MODEL_NAME

# Inference configuration
MAX_WORKERS = 64  # Maximum parallel requests (matches vLLM max_num_seqs)

# Label mapping for AG News
LABEL_NAMES = {
    0: "World",
    1: "Sports", 
    2: "Business",
    3: "Sci/Tech"
}

NAME_TO_LABEL = {v: k for k, v in LABEL_NAMES.items()}

print(f"vLLM Server: {VLLM_BASE_URL}")
print(f"Model (QLoRA adapter): {MODEL_NAME}")
print(f"Max parallel workers: {MAX_WORKERS}")

## Verify vLLM Server Connection

In [None]:
# Create client with extended timeout for batch processing
client = OpenAI(
    base_url=VLLM_BASE_URL,
    api_key="not-needed",  # vLLM doesn't require auth
    timeout=httpx.Timeout(120.0, connect=10.0)
)

# Verify connection
try:
    models = client.models.list()
    print("Connected to vLLM server!")
    print("\nAvailable models:")
    for model in models.data:
        print(f"  - {model.id}")
    
    # Check if our LoRA adapter is available
    model_ids = [m.id for m in models.data]
    if QLORA_MODEL_NAME in model_ids:
        print(f"\n✓ QLoRA adapter '{QLORA_MODEL_NAME}' is available!")
    else:
        print(f"\n✗ QLoRA adapter '{QLORA_MODEL_NAME}' not found!")
        print("Make sure to start vLLM with: ./start_docker.sh start qwen7b-qlora")
except Exception as e:
    print(f"✗ Failed to connect to vLLM server: {e}")
    print(f"\nMake sure the server is running:")
    print(f"  cd 6-open-source && ./start_docker.sh start qwen7b-qlora")

## Load Test Dataset

In [None]:
# Load AG News dataset from Hugging Face
dataset = load_dataset("ag_news")
test_data = dataset["test"]

print(f"Test dataset loaded!")
print(f"  Total test samples: {len(test_data):,}")

# Show distribution
from collections import Counter
test_counts = Counter(test_data["label"])
print(f"\nCategory distribution:")
for label in sorted(LABEL_NAMES.keys()):
    print(f"  {LABEL_NAMES[label]}: {test_counts[label]:,}")

## Define Prompts and Structured Output Schema

In [None]:
# System prompt (same as used for training and base model evaluation)
SYSTEM_PROMPT = """You are a news article classifier. Your task is to categorize news articles into exactly one of four categories:

- World: News about politics, government, elections, diplomacy, conflicts, and public affairs (domestic or international)
- Sports: News about athletic events, games, players, teams, coaches, tournaments, and championships
- Business: News about companies, markets, finance, economy, trade, corporate activities, and business services
- Sci/Tech: News about technology products, software, hardware, scientific research, gadgets, and tech innovations

Rules:
- Focus on the PRIMARY topic of the article
- Ignore HTML artifacts (like #39; or &lt;b&gt;) - they are formatting errors
- If an article is truncated, classify based on the available content
- When a topic spans multiple categories, choose the one that best represents the main focus"""

# Pydantic schema for structured output
class NewsCategory(str, Enum):
    WORLD = "World"
    SPORTS = "Sports"
    BUSINESS = "Business"
    SCI_TECH = "Sci/Tech"

class ClassificationResponse(BaseModel):
    category: NewsCategory

print("Prompts and schema defined.")
print(f"\nValid categories: {[c.value for c in NewsCategory]}")

## Classification Function with Structured Output

In [None]:
# Create async client
async_client = AsyncOpenAI(
    base_url=VLLM_BASE_URL,
    api_key="not-needed",
    timeout=httpx.Timeout(120.0, connect=10.0)
)

@dataclass
class ClassificationOutput:
    """Result of a single classification"""
    index: int
    ground_truth: int
    predicted: Optional[int] = None
    predicted_name: Optional[str] = None
    ground_truth_name: str = ""
    raw_output: str = ""
    error: Optional[str] = None
    latency_ms: float = 0.0

async def classify_article(index: int, article_text: str, ground_truth: int) -> ClassificationOutput:
    """Classify a single article using structured output"""
    start_time = time.perf_counter()
    
    try:
        response = await async_client.chat.completions.create(
            model=MODEL_NAME,
            messages=[
                {"role": "system", "content": SYSTEM_PROMPT},
                {"role": "user", "content": f"Classify the following news article:\n\n{article_text}"}
            ],
            max_tokens=20,
            temperature=0.1,
            extra_body={
                "guided_json": ClassificationResponse.model_json_schema()
            }
        )
        
        latency = (time.perf_counter() - start_time) * 1000
        raw_output = response.choices[0].message.content
        
        # Parse structured output
        parsed = ClassificationResponse.model_validate_json(raw_output)
        predicted_name = parsed.category.value
        predicted_label = NAME_TO_LABEL.get(predicted_name)
        
        return ClassificationOutput(
            index=index,
            ground_truth=ground_truth,
            predicted=predicted_label,
            predicted_name=predicted_name,
            ground_truth_name=LABEL_NAMES[ground_truth],
            raw_output=raw_output,
            latency_ms=latency
        )
        
    except Exception as e:
        latency = (time.perf_counter() - start_time) * 1000
        return ClassificationOutput(
            index=index,
            ground_truth=ground_truth,
            ground_truth_name=LABEL_NAMES[ground_truth],
            error=str(e),
            latency_ms=latency
        )

print("Classification function ready!")

## Quick Test: Single Article Classification

In [None]:
# Test with a single article
test_article = test_data[0]
print(f"Test article:")
print(f"Text: {test_article['text'][:200]}...")
print(f"Ground Truth: {LABEL_NAMES[test_article['label']]}")

# Run single test
result = asyncio.get_event_loop().run_until_complete(
    classify_article(0, test_article["text"], test_article["label"])
)

print(f"\n--- Result ---")
print(f"Predicted: {result.predicted_name}")
print(f"Ground Truth: {result.ground_truth_name}")
print(f"Correct: {'✓' if result.predicted == result.ground_truth else '✗'}")
print(f"Latency: {result.latency_ms:.0f}ms")
if result.error:
    print(f"Error: {result.error}")

## Batch Classification with Parallel Workers

In [None]:
async def classify_batch(
    data,
    max_workers: int = MAX_WORKERS,
    show_progress: bool = True
) -> list[ClassificationOutput]:
    """
    Classify a batch of articles with parallel workers.
    
    Args:
        data: Dataset with 'text' and 'label' fields
        max_workers: Maximum concurrent requests
        show_progress: Show progress bar
    
    Returns:
        List of ClassificationOutput results
    """
    # Create semaphore to limit concurrency
    semaphore = asyncio.Semaphore(max_workers)
    
    async def classify_with_semaphore(index: int, article_text: str, ground_truth: int):
        async with semaphore:
            return await classify_article(index, article_text, ground_truth)
    
    # Create all tasks
    tasks = [
        classify_with_semaphore(i, example["text"], example["label"])
        for i, example in enumerate(data)
    ]
    
    # Run with progress bar
    if show_progress:
        results = await tqdm_asyncio.gather(*tasks, desc="Classifying")
    else:
        results = await asyncio.gather(*tasks)
    
    return results

def run_classification(data, max_workers: int = MAX_WORKERS) -> list[ClassificationOutput]:
    """Synchronous wrapper for batch classification"""
    return asyncio.get_event_loop().run_until_complete(
        classify_batch(data, max_workers)
    )

print("Batch classification function ready!")
print(f"  - Max parallel workers: {MAX_WORKERS}")
print(f"  - Test set size: {len(test_data):,} articles")

## Run Full Test Set Classification

In [None]:
print("=" * 70)
print("RUNNING CLASSIFICATION ON FULL TEST SET (QLoRA Fine-Tuned)")
print("=" * 70)
print(f"\nTest set size: {len(test_data):,} articles")
print(f"Max parallel workers: {MAX_WORKERS}")
print(f"Model: {MODEL_NAME}")
print("\nStarting classification...\n")

# Run classification
start_time = time.time()

# Run classification
results = run_classification(test_data, max_workers=MAX_WORKERS)

# Calculate total time
total_time = time.time() - start_time

# Summary
successful = [r for r in results if r.error is None]
failed = [r for r in results if r.error is not None]

print(f"\n" + "=" * 70)
print(f"CLASSIFICATION COMPLETE")
print(f"=" * 70)
print(f"Total time: {total_time:.1f} seconds ({total_time/60:.1f} minutes)")
print(f"Throughput: {len(test_data)/total_time:.1f} articles/second")
print(f"Successful: {len(successful):,} ({len(successful)/len(results)*100:.1f}%)")
print(f"Failed: {len(failed):,} ({len(failed)/len(results)*100:.1f}%)")

## Model Performance Evaluation

In [None]:
from sklearn.metrics import (
    accuracy_score, 
    precision_score, 
    recall_score, 
    f1_score, 
    classification_report,
    confusion_matrix
)

# Extract predictions (only from successful results)
y_true = [r.ground_truth for r in successful]
y_pred = [r.predicted for r in successful]

# Calculate metrics
accuracy = accuracy_score(y_true, y_pred)
precision_macro = precision_score(y_true, y_pred, average='macro')
precision_weighted = precision_score(y_true, y_pred, average='weighted')
recall_macro = recall_score(y_true, y_pred, average='macro')
recall_weighted = recall_score(y_true, y_pred, average='weighted')
f1_macro = f1_score(y_true, y_pred, average='macro')
f1_weighted = f1_score(y_true, y_pred, average='weighted')

print("\n" + "="*60)
print("PERFORMANCE METRICS")
print("="*60)
print(f"\n{'Metric':<25} {'Value':>15}")
print("-" * 40)
print(f"{'Accuracy':<25} {accuracy:>14.2%}")
print(f"{'Precision (macro)':<25} {precision_macro:>14.2%}")
print(f"{'Precision (weighted)':<25} {precision_weighted:>14.2%}")
print(f"{'Recall (macro)':<25} {recall_macro:>14.2%}")
print(f"{'Recall (weighted)':<25} {recall_weighted:>14.2%}")
print(f"{'F1 Score (macro)':<25} {f1_macro:>14.2%}")
print(f"{'F1 Score (weighted)':<25} {f1_weighted:>14.2%}")

In [None]:
# Detailed classification report
print("\n" + "="*60)
print("CLASSIFICATION REPORT BY CATEGORY")
print("="*60 + "\n")

target_names = [LABEL_NAMES[i] for i in range(4)]
print(classification_report(y_true, y_pred, target_names=target_names, digits=4))

## Confusion Matrix Visualization

In [None]:
import matplotlib.pyplot as plt
import numpy as np

# Compute confusion matrix
cm = confusion_matrix(y_true, y_pred)

# Plot
fig, ax = plt.subplots(figsize=(10, 8))
im = ax.imshow(cm, interpolation='nearest', cmap='Blues')
ax.figure.colorbar(im, ax=ax)

# Labels
ax.set(
    xticks=np.arange(4),
    yticks=np.arange(4),
    xticklabels=target_names,
    yticklabels=target_names,
    ylabel='True Label',
    xlabel='Predicted Label',
    title='QLoRA Fine-Tuned Model - Confusion Matrix'
)

# Rotate x labels
plt.setp(ax.get_xticklabels(), rotation=45, ha="right", rotation_mode="anchor")

# Add text annotations
thresh = cm.max() / 2.
for i in range(4):
    for j in range(4):
        ax.text(j, i, format(cm[i, j], 'd'),
                ha="center", va="center",
                color="white" if cm[i, j] > thresh else "black",
                fontsize=14)

plt.tight_layout()
plt.savefig('qlora_confusion_matrix.png', dpi=150, bbox_inches='tight')
plt.show()

print("\n✓ Confusion matrix saved to 'qlora_confusion_matrix.png'")

## Performance Comparison: Base Model vs QLoRA

In [None]:
# Base model results (from base_model_performance.ipynb)
base_model_metrics = {
    "accuracy": 0.7876,
    "precision_macro": 0.8198,
    "recall_macro": 0.7876,
    "f1_macro": 0.7797,
    "f1_weighted": 0.7797,
}

# QLoRA fine-tuned results
qlora_metrics = {
    "accuracy": accuracy,
    "precision_macro": precision_macro,
    "recall_macro": recall_macro,
    "f1_macro": f1_macro,
    "f1_weighted": f1_weighted,
}

print("="*70)
print("PERFORMANCE COMPARISON: BASE MODEL vs QLoRA FINE-TUNED")
print("="*70)
print(f"\n{'Metric':<25} {'Base Model':>15} {'QLoRA':>15} {'Δ Change':>15}")
print("-" * 70)

for metric in ["accuracy", "precision_macro", "recall_macro", "f1_macro", "f1_weighted"]:
    base_val = base_model_metrics[metric]
    qlora_val = qlora_metrics[metric]
    delta = qlora_val - base_val
    delta_str = f"+{delta:.2%}" if delta >= 0 else f"{delta:.2%}"
    
    metric_name = metric.replace("_", " ").title()
    print(f"{metric_name:<25} {base_val:>14.2%} {qlora_val:>14.2%} {delta_str:>15}")

# Calculate overall improvement
accuracy_improvement = (qlora_metrics["accuracy"] - base_model_metrics["accuracy"]) / base_model_metrics["accuracy"] * 100
f1_improvement = (qlora_metrics["f1_macro"] - base_model_metrics["f1_macro"]) / base_model_metrics["f1_macro"] * 100

print(f"\n" + "="*70)
print(f"RELATIVE IMPROVEMENT")
print(f"  Accuracy:  {accuracy_improvement:+.1f}%")
print(f"  F1 (macro): {f1_improvement:+.1f}%")

## Save Results

In [None]:
# Create comprehensive results dictionary
qlora_results = {
    "model_name": BASE_MODEL_NAME,
    "adapter_name": QLORA_MODEL_NAME,
    "model_type": "qlora_fine_tuned",
    "test_set_size": len(test_data),
    "successful_predictions": len(successful),
    "failed_predictions": len(failed),
    "total_time_seconds": total_time,
    "throughput_per_second": len(test_data) / total_time,
    "avg_latency_ms": sum(r.latency_ms for r in successful) / len(successful) if successful else 0,
    "metrics": {
        "accuracy": accuracy,
        "precision_macro": precision_macro,
        "precision_weighted": precision_weighted,
        "recall_macro": recall_macro,
        "recall_weighted": recall_weighted,
        "f1_macro": f1_macro,
        "f1_weighted": f1_weighted,
    },
    "comparison_vs_base": {
        "accuracy_delta": accuracy - base_model_metrics["accuracy"],
        "f1_macro_delta": f1_macro - base_model_metrics["f1_macro"],
        "accuracy_improvement_pct": accuracy_improvement,
        "f1_improvement_pct": f1_improvement,
    }
}

# Save to JSON
with open("qlora_fine_tuned_results.json", "w") as f:
    json.dump(qlora_results, f, indent=2)

print("Results saved to 'qlora_fine_tuned_results.json'")

# Print summary
print("\n" + "="*70)
print("FINAL SUMMARY")
print("="*70)
print(f"""
Model: {BASE_MODEL_NAME}
Adapter: {QLORA_MODEL_NAME}

PERFORMANCE METRICS:
--------------------
Accuracy:           {accuracy:.2%}
F1 Score (macro):   {f1_macro:.2%}
F1 Score (weighted):{f1_weighted:.2%}

IMPROVEMENT vs BASE MODEL:
--------------------------
Accuracy:   {accuracy_improvement:+.1f}%
F1 (macro): {f1_improvement:+.1f}%

INFERENCE STATS:
----------------
Test samples:       {len(test_data):,}
Total time:         {total_time:.1f}s
Throughput:         {len(test_data)/total_time:.1f} articles/sec
""")

## Conclusions

### QLoRA Fine-Tuning Results

| Metric | Base Model | QLoRA Fine-Tuned | Target | Status |
|--------|------------|------------------|--------|--------|
| **Accuracy** | 78.76% | **95.14%** | >85% | **Exceeded** |
| **F1 (macro)** | 77.97% | **95.13%** | - | **+17.16%** |
| **Sci/Tech F1** | 62.06% | **~93.2%** | >75% | **Exceeded** |
| **Business Precision** | 63.66% | **~95.8%** | >75% | **Exceeded** |

### Performance Improvement Summary

| Metric | Absolute Improvement | Relative Improvement |
|--------|---------------------|---------------------|
| Accuracy | +16.38% | **+20.8%** |
| F1 (macro) | +17.16% | **+22.0%** |
| Precision (macro) | +13.25% | +16.2% |
| Recall (macro) | +16.38% | +20.8% |

### Per-Category Performance (from Confusion Matrix)

| Category | Correct | Total | Accuracy | Key Observations |
|----------|---------|-------|----------|------------------|
| **World** | 1,818 | 1,900 | 95.7% | Excellent classification |
| **Sports** | 1,894 | 1,900 | **99.7%** | Near-perfect performance |
| **Business** | 1,690 | 1,900 | 88.9% | Most challenging category |
| **Sci/Tech** | 1,829 | 1,900 | 96.3% | Massive improvement from 46.8% recall |

### Key Observations

1. **All Targets Exceeded**: QLoRA fine-tuning achieved 95.14% accuracy, far exceeding the 85% target.

2. **Sci/Tech Category Transformed**: 
   - Base model: 62.06% F1 (major weakness with only 46.84% recall)
   - QLoRA: ~93.2% F1 with 96.3% recall
   - This was the biggest improvement area

3. **Business-Sci/Tech Confusion Reduced but Persists**:
   - 155 Business articles still misclassified as Sci/Tech
   - This represents tech-business overlap (e.g., tech company news)
   - Still significantly better than base model

4. **Sports Nearly Perfect**: 99.7% accuracy (1,894/1,900) - the clearest category

5. **Inference Speed**: 
   - 33.6 articles/second with vLLM parallel processing
   - Total evaluation: 226 seconds (~3.8 minutes) for 7,600 articles
   - 100% success rate

### QLoRA Training Summary

| Aspect | Value |
|--------|-------|
| Training Time | ~6 hours |
| Final Loss | 0.4625 |
| Adapter Size | 177.42 MB |
| Trainable Parameters | 40.4M (0.53% of model) |
| Memory Efficiency | 4-bit quantized base model |

### Conclusion

QLoRA fine-tuning on Qwen2.5-7B-Instruct achieved **exceptional results** for AG News classification:
- **95.14% accuracy** (vs 78.76% base) - a **20.8% relative improvement**
- All target metrics exceeded
- Training required only **0.53%** of parameters (40.4M vs 7.6B)
- Adapter size is only **177 MB** vs ~15 GB for the full model

This demonstrates that QLoRA is highly effective for domain-specific fine-tuning, achieving near state-of-the-art performance with minimal computational resources.