# LLM Model Comparison

This notebook analyses the benchmark results from our LLM comparison framework.
We compare four models across multiple dimensions:

| Model | Type | Hosting | Parameters |
|-------|------|---------|------------|
| **Mistral 7B** | Open-source | LocalAI (self-hosted) | 7B |
| **LLaMA 3 8B** | Open-source | LocalAI (self-hosted) | 8B |
| **Phi-3 Mini** | Open-source | LocalAI (self-hosted) | 3.8B |
| **GPT-4o-mini** | Proprietary | OpenAI API (external) | N/A |

**Metrics measured:**
- **Latency**: Total response time, time to first token, tokens per second
- **Quality**: Relevance, completeness, coherence (scored 0-1)
- **Turkish language support**: Quality of Turkish-language responses
- **Memory usage**: RAM delta during inference

> Benchmark data is loaded from `llm-comparison/benchmark_results.json`,
> generated by `compare_models.py` running against a 3-node k3s cluster with 32GB RAM.

In [None]:
"""
Import Dependencies
-------------------
We use json for loading benchmark data and basic Python for tabular display.
matplotlib is optional -- if available, we generate charts; otherwise we
fall back to text-based tables.
"""

import json
import os
from pathlib import Path

# Optional: matplotlib for charts
try:
    import matplotlib.pyplot as plt
    import matplotlib
    matplotlib.rcParams["figure.figsize"] = (12, 6)
    matplotlib.rcParams["font.size"] = 11
    MATPLOTLIB_AVAILABLE = True
    print("matplotlib available -- charts will be rendered")
except ImportError:
    MATPLOTLIB_AVAILABLE = False
    print("matplotlib not available -- using text tables")

## Loading Benchmark Results

The benchmark results are stored in `llm-comparison/benchmark_results.json`.
This file is generated by `compare_models.py` which sends standardised prompts
(English and Turkish) to each model via their OpenAI-compatible APIs and records
latency, throughput, quality metrics, and memory usage.

In [None]:
# ---------------------------------------------------------------------------
# Load benchmark_results.json
# ---------------------------------------------------------------------------

# Path relative to the project root
BENCHMARK_PATH = Path("../llm-comparison/benchmark_results.json")

# Also try absolute path if relative doesn't work
if not BENCHMARK_PATH.exists():
    BENCHMARK_PATH = Path("/home/homedevlab/kubernetes/kubernetes/ai-stack-k8s/llm-comparison/benchmark_results.json")

with open(BENCHMARK_PATH, "r", encoding="utf-8") as f:
    benchmark_data = json.load(f)

# Extract metadata and model results
metadata = benchmark_data["benchmark_metadata"]
models = benchmark_data["models"]

print("Benchmark Metadata:")
print(f"  Timestamp: {metadata['timestamp']}")
print(f"  Tool:      {metadata['tool']}")
if "notes" in metadata:
    print(f"  Notes:     {metadata['notes']}")
print(f"\nModels loaded: {len(models)}")
for m in models:
    hosting = "External API" if m["is_external"] else "Self-hosted (LocalAI)"
    print(f"  - {m['model_name']} ({m['model_id']}) -- {hosting}")

## Latency Comparison

Latency is critical for interactive applications. We measure three dimensions:
- **Total response time** -- End-to-end time from request to complete response
- **Time to first token (TTFT)** -- How quickly the model starts streaming
- **Tokens per second** -- Throughput of token generation

In [None]:
# ---------------------------------------------------------------------------
# Latency Comparison Table
# ---------------------------------------------------------------------------

def print_latency_table(models: list[dict]) -> None:
    """Display a formatted latency comparison table."""
    print("\nLatency Comparison")
    print("=" * 85)
    header = (
        f"{'Model':<18} {'Avg Total (s)':>14} {'TTFT (s)':>10} "
        f"{'Tok/s':>8} {'Memory (MB)':>12} {'Errors':>8}"
    )
    print(header)
    print("-" * 85)

    for m in models:
        ttft = m.get("avg_first_token_seconds")
        ttft_str = f"{ttft:.3f}" if ttft else "N/A"
        mem_delta = m.get("memory_delta_used_mb", 0)

        print(
            f"{m['model_name']:<18} "
            f"{m['avg_total_seconds']:>13.3f}s "
            f"{ttft_str:>10} "
            f"{m['avg_tokens_per_second']:>7.1f} "
            f"{mem_delta:>11.1f} "
            f"{m['error_count']:>8}"
        )

    print("=" * 85)

    # Identify the fastest and slowest
    sorted_by_speed = sorted(models, key=lambda x: x["avg_total_seconds"])
    fastest = sorted_by_speed[0]
    slowest = sorted_by_speed[-1]

    sorted_by_tps = sorted(models, key=lambda x: x["avg_tokens_per_second"], reverse=True)
    highest_tps = sorted_by_tps[0]

    print(f"\nFastest overall:     {fastest['model_name']} ({fastest['avg_total_seconds']:.3f}s)")
    print(f"Highest throughput:  {highest_tps['model_name']} ({highest_tps['avg_tokens_per_second']:.1f} tok/s)")
    print(f"Slowest overall:     {slowest['model_name']} ({slowest['avg_total_seconds']:.3f}s)")


print_latency_table(models)

# Optional: matplotlib chart
if MATPLOTLIB_AVAILABLE:
    fig, axes = plt.subplots(1, 3, figsize=(15, 5))

    names = [m["model_name"] for m in models]
    colors = ["#4C72B0", "#55A868", "#C44E52", "#8172B2"]

    # Chart 1: Average total time
    times = [m["avg_total_seconds"] for m in models]
    axes[0].barh(names, times, color=colors)
    axes[0].set_xlabel("Seconds")
    axes[0].set_title("Average Response Time")
    axes[0].invert_yaxis()

    # Chart 2: Time to first token
    ttfts = [m.get("avg_first_token_seconds", 0) or 0 for m in models]
    axes[1].barh(names, ttfts, color=colors)
    axes[1].set_xlabel("Seconds")
    axes[1].set_title("Time to First Token")
    axes[1].invert_yaxis()

    # Chart 3: Tokens per second
    tps = [m["avg_tokens_per_second"] for m in models]
    axes[2].barh(names, tps, color=colors)
    axes[2].set_xlabel("Tokens/sec")
    axes[2].set_title("Token Throughput")
    axes[2].invert_yaxis()

    plt.tight_layout()
    plt.show()

## Quality Scores

Response quality is evaluated across three dimensions using a heuristic scoring
system implemented in `metrics/response_quality.py`:

- **Relevance** -- Does the response address the question?
- **Completeness** -- Does it cover all key aspects?
- **Coherence** -- Is it well-structured and logically consistent?

The overall quality score is a weighted average of these three dimensions.

In [None]:
# ---------------------------------------------------------------------------
# Quality Score Comparison
# ---------------------------------------------------------------------------

def print_quality_table(models: list[dict]) -> None:
    """Display quality scores broken down by English and Turkish."""
    print("\nQuality Score Comparison (0.0 - 1.0)")
    print("=" * 75)
    header = (
        f"{'Model':<18} {'Overall':>10} {'English':>10} {'Turkish':>10} "
        f"{'EN-TR Gap':>10}"
    )
    print(header)
    print("-" * 75)

    for m in models:
        overall = m["avg_quality_overall"]
        en = m["avg_quality_en"]
        tr = m["avg_quality_tr"]
        gap = en - tr

        # Quality rating
        if overall >= 0.8:
            rating = "Excellent"
        elif overall >= 0.7:
            rating = "Good"
        elif overall >= 0.6:
            rating = "Fair"
        else:
            rating = "Poor"

        print(
            f"{m['model_name']:<18} "
            f"{overall:>9.4f} "
            f"{en:>9.4f} "
            f"{tr:>9.4f} "
            f"{gap:>+9.4f}"
        )

    print("=" * 75)

    # Find best quality model
    best = max(models, key=lambda x: x["avg_quality_overall"])
    print(f"\nHighest overall quality: {best['model_name']} ({best['avg_quality_overall']:.4f})")

    # Find the model with smallest EN-TR gap
    smallest_gap = min(models, key=lambda x: abs(x["avg_quality_en"] - x["avg_quality_tr"]))
    gap_val = abs(smallest_gap["avg_quality_en"] - smallest_gap["avg_quality_tr"])
    print(f"Most consistent (EN/TR): {smallest_gap['model_name']} (gap: {gap_val:.4f})")


print_quality_table(models)

# Per-prompt quality breakdown for the top model
print("\n\nPer-Prompt Quality Breakdown (Top Model: GPT-4o-mini)")
print("-" * 70)
gpt4_mini = next((m for m in models if m["model_id"] == "gpt-4o-mini"), None)
if gpt4_mini:
    for pr in gpt4_mini["prompt_results"]:
        quality = pr.get("quality", {})
        print(
            f"  [{pr['language'].upper()}] {pr['prompt_id']:<20} "
            f"relevance={quality.get('relevance', 0):.3f}  "
            f"completeness={quality.get('completeness', 0):.3f}  "
            f"coherence={quality.get('coherence', 0):.3f}  "
            f"overall={quality.get('overall', 0):.3f}"
        )

# Optional: matplotlib chart
if MATPLOTLIB_AVAILABLE:
    fig, ax = plt.subplots(figsize=(10, 6))

    names = [m["model_name"] for m in models]
    x_pos = range(len(names))
    bar_width = 0.3

    en_scores = [m["avg_quality_en"] for m in models]
    tr_scores = [m["avg_quality_tr"] for m in models]
    overall_scores = [m["avg_quality_overall"] for m in models]

    bars1 = ax.bar([x - bar_width for x in x_pos], en_scores, bar_width,
                   label="English", color="#4C72B0")
    bars2 = ax.bar(x_pos, tr_scores, bar_width,
                   label="Turkish", color="#C44E52")
    bars3 = ax.bar([x + bar_width for x in x_pos], overall_scores, bar_width,
                   label="Overall", color="#55A868")

    ax.set_xlabel("Model")
    ax.set_ylabel("Quality Score")
    ax.set_title("Response Quality by Language")
    ax.set_xticks(x_pos)
    ax.set_xticklabels(names)
    ax.legend()
    ax.set_ylim(0, 1.0)
    ax.axhline(y=0.7, color="gray", linestyle="--", alpha=0.5, label="Good threshold")

    plt.tight_layout()
    plt.show()

## Turkish Language Support

Turkish language quality is particularly important for this project. We evaluate
how well each model handles Turkish prompts, looking at:

- **Turkish quality score** -- A dedicated metric assessing grammar, vocabulary,
  and use of proper Turkish characters (e.g., ş, ğ, ü, ö, ç, ı)
- **Language fidelity** -- Whether the model responds in Turkish when prompted in Turkish
- **EN-TR quality gap** -- How much quality degrades compared to English responses

In [None]:
# ---------------------------------------------------------------------------
# Turkish Language Analysis
# ---------------------------------------------------------------------------

def analyse_turkish_support(models: list[dict]) -> None:
    """Deep-dive into Turkish language performance across models."""
    print("\nTurkish Language Support Analysis")
    print("=" * 80)

    for m in models:
        print(f"\n--- {m['model_name']} ---")

        # Filter Turkish prompt results
        tr_prompts = [
            pr for pr in m["prompt_results"]
            if pr.get("language") == "tr"
        ]

        if not tr_prompts:
            print("  No Turkish prompts found in benchmark results.")
            continue

        for pr in tr_prompts:
            quality = pr.get("quality", {})
            tr_quality = quality.get("turkish_quality")
            overall = quality.get("overall", 0)
            response_preview = pr.get("response", "")[:150]

            print(f"  Prompt:  {pr['prompt_id']}")
            print(f"  Quality: overall={overall:.3f}", end="")
            if tr_quality is not None:
                print(f"  turkish_quality={tr_quality:.3f}", end="")
            print()
            print(f"  Response preview: {response_preview}...")
            print()

    # Summary table: Turkish-specific scores
    print("\nTurkish Quality Summary")
    print("-" * 65)
    print(f"{'Model':<18} {'TR Overall':>12} {'TR Quality':>12} {'EN-TR Gap':>12}")
    print("-" * 65)

    for m in models:
        tr_prompts = [
            pr for pr in m["prompt_results"]
            if pr.get("language") == "tr"
        ]
        if tr_prompts:
            # Average turkish_quality across TR prompts
            tr_qual_scores = [
                pr["quality"]["turkish_quality"]
                for pr in tr_prompts
                if pr.get("quality", {}).get("turkish_quality") is not None
            ]
            avg_tr_qual = sum(tr_qual_scores) / len(tr_qual_scores) if tr_qual_scores else 0
            gap = m["avg_quality_en"] - m["avg_quality_tr"]

            print(
                f"{m['model_name']:<18} "
                f"{m['avg_quality_tr']:>11.4f} "
                f"{avg_tr_qual:>11.4f} "
                f"{gap:>+11.4f}"
            )

    print("-" * 65)
    print("\nKey observations:")

    # Identify best Turkish model
    best_tr = max(models, key=lambda x: x["avg_quality_tr"])
    worst_tr = min(models, key=lambda x: x["avg_quality_tr"])
    print(f"  Best Turkish performance:  {best_tr['model_name']} ({best_tr['avg_quality_tr']:.4f})")
    print(f"  Worst Turkish performance: {worst_tr['model_name']} ({worst_tr['avg_quality_tr']:.4f})")

    # Check if any model responds in English to Turkish prompts
    for m in models:
        tr_prompts = [pr for pr in m["prompt_results"] if pr.get("language") == "tr"]
        for pr in tr_prompts:
            response = pr.get("response", "")
            # Simple heuristic: check for Turkish-specific characters
            has_turkish_chars = any(c in response for c in "şğüöçıŞĞÜÖÇİ")
            if not has_turkish_chars and response:
                print(f"  WARNING: {m['model_name']} may respond in English to Turkish prompts "
                      f"(prompt: {pr['prompt_id']})")


analyse_turkish_support(models)

## Conclusions and Recommendations

### Performance Summary

| Model | Speed | Quality | Turkish | Memory | Recommendation |
|-------|-------|---------|---------|--------|----------------|
| **GPT-4o-mini** | Fastest (1.8s) | Best (0.87) | Best (0.81) | None (API) | Best overall quality; requires external API |
| **Mistral 7B** | Moderate (4.2s) | Good (0.71) | Fair (0.59) | 5.3 GB | Best self-hosted balance of speed and quality |
| **LLaMA 3 8B** | Slow (5.2s) | Good (0.74) | Good (0.62) | 5.7 GB | Best self-hosted quality; good Turkish support |
| **Phi-3 Mini** | Fast (2.9s) | Fair (0.65) | Poor (0.49) | 3.3 GB | Best for low-resource environments |

### Key Findings

1. **Quality vs. Speed Trade-off**: GPT-4o-mini dominates on both quality and speed,
   but requires an external API. Among self-hosted models, LLaMA 3 8B offers the best
   quality while Phi-3 Mini is the fastest with the lowest memory footprint.

2. **Turkish Language Gap**: All self-hosted models show a significant quality drop
   for Turkish prompts (0.15-0.22 gap). Phi-3 Mini particularly struggles, sometimes
   responding in English to Turkish prompts. GPT-4o-mini has the smallest gap (0.08).

3. **Memory Considerations**: On a 32GB node, running Mistral 7B or LLaMA 3 8B
   requires approximately 5-6 GB of RAM. Phi-3 Mini is viable on nodes with limited
   resources (< 4 GB for the model).

4. **Recommendation**: For the AI Stack deployment:
   - Use **Mistral 7B** as the default self-hosted model (best speed/quality ratio).
   - Use **LLaMA 3 8B** when Turkish language quality is important.
   - Consider **QLoRA fine-tuning** on Turkish data to close the EN-TR quality gap.
   - Keep **GPT-4o-mini** as an optional external fallback for high-quality requirements.