# Why Qwen Needed Output Normalization for RAG JSON Compliance

This notebook documents why `ollama:qwen2.5:3b` required output normalization before its responses could be validated against our strict `RagResponse` schema.

## Goals

1. Show real output examples from the 5 golden questions.
2. Explain why raw outputs do not match strict Pydantic validation.
3. Describe the validation and normalization pipeline added for Qwen.
4. Demonstrate how normalized outputs become schema-compliant and usable.

## Context

- Raw evaluation artifact: `docs/rag_eval_results_qwen_raw.json`
- Canonical schema: `app/schemas/rag_response.py`
- Evaluation script with normalization: `scripts/eval_rag_quality_raw_qwen.py`


In [3]:
import json
from pathlib import Path
from pprint import pprint

RESULTS_PATH = Path("../docs/rag_eval_results_qwen_raw.json")

def load_results(path: Path):
    data = json.loads(path.read_text(encoding="utf-8"))
    assert isinstance(data, list) and len(data) > 0, "Unexpected result format"
    return data[0]

report = load_results(RESULTS_PATH)
report.keys()

dict_keys(['model', 'json_parse_rate', 'native_schema_valid_rate', 'normalized_schema_valid_rate', 'avg_latency_native', 'avg_latency_normalized', 'details'])

In [4]:
summary = {
    "model": report["model"],
    "json_parse_rate": report["json_parse_rate"],
    "native_schema_valid_rate": report["native_schema_valid_rate"],
    "normalized_schema_valid_rate": report["normalized_schema_valid_rate"],
    "avg_latency_native": report["avg_latency_native"],
    "avg_latency_normalized": report["avg_latency_normalized"],
    "num_questions": len(report["details"]),
}

pprint(summary)

{'avg_latency_native': 0,
 'avg_latency_normalized': 4.658951663970948,
 'json_parse_rate': 100.0,
 'model': 'ollama:qwen2.5:3b',
 'native_schema_valid_rate': 0.0,
 'normalized_schema_valid_rate': 100.0,
 'num_questions': 5}


## 1) Real Qwen Outputs for the Golden Questions

The next cell prints a concise view of each question:

- `question_id`
- Native validation status
- Raw parsed keys from Qwen (`parsed_json`)
- A short preview of the raw output string

This section is intentionally focused on *what Qwen actually returned*, before any schema adaptation.


In [5]:
for d in report["details"]:
    parsed = d.get("parsed_json")
    parsed_keys = list(parsed.keys()) if isinstance(parsed, dict) else type(parsed).__name__
    raw_preview = (d.get("raw_output") or "").replace("\n", " ")[:220]

    print(f"--- {d['question_id']} ---")
    print("native_schema_ok:", d.get("native_schema_ok"))
    print("normalized_schema_ok:", d.get("normalized_schema_ok"))
    print("parsed_json keys:", parsed_keys)
    print("raw_output preview:", raw_preview)
    print()

--- q1_synthesis ---
native_schema_ok: False
normalized_schema_ok: True
parsed_json keys: ['answer']
raw_output preview: {   "answer": "Processes are unique entities that can execute different applications concurrently, they usually involve its own resources like memory space and file handles but share CPU scheduling. Threads, on the other

--- q2_json_list ---
native_schema_ok: False
normalized_schema_ok: True
parsed_json keys: ['answer']
raw_output preview: {   "answer": "Atomicity, Consistency, Isolation" }

--- q3_faithfulness ---
native_schema_ok: False
normalized_schema_ok: True
parsed_json keys: ['output']
raw_output preview: {   "output": "The provided guide does not cover the configuration for AMD Radeon GPUs and has no information regarding such a setup. It specifically focuses on CUDA drivers for NVIDIA GPUs, particularly mentioning GTX 1

--- q4_multihop ---
native_schema_ok: False
normalized_schema_ok: True
parsed_json keys: ['output']
raw_output preview: {   "output": {  

## 2) Why Raw Qwen Outputs Fail Strict Pydantic Validation

Our canonical schema (`RagResponse`) requires these core fields:

- `answer` (required)
- `confidence_score` (required, float in `[0, 1]`)
- `sources_used` (required, bool)

`key_terms` and `reasoning` are more permissive, but the three required fields above must always exist.

Qwen often returns semantically correct content in alternative shapes such as:

- `{"output": "..."}`
- `{"output": {"model_chosen": "...", "rationale": "..."}}`
- `{"response": "..."}`

Those structures are valid JSON, but they are not directly valid `RagResponse` payloads.


In [6]:
required_fields = ["answer", "confidence_score", "sources_used"]

for d in report["details"]:
    parsed = d.get("parsed_json") if isinstance(d.get("parsed_json"), dict) else {}
    missing = [f for f in required_fields if f not in parsed]

    print(f"{d['question_id']}: missing native required fields -> {missing}")
    if d.get("native_schema_error"):
        first_line = str(d["native_schema_error"]).split("\n")[0]
        print("  native_schema_error:", first_line)
    print()

q1_synthesis: missing native required fields -> ['confidence_score', 'sources_used']
  native_schema_error: 2 validation errors for RagResponse

q2_json_list: missing native required fields -> ['confidence_score', 'sources_used']
  native_schema_error: 2 validation errors for RagResponse

q3_faithfulness: missing native required fields -> ['answer', 'confidence_score', 'sources_used']
  native_schema_error: 3 validation errors for RagResponse

q4_multihop: missing native required fields -> ['answer', 'confidence_score', 'sources_used']
  native_schema_error: 3 validation errors for RagResponse

q5_spanish_instruction: missing native required fields -> ['confidence_score', 'sources_used']
  native_schema_error: 2 validation errors for RagResponse



## 3) New Qwen Validation + Normalization Pipeline

The updated evaluator runs **three explicit gates** per question:

1. **JSON parse gate**
   - `json_ok`: verifies raw text can be parsed with `json.loads`.

2. **Native schema gate**
   - `native_schema_ok`: validates `parsed_json` directly with `RagResponse.model_validate(...)`.
   - This measures strict, out-of-the-box compliance.

3. **Normalized schema gate**
   - `normalize_qwen_payload(parsed_json)` maps heterogeneous shapes into canonical keys.
   - `normalized_schema_ok`: validates the normalized payload with `RagResponse.model_validate(...)`.

Why this is necessary:

- Qwen frequently returns JSON that is structurally different but semantically meaningful.
- A deterministic adapter preserves useful content while enforcing one stable API contract for downstream components.


In [7]:
rows = []
for d in report["details"]:
    rows.append({
        "question_id": d["question_id"],
        "json_ok": d["json_ok"],
        "native_schema_ok": d["native_schema_ok"],
        "normalized_schema_ok": d["normalized_schema_ok"],
    })

pprint(rows)

[{'json_ok': True,
  'native_schema_ok': False,
  'normalized_schema_ok': True,
  'question_id': 'q1_synthesis'},
 {'json_ok': True,
  'native_schema_ok': False,
  'normalized_schema_ok': True,
  'question_id': 'q2_json_list'},
 {'json_ok': True,
  'native_schema_ok': False,
  'normalized_schema_ok': True,
  'question_id': 'q3_faithfulness'},
 {'json_ok': True,
  'native_schema_ok': False,
  'normalized_schema_ok': True,
  'question_id': 'q4_multihop'},
 {'json_ok': True,
  'native_schema_ok': False,
  'normalized_schema_ok': True,
  'question_id': 'q5_spanish_instruction'}]


## 4) Before vs After Normalization (Per Question)

The next cell prints one compact comparison per golden question:

- **Raw shape** (`parsed_json` keys)
- **Raw answer candidate** (if obvious)
- **Normalized final answer** (`normalized_json["answer"]`)
- **Final canonical payload** (the `response` field stored by the evaluator)

This is the practical evidence that normalization converts non-canonical but useful outputs into valid `RagResponse` objects.


In [8]:
def raw_answer_hint(parsed):
    if not isinstance(parsed, dict):
        return None
    if isinstance(parsed.get("answer"), str):
        return parsed["answer"]
    if isinstance(parsed.get("output"), str):
        return parsed["output"]
    if isinstance(parsed.get("response"), str):
        return parsed["response"]
    if isinstance(parsed.get("output"), dict):
        out = parsed["output"]
        if isinstance(out.get("value"), str):
            return out["value"]
        if isinstance(out.get("rationale"), str):
            return out["rationale"]
    return None

for d in report["details"]:
    parsed = d.get("parsed_json")
    normalized = d.get("normalized_json") or {}

    print(f"=== {d['question_id']} ===")
    print("raw keys:", list(parsed.keys()) if isinstance(parsed, dict) else type(parsed).__name__)

    hint = raw_answer_hint(parsed)
    if isinstance(hint, str):
        print("raw answer hint:", hint[:200], "..." if len(hint) > 200 else "")
    else:
        print("raw answer hint: <not directly accessible from a single field>")

    print("normalized answer:", normalized.get("answer", "<missing>"))
    print("canonical response object:")
    pprint(d.get("response"))
    print()

=== q1_synthesis ===
raw keys: ['answer']
raw answer hint: Processes are unique entities that can execute different applications concurrently, they usually involve its own resources like memory space and file handles but share CPU scheduling. Threads, on the  ...
normalized answer: Processes are unique entities that can execute different applications concurrently, they usually involve its own resources like memory space and file handles but share CPU scheduling. Threads, on the other hand, are smaller units of work than processes; however, unlike processes, a single thread belongs to a process sharing resources such as memory and files. Furthermore, context switching among threads consumes less time compared to switching between independent threads or processes.
canonical response object:
{'answer': 'Processes are unique entities that can execute different '
           'applications concurrently, they usually involve its own resources '
           'like memory space and file handles bu

## Final Interpretation

During the evaluation of `Qwen2.5:3b`, it was concluded:

- Qwen is highly reliable at producing **parseable JSON** (`json_parse_rate = 100%`).
- Qwen does **not** natively satisfy strict `RagResponse` requirements (`native_schema_valid_rate = 0%`).
- After deterministic normalization, outputs become fully compatible with the canonical schema (`normalized_schema_valid_rate = 100%`).

This justifies keeping:

- a single canonical API schema (`RagResponse`), and
- a model-specific normalization layer for Qwen.

This preserves contract stability for downstream systems while still leveraging useful outputs from other models.
