# 03 - Qualitative Review

This notebook provides tools for reviewing flagged low-consistency examples:
- Display prompts and responses side-by-side across languages
- Annotate drift types (semantic, format, factual)
- Generate a qualitative evidence set for the report

In [1]:
# Add parent directory to path for imports
import sys
sys.path.insert(0, '..')

In [2]:
from src.infer_ollama import load_responses, get_responses_by_prompt
from src.load_prompts import load_prompts
from src.similarity import load_metrics, get_flagged_examples
from src.task_checks import load_task_metrics, get_mismatched_examples
import pandas as pd
from pathlib import Path
from IPython.display import display, HTML, Markdown

## 1. Load Data

In [3]:
# Load all data
responses = load_responses()
prompts_df = load_prompts(prepend_control_line=False)
metrics_df = load_metrics()
task_metrics_df = load_task_metrics()

print(f"Loaded {len(responses)} responses")
print(f"Loaded {len(prompts_df)} prompts")
print(f"Loaded {len(metrics_df)} metric records")
print(f"Loaded {len(task_metrics_df)} task metric records")

Loaded 720 responses
Loaded 60 prompts
Loaded 288 metric records
Loaded 144 task metric records


## 2. Helper Functions

In [4]:
def display_prompt(prompt_id: int):
    """Display a prompt in all three languages."""
    prompt_rows = prompts_df[prompts_df['prompt_id'] == prompt_id]
    
    if len(prompt_rows) == 0:
        print(f"Prompt {prompt_id} not found")
        return
    
    task_type = prompt_rows.iloc[0]['task_type']
    print(f"=== Prompt {prompt_id} ({task_type}) ===\n")
    
    for lang in ['EN', 'DE', 'TR']:
        row = prompt_rows[prompt_rows['language'] == lang]
        if len(row) > 0:
            print(f"[{lang}]")
            print(row.iloc[0]['text'])
            print()


def display_responses(prompt_id: int, model_id: str, run_id: int = 1):
    """Display responses for a prompt across all languages."""
    resp_dict = {}
    for r in responses:
        if r['prompt_id'] == prompt_id and r['model_id'] == model_id and r['run_id'] == run_id:
            resp_dict[r['language']] = r['response_text']
    
    print(f"=== Responses (Model: {model_id}, Run: {run_id}) ===\n")
    
    for lang in ['EN', 'DE', 'TR']:
        if lang in resp_dict:
            print(f"[{lang}]")
            print(resp_dict[lang])
            print()


def display_full_comparison(prompt_id: int, model_id: str, run_id: int = 1):
    """Display prompt and responses side by side."""
    display_prompt(prompt_id)
    print("-" * 80)
    display_responses(prompt_id, model_id, run_id)

## 3. Review Flagged Open-Text Examples

These are the bottom 10% similarity cases for open-text tasks (summarization, creative).

In [5]:
# Get flagged examples
flagged = get_flagged_examples(metrics_df, responses)
print(f"Total flagged examples: {len(flagged)}")

if len(flagged) > 0:
    print("\nFlagged cases sorted by similarity:")
    display(flagged[['model_id', 'prompt_id', 'task_type', 'pair', 'run_id', 'cosine_similarity']].head(10))

Total flagged examples: 30

Flagged cases sorted by similarity:


Unnamed: 0,model_id,prompt_id,task_type,pair,run_id,cosine_similarity
232,phi3:latest,19,creative,EN-TR,2,0.24037
139,llama3.2:1b,20,creative,EN-TR,1,0.285214
226,phi3:latest,18,creative,EN-TR,2,0.297483
229,phi3:latest,19,creative,EN-TR,1,0.31001
136,llama3.2:1b,19,creative,EN-TR,2,0.323369
223,phi3:latest,18,creative,EN-TR,1,0.332236
230,phi3:latest,19,creative,DE-TR,1,0.343559
40,gemma3:1b,19,creative,EN-TR,2,0.346787
183,llama3.2:3b,19,creative,EN-DE,2,0.361225
129,llama3.2:1b,18,creative,EN-DE,2,0.367049


In [6]:
# Review the lowest similarity case
if len(flagged) > 0:
    worst = flagged.iloc[0]
    print(f"Reviewing lowest similarity case:")
    print(f"  Similarity: {worst['cosine_similarity']:.4f}")
    print(f"  Pair: {worst['pair']}")
    print()
    display_full_comparison(worst['prompt_id'], worst['model_id'], worst['run_id'])

Reviewing lowest similarity case:
  Similarity: 0.2404
  Pair: EN-TR

=== Prompt 19 (creative) ===

[EN]
Write a slogan for an online education platform. It must be 6 to 8 words. Output only the slogan.

[DE]
Schreibe einen Slogan für eine Online-Bildungsplattform. Er muss aus 6 bis 8 Wörtern bestehen. Gib nur den Slogan aus.

[TR]
Bir çevrimiçi eğitim platformu için slogan yaz. 6 ile 8 kelime arasında olmalı. Sadece sloganı yaz.

--------------------------------------------------------------------------------
=== Responses (Model: phi3:latest, Run: 2) ===

[EN]
"Unlock Potential, Anywhere."

[DE]
"Lernen, Fortschritt, Weltweiter Zugang!"

[TR]
"Çalışma, Öncelikler ve Destek Yapmak!" - Bir çevrimiçi eğitim platformu için bir slogan oluyoruz. Bu kelimenin arasında "Çalışma", özellikle başlangıc atılması ve gelişmeyi belirtir; "Öncelikler", önceden eğitim altyapısı, yazan bir süreçte olan konuya geri kalma; "Destek Yapmak" de özgü bağlamlarda ve geliştirmeyi belirtir.

Çünkü 6-8 kelime a

## 4. Review Mismatched Discrete-Answer Examples

These are cases where the extracted answer differs across languages.

In [7]:
# Get mismatched examples
mismatched = get_mismatched_examples(task_metrics_df, responses)
print(f"Total mismatched examples: {len(mismatched)}")

if len(mismatched) > 0:
    print("\nMismatched cases:")
    display(mismatched[['model_id', 'prompt_id', 'task_type', 'run_id', 'key_en', 'key_de', 'key_tr']].head(10))

Total mismatched examples: 51

Mismatched cases:


Unnamed: 0,model_id,prompt_id,task_type,run_id,key_en,key_de,key_tr
12,gemma3:4b,11,reasoning,1,1,78,1
13,gemma3:4b,11,reasoning,2,1,78,1
34,gemma3:1b,10,reasoning,1,22.5,60,18
35,gemma3:1b,10,reasoning,2,22.5,60,90
36,gemma3:1b,11,reasoning,1,75,1,45
37,gemma3:1b,11,reasoning,2,75,60,1
38,gemma3:1b,12,reasoning,1,A,B,B
39,gemma3:1b,12,reasoning,2,A,B,B
48,llama3.2:1b,5,classification,1,negative,positive,positive
49,llama3.2:1b,5,classification,2,negative,positive,positive


In [8]:
# Review a mismatched case
if len(mismatched) > 0:
    case = mismatched.iloc[0]
    print(f"Reviewing mismatched case:")
    print(f"  Task: {case['task_type']}")
    print(f"  Extracted: EN={case['key_en']}, DE={case['key_de']}, TR={case['key_tr']}")
    print()
    display_full_comparison(case['prompt_id'], case['model_id'], case['run_id'])

Reviewing mismatched case:
  Task: reasoning
  Extracted: EN=1, DE=78, TR=1

=== Prompt 11 (reasoning) ===

[EN]
A meeting started at 9:15 and ended at 10:45. How many minutes did it last? Output only the number of minutes.

[DE]
Ein Meeting begann um 9:15 und endete um 10:45. Wie viele Minuten dauerte es? Gib nur die Minutenanzahl aus.

[TR]
Toplantı 9:15'te başlayıp 10:45'te bitti. Kaç dakika sürdü? Sadece dakika sayısını yaz.

--------------------------------------------------------------------------------
=== Responses (Model: gemma3:4b, Run: 1) ===

[EN]
1 hour and 30 minutes
90 minutes

[DE]
78


[TR]
1 saat 30 dakika




## 5. Interactive Review

Use this cell to review any specific prompt/model/run combination.

In [9]:
# Change these values to review different cases
PROMPT_ID = 1
MODEL_ID = "gemma3:1b"  # or "llama3.2:1b"
RUN_ID = 1

display_full_comparison(PROMPT_ID, MODEL_ID, RUN_ID)

=== Prompt 1 (summarization) ===

[EN]
Summarize the following paragraph in one sentence: "Artificial intelligence is changing many industries by automating repetitive work and helping people make faster decisions. Companies use AI to detect patterns in large datasets, but the results depend on data quality and careful evaluation. While productivity can increase, some tasks may be replaced and employees may need reskilling. Clear policies are also needed to reduce privacy risks and unfair outcomes."

[DE]
Fasse den folgenden Absatz in einem Satz zusammen: „Künstliche Intelligenz verändert viele Branchen, indem sie repetitive Arbeit automatisiert und Menschen hilft, schneller Entscheidungen zu treffen. Unternehmen nutzen KI, um Muster in großen Datensätzen zu erkennen, aber die Ergebnisse hängen von Datenqualität und sorgfältiger Bewertung ab. Obwohl die Produktivität steigen kann, können einige Tätigkeiten ersetzt werden und Beschäftigte benötigen möglicherweise Umschulungen. Außerdem 

## 6. Create Qualitative Evidence Set

Generate a structured set of flagged examples for the report.

In [10]:
# Create qualitative evidence set
evidence_rows = []

# Add open-text flagged examples (top 3 lowest)
if len(flagged) > 0:
    for _, row in flagged.head(3).iterrows():
        evidence_rows.append({
            'type': 'open_text',
            'model_id': row['model_id'],
            'prompt_id': row['prompt_id'],
            'task_type': row['task_type'],
            'pair': row['pair'],
            'similarity': row['cosine_similarity'],
            'drift_type': '',  # To be filled manually
            'notes': ''  # To be filled manually
        })

# Add discrete mismatched examples (top 3)
if len(mismatched) > 0:
    for _, row in mismatched.head(3).iterrows():
        evidence_rows.append({
            'type': 'discrete',
            'model_id': row['model_id'],
            'prompt_id': row['prompt_id'],
            'task_type': row['task_type'],
            'pair': 'EN-DE-TR',
            'similarity': None,
            'drift_type': '',  # To be filled manually
            'notes': f"EN={row['key_en']}, DE={row['key_de']}, TR={row['key_tr']}"
        })

evidence_df = pd.DataFrame(evidence_rows)
print("Qualitative Evidence Set (fill in drift_type column manually):")
display(evidence_df)

Qualitative Evidence Set (fill in drift_type column manually):


Unnamed: 0,type,model_id,prompt_id,task_type,pair,similarity,drift_type,notes
0,open_text,phi3:latest,19,creative,EN-TR,0.24037,,
1,open_text,llama3.2:1b,20,creative,EN-TR,0.285214,,
2,open_text,phi3:latest,18,creative,EN-TR,0.297483,,
3,discrete,gemma3:4b,11,reasoning,EN-DE-TR,,,"EN=1, DE=78, TR=1"
4,discrete,gemma3:4b,11,reasoning,EN-DE-TR,,,"EN=1, DE=78, TR=1"
5,discrete,gemma3:1b,10,reasoning,EN-DE-TR,,,"EN=22.5, DE=60, TR=18"


In [11]:
# Save evidence set
output_path = Path('../outputs/reports/qualitative_evidence.csv')
output_path.parent.mkdir(parents=True, exist_ok=True)
evidence_df.to_csv(output_path, index=False)
print(f"Saved to {output_path}")

Saved to ../outputs/reports/qualitative_evidence.csv


## Drift Type Categories

When annotating examples, use these categories:

1. **Semantic Drift**: Core meaning differs (e.g., different conclusions, missing key points)
2. **Format Drift**: Output format differs (e.g., bullet points vs prose, different length)
3. **Factual Drift**: Factual content differs (e.g., wrong answer, different label)
4. **Style Drift**: Tone or style differs significantly
5. **Hallucination**: One language includes fabricated content

---

**End of Qualitative Review Notebook**