# 02 - Metrics and Plots

This notebook computes all evaluation metrics and generates visualizations:
- LaBSE embeddings and cross-lingual cosine similarity for open-text tasks
- Task-aware agreement checks for discrete-answer tasks
- Intra-language stability baselines (run1 vs run2)
- Heatmaps, distribution plots, and summary tables

In [1]:
# Add parent directory to path for imports
import sys
sys.path.insert(0, '..')

In [2]:
from src.infer_ollama import load_responses, get_models
from src.similarity import (
    compute_all_metrics, 
    load_metrics, 
    load_stability,
    aggregate_by_language_pair,
    aggregate_by_task_type,
    get_flagged_examples
)
from src.task_checks import (
    compute_task_metrics,
    load_task_metrics,
    aggregate_task_metrics_by_task_type,
    get_mismatched_examples
)
from src.plots import generate_all_plots
import pandas as pd
from pathlib import Path

## 1. Load Responses

In [3]:
# Load all responses
responses = load_responses()
responses_df = pd.DataFrame(responses)

print(f"Total responses: {len(responses_df)}")
print(f"\nBy task type:")
print(responses_df.groupby('task_type').size())

# Separate open-text and discrete responses
open_text_tasks = ['summarization', 'creative']
discrete_tasks = ['classification', 'reasoning', 'factual']

open_text_df = responses_df[responses_df['task_type'].isin(open_text_tasks)]
discrete_df = responses_df[responses_df['task_type'].isin(discrete_tasks)]

print(f"\nOpen-text responses: {len(open_text_df)}")
print(f"Discrete-answer responses: {len(discrete_df)}")

Total responses: 720

By task type:
task_type
classification    144
creative          144
factual           144
reasoning         144
summarization     144
dtype: int64

Open-text responses: 288
Discrete-answer responses: 432


## 2. Compute Open-Text Metrics (LaBSE Similarity)

This section:
1. Loads the LaBSE model
2. Generates embeddings for all open-text responses
3. Computes cross-lingual cosine similarity (EN-DE, EN-TR, DE-TR)
4. Computes intra-language stability (run1 vs run2)

In [4]:
# Compute open-text metrics
# This will load LaBSE and compute embeddings (may take a few minutes on first run)

metrics_df, stability_df = compute_all_metrics(show_progress=True)

print(f"\nMetrics computed: {len(metrics_df)} cross-lingual comparisons")
print(f"Stability computed: {len(stability_df)} intra-language comparisons")

Processing 288 open-text responses...
Loading LaBSE model: sentence-transformers/LaBSE


modules.json:   0%|          | 0.00/461 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/804 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.88G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/397 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/114 [00:00<?, ?B/s]

2_Dense/model.safetensors:   0%|          | 0.00/2.36M [00:00<?, ?B/s]

Computing embeddings for 288 responses...


Batches:   0%|          | 0/9 [00:00<?, ?it/s]

Saved metrics to /home/baran/project-courses/llm-crossslingual-prompt-consistency/notebooks/../data/metrics.csv
Saved stability to /home/baran/project-courses/llm-crossslingual-prompt-consistency/notebooks/../data/stability.csv

Metrics computed: 288 cross-lingual comparisons
Stability computed: 144 intra-language comparisons


In [5]:
# View metrics summary
print("Cross-lingual similarity by language pair:")
for model_id in metrics_df['model_id'].unique():
    print(f"\n{model_id}:")
    print(aggregate_by_language_pair(metrics_df, model_id))

Cross-lingual similarity by language pair:

gemma3:1b:
      cosine_similarity              
                   mean     std count
pair                                 
DE-TR            0.6982  0.1028    16
EN-DE            0.6564  0.1030    16
EN-TR            0.6293  0.1114    16

gemma3:4b:
      cosine_similarity              
                   mean     std count
pair                                 
DE-TR            0.7336  0.0928    16
EN-DE            0.7186  0.0972    16
EN-TR            0.6385  0.1254    16

llama3.2:1b:
      cosine_similarity              
                   mean     std count
pair                                 
DE-TR            0.6559  0.1707    16
EN-DE            0.5833  0.1114    16
EN-TR            0.5746  0.1445    16

llama3.2:3b:
      cosine_similarity              
                   mean     std count
pair                                 
DE-TR            0.6525  0.0728    16
EN-DE            0.5879  0.1033    16
EN-TR            0.6110  0.1019

In [6]:
# View by task type
print("Cross-lingual similarity by task type:")
for model_id in metrics_df['model_id'].unique():
    print(f"\n{model_id}:")
    print(aggregate_by_task_type(metrics_df, model_id))

Cross-lingual similarity by task type:

gemma3:1b:
              cosine_similarity              
                           mean     std count
task_type                                    
creative                 0.6077  0.1114    24
summarization            0.7150  0.0720    24

gemma3:4b:
              cosine_similarity              
                           mean     std count
task_type                                    
creative                 0.6550  0.1343    24
summarization            0.7388  0.0629    24

llama3.2:1b:
              cosine_similarity              
                           mean     std count
task_type                                    
creative                 0.4965  0.1002    24
summarization            0.7128  0.0950    24

llama3.2:3b:
              cosine_similarity              
                           mean     std count
task_type                                    
creative                 0.5815  0.0978    24
summarization            0.6529  0.

## 3. Compute Discrete-Answer Metrics (Task-Aware Checks)

This section:
1. Extracts labels/answers from discrete-answer responses
2. Checks cross-lingual agreement (same run_id across languages)
3. Checks intra-language stability (run1 vs run2)

In [7]:
# Compute task-aware metrics for discrete-answer tasks
task_metrics_df, discrete_stability_df = compute_task_metrics(show_progress=True)

print(f"\nTask metrics computed: {len(task_metrics_df)} cross-lingual comparisons")

Processing 432 discrete-answer responses...
Saved task metrics to /home/baran/project-courses/llm-crossslingual-prompt-consistency/notebooks/../data/task_metrics.csv
Updated stability in /home/baran/project-courses/llm-crossslingual-prompt-consistency/notebooks/../data/stability.csv

Task metrics computed: 144 cross-lingual comparisons


In [8]:
# View task metrics summary
print("Cross-lingual agreement by task type:")
for model_id in task_metrics_df['model_id'].unique():
    print(f"\n{model_id}:")
    print(aggregate_task_metrics_by_task_type(task_metrics_df, model_id))

Cross-lingual agreement by task type:

gemma3:4b:
                total  matches  mismatches  uncertain  match_rate  \
task_type                                                           
classification    8.0      8.0         0.0        0.0        1.00   
factual           8.0      8.0         0.0        0.0        1.00   
reasoning         8.0      6.0         2.0        0.0        0.75   

                mismatch_rate  
task_type                      
classification           0.00  
factual                  0.00  
reasoning                0.25  

gemma3:1b:
                total  matches  mismatches  uncertain  match_rate  \
task_type                                                           
classification    8.0      8.0         0.0        0.0        1.00   
factual           8.0      8.0         0.0        0.0        1.00   
reasoning         8.0      2.0         6.0        0.0        0.25   

                mismatch_rate  
task_type                      
classification        

  return df.groupby("task_type").apply(compute_rates).round(4)
  return df.groupby("task_type").apply(compute_rates).round(4)
  return df.groupby("task_type").apply(compute_rates).round(4)
  return df.groupby("task_type").apply(compute_rates).round(4)
  return df.groupby("task_type").apply(compute_rates).round(4)
  return df.groupby("task_type").apply(compute_rates).round(4)


## 4. View Stability Baseline

Compare cross-lingual consistency against intra-language stability to separate language effects from model randomness.

In [9]:
# Load combined stability data
stability_df = load_stability()

print("Intra-language stability by type:")
print(stability_df.groupby(['stability_type', 'language'])['stability_value'].agg(['mean', 'std', 'count']))

Intra-language stability by type:
                               mean       std  count
stability_type   language                           
discrete_match   DE        0.888889  0.316475     72
                 EN        0.972222  0.165489     72
                 TR        0.897059  0.306141     68
open_text_cosine DE        0.853530  0.130855     48
                 EN        0.863358  0.104380     48
                 TR        0.827966  0.148075     48


## 5. Generate Visualizations

In [10]:
# Generate all plots and save them
saved_paths = generate_all_plots(
    metrics_df=metrics_df,
    task_metrics_df=task_metrics_df,
    stability_df=stability_df,
    show=False  # Set to True to display inline
)

print("Generated plots:")
for plot_type, paths in saved_paths.items():
    print(f"\n{plot_type}:")
    for path in paths:
        print(f"  - {path}")


Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.boxplot(data=df, x="pair", y="cosine_similarity", ax=ax1, palette="Set2")

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.boxplot(data=stab_df, x="language", y="stability_value", ax=ax1, palette="Set3")

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.boxplot(data=met_df, x="pair", y="cosine_similarity", ax=ax2, palette="Set2")

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.boxplot(data=df, x="pair", y="cosine_similarity", ax=ax1, palette="Set2")



Generated plots:

heatmaps:
  - /home/baran/project-courses/llm-crossslingual-prompt-consistency/notebooks/../outputs/plots/heatmap_phi3_latest.png
  - /home/baran/project-courses/llm-crossslingual-prompt-consistency/notebooks/../outputs/plots/heatmap_phi4-mini_3.8b.png
  - /home/baran/project-courses/llm-crossslingual-prompt-consistency/notebooks/../outputs/plots/heatmap_gemma3_1b.png
  - /home/baran/project-courses/llm-crossslingual-prompt-consistency/notebooks/../outputs/plots/heatmap_gemma3_4b.png
  - /home/baran/project-courses/llm-crossslingual-prompt-consistency/notebooks/../outputs/plots/heatmap_llama3.2_3b.png
  - /home/baran/project-courses/llm-crossslingual-prompt-consistency/notebooks/../outputs/plots/heatmap_llama3.2_1b.png

distributions:
  - /home/baran/project-courses/llm-crossslingual-prompt-consistency/notebooks/../outputs/plots/distribution_phi3_latest.png
  - /home/baran/project-courses/llm-crossslingual-prompt-consistency/notebooks/../outputs/plots/distribution_phi

## 6. View Flagged Low-Similarity Examples

In [11]:
# Get flagged examples (bottom 10% similarity)
flagged = get_flagged_examples(metrics_df, responses)
print(f"Flagged low-similarity examples: {len(flagged)}")

if len(flagged) > 0:
    print("\nLowest similarity cases:")
    print(flagged[['model_id', 'prompt_id', 'task_type', 'pair', 'cosine_similarity']].head(10))

Flagged low-similarity examples: 30

Lowest similarity cases:
        model_id  prompt_id task_type   pair  cosine_similarity
232  phi3:latest         19  creative  EN-TR           0.240370
139  llama3.2:1b         20  creative  EN-TR           0.285214
226  phi3:latest         18  creative  EN-TR           0.297483
229  phi3:latest         19  creative  EN-TR           0.310010
136  llama3.2:1b         19  creative  EN-TR           0.323369
223  phi3:latest         18  creative  EN-TR           0.332236
230  phi3:latest         19  creative  DE-TR           0.343559
40     gemma3:1b         19  creative  EN-TR           0.346787
183  llama3.2:3b         19  creative  EN-DE           0.361225
129  llama3.2:1b         18  creative  EN-DE           0.367049


In [12]:
# Get mismatched discrete-answer examples
mismatched = get_mismatched_examples(task_metrics_df, responses)
print(f"Mismatched discrete-answer examples: {len(mismatched)}")

if len(mismatched) > 0:
    print("\nMismatched cases:")
    print(mismatched[['model_id', 'prompt_id', 'task_type', 'key_en', 'key_de', 'key_tr']].head(10))

Mismatched discrete-answer examples: 51

Mismatched cases:
       model_id  prompt_id       task_type    key_en    key_de    key_tr
12    gemma3:4b         11       reasoning         1        78         1
13    gemma3:4b         11       reasoning         1        78         1
34    gemma3:1b         10       reasoning      22.5        60        18
35    gemma3:1b         10       reasoning      22.5        60        90
36    gemma3:1b         11       reasoning        75         1        45
37    gemma3:1b         11       reasoning        75        60         1
38    gemma3:1b         12       reasoning         A         B         B
39    gemma3:1b         12       reasoning         A         B         B
48  llama3.2:1b          5  classification  negative  positive  positive
49  llama3.2:1b          5  classification  negative  positive  positive


## Summary

All metrics have been computed and saved:
- `data/metrics.csv` - Cross-lingual cosine similarities for open-text tasks
- `data/stability.csv` - Intra-language stability (run1 vs run2)
- `data/task_metrics.csv` - Discrete-answer agreement results
- `outputs/plots/` - Heatmaps and distribution plots
- `outputs/reports/` - Summary tables

Proceed to `03_qualitative_review.ipynb` to examine flagged low-consistency examples in detail.