# AI Translation Evaluation Pipeline

This notebook evaluates machine translation outputs produced by multiple **AI models** across a set of `.txt` files.  
It performs:

1. **Data ingestion & parsing** of each `.txt` file (splitting source vs. model output).
2. **Hypothesis/source preparation** by writing concatenated `src.txt` and `hypX.txt` files per model.
3. **Quality Estimation (QE)** using a reference-free COMET QE model for each file.
4. **Statistical testing** (paired t‑tests) to compare model scores across hypotheses.

> Notes:
> - The code is generalized to **AI models** (no vendor-specific names).
> - File/folder structure is auto‑detected from a base directory that contains one subfolder per AI model.
> - Each model subfolder is expected to contain `.txt` files with a *Source* and an *Output* section.


## 1. Environment Setup

In [None]:
# If running on Google Colab, mount Drive once. Safe to skip on local environments.
try:
    from google.colab import drive  # type: ignore
    IN_COLAB = True
except Exception:
    IN_COLAB = False

if IN_COLAB:
    drive.mount('/content/drive')
    BASE_DIR = "/content/drive/MyDrive/AI"  # <-- Set your base directory here
else:
    # Fallback: use a local path when not on Colab
    BASE_DIR = "./AI"  # <-- Set your local base directory here

print("IN_COLAB:", IN_COLAB)
print("BASE_DIR:", BASE_DIR)

In [None]:
# Install required packages if running on Colab
if IN_COLAB:
    !pip install -q unbabel-comet

else :
    %pip install -q unbabel-comet

In [None]:
from comet import download_model, load_from_checkpoint


import os
import itertools
import pandas as pd
from scipy import stats
from typing import  Tuple

## 2. Configuration

In [None]:
# Configure patterns for parsing, expected file suffix, and naming rules.
PARSE_KEYS = ["Output:", "output:"]  # Split marker between source and model output
TXT_SUFFIX = ".txt"

# Optional: map each model folder to a human-readable hypothesis label.
# By default, labels are inferred as hyp1, hyp2, ... in alphabetical order of folders.
HYP_LABELS = {}  # e.g., {"ModelA": "hyp1", "ModelB": "hyp2"}

## 3. Discover Model Folders & Files

In [None]:
def is_model_dir(path: str) -> bool:
    # Heuristic: a model folder is a directory with at least one .txt file
    if not os.path.isdir(path):
        return False
    files = [f for f in os.listdir(path) if f.endswith(TXT_SUFFIX)]
    return len(files) > 0

model_dirs = sorted([os.path.join(BASE_DIR, d) for d in os.listdir(BASE_DIR) if is_model_dir(os.path.join(BASE_DIR, d))])
if not model_dirs:
    raise RuntimeError(f"No model directories with {TXT_SUFFIX} files found under {BASE_DIR}.")

print("Discovered model folders (alphabetical):")
for d in model_dirs:
    print(" -", os.path.basename(d))

## 4. Parse .txt Files (Source vs. Hypothesis)

In [None]:
def split_source_hyp(text: str, keys=PARSE_KEYS) -> Tuple[str, str]:
    """Split a file into (source, hypothesis) using the first matching key.
    If no key is found, returns (text, "").
    """
    for k in keys:
        if k in text:
            parts = text.split(k, 1)
            return parts[0].strip(), parts[1].strip()
    return text.strip(), ""

# Build an index of files per model with extracted (src, hyp)
corpus = {}
for mdir in model_dirs:
    model_name = os.path.basename(mdir)
    files = sorted([f for f in os.listdir(mdir) if f.endswith(TXT_SUFFIX)])
    entries = []
    for f in files:
        with open(os.path.join(mdir, f), 'r', encoding='utf-8', errors='ignore') as fh:
            raw = fh.read()
        src, hyp = split_source_hyp(raw)
        entries.append({"file": f, "source": src, "hypothesis": hyp})
    corpus[model_name] = entries

# Quick sanity check
print("Models parsed:", list(corpus.keys()))
for k, v in corpus.items():
    print(f"{k}: {len(v)} files")

## 5. Prepare `src.txt` and `hypX.txt` per Model

In [None]:
# Determine hypothesis labels in a stable order
labels = {}
for i, mdir in enumerate(model_dirs, start=1):
    mname = os.path.basename(mdir)
    labels[mname] = HYP_LABELS.get(mname, f"hyp{i}")

# Create concatenated files inside each model directory
for mdir in model_dirs:
    mname = os.path.basename(mdir)
    hyp_label = labels[mname]
    src_out = os.path.join(mdir, "src.txt")
    hyp_out = os.path.join(mdir, f"{hyp_label}.txt")

    with open(src_out, "w", encoding="utf-8") as fsrc, open(hyp_out, "w", encoding="utf-8") as fhyp:
        for rec in corpus[mname]:
            # Each record appended as a single line; adjust joining if multi-line sentence granularity is desired
            fsrc.write(rec["source"].replace("\n", " ").strip() + "\n")
            fhyp.write(rec["hypothesis"].replace("\n", " ").strip() + "\n")

    print(f"Wrote: {src_out} and {hyp_out}")

## 6. Load QE Model

In [None]:
# Reference-free Quality Estimation (QE) model selection.
# Using a widely adopted QE checkpoint intended for sentence-level evaluation.
qe_model_path = download_model("wmt21-comet-qe-mqm")
qe_model = load_from_checkpoint(qe_model_path)
print("Loaded QE model from:", qe_model_path)

## 7. Score Files with QE (per Model, per File)

In [None]:
records = []

for mdir in model_dirs:
    mname = os.path.basename(mdir)
    for rec in corpus[mname]:
        src = rec["source"]
        hyp = rec["hypothesis"]
        # Build a single-sentence sample for QE (reference-free)
        data = [{"src": src, "mt": hyp}]
        try:
            output = qe_model.predict(data, batch_size=8, gpus=0)
            score = float(output["scores"][0])
        except Exception as e:
            score = float("nan")
        records.append({
            "model": mname,
            "file": rec["file"],
            "qe_score": score
        })

qe_df = pd.DataFrame(records)
print(qe_df.head())

## 8. Aggregate Scores & Prepare for Statistical Testing

In [None]:
# Map model names to hypothesis labels and pivot
qe_df["hypothesis"] = qe_df["model"].map(labels)
wide = qe_df.pivot_table(index="file", columns="hypothesis", values="qe_score")

# Persist scores
scores_csv = os.path.join(BASE_DIR, "qe_scores_per_file.csv")
qe_df.to_csv(scores_csv, index=False)
print("Saved per-file scores:", scores_csv)

wide_csv = os.path.join(BASE_DIR, "qe_scores_wide.csv")
wide.to_csv(wide_csv)
print("Saved wide scores:", wide_csv)

wide

## 9. Statistical Evaluation (Paired t‑tests)

In [None]:
# Perform paired t-tests between all pairs of hypotheses on matched files.
# Interpretation: tests whether the mean QE score differs between two hypotheses across the same file set.

results = []
cols = [c for c in wide.columns if c is not None]
for a, b in itertools.combinations(cols, 2):
    # Drop rows where either hypothesis is NaN to keep pairs matched
    ab = wide[[a, b]].dropna()
    if len(ab) < 2:
        t_stat, p_val = float("nan"), float("nan")
    else:
        t_stat, p_val = stats.ttest_rel(ab[a], ab[b])
    results.append({"Hypothesis 1": a, "Hypothesis 2": b, "t-statistic": t_stat, "p-value": p_val, "n": len(ab)})

stats_df = pd.DataFrame(results).sort_values(["Hypothesis 1", "Hypothesis 2"]).reset_index(drop=True)

stats_xlsx = os.path.join(BASE_DIR, "paired_ttest_results.xlsx")
stats_df.to_excel(stats_xlsx, index=False)
print("Paired t-test results saved to:", stats_xlsx)

stats_df

## 10. Methods Summary

- **Parsing:** Each `.txt` is split into *Source* and *Output* using the first occurrence of `Output:` (case-insensitive variant also supported).
- **Hypotheses:** For each AI model folder, all sources are concatenated into `src.txt` and all corresponding model outputs into `hypX.txt` (where `X` is the hypothesis index).
- **Quality Estimation:** Sentence-level, reference-free **COMET QE** model (`wmt21-comet-qe-mqm`) is used to assign a score to each (source, hypothesis) pair.
- **Statistical Testing:** Two-tailed **paired t‑tests** compare per-file QE scores across hypotheses (matched on the same file set).