# ASR Data Augmentation — Analyse Results 

This notebook provides a **reproducible framework** to evaluate, analyze, and compare Automatic Speech Recognition (ASR) models on **pathological voices**.  
It is designed to accompany the thesis *“Improving Automatic Speech Recognition for Pathological
Voices Using Data Augmentation”* and allows running a baseline evaluation and analyzing custom experiments.

### Goals
- Establish a **baseline WER** for pathological speech using a pretrained ASR model.  
- Analyze and compare different augmentation and fine-tuning experiments.  
- Provide visual and quantitative summaries of model performance.  

### Input
- Collected pathological dataset (used for baseline evaluation).  
- `metrics.csv` files from training runs (NeMo logs).  

### Process
1. **Baseline Evaluation**  
   - Compute Word Error Rate (WER) of a pretrained NeMo ASR model (e.g., `stt_it_conformer_ctc_large`).  
   - Use results as a reference point before applying augmentation or fine-tuning.  

2. **Experiment Analysis**  
   - Load experiment logs (`metrics.csv`).  
   - Plot learning curves (WER vs. epoch) with confidence intervals.  
   - Compare multiple experiment groups side by side.  
   - Summarize best performance (lowest mean WER, confidence intervals, relative improvements).  
   - Ensure that recognition accuracy for healthy speech does not degrade.  
   - Visualize and compare multiple runs against the baseline.  

 


In [None]:
#!/usr/bin/env python3
from pathlib import Path
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from math import sqrt
from matplotlib.ticker import PercentFormatter
from scipy import stats
import torch
from jiwer import wer, Compose, ToLowerCase, RemovePunctuation, RemoveMultipleSpaces, Strip
from nemo.collections.asr.models import EncDecCTCModel, EncDecCTCModelBPE


## CONFIG — set your paths and parameters

In [None]:

# =======================================
# Configs for the experiment analysis
# =======================================

# Root directory of your trainer outputs (matches your config.yaml paths.output_dir)
OUTPUT_DIR = Path("./nemo_out") 

# Path to your metrics CSV files (relative to OUTPUT_DIR)
METRICS_PATTERN = "logs/job_*/lightning_logs/**/metrics.csv"

# Columns in metrics.csv
EPOCH_COL   = "epoch"
VAL_WER_COL = "collected_pathological_eval_val_wer"


# ================================================
# Configs for the Baseline (NeMo pretrained model)
# ================================================

CSV_PATH   = Path("/data/speech-project/datasets/collected_dataset/dataset.csv") 
    # The dataset CSV (CSV_PATH) must contain at least these columns:
    #   - "is_health"      : marks whether the utterance is healthy (t/true/1) or pathological (f/false/0)
    #   - "text"           : the ground-truth transcript of the utterance
    #   - "filename_path"  : relative path to the audio file, resolved against AUDIO_ROOT

# folder with the audio files
AUDIO_ROOT = Path("/data/speech-project/datasets/collected_dataset")   

MODEL_NAME = "stt_it_conformer_ctc_large"  

BASELINE = 0.457
    # If you have already computed the baseline WER, set it here to avoid recomputation
    # Otherwise, you can compute it in the next cell 



## Optional - Compute the Baseline for Comparison

### 🔎 Baseline Evaluation (Pretrained NeMo Model, No Fine-Tuning)

This step computes the **baseline Word Error Rate (WER)** of a pretrained ASR model on your *pathological speech dataset*.  

**What happens here:**
1. Load your dataset metadata (`dataset.csv`), which must contain at least:
   - `is_health` → True/False flag for healthy speech  
   - `text` → Reference transcript of the utterance  
   - `filename_path` → Relative path to the audio file  
2. Filter for pathological recordings only (`is_health = False`).  
3. Load the corresponding audio files from `AUDIO_ROOT`.  
4. Run the pretrained ASR model (without fine-tuning) to generate transcriptions.  
5. Compute the WER against the reference transcripts.  
6. Print the baseline WER and show sample predictions.  

👉 This establishes a **reference performance** before applying fine-tuning, augmentation, or synthesized data.


In [None]:

_normalize = Compose([ToLowerCase(), RemovePunctuation(), RemoveMultipleSpaces(), Strip()])

def baseline_pathological_wer(
    csv_path: str | Path = CSV_PATH,
    audio_root: str | Path = AUDIO_ROOT,
    model_name: str = MODEL_NAME,
    batch_size: int = 4,      
    num_workers: int = 0,   
    preview_k: int = 10,
) -> float:
    """
    Compute baseline WER on pathological utterances using a pretrained (NOT fine-tuned) NeMo CTC model.
    Equivalent to the flat script you shared, but wrapped in a function.
    """
    # ==== LOAD DATA ====
    df = pd.read_csv(csv_path)

    def _is_pathological(v):
        if pd.isna(v): 
            return False
        s = str(v).strip().lower()
        return s in {"f", "false", "0", "no"}

    df = df[df["is_health"].apply(_is_pathological)].copy()

    if "audio_filepath" in df.columns:
        audio_paths = df["audio_filepath"].astype(str).tolist()
    elif "filename_path" in df.columns:
        audio_paths = [str(Path(audio_root) / p) for p in df["filename_path"].astype(str)]
    else:
        raise ValueError("CSV must include either 'audio_filepath' or 'filename_path'.")

    refs = [_normalize(t) for t in df["text"].astype(str).tolist()]
    print(f"Evaluating {len(audio_paths)} pathological utterances.")

    # ==== LOAD MODEL ====
    device = "cuda" if torch.cuda.is_available() else "cpu"
    model = EncDecCTCModelBPE.from_pretrained(model_name, map_location=device)
    model.eval()

    # ==== TRANSCRIBE ====
    hyp_objects = model.transcribe(audio_paths, batch_size=batch_size, num_workers=num_workers)
    hyps = [_normalize(h.text) if hasattr(h, "text") else _normalize(h) for h in hyp_objects]

    # ==== COMPUTE WER ====
    baseline = wer(refs, hyps)
    print(f"\n🔎 Baseline WER (pretrained {model_name}, pathological only): {baseline:.4f}\n")

    # ==== SHOW PREDICTIONS ====
    print("🗒 Sample Predictions:\n")
    for i in range(min(preview_k, len(refs))):
        print(f"[{i+1}]")
        print(f"🗣 REF: {refs[i]}")
        print(f"🤖 HYP: {hyps[i]}")
        print()

    return baseline
baseline_pathological_wer()


## Helper Functions to Analyze Experiments

In [None]:
# -- discovery ---------------------------------------------------------------
def debug_list_job(job_id, output_dir=OUTPUT_DIR, pattern=METRICS_PATTERN):
    pat = pattern.replace("job_*", f"job_{str(job_id).strip()}")
    found = sorted(output_dir.glob(pat))
    print(f"job_{job_id}: {len(found)} file(s)")
    for p in found[:10]:
        print(" -", p)
    return found

def job_paths_from_ids(*job_ids, output_dir=OUTPUT_DIR, pattern=METRICS_PATTERN):
    """Return list of metrics.csv paths for the given job IDs."""
    out = []
    for j in job_ids:
        out += debug_list_job(j, output_dir=output_dir, pattern=pattern)
    return out

# -- 1) Per-run split plot (your style) -------------------------------------
def plot_result(csv_path, title=''):
    """Plot pathological/healthy/synth WER curves for a single metrics.csv (display only)."""
    df = pd.read_csv(csv_path)

    plt.figure(figsize=(12, 6))

    # Pathological
    if "collected_pathological_eval_val_wer" in df.columns:
        tmp = df[['epoch', 'collected_pathological_eval_val_wer']].dropna()
        plt.plot(tmp['epoch'], tmp['collected_pathological_eval_val_wer'], marker='o', label='Collected Pathological WER')

    # Healthy
    if "collected_healthy_eval_val_wer" in df.columns:
        tmp = df[['epoch', 'collected_healthy_eval_val_wer']].dropna()
        plt.plot(tmp['epoch'], tmp['collected_healthy_eval_val_wer'], marker='o', label='Collected Healthy WER')

    # Synth
    if "synth_eval_val_wer" in df.columns:
        tmp = df[['epoch', 'synth_eval_val_wer']].dropna()
        plt.plot(tmp['epoch'], tmp['synth_eval_val_wer'], marker='o', linestyle='--', label='Synthesized Eval WER')

    plt.xlabel('Epoch')
    plt.ylabel('Validation WER')
    plt.title(title or str(csv_path))
    plt.legend()
    plt.grid(True)
    plt.tight_layout()
    plt.show()

    # Print lowest pathological WER if present
    col = 'collected_pathological_eval_val_wer'
    if col in df.columns:
        tmp = df[['epoch', col]].dropna()
        if not tmp.empty:
            min_wer = tmp[col].min()
            min_epoch = int(tmp.loc[tmp[col].idxmin(), 'epoch'])
            print(f"Lowest Pathological WER: {min_wer:.4f} at epoch {min_epoch}")

# -- 2) Multi-run overlay of pathological split ------------------------------
def plot_pathological_wer_multi(csv_paths, title='Collected Pathological Eval WER'):
    """Each run as a separate line; legend shows job_<ID>."""
    plt.figure(figsize=(12, 6))
    plotted = 0
    for path in csv_paths:
        df = pd.read_csv(path)
        col = "collected_pathological_eval_val_wer"
        if col not in df.columns:
            continue
        tmp = df[['epoch', col]].dropna()
        # derive label like "job_52355"
        parts = Path(path).parts
        lab = next((p for p in parts if p.startswith("job_")), Path(path).name)
        plt.plot(tmp['epoch'], tmp[col], marker='o', label=lab)
        plotted += 1

    if plotted == 0:
        print("[WARN] None of the CSVs had 'collected_pathological_eval_val_wer'.")
        return

    plt.xlabel('Epoch')
    plt.ylabel('Validation WER on Real Pathological Speech')
    plt.title(title)
    plt.legend()
    plt.grid(True)
    plt.tight_layout()
    plt.show()

# -- 3) Best WER per file (pathological) -------------------------------------
def lowest_pathological_wer(csv_paths):
    """Print min pathological WER per file (display only)."""
    col = "collected_pathological_eval_val_wer"
    any_printed = False
    for path in csv_paths:
        df = pd.read_csv(path)
        if col not in df.columns: 
            continue
        tmp = df[['epoch', col]].dropna()
        if tmp.empty: 
            continue
        min_wer = tmp[col].min()
        min_epoch = int(tmp.loc[tmp[col].idxmin(), 'epoch'])
        print(f"{path}: Lowest Pathological WER: {min_wer:.4f} at epoch {min_epoch}")
        any_printed = True
    if not any_printed:
        print("[INFO] No pathological WER column found in provided CSVs.")

# -- 4) Healthy WER monotonic check ------------------------------------------
def healthy_never_rises_multi(csv_paths, tolerance=1e-4, verbose=True):
    """
    Check that collected healthy validation WER never increases (beyond tolerance).
    Returns {csv_path: bool}
    """
    results = {}
    for path in csv_paths:
        df = pd.read_csv(path)
        if "collected_healthy_eval_val_wer" in df.columns:
            hcol = "collected_healthy_eval_val_wer"
        elif "healthy_val_wer" in df.columns:
            hcol = "healthy_val_wer"
        else:
            if verbose:
                print(f"[WARN] No healthy WER column in {path}")
            continue

        x = df[['epoch', hcol]].dropna().sort_values('epoch').reset_index(drop=True)
        running_min = x[hcol].cummin()
        prev_min = running_min.shift(1).fillna(x[hcol].iloc[0])
        rise = x[hcol] - prev_min
        ok = not (rise > tolerance).any()
        results[path] = ok

        if verbose:
            if ok:
                print(f"✅ {path}: PASS (never rises, tol={tolerance})")
            else:
                first = x.loc[(rise > tolerance)].iloc[0]
                print(f"❌ {path}: FAIL at epoch {int(first['epoch'])} (rise {float(rise.loc[first.name]):.6f})")
    return results

# -- 5) Grouped mean ± 95% CI for ONE experiment (pathological) --------------
def analyze_single_experiment(
    csv_paths,
    label='Experiment',
    max_epoch=None,
    show_plot=True,
    starting_wer=BASELINE,                           
    wer_col='collected_pathological_eval_val_wer',
    min_runs_per_epoch=1,
    show_baseline=True,
    show_ci=True,
    color=None
):
    """Aggregate repeated runs (same experiment) → mean ± 95% CI over epochs (display only)."""
    runs = []
    for i, path in enumerate(csv_paths):
        df = pd.read_csv(path)
        if wer_col not in df.columns:
            continue
        d = df.dropna(subset=['epoch', wer_col])
        if max_epoch is not None:
            d = d[d['epoch'] <= max_epoch]
        d = d[['epoch', wer_col]].copy()
        d['run_id'] = i
        runs.append(d)
    if not runs:
        raise ValueError("No valid CSVs/columns found for this experiment.")

    combined = pd.concat(runs, ignore_index=True)

    if show_baseline:
        n_runs = combined['run_id'].nunique()
        baselines = pd.DataFrame({'epoch':[0]*n_runs, wer_col:[starting_wer]*n_runs, 'run_id':list(range(n_runs))})
        combined = pd.concat([baselines, combined], ignore_index=True)

    summary = (
        combined.groupby('epoch')[wer_col]
        .agg(['mean','std','count'])
        .reset_index()
        .sort_values('epoch')
    )
    summary = summary[summary['count'] >= min_runs_per_epoch]
    summary['sem']  = summary['std'] / np.sqrt(summary['count'].clip(lower=1))
    t_quant         = stats.t.ppf(0.975, np.maximum(summary['count'] - 1, 1))
    summary['ci95'] = summary['sem'].fillna(0) * t_quant

    if show_plot:
        plt.figure(figsize=(10, 5))
        x = summary['epoch'].values
        y = summary['mean'].values
        e = summary['ci95'].values
        plt.plot(x, y, linewidth=2.5, label=f'{label} (mean)', color=color)
        if show_ci:
            plt.fill_between(x, y - e, y + e, alpha=0.20, linewidth=0, label=f'{label} 95% CI', color=color)
        if show_baseline:
            plt.axhline(starting_wer, linestyle=':', linewidth=1.2, alpha=0.8, label=f'Baseline {starting_wer:.0%}')
        plt.xlabel('Epoch')
        plt.ylabel('Validation WER on Real Pathological Speech')
        plt.gca().yaxis.set_major_formatter(PercentFormatter(xmax=1.0, decimals=0))
        plt.grid(True, alpha=0.25)
        plt.legend(frameon=False, loc='upper right')
        plt.tight_layout()
        plt.show()

    # Text summary
    min_row = summary.loc[summary['mean'].idxmin()]
    min_epoch = int(min_row['epoch'])
    min_mean_wer = float(min_row['mean'])
    abs_impr = starting_wer - min_mean_wer
    rel_impr = (abs_impr / starting_wer) * 100
    print("📊 Summary")
    print("----------")
    print(f"🔽 Best mean WER: {min_mean_wer:.4f} at epoch {min_epoch}")
    print(f"✅ Absolute improvement from {starting_wer:.2f}: {abs_impr:.4f}")
    print(f"📈 Relative improvement: {rel_impr:.2f}%")

    return summary

# -- 6) Multi-experiment comparison (pathological, display only) --------------
def analyze_experiments_grouped(
    group_paths_dict,            # {"Exp A": [...paths...], "Exp B": [...paths...] }
    max_epoch=None,
    show_plot=True,
    starting_wer=None,
    show_shadow=True,
    color_map=None,              # Optional: {"Exp A": "red", "Exp B": "blue"}
    wer_col=None                 # Specify which column to compare
):
    """
    Overlay experiments: mean ± 95% CI per epoch (display only).
    wer_col: column name to compare (default: VAL_WER_COL)
    starting_wer: baseline for improvement
    """
    if wer_col is None:
        wer_col = VAL_WER_COL
    if starting_wer is None:
        starting_wer = BASELINE

    all_data = []

    # collect runs
    for label, paths in group_paths_dict.items():
        for rid, path in enumerate(paths):
            df = pd.read_csv(path)
            if wer_col not in df.columns:
                continue
            d = df.dropna(subset=[EPOCH_COL, wer_col])
            if max_epoch is not None:
                d = d[d[EPOCH_COL] <= max_epoch]
            d = d[[EPOCH_COL, wer_col]].copy()
            d['run_id'] = rid
            d['label'] = label
            all_data.append(d)

    if not all_data:
        raise ValueError("No valid CSVs found for any group.")

    df_all = pd.concat(all_data, ignore_index=True)

    # summary stats
    summary = (
        df_all.groupby(['label', EPOCH_COL])[wer_col]
        .agg(['mean', 'std', 'count'])
        .reset_index()
        .sort_values(['label', EPOCH_COL])
    )
    summary['sem']  = summary['std'] / np.sqrt(summary['count'].clip(lower=1))
    t_quant         = stats.t.ppf(0.975, np.maximum(summary['count'] - 1, 1))
    summary['ci95'] = summary['sem'].fillna(0) * t_quant

    # plotting
    if show_plot:
        plt.figure(figsize=(12, 6))
        ax = plt.gca()
        for label in summary['label'].unique():
            g = summary[summary['label'] == label]
            x = g[EPOCH_COL].values
            y = g['mean'].values
            e = g['ci95'].values
            color = color_map[label] if color_map and label in color_map else None
            (ln,) = ax.plot(x, y, linewidth=2, label=label, color=color)
            if show_shadow:
                ax.fill_between(x, y - e, y + e, alpha=0.20, linewidth=0, color=ln.get_color())
        ax.set_xlabel('Epoch')
        ax.set_ylabel(f'Validation {wer_col} (%)')
        ax.yaxis.set_major_formatter(PercentFormatter(xmax=1.0, decimals=0))
        ax.grid(True, alpha=0.25)
        ax.legend(frameon=False, loc='upper right')
        plt.tight_layout()
        plt.show()

    # textual summary
    print("📊 Summary per Group")
    print("--------------------")
    for label in summary['label'].unique():
        g = summary[summary['label'] == label]
        m = g.loc[g['mean'].idxmin()]
        best = float(m['mean'])
        ep   = int(m[EPOCH_COL])
        abs_impr = starting_wer - best
        rel_impr = (abs_impr / starting_wer) * 100 if starting_wer else float('nan')
        print(f"🔎 {label}: best {best:.4f} at epoch {ep} | Δ {abs_impr:.4f} ({rel_impr:.2f}%)")

    return summary


## Examples — paste your job IDs and run

In [None]:
# A) From job IDs → paths → your plots & summaries

# sanity: see what files a job id has
debug_list_job("52354")
debug_list_job("52355")

# collect paths
expA_paths = job_paths_from_ids("52354", "52355")
expB_paths = job_paths_from_ids("52356", "52357")

# show all imporant columns one run
plot_result(expA_paths[0], title="Exp A — single run")

# shows all runs of one experiment in one plot
plot_pathological_wer_multi(
    expA_paths,
    title="Exp A — per-run pathological WER",           
    )

# best WER per file (pathological)
lowest_pathological_wer(expA_paths)

# healthy WER monotonic check
healthy_never_rises_multi(expA_paths, tolerance=1e-4, verbose=True)

# shows the mean WER over multiple runs of one experiment with mean ± 95% CI + summary of findings
analyze_single_experiment(
    expA_paths,
    label="Exp A (w=1, aug=on)",
    max_epoch=30,
    wer_col='collected_pathological_eval_val_wer',
    show_baseline=True,
    show_ci=True,
    color='red'
)

# compares the WER across different experiments with mean ± 95% CI + summary of findings
analyze_experiments_grouped(
    {"Exp A (w=1, aug=on)": expA_paths,
     "Exp B (w=40, aug=off)": expB_paths},
    max_epoch=30,
    wer_col='collected_pathological_eval_val_wer',
    show_shadow=True,
    color_map={"Exp A (w=1, aug=on)": "red",
               "Exp B (w=40, aug=off)": "blue"}
)

