# 📓 Eval Reproducibility Notebook — Comparative Benchmark of Moroccan Darija Toxicity Detection (Typica.ai vs. Major LLM Moderation APIs)

## 🔧 0. Setup Environment

Install required Python packages.  
This ensures reproducibility even if run on a fresh Colab or local environment.

In [1]:
%%capture
!pip install pandas scikit-learn altair

## 📦 1. Import Libraries

We import:
- `pandas` for data handling.
- `scikit-learn` for evaluation metrics.
- `altair` for visualization.

In [2]:
import pandas as pd
from sklearn.metrics import classification_report
import altair as alt

print("✅ Libraries imported successfully.")

✅ Libraries imported successfully.


## 📂 2. Load Baseline and Predictions

Load:
- Baseline evaluation dataset (`eval_baseline.csv`).
- Model prediction files: OpenAI, Typica.ai, Mistral, Claude.

In [3]:
eval_baseline_df = pd.read_csv('./data/eval_baseline.csv', encoding='utf-8-sig')
df_openai = pd.read_csv("./pred/openai_results.csv")
df_typicaai = pd.read_csv("./pred/typica_ai_results.csv")
df_mistral = pd.read_csv("./pred/mistral_results.csv")
df_claud = pd.read_csv("./pred/claud_results.csv")

print("✅ Data loaded.")

✅ Data loaded.


## 🧹 3. Clean and Prepare Data

- Remove empty rows.
- Standardize column names.
- Drop duplicate sentences.

In [4]:
eval_baseline_df = eval_baseline_df[eval_baseline_df['sentence'].str.strip() != ""].drop_duplicates(subset='sentence')
df_openai = df_openai[df_openai['sentence'].str.strip() != ""].drop_duplicates(subset='sentence')
df_typicaai = df_typicaai[df_typicaai['sentence'].str.strip() != ""].drop_duplicates(subset='sentence')

df_mistral = df_mistral.rename(columns={'Sentence': 'sentence', 'Flagged': 'Mistral Flagged'})
df_mistral = df_mistral[df_mistral['sentence'].str.strip() != ""].drop_duplicates(subset='sentence')

df_claud = df_claud.rename(columns={'Sentence': 'sentence', 'Flagged': 'Claud Flagged'})
df_claud = df_claud[df_claud['sentence'].str.strip() != ""].drop_duplicates(subset='sentence')

print("✅ Data cleaned and prepared.")

✅ Data cleaned and prepared.


## 🔗 4. Merge DataFrames

Merge all system predictions with the baseline gold labels.

In [5]:
df_all = (
    eval_baseline_df
    .merge(df_openai[['sentence', 'OpenAI Flagged']], on='sentence', how='left')
    .merge(df_typicaai[['sentence', 'typica_ai Flagged']], on='sentence', how='left')
    .merge(df_mistral[['sentence', 'Mistral Flagged']], on='sentence', how='left')
    .merge(df_claud[['sentence', 'Claud Flagged']], on='sentence', how='left')
)

print(f"✅ Final merged shape: {df_all.shape}")

✅ Final merged shape: (631, 6)


## 📊 5. Compute Classification Reports

For each system, generate precision, recall, F1, and support metrics.

In [6]:
def extract_report(label, preds):
    report = classification_report(
        df_all['gold_flagged'],
        preds,
        output_dict=True,
        zero_division=0
    )
    return pd.DataFrame(report).transpose().add_prefix(f'{label}_')

report_openai = extract_report("OpenAI", df_all["OpenAI Flagged"])
report_typicaai = extract_report("typica_ai", df_all["typica_ai Flagged"])
report_mistral = extract_report("Mistral", df_all["Mistral Flagged"])
report_claud = extract_report("Claud", df_all["Claud Flagged"])

report_all = pd.concat([report_openai, report_typicaai, report_mistral, report_claud], axis=1).round(3)
report_all

Unnamed: 0,OpenAI_precision,OpenAI_recall,OpenAI_f1-score,OpenAI_support,typica_ai_precision,typica_ai_recall,typica_ai_f1-score,typica_ai_support,Mistral_precision,Mistral_recall,Mistral_f1-score,Mistral_support,Claud_precision,Claud_recall,Claud_f1-score,Claud_support
False,0.596,0.847,0.7,301.0,0.805,0.85,0.827,301.0,0.594,0.837,0.695,301.0,0.847,0.349,0.494,301.0
True,0.773,0.476,0.589,330.0,0.856,0.812,0.834,330.0,0.763,0.479,0.588,330.0,0.613,0.942,0.743,330.0
accuracy,0.653,0.653,0.653,0.653,0.83,0.83,0.83,0.83,0.65,0.65,0.65,0.65,0.659,0.659,0.659,0.659
macro avg,0.685,0.661,0.644,631.0,0.831,0.831,0.83,631.0,0.679,0.658,0.642,631.0,0.73,0.646,0.619,631.0
weighted avg,0.689,0.653,0.642,631.0,0.832,0.83,0.831,631.0,0.683,0.65,0.639,631.0,0.725,0.659,0.624,631.0


## 🎨 6. Plot Weighted F1 Comparison

Visualize weighted average F1-scores across all systems.

In [7]:
f1_scores = {
    'typica_ai': report_typicaai.loc['weighted avg']['typica_ai_f1-score'],
    'OpenAI': report_openai.loc['weighted avg']['OpenAI_f1-score'],
    'Mistral': report_mistral.loc['weighted avg']['Mistral_f1-score'],
    'Anthropic': report_claud.loc['weighted avg']['Claud_f1-score']
}

plot_df = pd.DataFrame({
    'Model': list(f1_scores.keys()),
    'F1 Score': list(f1_scores.values())
}).sort_values('F1 Score', ascending=False)
plot_df['F1 Score'] = plot_df['F1 Score'].round(3)

color_scale = alt.Scale(
    domain=["typica_ai", "OpenAI", "Mistral", "Anthropic"],
    range=["#8e44ad", "#1f77b4", "#e67e22", "#27ae60"]
)

chart = alt.Chart(plot_df).mark_bar(
    cornerRadiusTopLeft=6,
    cornerRadiusTopRight=6
).encode(
    x=alt.X("Model:N", sort='-y'),
    y=alt.Y("F1 Score:Q", scale=alt.Scale(domain=[0, 1])),
    color=alt.Color("Model:N", scale=color_scale, legend=alt.Legend(title="Model")),
    tooltip=[alt.Tooltip("Model"), alt.Tooltip("F1 Score")]
).properties(
    width=550,
    height=400,
    title=alt.TitleParams(text="Weighted F1 Score Comparison", fontSize=18, fontWeight="bold")
).configure_axis(
    labelFontSize=13,
    titleFontSize=14
).configure_title(
    anchor='start'
).configure_legend(
    labelFontSize=12,
    titleFontSize=14,
    orient="right"
)

chart

## 💾 7. Export Final Report

Save combined metrics as a CSV for reproducibility package.

In [8]:
output_report_path = './data/final_evaluation_report.csv'
report_all.to_csv(output_report_path, index=True, encoding='utf-8-sig')

print(f"✅ Final report exported to: {output_report_path}")

✅ Final report exported to: ./data/final_evaluation_report.csv
