# 04 — Translation-based Multilingual Evaluation

**Mục tiêu:**
- Translate val/test EN→ES and EN→FR (silver translations)
- Evaluate trained model trên EN, ES, FR
- Measure robustness: performance drop, label flip rate

**Lý do translate chỉ val/test:**
- Dataset gốc en/fr/es metadata noisy, text thực tế EN
- Tạo silver translations để test cross-lingual robustness
- Translation chỉ 1-2k samples (đủ để measure robustness)

In [1]:
# Imports
from pathlib import Path
import pandas as pd
import numpy as np
import torch
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    MarianMTModel,
    MarianTokenizer
)
from sklearn.metrics import accuracy_score, f1_score, classification_report, confusion_matrix
from tqdm.auto import tqdm

print('torch:', torch.__version__)
print('CUDA available:', torch.cuda.is_available())

  from .autonotebook import tqdm as notebook_tqdm


torch: 2.7.1+cu118
CUDA available: True


## 1) Config

In [2]:
# Paths
DATA_DIR = Path('data_splits')
MODEL_DIR = Path('model_output/final_model')
OUTPUT_DIR = Path('translation_eval')
OUTPUT_DIR.mkdir(exist_ok=True)

# Columns
TEXT_COL = 'cleaned_text'  # actual column name from CSV
LABEL_COL = 'sentiment'

# Translation models (MarianMT)
TRANS_MODEL_ES = 'Helsinki-NLP/opus-mt-en-es'
TRANS_MODEL_FR = 'Helsinki-NLP/opus-mt-en-fr'

# Evaluation config
EVAL_SPLIT = 'test'  # 'val' or 'test'
SAMPLE_SIZE = 2000  # translate only 2000 samples (reduce to 1000 if too slow)
BATCH_SIZE = 16  # translation batch size
MAX_LENGTH = 128

DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f'Device: {DEVICE}')

Device: cuda


## 2) Load data

In [3]:
df = pd.read_csv(DATA_DIR / f'{EVAL_SPLIT}.csv')
print(f'{EVAL_SPLIT.capitalize()} shape:', df.shape)

# Sample for translation (stratified)
if len(df) > SAMPLE_SIZE:
    df = df.groupby(LABEL_COL, group_keys=False).apply(
        lambda x: x.sample(min(SAMPLE_SIZE // df[LABEL_COL].nunique(), len(x)), random_state=42)
    ).reset_index(drop=True)
    print(f'Sampled {len(df)} for translation')

print('Label distribution:')
display(df[LABEL_COL].value_counts())

Test shape: (6751, 2)
Sampled 1998 for translation
Label distribution:


  df = df.groupby(LABEL_COL, group_keys=False).apply(


sentiment
negative    666
neutral     666
positive    666
Name: count, dtype: int64

## 3) Translation function

In [4]:
def translate_texts(texts, model_name, device='cpu', batch_size=16, max_length=256):
    """
    Translate texts using MarianMT model.
    Returns list of translated texts.
    """
    tokenizer = MarianTokenizer.from_pretrained(model_name)
    model = MarianMTModel.from_pretrained(model_name).to(device)
    model.eval()
    
    translated = []
    
    for i in tqdm(range(0, len(texts), batch_size), desc='Translating'):
        batch = texts[i:i+batch_size]
        
        # Tokenize
        inputs = tokenizer(
            batch,
            return_tensors='pt',
            padding=True,
            truncation=True,
            max_length=max_length
        ).to(device)
        
        # Generate
        with torch.no_grad():
            outputs = model.generate(**inputs, max_new_tokens=max_length)
        
        # Decode
        batch_trans = tokenizer.batch_decode(outputs, skip_special_tokens=True)
        translated.extend(batch_trans)
    
    return translated

## 4) Translate EN→ES

In [5]:
texts_en = df[TEXT_COL].astype(str).tolist()

print('Translating EN→ES...')
texts_es = translate_texts(
    texts_en,
    model_name=TRANS_MODEL_ES,
    device=DEVICE,
    batch_size=BATCH_SIZE,
    max_length=256
)

df['text_es'] = texts_es
print('✅ ES translation complete')

# Show examples
print('\nExamples (EN→ES):')
for i in range(3):
    print(f'EN: {texts_en[i][:100]}...')
    print(f'ES: {texts_es[i][:100]}...')
    print()

Translating EN→ES...


Translating: 100%|██████████| 125/125 [01:16<00:00,  1.64it/s]

✅ ES translation complete

Examples (EN→ES):
EN: Phishing breached our database. Reality kitchen serious next nature receive but focus....
ES: Phishing rompió nuestra base de datos la cocina de la realidad seria próxima naturaleza recibir pero...

EN: Theory I song down concern upon air grow cause billion prove national. Million firm score remain....
ES: La teoría de la preocupación por el aire crece, porque miles de millones de personas se convierten e...

EN: Away apply budget believe cold half answer themselves positive pass will ever....
ES: Lejos de aplicar el presupuesto creen que la mitad fría respuesta ellos mismos positivo pase alguna ...






## 5) Translate EN→FR

In [6]:
print('Translating EN→FR...')
texts_fr = translate_texts(
    texts_en,
    model_name=TRANS_MODEL_FR,
    device=DEVICE,
    batch_size=BATCH_SIZE,
    max_length=256
)

df['text_fr'] = texts_fr
print('✅ FR translation complete')

# Show examples
print('\nExamples (EN→FR):')
for i in range(3):
    print(f'EN: {texts_en[i][:100]}...')
    print(f'FR: {texts_fr[i][:100]}...')
    print()

Translating EN→FR...


Translating: 100%|██████████| 125/125 [01:08<00:00,  1.81it/s]

✅ FR translation complete

Examples (EN→FR):
EN: Phishing breached our database. Reality kitchen serious next nature receive but focus....
FR: L'hameçonnage a violé notre base de données....

EN: Theory I song down concern upon air grow cause billion prove national. Million firm score remain....
FR: La théorie que je chante s'inquiète de la croissance de l'air parce que des milliards de dollars s'a...

EN: Away apply budget believe cold half answer themselves positive pass will ever....
FR: Loin d'appliquer le budget croire froide moitié répondre eux-mêmes positif passe jamais....






## 6) Save translated dataset

In [7]:
df.to_csv(OUTPUT_DIR / f'{EVAL_SPLIT}_translated.csv', index=False)
print(f'✅ Saved to {OUTPUT_DIR / f"{EVAL_SPLIT}_translated.csv"}')

✅ Saved to translation_eval\test_translated.csv


## 7) Load trained model

In [8]:
tokenizer = AutoTokenizer.from_pretrained(str(MODEL_DIR))
model = AutoModelForSequenceClassification.from_pretrained(str(MODEL_DIR)).to(DEVICE)
model.eval()

print(f'✅ Model loaded from {MODEL_DIR}')
print(f'Num labels: {model.config.num_labels}')

✅ Model loaded from model_output\final_model
Num labels: 3


## 8) Encode labels

In [9]:
# Load label map if exists
import json
label_map_path = Path('model_output/label_map.json')

if label_map_path.exists():
    with open(label_map_path) as f:
        label_map = json.load(f)
    df['label_id'] = df[LABEL_COL].map(label_map)
    print('Label map loaded:', label_map)
else:
    # Assume already numeric
    df['label_id'] = df[LABEL_COL]
    print('Labels assumed numeric')

Label map loaded: {'negative': 0, 'neutral': 1, 'positive': 2}


## 9) Prediction function

In [10]:
def predict_batch(texts, model, tokenizer, device='cpu', batch_size=32, max_length=128):
    """
    Predict labels for a list of texts.
    Returns array of predicted label ids.
    """
    preds = []
    
    for i in tqdm(range(0, len(texts), batch_size), desc='Predicting'):
        batch = texts[i:i+batch_size]
        
        inputs = tokenizer(
            batch,
            return_tensors='pt',
            padding=True,
            truncation=True,
            max_length=max_length
        ).to(device)
        
        with torch.no_grad():
            outputs = model(**inputs)
            logits = outputs.logits
        
        batch_preds = torch.argmax(logits, dim=-1).cpu().numpy()
        preds.extend(batch_preds)
    
    return np.array(preds)

## 10) Evaluate on EN

In [11]:
print('Evaluating on EN...')
preds_en = predict_batch(
    df[TEXT_COL].astype(str).tolist(),
    model, tokenizer,
    device=DEVICE,
    batch_size=32,
    max_length=MAX_LENGTH
)

df['pred_en'] = preds_en
acc_en = accuracy_score(df['label_id'], preds_en)
f1_en = f1_score(df['label_id'], preds_en, average='macro')

print(f'\n✅ EN — Accuracy: {acc_en:.4f}  |  F1-macro: {f1_en:.4f}')
print('\nClassification report (EN):')
print(classification_report(df['label_id'], preds_en, digits=4))

Evaluating on EN...


Predicting: 100%|██████████| 63/63 [00:06<00:00, 10.43it/s]


✅ EN — Accuracy: 0.9515  |  F1-macro: 0.9510

Classification report (EN):
              precision    recall  f1-score   support

           0     0.9545    0.9775    0.9659       666
           1     0.9582    0.8949    0.9255       666
           2     0.9424    0.9820    0.9618       666

    accuracy                         0.9515      1998
   macro avg     0.9517    0.9515    0.9510      1998
weighted avg     0.9517    0.9515    0.9510      1998






## 11) Evaluate on ES

In [12]:
print('Evaluating on ES...')
preds_es = predict_batch(
    df['text_es'].astype(str).tolist(),
    model, tokenizer,
    device=DEVICE,
    batch_size=32,
    max_length=MAX_LENGTH
)

df['pred_es'] = preds_es
acc_es = accuracy_score(df['label_id'], preds_es)
f1_es = f1_score(df['label_id'], preds_es, average='macro')

print(f'\n✅ ES — Accuracy: {acc_es:.4f}  |  F1-macro: {f1_es:.4f}')
print('\nClassification report (ES):')
print(classification_report(df['label_id'], preds_es, digits=4))

Evaluating on ES...


Predicting: 100%|██████████| 63/63 [00:08<00:00,  7.67it/s]


✅ ES — Accuracy: 0.6211  |  F1-macro: 0.6159

Classification report (ES):
              precision    recall  f1-score   support

           0     0.8107    0.3859    0.5229       666
           1     0.4823    0.8979    0.6275       666
           2     0.8753    0.5796    0.6974       666

    accuracy                         0.6211      1998
   macro avg     0.7228    0.6211    0.6159      1998
weighted avg     0.7228    0.6211    0.6159      1998






## 12) Evaluate on FR

In [13]:
print('Evaluating on FR...')
preds_fr = predict_batch(
    df['text_fr'].astype(str).tolist(),
    model, tokenizer,
    device=DEVICE,
    batch_size=32,
    max_length=MAX_LENGTH
)

df['pred_fr'] = preds_fr
acc_fr = accuracy_score(df['label_id'], preds_fr)
f1_fr = f1_score(df['label_id'], preds_fr, average='macro')

print(f'\n✅ FR — Accuracy: {acc_fr:.4f}  |  F1-macro: {f1_fr:.4f}')
print('\nClassification report (FR):')
print(classification_report(df['label_id'], preds_fr, digits=4))

Evaluating on FR...


Predicting: 100%|██████████| 63/63 [00:08<00:00,  7.24it/s]


✅ FR — Accuracy: 0.6171  |  F1-macro: 0.6137

Classification report (FR):
              precision    recall  f1-score   support

           0     0.8797    0.3844    0.5350       666
           1     0.4709    0.8859    0.6149       666
           2     0.8524    0.5811    0.6911       666

    accuracy                         0.6171      1998
   macro avg     0.7343    0.6171    0.6137      1998
weighted avg     0.7343    0.6171    0.6137      1998






## 13) Robustness summary

In [14]:
results = pd.DataFrame({
    'Language': ['EN', 'ES', 'FR'],
    'Accuracy': [acc_en, acc_es, acc_fr],
    'F1-macro': [f1_en, f1_es, f1_fr],
})

results['Acc_drop'] = results['Accuracy'] - acc_en
results['F1_drop'] = results['F1-macro'] - f1_en

print('\n' + '='*60)
print('ROBUSTNESS SUMMARY')
print('='*60)
display(results)

# Save results
results.to_csv(OUTPUT_DIR / 'robustness_results.csv', index=False)
print(f'\n✅ Results saved to {OUTPUT_DIR / "robustness_results.csv"}')


ROBUSTNESS SUMMARY


Unnamed: 0,Language,Accuracy,F1-macro,Acc_drop,F1_drop
0,EN,0.951451,0.951035,0.0,0.0
1,ES,0.621121,0.615921,-0.33033,-0.335115
2,FR,0.617117,0.61366,-0.334334,-0.337375



✅ Results saved to translation_eval\robustness_results.csv


## 14) Label flip rate (EN vs ES/FR)

In [15]:
flip_en_es = (df['pred_en'] != df['pred_es']).mean()
flip_en_fr = (df['pred_en'] != df['pred_fr']).mean()

print('\nLabel flip rate (prediction changes):')
print(f'  EN→ES: {flip_en_es:.2%}')
print(f'  EN→FR: {flip_en_fr:.2%}')

# Save flip stats
flip_stats = pd.DataFrame({
    'comparison': ['EN→ES', 'EN→FR'],
    'flip_rate': [flip_en_es, flip_en_fr]
})
flip_stats.to_csv(OUTPUT_DIR / 'flip_rate.csv', index=False)


Label flip rate (prediction changes):
  EN→ES: 38.69%
  EN→FR: 38.29%


## 15) Save predictions for XAI analysis

In [None]:
# Save full df with predictions
df.to_csv(OUTPUT_DIR / f'{EVAL_SPLIT}_with_predictions.csv', index=False)
print(f'✅ Full predictions saved to {OUTPUT_DIR / f"{EVAL_SPLIT}_with_predictions.csv"}')

✅ Full predictions saved to translation_eval\test_with_predictions.csv


: 

---
## Next steps
- Notebook 05: XAI (Integrated Gradients) + explanation consistency metrics (CTAM, overlap)