# Week 4 - Activity 1: Evaluating Machine Translation Outputs

In this activity, we'll analyze machine translation outputs using different evaluation metrics (BLEU and chrF) and compare them with human evaluations. We'll:

1. Load and examine WMT shared task data
2. Calculate different automatic metrics
3. Compare metric rankings with human evaluations
4. Analyze cases where automatic and human evaluations differ significantly

In [4]:
import evaluate
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from datasets import load_dataset

## 1. Load WMT Data

We'll use the WMT metrics shared task data, which includes machine translations and human judgments:

In [None]:
# Load WMT metrics data
print("\nLoading WMT metrics dataset...")
dataset = load_dataset("nllg/wmt-metrics-data")
print("Successfully loaded WMT metrics dataset")

# Create a DataFrame with source, reference, and system outputs
data = []
max_samples = 100  # Limit samples for faster processing

print("\nProcessing dataset entries...")
for item in dataset['test']:  # Using test split as it contains human evaluations
    try:
        data.append({
            'source': item['src'],
            'reference': item['ref'],
            'system_output': item['mt'],
            'human_score': item['score'],
            'language_pair': item['lp'],
            'score_type': item['score_type']
        })

        if len(data) >= max_samples:
            break
    except Exception as e:
        print(f"Error processing item: {str(e)}")
        continue

df = pd.DataFrame(data)
print("Dataset size:", len(df))
print("\nExample entry:")
print(df.iloc[0])

## 2. Calculate Automatic Metrics

Let's compute BLEU and chrF scores for each translation:

In [None]:
# Load metrics
print("\nLoading evaluation metrics...")
bleu = evaluate.load("sacrebleu")
chrf = evaluate.load("chrf")

def calculate_metrics(row):
    # BLEU score
    bleu_score = bleu.compute(predictions=[row['system_output']], 
                         references=[[row['reference']]])
    
    # chrF score
    chrf_score = chrf.compute(predictions=[row['system_output']], 
                         references=[[row['reference']]])
    
    return pd.Series({
        'bleu': bleu_score['score'],
        'chrf': chrf_score['score']
    })

# Calculate metrics for each row
print("\nCalculating metrics...")
metrics = df.apply(calculate_metrics, axis=1)
df = pd.concat([df, metrics], axis=1)

print("\nMetrics summary:")
print(df[['bleu', 'chrf', 'human_score']].describe())

## 3. Analyze Translation Examples

Let's look at some example translations and their scores:

In [None]:
print("\nSample translations with metrics:")
for idx, row in df.head(3).iterrows():
    print(f"\nExample {idx+1}:")
    print(f"Language pair: {row['language_pair']}")
    print(f"Source: {row['source']}")
    print(f"Reference: {row['reference']}")
    print(f"System Output: {row['system_output']}")
    print("\nScores:")
    print(f"Human Score ({row['score_type']}): {row['human_score']:.3f}")
    print(f"BLEU: {row['bleu']:.3f}")
    print(f"chrF: {row['chrf']:.3f}")

## 4. Analyze Metric Correlations

Let's examine how well the automatic metrics correlate with human judgments:

In [None]:
# Calculate correlations
correlations = df[['bleu', 'chrf', 'human_score']].corr()
print("\nPearson correlation matrix:")
print(correlations)

# Calculate Spearman rank correlations
rank_correlations = df[['bleu', 'chrf', 'human_score']].corr(method='spearman')
print("\nSpearman rank correlation matrix:")
print(rank_correlations)

# Visualize correlations
plt.figure(figsize=(10, 4))

plt.subplot(1, 2, 1)
sns.heatmap(correlations, annot=True, cmap='coolwarm', vmin=-1, vmax=1)
plt.title('Pearson Correlations')

plt.subplot(1, 2, 2)
sns.heatmap(rank_correlations, annot=True, cmap='coolwarm', vmin=-1, vmax=1)
plt.title('Spearman Rank Correlations')

plt.tight_layout()
plt.show()

## 5. Analyze Disagreements

Let's look at cases where automatic metrics strongly disagree with human judgments:

In [None]:
# Calculate normalized scores (z-scores) for fair comparison
metrics = ['bleu', 'chrf', 'human_score']
df_norm = df[metrics].apply(lambda x: (x - x.mean()) / x.std())

# Find largest disagreements
disagreements = []
for metric in ['bleu', 'chrf']:
    diff = abs(df_norm[metric] - df_norm['human_score'])
    worst_idx = diff.nlargest(1).index[0]
    
    print(f"\nLargest disagreement for {metric.upper()}:")
    row = df.loc[worst_idx]
    print(f"Language pair: {row['language_pair']}")
    print(f"Source: {row['source']}")
    print(f"Reference: {row['reference']}")
    print(f"System Output: {row['system_output']}")
    print(f"Human score ({row['score_type']}): {row['human_score']:.3f}")
    print(f"BLEU score: {row['bleu']:.3f}")
    print(f"chrF score: {row['chrf']:.3f}")

## 6. Language-Specific Analysis

Let's examine how metrics perform for different language pairs:

In [None]:
print("\nCorrelations by language pair:")
for lp in df['language_pair'].unique():
    lp_data = df[df['language_pair'] == lp]
    if len(lp_data) > 5:  # Only show if we have enough samples
        print(f"\n{lp} ({len(lp_data)} samples):")
        correlations = lp_data[['bleu', 'chrf', 'human_score']].corr()['human_score'][['bleu', 'chrf']]
        print("Correlations with human scores:")
        print(correlations)

## Discussion Points

1. Which metric correlates better with human judgments? Why might this be?
   - Compare the Pearson and Spearman correlations
   - Consider the differences in how BLEU and chrF work

2. What types of translations tend to have high disagreement between metrics?
   - Look at the examples with largest disagreements
   - Consider factors like:
     * Literal vs. natural translations
     * Complex vs. simple sentences
     * Cultural adaptations

3. How do metrics perform across different language pairs?
   - Look at the per-language correlations
   - Consider linguistic differences between languages

4. What are the limitations of each metric?
   - BLEU: Focus on exact n-gram matches
   - chrF: Character-level matching
   - Consider what aspects of translation quality they might miss

5. How could we improve automatic evaluation?
   - Combining multiple metrics
   - Task-specific metrics
   - Better alignment with human judgments
   - Neural metrics like COMET