# Improved Methodology: Translation and Evaluation Study

## Key Improvements Addressing Reviewer Concerns

1. **Separate Translation and Evaluation Tasks** - Addresses Reviewer 1's concern about single prompt
2. **Blind Evaluation** - GPT doesn't know which translation is which (A/B/C labels)
3. **Explicit FD Framework** - Force Dynamics theory provided to GPT
4. **Clear Evaluation Criteria** - Defined criteria for lexis, syntax, semantics, and FD

## Data Source
https://docs.google.com/spreadsheets/d/1gU-Y_meE7A-re4LCA3Bq24K7b7vs9U3ZFXy7oS6ftdQ/edit?usp=sharing

In [1]:
import pandas as pd
import json
import random
import os

# Reproducibility
SEED = 42
random.seed(SEED)
print(f'[INFO] SEED={SEED}')

# Load API key from .env file (optional)
try:
    from dotenv import load_dotenv
    load_dotenv(dotenv_path='.env', override=False)
    print('[OK] .env loaded')
except (ImportError, AssertionError):
    pass

# Fallback: manual .env read to populate environment
if 'OPENAI_API_KEY' not in os.environ and os.path.exists('.env'):
    with open('.env', 'r') as f:
        for line in f:
            if line.startswith('OPENAI_API_KEY='):
                os.environ['OPENAI_API_KEY'] = line.strip().split('=', 1)[1]
                print('[OK] OPENAI_API_KEY loaded from .env')



[INFO] SEED=42
[OK] .env loaded


In [2]:
# Import improved methodology functions
from scripts.legacy.improved_methodology import process_sentence_blind, process_dataframe_blind, analyze_blind_results
import ast
from pathlib import Path

# Offline mode: reuse cached outputs when API access is blocked
USE_CACHED_RESULTS = True
OUTPUT_DIR = Path("output")
CACHE_DIR = OUTPUT_DIR

OUTPUT_DIR.mkdir(exist_ok=True)

def load_cached_results(path: Path):
    df = pd.read_excel(path)
    results = []
    for _, row in df.iterrows():
        result = row.to_dict()
        if isinstance(result.get("blind_evaluation"), str):
            result["blind_evaluation"] = ast.literal_eval(result["blind_evaluation"])
        if isinstance(result.get("mapping"), str):
            result["mapping"] = ast.literal_eval(result["mapping"])
        results.append(result)
    return results


## Data Import

In [3]:
SOURCE_SHEET_EN = "data/seed"

def import_data():
    df_fi = pd.read_csv("data/seed/finnish.csv", encoding="utf-8")
    df_pl = pd.read_csv("data/seed/polish.csv", encoding="utf-8")

    df = df_fi.merge(
        df_pl,
        on=["id", "original_text"],
        suffixes=("_fi", "_pl"),
        validate="one_to_one",
    )

    df = df.rename(columns={
        "original_text": "English Original",
        "translation_google_fi": "Finnish MT (Google Translate)",
        "translation_human_fi": "Finnish Human Reference",
        "translation_google_pl": "Polish MT (Google Translate)",
        "translation_human_pl": "Polish Human Reference",
    })

    return df

df = import_data()
df.head()


   id  ...                       Polish MT (Google Translate)
0   1  ...                Taki piękny dzień niecierpliwił ją.
1   2  ...  W środku lata mojej siostrze i mnie nie wolno ...
2   3  ...                               Zakazano jej pomocy.
3   4  ...  Jakby to ona poddawała bezbronnemu chłopczykow...
4   5  ...                         Upuścił płaszcz na ziemię.

[5 rows x 8 columns]

## Data Provenance Documentation

**IMPORTANT**: Document data sources clearly (addresses Reviewer 2's concern)

In [4]:
# Data Provenance (aligned with manuscript)
SOURCE_SHEET_HR = "data/seed/croatian.csv"

DATA_PROVENANCE = {
    'english_sentences': {
        'source': 'Ian McEwan, Atonement (2001)',
        'selection_criteria': 'Sentences containing causative/permissive verb constructions relevant to FD',
        'count': 10,
        'from_novel': True,
        'from_film': False,
        'sheet_url': SOURCE_SHEET_EN
    },
    'finnish_reference': {
        'source': 'Published translation (see Wi?niewska, 2022a for bibliographic details)',
        'rationale': 'Primary published source used; details documented in dissertation',
        'sheet_url': SOURCE_SHEET_EN
    },
    'polish_reference': {
        'source': 'Published translation (see Wi?niewska, 2022a for bibliographic details)',
        'rationale': 'Primary published source used; details documented in dissertation',
        'sheet_url': SOURCE_SHEET_EN
    },
    'croatian_reference': {
        'source': 'PRIMARY SOURCE: Published Croatian translation',
        'title': 'Okajanje',
        'author': 'Ian McEwan',
        'publisher': 'Celeber',
        'year': '2003',
        'place': 'Zagreb',
        'pages': '307',
        'quality_control': 'Translations verified against Okajanje (Celeber, 2003, Zagreb)',
        'rationale': 'Croatian reference translations were checked against the published Croatian translation to ensure consistency across languages',
        'sheet_url': SOURCE_SHEET_HR
    }
}

print('Data Provenance:')
print(json.dumps(DATA_PROVENANCE, indent=2))


Data Provenance:
{
  "english_sentences": {
    "source": "Ian McEwan, Atonement (2001)",
    "selection_criteria": "Sentences containing causative/permissive verb constructions relevant to FD",
    "count": 10,
    "from_novel": true,
    "from_film": false,
    "sheet_url": "data/seed"
  },
  "finnish_reference": {
    "source": "Published translation (see Wi?niewska, 2022a for bibliographic details)",
    "rationale": "Primary published source used; details documented in dissertation",
    "sheet_url": "data/seed"
  },
  "polish_reference": {
    "source": "Published translation (see Wi?niewska, 2022a for bibliographic details)",
    "rationale": "Primary published source used; details documented in dissertation",
    "sheet_url": "data/seed"
  },
  "croatian_reference": {
    "source": "PRIMARY SOURCE: Published Croatian translation",
    "title": "Okajanje",
    "author": "Ian McEwan",
    "publisher": "Celeber",
    "year": "2003",
    "place": "Zagreb",
    "pages": "307",
    "

## Improved Methodology: Three-Phase Approach

### Phase 1: Translation (Separate Task)
- GPT translates sentence independently
- No evaluation, no comparison

### Phase 2: Blind Evaluation (Separate Task)
- Three translations evaluated blindly (A/B/C labels)
- GPT doesn't know which is which
- Explicit FD framework provided
- Clear evaluation criteria

In [5]:
# Test on single sentence first
random.seed(SEED)
print(f'[INFO] Test seed={SEED}')

cache_path = CACHE_DIR / "Finnish_evaluation_IMPROVED.xlsx"
if USE_CACHED_RESULTS and cache_path.exists():
    print(f"[INFO] Using cached results: {cache_path}")
    cached = load_cached_results(cache_path)
    result = cached[0]
else:
    test_sentence = df.iloc[0]
    result = process_sentence_blind(
        source_language='English',
        target_language='Finnish',
        original_text=test_sentence['English Original'],
        translation_google=test_sentence['Finnish MT (Google Translate)'],
        translation_human=test_sentence['Finnish Human Reference'],
        model='gpt-4o'
    )

print('GPT Translation:', result['translation_gpt'])
print('\nMapping (for analysis):', result['mapping'])
print('\nBlind Evaluation:')
print(json.dumps(result["blind_evaluation"], indent=2, ensure_ascii=False))


[INFO] Test seed=42
[INFO] Using cached results: output\Finnish_evaluation_IMPROVED.xlsx
GPT Translation: Tällainen hieno päivä teki hänet kärsimättömäksi.

Mapping (for analysis): {'A': 'Human', 'B': 'GPT', 'C': 'Google'}

Blind Evaluation:
{
  "translation_A_description": "Lexis: The verb phrase 'sai hänet kärsimättömäksi' uses 'sai' (got/made) which is appropriate and idiomatic in Finnish for causation. Syntax: The structure is grammatically correct and follows typical Finnish word order. Semantics: The Agonist is 'hänet' (her) and the Antagonist is 'tällainen upea päivä' (a fine day like this). The force relation is causation, with the day causing her to become impatient. The force dynamics are well-preserved.",
  "translation_A_score": 1.0,
  "translation_B_description": "Lexis: The verb phrase 'teki hänet kärsimättömäksi' uses 'teki' (made) which is also appropriate but slightly less idiomatic than 'sai' for expressing causation in this context. Syntax: The structure is grammatic

## Process All Finnish Sentences

In [6]:
# Process all Finnish sentences with improved methodology
random.seed(SEED + 1)
print(f'[INFO] Finnish seed={SEED + 1}')

print('='*80)
print('PROCESSING FINNISH SENTENCES')
print('='*80)
print(f'Total sentences to process: {len(df)}')
print('\nProcessing with improved methodology:')
print('- Separate translation task')
print('- Blind evaluation (A/B/C labels)')
print('- Explicit FD framework')
print('\nStarting...\n')

cache_path = OUTPUT_DIR / "Finnish_evaluation_IMPROVED.xlsx"
if USE_CACHED_RESULTS and cache_path.exists():
    print(f"[INFO] Using cached results: {cache_path}")
    results_finnish = load_cached_results(cache_path)
else:
    results_finnish = process_dataframe_blind(
        df,
        source_language='English',
        target_language='Finnish',
        original_col='English Original',
        google_col='Finnish MT (Google Translate)',
        human_col='Finnish Human Reference',
        model='gpt-4o'
    )
    df_finnish_results = pd.DataFrame(results_finnish)
    df_finnish_results.to_excel(cache_path, index=False)
    print(f"[OK] Results saved to: {cache_path}")

print(f'\n[OK] Processed {len(results_finnish)} Finnish sentences')


[INFO] Finnish seed=43
PROCESSING FINNISH SENTENCES
Total sentences to process: 10

Processing with improved methodology:
- Separate translation task
- Blind evaluation (A/B/C labels)
- Explicit FD framework

Starting...

[INFO] Using cached results: output\Finnish_evaluation_IMPROVED.xlsx

[OK] Processed 10 Finnish sentences


## Analyze Results: Bias Detection

Compare how GPT rates its own translations vs others (now blind)

In [7]:
# Analyze blind evaluation results
print("\n" + "="*80)
print("FINNISH ANALYSIS")
print("="*80)
analysis_finnish = analyze_blind_results(results_finnish)

print(f"Average GPT self-score: {analysis_finnish.get('avg_gpt_self', 'N/A')}")
print(f"Average GPT score for Google: {analysis_finnish.get('avg_gpt_google', 'N/A')}")
print(f"Average GPT score for Human: {analysis_finnish.get('avg_gpt_human', 'N/A')}")
print(f"Total sentences analyzed: {analysis_finnish.get('total_sentences', 'N/A')}")

# Bias detection
if analysis_finnish.get('avg_gpt_self') and analysis_finnish.get('avg_gpt_human'):
    self_score = analysis_finnish.get('avg_gpt_self')
    human_score = analysis_finnish.get('avg_gpt_human')
    diff = self_score - human_score
    print(f"\nBias Analysis:")
    print(f"  GPT rates itself {diff:.3f} {'higher' if diff > 0 else 'lower'} than Human translations")
    if diff > 0.05:
        print(f"  [NOTE] GPT shows self-scoring bias (rates itself significantly higher)")
    elif diff < -0.05:
        print(f"  [NOTE] GPT rates itself lower than Human (interesting finding)")
    else:
        print(f"  [NOTE] GPT scoring is relatively neutral")


FINNISH ANALYSIS
Average GPT self-score: 0.9400000000000001
Average GPT score for Google: 0.905
Average GPT score for Human: 0.735
Total sentences analyzed: 10

Bias Analysis:
  GPT rates itself 0.205 higher than Human translations
  [NOTE] GPT shows self-scoring bias (rates itself significantly higher)


## Process Polish Sentences

In [8]:
# Process all Polish sentences
random.seed(SEED + 2)
print(f'[INFO] Polish seed={SEED + 2}')

print('\n' + '='*80)
print('PROCESSING POLISH SENTENCES')
print('='*80)
print(f'Total sentences to process: {len(df)}')
print('\nStarting...\n')

cache_path = OUTPUT_DIR / "Polish_evaluation_IMPROVED.xlsx"
if USE_CACHED_RESULTS and cache_path.exists():
    print(f"[INFO] Using cached results: {cache_path}")
    results_polish = load_cached_results(cache_path)
else:
    results_polish = process_dataframe_blind(
        df,
        source_language='English',
        target_language='Polish',
        original_col='English Original',
        google_col='Polish MT (Google Translate)',
        human_col='Polish Human Reference',
        model='gpt-4o'
    )
    df_polish_results = pd.DataFrame(results_polish)
    df_polish_results.to_excel(cache_path, index=False)
    print(f"[OK] Results saved to: {cache_path}")

print(f'\n[OK] Processed {len(results_polish)} Polish sentences')

# Analyze
print('\n' + '='*80)
print('POLISH ANALYSIS')
print('='*80)
analysis_polish = analyze_blind_results(results_polish)
print(f"Average GPT self-score: {analysis_polish.get('avg_gpt_self', 'N/A')}")
print(f"Average GPT score for Google: {analysis_polish.get('avg_gpt_google', 'N/A')}")
print(f"Average GPT score for Human: {analysis_polish.get('avg_gpt_human', 'N/A')}")
print(f"Total sentences analyzed: {analysis_polish.get('total_sentences', 'N/A')}")


[INFO] Polish seed=44

PROCESSING POLISH SENTENCES
Total sentences to process: 10

Starting...

[INFO] Using cached results: output\Polish_evaluation_IMPROVED.xlsx

[OK] Processed 10 Polish sentences

POLISH ANALYSIS
Average GPT self-score: 0.9349999999999999
Average GPT score for Google: 0.8300000000000001
Average GPT score for Human: 0.675
Total sentences analyzed: 10


## Croatian Sentences

**✅ Croatian translations checked against published source**

Croatian reference translations were checked against the published Croatian translation of Atonement: **Okajanje** (Celeber, Zagreb, 2003). All 10 sentences were verified against this primary source to ensure accuracy and to maintain methodological consistency with Finnish and Polish translations.

This addresses Reviewer 2's concern about data source consistency.

In [9]:
SOURCE_SHEET_HR = "data/seed/croatian.csv"

# Croatian data
df_hr = pd.read_csv(SOURCE_SHEET_HR, encoding="utf-8")
df_hr = df_hr.rename(columns={
    "original_text": "English Original",
    "translation_google": "Croatian MT (Google Translate)",
    "translation_human": "Croatian Human Reference",
})
df_hr.head()

# Croatian references are verified against the published translation:
# Okajanje (Celeber, Zagreb, 2003)


   id  ...                     Croatian MT (Google Translate)
0   1  ...   Lijep dan poput ovoga učinio ju je nestrpljivom.
1   2  ...  U jeku ljeta sestra i ja nismo smjele izlaziti...
2   3  ...                   Bilo joj je zabranjeno pomagati.
3   4  ...  Kao da je ona ta koja je bespomoćnom mališanu ...
4   5  ...                   Pustio je kaput da padne na tlo.

[5 rows x 5 columns]

In [10]:
print('Reproducibility')
print(f'SEED={SEED}')
print(f'SOURCE_SHEET_EN={SOURCE_SHEET_EN}')
print(f'SOURCE_SHEET_HR={SOURCE_SHEET_HR}')
print(f'USE_CACHED_RESULTS={USE_CACHED_RESULTS}')
print(f'CACHE_DIR={CACHE_DIR}')
print('MODEL=gpt-4o')


Reproducibility
SEED=42
SOURCE_SHEET_EN=data/seed
SOURCE_SHEET_HR=data/seed/croatian.csv
USE_CACHED_RESULTS=True
CACHE_DIR=output
MODEL=gpt-4o


In [11]:
# Process Croatian sentences with improved methodology
random.seed(SEED + 3)
print(f'[INFO] Croatian seed={SEED + 3}')

print('\n' + '='*80)
print('PROCESSING CROATIAN SENTENCES')
print('='*80)
print(f'Total sentences to process: {len(df_hr)}')
print('\nStarting...\n')

cache_path = OUTPUT_DIR / "Croatian_evaluation_IMPROVED.xlsx"
if USE_CACHED_RESULTS and cache_path.exists():
    print(f"[INFO] Using cached results: {cache_path}")
    results_croatian = load_cached_results(cache_path)
else:
    results_croatian = process_dataframe_blind(
        df_hr,
        source_language='English',
        target_language='Croatian',
        original_col='English Original',
        google_col='Croatian MT (Google Translate)',
        human_col='Croatian Human Reference',
        model='gpt-4o'
    )
    df_croatian_results = pd.DataFrame(results_croatian)
    df_croatian_results.to_excel(cache_path, index=False)
    print(f"[OK] Results saved to: {cache_path}")

print(f'\n[OK] Processed {len(results_croatian)} Croatian sentences')

# Analyze results
print('\n' + '='*80)
print('CROATIAN ANALYSIS')
print('='*80)
analysis_croatian = analyze_blind_results(results_croatian)
print(f"Average GPT self-score: {analysis_croatian.get('avg_gpt_self', 'N/A')}")
print(f"Average GPT score for Google: {analysis_croatian.get('avg_gpt_google', 'N/A')}")
print(f"Average GPT score for Human: {analysis_croatian.get('avg_gpt_human', 'N/A')}")
print(f"Total sentences analyzed: {analysis_croatian.get('total_sentences', 'N/A')}")


[INFO] Croatian seed=45

PROCESSING CROATIAN SENTENCES
Total sentences to process: 10

Starting...

[INFO] Using cached results: output\Croatian_evaluation_IMPROVED.xlsx

[OK] Processed 10 Croatian sentences

CROATIAN ANALYSIS
Average GPT self-score: 0.885
Average GPT score for Google: 0.93
Average GPT score for Human: 0.915
Total sentences analyzed: 10


## Summary: All Languages Processed

Results saved to:
- `Finnish_evaluation_IMPROVED.xlsx`
- `Polish_evaluation_IMPROVED.xlsx`
- `Croatian_evaluation_IMPROVED.xlsx`

### Key Improvements in New Methodology:
1. ✅ Translation separated from evaluation
2. ✅ Blind evaluation (GPT doesn't know which is which)
3. ✅ Explicit FD framework provided
4. ✅ Bias detection possible

### Next Steps:
1. Review results in Excel files
2. Compare with old methodology (if available)
3. Update paper methodology section
4. Analyze findings