# Improved Methodology: Translation and Evaluation Study

## Key Improvements Addressing Reviewer Concerns

1. **Separate Translation and Evaluation Tasks** - Addresses Reviewer 1's concern about single prompt
2. **Blind Evaluation** - GPT doesn't know which translation is which (A/B/C labels)
3. **Explicit FD Framework** - Force Dynamics theory provided to GPT
4. **Clear Evaluation Criteria** - Defined criteria for lexis, syntax, semantics, and FD

## Data Source
https://docs.google.com/spreadsheets/d/1gU-Y_meE7A-re4LCA3Bq24K7b7vs9U3ZFXy7oS6ftdQ/edit?usp=sharing

In [None]:
import pandas as pd
import json
import random
import os

# Reproducibility
SEED = 42
random.seed(SEED)
print(f'[INFO] SEED={SEED}')

# Load API key from .env file (optional)
try:
    from dotenv import load_dotenv
    load_dotenv()
    print('[OK] .env loaded')
except ImportError:
    pass

# Fallback: manual .env read to populate environment
if 'OPENAI_API_KEY' not in os.environ and os.path.exists('.env'):
    with open('.env', 'r') as f:
        for line in f:
            if line.startswith('OPENAI_API_KEY='):
                os.environ['OPENAI_API_KEY'] = line.split('=', 1)[1].strip()
                print('[OK] API key loaded from .env manually')
                break

if 'OPENAI_API_KEY' not in os.environ:
    print('[WARNING] OPENAI_API_KEY not set; set it in .env before running API calls')


: 

In [None]:
# Import improved methodology functions
from scripts.legacy.improved_methodology import process_sentence_blind, process_dataframe_blind, analyze_blind_results


## Data Import

In [None]:
SOURCE_SHEET_EN = "https://docs.google.com/spreadsheets/d/e/2PACX-1vRpFclgUNhF1CSHDHPyQCaSUojcPALed14HFzCMFiseOe1izFPyWOXeDgRABKLhH0ckTW3ffFa1xjRv/pub?output=xlsx"

def import_data():
    # Function to retrieve the data from the project Google Sheet
    df = pd.read_excel(SOURCE_SHEET_EN)
    return df

df = import_data()
df.head()


## Data Provenance Documentation

**IMPORTANT**: Document data sources clearly (addresses Reviewer 2's concern)

In [None]:
# Data Provenance (aligned with manuscript)
DATA_PROVENANCE = {
    'english_sentences': {
        'source': 'Ian McEwan, Atonement (2001)',
        'selection_criteria': 'Sentences containing causative/permissive verb constructions relevant to FD',
        'count': 10,
        'from_novel': True,
        'from_film': False,
        'sheet_url': SOURCE_SHEET_EN
    },
    'finnish_reference': {
        'source': 'Published translation (see Wi?niewska, 2022a for bibliographic details)',
        'rationale': 'Primary published source used; details documented in dissertation',
        'sheet_url': SOURCE_SHEET_EN
    },
    'polish_reference': {
        'source': 'Published translation (see Wi?niewska, 2022a for bibliographic details)',
        'rationale': 'Primary published source used; details documented in dissertation',
        'sheet_url': SOURCE_SHEET_EN
    },
    'croatian_reference': {
        'source': 'PRIMARY SOURCE: Published Croatian translation',
        'title': 'Okajanje',
        'author': 'Ian McEwan',
        'publisher': 'Celeber',
        'year': '2003',
        'place': 'Zagreb',
        'pages': '307',
        'quality_control': 'Translations verified against Okajanje (Celeber, 2003, Zagreb)',
        'rationale': 'Croatian reference translations were checked against the published Croatian translation to ensure consistency across languages',
        'sheet_url': SOURCE_SHEET_HR
    }
}

print('Data Provenance:')
print(json.dumps(DATA_PROVENANCE, indent=2))


## Improved Methodology: Three-Phase Approach

### Phase 1: Translation (Separate Task)
- GPT translates sentence independently
- No evaluation, no comparison

### Phase 2: Blind Evaluation (Separate Task)
- Three translations evaluated blindly (A/B/C labels)
- GPT doesn't know which is which
- Explicit FD framework provided
- Clear evaluation criteria

In [None]:
# Test on single sentence first
random.seed(SEED)
print(f'[INFO] Test seed={SEED}')

test_sentence = df.iloc[0]

# Process with improved methodology
result = process_sentence_blind(
    source_language='English',
    target_language='Finnish',
    original_text=test_sentence['English Original'],
    translation_google=test_sentence['Finnish MT (Google Translate)'],
    translation_human=test_sentence['Finnish Human Reference'],
    model='gpt-4o'
)

print('GPT Translation:', result['translation_gpt'])
print('
Mapping (for analysis):', result['mapping'])
print('
Blind Evaluation:')
print(json.dumps(result['blind_evaluation'], indent=2))


## Process All Finnish Sentences

In [None]:
# Process all Finnish sentences with improved methodology
random.seed(SEED + 1)
print(f'[INFO] Finnish seed={SEED + 1}')

print('='*80)
print('PROCESSING FINNISH SENTENCES')
print('='*80)
print(f'Total sentences to process: {len(df)}')
print('
Processing with improved methodology:')
print('- Separate translation task')
print('- Blind evaluation (A/B/C labels)')
print('- Explicit FD framework')
print('
Starting...
')

results_finnish = process_dataframe_blind(
    df,
    source_language='English',
    target_language='Finnish',
    original_col='English Original',
    google_col='Finnish MT (Google Translate)',
    human_col='Finnish Human Reference',
    model='gpt-4o'
)

df_finnish_results = pd.DataFrame(results_finnish)
df_finnish_results.to_excel('Finnish_evaluation_IMPROVED.xlsx', index=False)
print(f"
[OK] Processed {len(results_finnish)} Finnish sentences")
print('[OK] Results saved to: Finnish_evaluation_IMPROVED.xlsx')


## Analyze Results: Bias Detection

Compare how GPT rates its own translations vs others (now blind)

In [None]:
# Analyze blind evaluation results
print("\n" + "="*80)
print("FINNISH ANALYSIS")
print("="*80)
analysis_finnish = analyze_blind_results(results_finnish)

print(f"Average GPT self-score: {analysis_finnish.get('avg_gpt_self', 'N/A')}")
print(f"Average GPT score for Google: {analysis_finnish.get('avg_gpt_google', 'N/A')}")
print(f"Average GPT score for Human: {analysis_finnish.get('avg_gpt_human', 'N/A')}")
print(f"Total sentences analyzed: {analysis_finnish.get('total_sentences', 'N/A')}")

# Bias detection
if analysis_finnish.get('avg_gpt_self') and analysis_finnish.get('avg_gpt_human'):
    self_score = analysis_finnish.get('avg_gpt_self')
    human_score = analysis_finnish.get('avg_gpt_human')
    diff = self_score - human_score
    print(f"\nBias Analysis:")
    print(f"  GPT rates itself {diff:.3f} {'higher' if diff > 0 else 'lower'} than Human translations")
    if diff > 0.05:
        print(f"  [NOTE] GPT shows self-scoring bias (rates itself significantly higher)")
    elif diff < -0.05:
        print(f"  [NOTE] GPT rates itself lower than Human (interesting finding)")
    else:
        print(f"  [NOTE] GPT scoring is relatively neutral")

## Process Polish Sentences

In [None]:
# Process all Polish sentences
random.seed(SEED + 2)
print(f'[INFO] Polish seed={SEED + 2}')

print('
' + '='*80)
print('PROCESSING POLISH SENTENCES')
print('='*80)
print(f'Total sentences to process: {len(df)}')
print('
Starting...
')

results_polish = process_dataframe_blind(
    df,
    source_language='English',
    target_language='Polish',
    original_col='English Original',
    google_col='Polish MT (Google Translate)',
    human_col='Polish Human Reference',
    model='gpt-4o'
)

df_polish_results = pd.DataFrame(results_polish)
df_polish_results.to_excel('Polish_evaluation_IMPROVED.xlsx', index=False)
print(f"
[OK] Processed {len(results_polish)} Polish sentences")
print('[OK] Results saved to: Polish_evaluation_IMPROVED.xlsx')

# Analyze
print('
' + '='*80)
print('POLISH ANALYSIS')
print('='*80)
analysis_polish = analyze_blind_results(results_polish)
print(f"Average GPT self-score: {analysis_polish.get('avg_gpt_self', 'N/A')}")
print(f"Average GPT score for Google: {analysis_polish.get('avg_gpt_google', 'N/A')}")
print(f"Average GPT score for Human: {analysis_polish.get('avg_gpt_human', 'N/A')}")
print(f"Total sentences analyzed: {analysis_polish.get('total_sentences', 'N/A')}")


## Croatian Sentences

**✅ Croatian translations checked against published source**

Croatian reference translations were checked against the published Croatian translation of Atonement: **Okajanje** (Celeber, Zagreb, 2003). All 10 sentences were verified against this primary source to ensure accuracy and to maintain methodological consistency with Finnish and Polish translations.

This addresses Reviewer 2's concern about data source consistency.

In [None]:
SOURCE_SHEET_HR = "https://docs.google.com/spreadsheets/d/e/2PACX-1vSCb5y3L7siAReDJ2QgaJCoR2b_GnuEciIoGRfdzigi1J0-7IJuSTc2K3WlmH-87czMtKNQsB0oUmAj/pub?output=xlsx"

# Croatian data
df_hr = pd.read_excel(SOURCE_SHEET_HR)
df_hr.head()

# Croatian references are verified against the published translation:
# Okajanje (Celeber, Zagreb, 2003)


In [None]:
print('Reproducibility')
print(f'SEED={SEED}')
print(f'SOURCE_SHEET_EN={SOURCE_SHEET_EN}')
print(f'SOURCE_SHEET_HR={SOURCE_SHEET_HR}')
print('MODEL=gpt-4o')


In [None]:
# Process Croatian sentences with improved methodology
random.seed(SEED + 3)
print(f'[INFO] Croatian seed={SEED + 3}')

print('
' + '='*80)
print('PROCESSING CROATIAN SENTENCES')
print('='*80)
print(f'Total sentences to process: {len(df_hr)}')
print('
Starting...
')

results_croatian = process_dataframe_blind(
    df_hr,
    source_language='English',
    target_language='Croatian',
    original_col='English Original',
    google_col='Croatian MT (Google Translate)',
    human_col='Croatian Human Reference',
    model='gpt-4o'
)

df_croatian_results = pd.DataFrame(results_croatian)
df_croatian_results.to_excel('Croatian_evaluation_IMPROVED.xlsx', index=False)
print(f"
[OK] Processed {len(results_croatian)} Croatian sentences")
print('[OK] Results saved to: Croatian_evaluation_IMPROVED.xlsx')

# Analyze results
print('
' + '='*80)
print('CROATIAN ANALYSIS')
print('='*80)
analysis_croatian = analyze_blind_results(results_croatian)
print(f"Average GPT self-score: {analysis_croatian.get('avg_gpt_self', 'N/A')}")
print(f"Average GPT score for Google: {analysis_croatian.get('avg_gpt_google', 'N/A')}")
print(f"Average GPT score for Human: {analysis_croatian.get('avg_gpt_human', 'N/A')}")
print(f"Total sentences analyzed: {analysis_croatian.get('total_sentences', 'N/A')}")


## Summary: All Languages Processed

Results saved to:
- `Finnish_evaluation_IMPROVED.xlsx`
- `Polish_evaluation_IMPROVED.xlsx`
- `Croatian_evaluation_IMPROVED.xlsx`

### Key Improvements in New Methodology:
1. ✅ Translation separated from evaluation
2. ✅ Blind evaluation (GPT doesn't know which is which)
3. ✅ Explicit FD framework provided
4. ✅ Bias detection possible

### Next Steps:
1. Review results in Excel files
2. Compare with old methodology (if available)
3. Update paper methodology section
4. Analyze findings