## Setup and Installation

In [1]:
# Install required packages
!pip install -q -U google-generativeai pandas


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.1.1[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3 -m pip install --upgrade pip[0m


In [2]:
import os
import pandas as pd
import google.generativeai as genai
from enum import Enum
from typing import Dict, List
import json
import time
from pathlib import Path

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
# Language Configuration
class Language(Enum):
    RUSSIAN = {"code": "RU", "name": "Russian", "native": "Русский"}
    FRENCH = {"code": "FR", "name": "French", "native": "Français"}
    CHINESE = {"code": "ZH", "name": "Chinese", "native": "中文"}
    ARABIC = {"code": "AR", "name": "Arabic", "native": "العربية"}

# ===========================
# SELECT TARGET LANGUAGE HERE
# ===========================
TARGET_LANGUAGE = Language.RUSSIAN
# Change to: Language.FRENCH, Language.CHINESE, or Language.ARABIC as needed

print(f"Target Language: {TARGET_LANGUAGE.value['name']} ({TARGET_LANGUAGE.value['native']})")
print(f"Language Code: {TARGET_LANGUAGE.value['code']}")

Target Language: Russian (Русский)
Language Code: RU


In [4]:
# Configure Gemini API
# Set your API key as environment variable: GOOGLE_API_KEY
GOOGLE_API_KEY = os.environ.get('GOOGLE_API_KEY')

if not GOOGLE_API_KEY:
    print("⚠️ WARNING: GOOGLE_API_KEY not found in environment variables!")
    print("Please set it using: os.environ['GOOGLE_API_KEY'] = 'your-api-key-here'")
else:
    genai.configure(api_key=GOOGLE_API_KEY)
    print("✓ Gemini API configured successfully")

✓ Gemini API configured successfully


In [5]:
# Initialize Gemini Model
MODEL_NAME = 'gemini-2.0-flash'
model = genai.GenerativeModel(MODEL_NAME)

print(f"Using model: {MODEL_NAME}")

Using model: gemini-2.0-flash


In [6]:
# Load English game data
input_csv = "EN/disinformer_full_games_clues.csv"

if not os.path.exists(input_csv):
    print(f"❌ Error: Input file not found: {input_csv}")
else:
    df = pd.read_csv(input_csv)
    print(f"✓ Loaded {len(df)} rows from {input_csv}")
    print(f"\nColumns: {list(df.columns)}")
    print(f"\nFirst few rows:")
    display(df.head())

✓ Loaded 3000 rows from EN/disinformer_full_games_clues.csv

Columns: ['test_run', 'topic_category', 'round', 'answer', 'choices', 'clue_type', 'clue_number', 'clue_text', 'word_count', 'length_ok', 'manual_score / comment']

First few rows:


Unnamed: 0,test_run,topic_category,round,answer,choices,clue_type,clue_number,clue_text,word_count,length_ok,manual_score / comment
0,1,Books,1,Fantasy,"Fantasy, Sci-Fi, Adventure",informed,1,"This genre often features magic, mythical crea...",16,YES,
1,1,Books,1,Fantasy,"Fantasy, Sci-Fi, Adventure",informed,2,"It typically involves quests, battles against ...",18,YES,
2,1,Books,1,Fantasy,"Fantasy, Sci-Fi, Adventure",informed,3,The narrative often includes characters with s...,15,YES,
3,1,Books,1,Fantasy,"Fantasy, Sci-Fi, Adventure",informed,4,"These narratives often feature heroes, their j...",20,YES,
4,1,Books,1,Fantasy,"Fantasy, Sci-Fi, Adventure",informed,5,Readers are often transported to realms where ...,17,YES,


In [7]:
# Define translation prompt template
def create_translation_prompt(target_lang_info: Dict[str, str]) -> str:
    """
    Create a system prompt for translation.
    
    Args:
        target_lang_info: Dictionary with language 'code', 'name', and 'native' name
    
    Returns:
        Formatted prompt string
    """
    return f"""You are a professional translator specializing in game content localization.

Your task is to translate Disinformer game clues from English to {target_lang_info['name']} ({target_lang_info['native']}).

CRITICAL REQUIREMENTS:
1. **TRANSLATE ALL THREE FIELDS** - This is absolutely mandatory:
   - "answer": MUST be translated from English to {target_lang_info['name']}
   - "choices": MUST translate each choice from English to {target_lang_info['name']}, separated by commas
   - "clue_text": MUST be translated and maintain 15-20 words in {target_lang_info['name']}
   
2. Maintain the EXACT word count (15-20 words) in the clue_text translation

3. Preserve the tone and intent of each clue type:
   - INFORMED clues: Accurate, helpful hints pointing to the correct answer
   - MISINFORMED clues: Vague, generic statements that create ambiguity, pointing to multiple options
   - FAKE clues: Point to the wrong answer options
   - EXTRA clues: Additional helpful clues for the correct answer

4. Keep proper nouns (names, titles, places) in their original form or use standard translations when appropriate

5. Ensure natural, fluent {target_lang_info['name']} that sounds native

6. Preserve game mechanics and clarity

VALIDATION CHECKLIST:
- answer field is NOT in English ✓
- choices field is NOT in English (all choices translated) ✓
- clue_text is NOT in English ✓
- clue_text has 15-20 words ✓
- All fields are valid JSON strings ✓

Return ONLY a valid JSON object with this exact structure:
{{
  "answer": "translated answer (NOT in English)",
  "choices": "translated choice1, translated choice2, translated choice3 (NOT in English)",
  "clue_text": "translated clue in {target_lang_info['name']} (15-20 words, NOT in English)"
}}

Do not include any explanations, markdown formatting, or additional text outside the JSON."""

TRANSLATION_PROMPT = create_translation_prompt(TARGET_LANGUAGE.value)
print("Translation prompt created:")
print("="*80)
print(TRANSLATION_PROMPT)
print("="*80)

Translation prompt created:
You are a professional translator specializing in game content localization.

Your task is to translate Disinformer game clues from English to Russian (Русский).

CRITICAL REQUIREMENTS:
1. **TRANSLATE ALL THREE FIELDS** - This is absolutely mandatory:
   - "answer": MUST be translated from English to Russian
   - "choices": MUST translate each choice from English to Russian, separated by commas
   - "clue_text": MUST be translated and maintain 15-20 words in Russian

2. Maintain the EXACT word count (15-20 words) in the clue_text translation

3. Preserve the tone and intent of each clue type:
   - INFORMED clues: Accurate, helpful hints pointing to the correct answer
   - MISINFORMED clues: Vague, generic statements that create ambiguity, pointing to multiple options
   - FAKE clues: Point to the wrong answer options
   - EXTRA clues: Additional helpful clues for the correct answer

4. Keep proper nouns (names, titles, places) in their original form or use s

In [8]:
def translate_round_data(answer: str, choices: str, clue_text: str, clue_type: str) -> Dict[str, str]:
    """
    Translate a single round's data using Gemini.
    
    Args:
        answer: The correct answer
        choices: Comma-separated choices
        clue_text: The clue text to translate
        clue_type: Type of clue (informed/misinformed/fake/extra)
    
    Returns:
        Dictionary with translated answer, choices, and clue_text
    """
    user_message = f"""Translate this game round data - TRANSLATE ALL THREE FIELDS:

Answer (MUST translate): {answer}
Choices (MUST translate all): {choices}
Clue Type: {clue_type}
Clue Text (MUST translate, 15-20 words): {clue_text}

IMPORTANT: 
- The answer MUST be in the target language, NOT English
- All choices MUST be in the target language, NOT English  
- The clue text MUST be in the target language (15-20 words), NOT English
- Return only JSON with all three fields fully translated."""
    
    try:
        response = model.generate_content(
            [TRANSLATION_PROMPT, user_message],
            generation_config=genai.types.GenerationConfig(
                temperature=0.3,
                max_output_tokens=500,
            )
        )
        
        # Extract JSON from response
        response_text = response.text.strip()
        
        # Remove markdown code blocks if present
        if response_text.startswith('```'):
            response_text = response_text.split('```')[1]
            if response_text.startswith('json'):
                response_text = response_text[4:]
            response_text = response_text.strip()
        
        # Parse JSON
        translation = json.loads(response_text)
        
        # Validate required fields
        if not all(k in translation for k in ['answer', 'choices', 'clue_text']):
            raise ValueError("Missing required fields in translation")
        
        return translation
        
    except Exception as e:
        print(f"❌ Translation error: {e}")
        print(f"Response: {response_text if 'response_text' in locals() else 'N/A'}")
        return {
            "answer": answer,
            "choices": choices,
            "clue_text": clue_text
        }

print("✓ Translation function defined")

✓ Translation function defined


In [9]:
def translate_dataframe(df: pd.DataFrame, delay_seconds: float = 2.0) -> pd.DataFrame:
    """
    Translate all rows in the dataframe.
    
    Args:
        df: Input dataframe with English content
        delay_seconds: Delay between API calls to avoid rate limits
    
    Returns:
        New dataframe with translated content
    """
    translated_rows = []
    total_rows = len(df)
    
    print(f"Starting translation of {total_rows} rows...\n")
    
    for idx, row in df.iterrows():
        if idx % 10 == 0:
            print(f"Progress: {idx}/{total_rows} ({idx/total_rows*100:.1f}%)")
        
        # Translate the row
        translation = translate_round_data(
            answer=row['answer'],
            choices=row['choices'],
            clue_text=row['clue_text'],
            clue_type=row['clue_type']
        )
        
        # Create new row with translated content
        new_row = row.copy()
        new_row['answer'] = translation['answer']
        new_row['choices'] = translation['choices']
        new_row['clue_text'] = translation['clue_text']
        
        # Update word count
        new_row['word_count'] = len(translation['clue_text'].split())
        new_row['length_ok'] = 'YES' if 15 <= new_row['word_count'] <= 20 else 'NO'
        
        translated_rows.append(new_row)
        
        # Rate limiting
        time.sleep(delay_seconds)
    
    print(f"\n✓ Translation complete: {total_rows}/{total_rows} (100%)")
    return pd.DataFrame(translated_rows)

print("✓ Batch translation function defined")

✓ Batch translation function defined


In [10]:
# Test translation with a single row
print("Testing translation with first row...\n")

test_row = df.iloc[0]
print(f"Original:")
print(f"  Answer: {test_row['answer']}")
print(f"  Choices: {test_row['choices']}")
print(f"  Clue: {test_row['clue_text']}")
print(f"  Type: {test_row['clue_type']}\n")

test_translation = translate_round_data(
    answer=test_row['answer'],
    choices=test_row['choices'],
    clue_text=test_row['clue_text'],
    clue_type=test_row['clue_type']
)

print(f"Translated:")
print(f"  Answer: {test_translation['answer']}")
print(f"  Choices: {test_translation['choices']}")
print(f"  Clue: {test_translation['clue_text']}")
print(f"  Word count: {len(test_translation['clue_text'].split())}")

Testing translation with first row...

Original:
  Answer: Fantasy
  Choices: Fantasy, Sci-Fi, Adventure
  Clue: This genre often features magic, mythical creatures, and imaginative worlds that defy the laws of reality.
  Type: informed



Translated:
  Answer: Фэнтези
  Choices: Фэнтези, Научная фантастика, Приключения
  Clue: Этот жанр часто включает в себя магию, мифических существ и воображаемые миры, которые бросают вызов законам реальности.
  Word count: 17


In [11]:
from enum import Enum

class GameTopic(Enum):
    BOOKS = "Books"
    BROADCAST_MEDIA = "Broadcast Media"
    FOOD = "Food"
    INVENTIONS = "Inventions"
    NATURE = "Nature"
    PLACES = "Places"
    SONGS = "Songs"
    SPORTS = "Sports"
    TECHNOLOGY = "Technology"
    VIDEO_GAMES = "Video Games"

    def __str__(self):
        return self.value

In [16]:
# Translate all data
# WARNING: This will make many API calls and may take considerable time
# Consider translating in batches or filtering by topic first

# Option 1: Translate everything (uncomment to use)
# translated_df = translate_dataframe(df, delay_seconds=2.0)

# Option 2: Translate a specific topic (recommended for testing)
# topic_to_translate = GameTopic.BOOKS.value
# topic_to_translate = GameTopic.BROADCAST_MEDIA.value
topic_to_translate = GameTopic.FOOD.value
# topic_to_translate = GameTopic.INVENTIONS.value
# topic_to_translate = GameTopic.NATURE.value
# topic_to_translate = GameTopic.PLACES.value
# topic_to_translate = GameTopic.SONGS.value
# topic_to_translate = GameTopic.SPORTS.value
# topic_to_translate = GameTopic.TECHNOLOGY.value
# topic_to_translate = GameTopic.VIDEO_GAMES.value

df_topic = df[df['topic_category'] == topic_to_translate].copy()
print(f"Translating topic: {topic_to_translate} ({len(df_topic)} rows)\n")

translated_df = translate_dataframe(df_topic, delay_seconds=1.0)

Translating topic: Food (300 rows)

Starting translation of 300 rows...

Progress: 600/300 (200.0%)
Progress: 610/300 (203.3%)
Progress: 620/300 (206.7%)
Progress: 630/300 (210.0%)
Progress: 640/300 (213.3%)
Progress: 650/300 (216.7%)
Progress: 660/300 (220.0%)
Progress: 670/300 (223.3%)
Progress: 680/300 (226.7%)
Progress: 690/300 (230.0%)
Progress: 700/300 (233.3%)
Progress: 710/300 (236.7%)
Progress: 720/300 (240.0%)
❌ Translation error: 429 Resource exhausted. Please try again later. Please refer to https://cloud.google.com/vertex-ai/generative-ai/docs/error-code-429 for more details.
Response: N/A
Progress: 730/300 (243.3%)
Progress: 740/300 (246.7%)
Progress: 750/300 (250.0%)
Progress: 760/300 (253.3%)
Progress: 770/300 (256.7%)
Progress: 780/300 (260.0%)
Progress: 790/300 (263.3%)
Progress: 800/300 (266.7%)
Progress: 810/300 (270.0%)
Progress: 820/300 (273.3%)
Progress: 830/300 (276.7%)
Progress: 840/300 (280.0%)
Progress: 850/300 (283.3%)
Progress: 860/300 (286.7%)
Progress: 87

In [17]:
# Review translation statistics
print("Translation Statistics:")
print("="*80)
print(f"Total rows translated: {len(translated_df)}")
print(f"\nWord count distribution:")
print(translated_df['word_count'].value_counts().sort_index())
print(f"\nLength compliance:")
print(translated_df['length_ok'].value_counts())
print(f"\nClues by type:")
print(translated_df['clue_type'].value_counts())

# Show sample translations
print("\n" + "="*80)
print("Sample translations:")
print("="*80)
display(translated_df.head(10))

Translation Statistics:
Total rows translated: 300

Word count distribution:
word_count
15     2
16    18
17    48
18    73
19    87
20    39
21    22
22    11
Name: count, dtype: int64

Length compliance:
length_ok
YES    267
NO      33
Name: count, dtype: int64

Clues by type:
clue_type
informed       180
fake            60
misinformed     40
extra           20
Name: count, dtype: int64

Sample translations:


Unnamed: 0,test_run,topic_category,round,answer,choices,clue_type,clue_number,clue_text,word_count,length_ok,manual_score / comment
600,1,Food,1,Итальянская,"Итальянская, Средиземноморская, Европейская",informed,1,"Влияние этой кухни заметно в блюдах из пасты, ...",19,YES,
601,1,Food,1,Итальянский,"Итальянский, Средиземноморский, Европейский",informed,2,"Эта кухня имеет богатую историю, кулинарные тр...",20,YES,
602,1,Food,1,Итальянская,"Итальянская, Средиземноморская, Европейская",informed,3,"Эта кухня славится использованием свежих трав,...",22,NO,
603,1,Food,1,Итальянский,"Итальянский, Средиземноморский, Европейский",informed,4,"Этот кулинарный стиль часто включает блюда, в ...",20,YES,
604,1,Food,1,Итальянский,"Итальянский, Средиземноморский, Европейский",informed,5,"Это кулинарная традиция, подчеркивающая важнос...",18,YES,
605,1,Food,1,Итальянский,"Итальянский, Средиземноморский, Европейский",informed,6,Кулинарная культура делает акцент на региональ...,20,YES,
606,1,Food,1,Итальянский,"Итальянский, Средиземноморский, Европейский",informed,7,Эта кухня известна во всем мире своим разнообр...,18,YES,
607,1,Food,1,Итальянская,"Итальянская, Средиземноморская, Европейская",informed,8,"Эта кулинарная традиция часто включает блюда, ...",21,NO,
608,1,Food,1,Итальянская,"Итальянская, Средиземноморская, Европейская",informed,9,Эта кухня славится своей способностью превраща...,19,YES,
609,1,Food,1,Итальянский,"Итальянский, Средиземноморский, Европейский",misinformed,1,Эта кулинарная традиция предлагает разнообрази...,19,YES,


In [18]:
# Save translated data
lang_code = TARGET_LANGUAGE.value['code']
output_dir = Path(lang_code)
output_dir.mkdir(exist_ok=True)

# Generate output filename
if 'topic_category' in translated_df.columns and len(translated_df['topic_category'].unique()) == 1:
    topic = translated_df['topic_category'].iloc[0]
    output_file = output_dir / f"disinformer_clues({topic}).csv"
else:
    output_file = output_dir / f"disinformer_full_games_clues(topic).csv"

# Save to CSV
translated_df.to_csv(output_file, index=False, encoding='utf-8-sig')
print(f"✓ Translated data saved to: {output_file}")
print(f"  Total rows: {len(translated_df)}")
print(f"  File size: {output_file.stat().st_size / 1024:.1f} KB")

# Accumulate into master file
master_file = output_dir / f"disinformer_full_games_clues.csv"

# Check if master file exists and load it
if master_file.exists():
    master_df = pd.read_csv(master_file)
    print(f"\n✓ Loaded existing master file with {len(master_df)} rows")
    
    # Append new translated data
    master_df = pd.concat([master_df, translated_df], ignore_index=True)
    print(f"✓ Appended {len(translated_df)} new rows")
else:
    master_df = translated_df.copy()
    print(f"\n✓ Creating new master file")

# Remove duplicates based on all columns (keep first occurrence)
initial_count = len(master_df)
master_df = master_df.drop_duplicates(keep='first')
duplicates_removed = initial_count - len(master_df)
if duplicates_removed > 0:
    print(f"⚠️ Removed {duplicates_removed} duplicate rows")

# Save master file
master_df.to_csv(master_file, index=False, encoding='utf-8-sig')
print(f"\n✓ Master file saved to: {master_file}")
print(f"  Total rows: {len(master_df)}")
print(f"  File size: {master_file.stat().st_size / 1024:.1f} KB")

✓ Translated data saved to: RU/disinformer_clues(Food).csv
  Total rows: 300
  File size: 106.6 KB

✓ Loaded existing master file with 600 rows
✓ Appended 300 new rows

✓ Master file saved to: RU/disinformer_full_games_clues.csv
  Total rows: 900
  File size: 338.3 KB


In [19]:
# Quality check: Compare original and translated
def quality_check(original_df: pd.DataFrame, translated_df: pd.DataFrame, num_samples: int = 5):
    """
    Display side-by-side comparison of original and translated content.
    """
    print("Quality Check - Side-by-Side Comparison")
    print("="*120)
    
    for idx in range(min(num_samples, len(translated_df))):
        orig_row = original_df.iloc[idx]
        trans_row = translated_df.iloc[idx]
        
        print(f"\nSample {idx + 1}:")
        print("-"*120)
        print(f"Topic: {orig_row['topic_category']} | Round: {orig_row['round']} | Clue Type: {orig_row['clue_type']}")
        print(f"\nOriginal Answer: {orig_row['answer']}")
        print(f"Translated Answer: {trans_row['answer']}")
        print(f"\nOriginal Clue ({orig_row['word_count']} words):")
        print(f"  {orig_row['clue_text']}")
        print(f"\nTranslated Clue ({trans_row['word_count']} words):")
        print(f"  {trans_row['clue_text']}")
        print("-"*120)

# Run quality check if we have the original data
if 'df' in locals() and 'translated_df' in locals():
    quality_check(df, translated_df, num_samples=5)

Quality Check - Side-by-Side Comparison

Sample 1:
------------------------------------------------------------------------------------------------------------------------
Topic: Books | Round: 1 | Clue Type: informed

Original Answer: Fantasy
Translated Answer: Итальянская

Original Clue (16 words):
  This genre often features magic, mythical creatures, and imaginative worlds that defy the laws of reality.

Translated Clue (19 words):
  Влияние этой кухни заметно в блюдах из пасты, известных своей универсальностью и способностью включать разнообразные ингредиенты со всего мира.
------------------------------------------------------------------------------------------------------------------------

Sample 2:
------------------------------------------------------------------------------------------------------------------------
Topic: Books | Round: 1 | Clue Type: informed

Original Answer: Fantasy
Translated Answer: Итальянский

Original Clue (18 words):
  It typically involves quests,

## Usage Instructions

### To translate to a different language:

1. **Change the target language** in the "Language Configuration" cell:
   ```python
   TARGET_LANGUAGE = Language.FRENCH   # or RUSSIAN, CHINESE, ARABIC
   ```

2. **Restart kernel and run all cells** to translate with the new language

### To translate specific topics:

Modify the "Translate all data" cell:
```python
topic_to_translate = "Sports"  # Change to desired topic
df_topic = df[df['topic_category'] == topic_to_translate].copy()
translated_df = translate_dataframe(df_topic, delay_seconds=2.0)
```

### Available Topics:
- Books
- Broadcast_Media
- Food
- Inventions
- Nature
- Places
- Songs
- Sports
- Technology
- Video_Games

### Notes:
- Each topic has **10 games** × **2 rounds** × **15 clues** = **300 rows**
- Full dataset translation will make **~3000 API calls**
- Recommended: Translate one topic at a time
- Adjust `delay_seconds` based on API rate limits