A large-scale pipeline for translating the TinyStories dataset into Modern Greek and evaluating translation quality using AI agents.
TinyStories-GR processes ~2.1 million short English children's stories, translating each one to Greek via Google Gemini 3.1 Flash Lite Preview and scoring the result with OpenAI GPT-4o-mini. Output is written incrementally to a single Parquet file, with full resume support across restarts.
Input (English Parquet)
│
▼
┌───────────────────┐ ┌──────────────────────┐
│ TranslateNode │ │ Combined Pipeline │
│ (Gemini) │ or │ translate → grade │
│ │ │ │ (atomic, no partial │
│ SaveTranslation │ │ saves) │
└───────────────────┘ └──────────────────────┘
│
┌───────────────────────────┘
│
▼
┌───────────────────┐
│ GradeNode │
│ (GPT-4o-mini) │
│ │ │
│ SaveGraded │
└───────────────────┘
│
▼
data/tinystories_greek/tinystories_greek.parquet
| Component | Technology | Purpose |
|---|---|---|
| Translation | Google Gemini gemini-3.1-flash-lite-preview |
English → Greek |
| Evaluation | OpenAI gpt-4o-mini |
Score translation quality (1–5) |
| Workflow | Pydantic Graph | DAG-based node orchestration |
| Storage | Polars + Apache Parquet | Incremental partitioned writes |
| Concurrency | asyncio.Semaphore(32) |
32 concurrent API calls |
| Resume | JSON progress file + row_id | Fault-tolerant restarts |
| Retry | tenacity (5 attempts, exp. backoff) | Transient API failures |
git clone https://github.com/alexliap/tinystories-gr
cd tinystories-gr
uv syncCreate a .env file in the project root:
OPENAI_API_KEY=...
GEMINI_API_KEY=...Optional settings (with defaults):
EVALUATION_MODEL=gpt-4o-mini
TRANSLATOR_MODEL=gemini-3.1-flash-lite-preview
MAX_CONCURRENT_REQUESTS=32
PARTITION_SIZE=100
RETRY_ATTEMPTS=5Follow these steps from zero to a complete translated and graded dataset.
uv run python scripts/get_tinystories.pyDownloads all parquet shards from HuggingFace into data/tinystories/parquet/.
Add HF_TOKEN=... to .env to avoid anonymous rate limits. The script waits 120 s between files automatically.
# Recommended: translate and grade atomically in one pass
uv run python main.py --pipeline all
# Run in background and capture logs
nohup uv run python main.py --pipeline all > nohup.out 2>&1 &Or run translation and grading as separate passes:
uv run python main.py --pipeline translate # saves with null evaluation fields
uv run python main.py --pipeline grade # grades the ungraded rowsThe pipeline is resumable: restart the same command at any time and it picks up from where it left off.
Progress is logged to translation.log.
After the pipeline finishes (or at any checkpoint), run the quality control scripts to find and fix problems.
Check for missing rows:
uv run python scripts/check_completeness.pyCompares the output parquet against every row in the source files. Prints a per-file breakdown and saves missing_row_ids.csv.
Re-translate missing rows:
uv run python scripts/regenerate_missing.pyReads missing_row_ids.csv, looks up the originals, and re-translates + grades them. Progress logged to regenerate_missing.log.
Check translation length ratios:
uv run python scripts/check_translation_length.pyFlags rows where the Greek character count is > 2× or < 20% of the English character count (likely garbled output). Saves flagged_translations.parquet.
Re-translate flagged rows:
uv run python scripts/regenerate_flagged.pyRemoves flagged rows from partition files, then re-translates + grades them. Progress logged to regenerate_flagged.log.
Fix empty-original rows:
uv run python scripts/replace_empty.pyFor any row where original_text == "", sets greek_translation = "".
| Mode | Input | Output | Use Case |
|---|---|---|---|
all (default) |
English parquet files | Fully graded records | Single-pass translate + grade |
translate |
English parquet files | Records with evaluation_score=null |
Translate first, grade later |
grade |
Existing output parquet (ungraded rows) | Graded records | Grade a previously translated dataset |
All three modes write to the same output file: data/tinystories_greek/tinystories_greek.parquet.
Resume is automatic on restart:
all/translate: reads the maxrow_idfrom the merged parquet and skips all rows up to that pointgrade: filters the output parquet for rows whereevaluation_score IS NULLand cross-checks againstgrading_progress.jsonto skip any rows graded in a previous interrupted run
File: data/tinystories_greek/tinystories_greek.parquet
| Column | Type | Nullable | Description |
|---|---|---|---|
row_id |
int64 | No | Unique row identifier |
original_text |
string | No | English source story |
greek_translation |
string | No | Greek translation |
evaluation_score |
int8 | Yes | Quality score 1–5 (null if ungraded) |
evaluation_reasoning |
string | Yes | Evaluation explanation (null if ungraded) |
processing_timestamp |
datetime | No | Completion time |
processing_attempts |
int32 | No | Number of API attempts |
source_file |
string | No | Source parquet filename |
source_row_index |
int64 | No | Row index in source file |
Scores use a 1–5 scale where 5 = excellent translation.
data/tinystories_greek/
├── tinystories_greek.parquet # Merged output (all processed rows)
├── partitions/
│ ├── batch_00001.parquet # Raw partition files (100 rows each)
│ ├── batch_00002.parquet
│ └── ...
├── errors.parquet # Merged error log
└── errors/
├── error_00001.parquet # Error partition files
└── ...
Partitions are flushed every PARTITION_SIZE rows (default 100) and immediately merged into the single consolidated tinystories_greek.parquet. Deduplication keeps graded rows over translate-only rows when the same row_id appears in multiple partitions.
| File | Written by | Contents |
|---|---|---|
translation.log |
main.py |
Main pipeline progress, row-level info/errors |
regenerate_missing.log |
scripts/regenerate_missing.py |
Missing-row regeneration progress |
regenerate_flagged.log |
scripts/regenerate_flagged.py |
Flagged-row regeneration progress |
| Script | Input | Output | Purpose |
|---|---|---|---|
scripts/check_completeness.py |
output parquet + source parquets | missing_row_ids.csv |
Find gaps in row_id coverage |
scripts/check_translation_length.py |
output parquet | flagged_translations.parquet |
Detect garbled/truncated translations |
scripts/regenerate_missing.py |
missing_row_ids.csv + source parquets |
appended to output parquet | Re-translate missing stories |
scripts/regenerate_flagged.py |
flagged_translations.parquet |
appended to output parquet | Re-translate bad translations |
scripts/replace_empty.py |
output parquet (in-place) | — | Set greek="" for empty originals |