Skip to content

alexliap/tinystories-gr

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TinyStories-GR

A large-scale pipeline for translating the TinyStories dataset into Modern Greek and evaluating translation quality using AI agents.

Overview

TinyStories-GR processes ~2.1 million short English children's stories, translating each one to Greek via Google Gemini 3.1 Flash Lite Preview and scoring the result with OpenAI GPT-4o-mini. Output is written incrementally to a single Parquet file, with full resume support across restarts.

Architecture

Input (English Parquet)
        │
        ▼
┌───────────────────┐     ┌──────────────────────┐
│  TranslateNode    │     │  Combined Pipeline   │
│  (Gemini)         │  or │  translate → grade   │
│        │          │     │  (atomic, no partial │
│  SaveTranslation  │     │   saves)             │
└───────────────────┘     └──────────────────────┘
                                    │
        ┌───────────────────────────┘
        │
        ▼
┌───────────────────┐
│  GradeNode        │
│  (GPT-4o-mini)    │
│        │          │
│  SaveGraded       │
└───────────────────┘
        │
        ▼
data/tinystories_greek/tinystories_greek.parquet
Component Technology Purpose
Translation Google Gemini gemini-3.1-flash-lite-preview English → Greek
Evaluation OpenAI gpt-4o-mini Score translation quality (1–5)
Workflow Pydantic Graph DAG-based node orchestration
Storage Polars + Apache Parquet Incremental partitioned writes
Concurrency asyncio.Semaphore(32) 32 concurrent API calls
Resume JSON progress file + row_id Fault-tolerant restarts
Retry tenacity (5 attempts, exp. backoff) Transient API failures

Setup

git clone https://github.com/alexliap/tinystories-gr
cd tinystories-gr
uv sync

Create a .env file in the project root:

OPENAI_API_KEY=...
GEMINI_API_KEY=...

Optional settings (with defaults):

EVALUATION_MODEL=gpt-4o-mini
TRANSLATOR_MODEL=gemini-3.1-flash-lite-preview
MAX_CONCURRENT_REQUESTS=32
PARTITION_SIZE=100
RETRY_ATTEMPTS=5

End-to-End Procedure

Follow these steps from zero to a complete translated and graded dataset.

Step 1 — Download the dataset

uv run python scripts/get_tinystories.py

Downloads all parquet shards from HuggingFace into data/tinystories/parquet/. Add HF_TOKEN=... to .env to avoid anonymous rate limits. The script waits 120 s between files automatically.

Step 2 — Run the pipeline

# Recommended: translate and grade atomically in one pass
uv run python main.py --pipeline all

# Run in background and capture logs
nohup uv run python main.py --pipeline all > nohup.out 2>&1 &

Or run translation and grading as separate passes:

uv run python main.py --pipeline translate   # saves with null evaluation fields
uv run python main.py --pipeline grade       # grades the ungraded rows

The pipeline is resumable: restart the same command at any time and it picks up from where it left off.

Progress is logged to translation.log.

Step 3 — Quality control

After the pipeline finishes (or at any checkpoint), run the quality control scripts to find and fix problems.

Check for missing rows:

uv run python scripts/check_completeness.py

Compares the output parquet against every row in the source files. Prints a per-file breakdown and saves missing_row_ids.csv.

Re-translate missing rows:

uv run python scripts/regenerate_missing.py

Reads missing_row_ids.csv, looks up the originals, and re-translates + grades them. Progress logged to regenerate_missing.log.

Check translation length ratios:

uv run python scripts/check_translation_length.py

Flags rows where the Greek character count is > 2× or < 20% of the English character count (likely garbled output). Saves flagged_translations.parquet.

Re-translate flagged rows:

uv run python scripts/regenerate_flagged.py

Removes flagged rows from partition files, then re-translates + grades them. Progress logged to regenerate_flagged.log.

Fix empty-original rows:

uv run python scripts/replace_empty.py

For any row where original_text == "", sets greek_translation = "".

Pipeline Modes

Mode Input Output Use Case
all (default) English parquet files Fully graded records Single-pass translate + grade
translate English parquet files Records with evaluation_score=null Translate first, grade later
grade Existing output parquet (ungraded rows) Graded records Grade a previously translated dataset

All three modes write to the same output file: data/tinystories_greek/tinystories_greek.parquet.

Resume is automatic on restart:

  • all / translate: reads the max row_id from the merged parquet and skips all rows up to that point
  • grade: filters the output parquet for rows where evaluation_score IS NULL and cross-checks against grading_progress.json to skip any rows graded in a previous interrupted run

Output Schema

File: data/tinystories_greek/tinystories_greek.parquet

Column Type Nullable Description
row_id int64 No Unique row identifier
original_text string No English source story
greek_translation string No Greek translation
evaluation_score int8 Yes Quality score 1–5 (null if ungraded)
evaluation_reasoning string Yes Evaluation explanation (null if ungraded)
processing_timestamp datetime No Completion time
processing_attempts int32 No Number of API attempts
source_file string No Source parquet filename
source_row_index int64 No Row index in source file

Scores use a 1–5 scale where 5 = excellent translation.

Storage Layout

data/tinystories_greek/
├── tinystories_greek.parquet     # Merged output (all processed rows)
├── partitions/
│   ├── batch_00001.parquet       # Raw partition files (100 rows each)
│   ├── batch_00002.parquet
│   └── ...
├── errors.parquet                # Merged error log
└── errors/
    ├── error_00001.parquet       # Error partition files
    └── ...

Partitions are flushed every PARTITION_SIZE rows (default 100) and immediately merged into the single consolidated tinystories_greek.parquet. Deduplication keeps graded rows over translate-only rows when the same row_id appears in multiple partitions.

Logging

File Written by Contents
translation.log main.py Main pipeline progress, row-level info/errors
regenerate_missing.log scripts/regenerate_missing.py Missing-row regeneration progress
regenerate_flagged.log scripts/regenerate_flagged.py Flagged-row regeneration progress

Quality Control Scripts

Script Input Output Purpose
scripts/check_completeness.py output parquet + source parquets missing_row_ids.csv Find gaps in row_id coverage
scripts/check_translation_length.py output parquet flagged_translations.parquet Detect garbled/truncated translations
scripts/regenerate_missing.py missing_row_ids.csv + source parquets appended to output parquet Re-translate missing stories
scripts/regenerate_flagged.py flagged_translations.parquet appended to output parquet Re-translate bad translations
scripts/replace_empty.py output parquet (in-place) Set greek="" for empty originals

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages