TinyStories-GR

A large-scale pipeline for translating the TinyStories dataset into Modern Greek and evaluating translation quality using AI agents.

Overview

TinyStories-GR processes ~2.1 million short English children's stories, translating each one to Greek via Google Gemini 3.1 Flash Lite Preview and scoring the result with OpenAI GPT-4o-mini. Output is written incrementally to a single Parquet file, with full resume support across restarts.

Architecture

Input (English Parquet)
        │
        ▼
┌───────────────────┐     ┌──────────────────────┐
│  TranslateNode    │     │  Combined Pipeline   │
│  (Gemini)         │  or │  translate → grade   │
│        │          │     │  (atomic, no partial │
│  SaveTranslation  │     │   saves)             │
└───────────────────┘     └──────────────────────┘
                                    │
        ┌───────────────────────────┘
        │
        ▼
┌───────────────────┐
│  GradeNode        │
│  (GPT-4o-mini)    │
│        │          │
│  SaveGraded       │
└───────────────────┘
        │
        ▼
data/tinystories_greek/tinystories_greek.parquet

Component	Technology	Purpose
Translation	Google Gemini `gemini-3.1-flash-lite-preview`	English → Greek
Evaluation	OpenAI `gpt-4o-mini`	Score translation quality (1–5)
Workflow	Pydantic Graph	DAG-based node orchestration
Storage	Polars + Apache Parquet	Incremental partitioned writes
Concurrency	`asyncio.Semaphore(32)`	32 concurrent API calls
Resume	JSON progress file + row_id	Fault-tolerant restarts
Retry	tenacity (5 attempts, exp. backoff)	Transient API failures

Setup

git clone https://github.com/alexliap/tinystories-gr
cd tinystories-gr
uv sync

Create a .env file in the project root:

OPENAI_API_KEY=...
GEMINI_API_KEY=...

Optional settings (with defaults):

EVALUATION_MODEL=gpt-4o-mini
TRANSLATOR_MODEL=gemini-3.1-flash-lite-preview
MAX_CONCURRENT_REQUESTS=32
PARTITION_SIZE=100
RETRY_ATTEMPTS=5

End-to-End Procedure

Follow these steps from zero to a complete translated and graded dataset.

Step 1 — Download the dataset

uv run python scripts/get_tinystories.py

Downloads all parquet shards from HuggingFace into data/tinystories/parquet/. Add HF_TOKEN=... to .env to avoid anonymous rate limits. The script waits 120 s between files automatically.

Step 2 — Run the pipeline

# Recommended: translate and grade atomically in one pass
uv run python main.py --pipeline all

# Run in background and capture logs
nohup uv run python main.py --pipeline all > nohup.out 2>&1 &

Or run translation and grading as separate passes:

uv run python main.py --pipeline translate   # saves with null evaluation fields
uv run python main.py --pipeline grade       # grades the ungraded rows

The pipeline is resumable: restart the same command at any time and it picks up from where it left off.

Progress is logged to translation.log.

Step 3 — Quality control

After the pipeline finishes (or at any checkpoint), run the quality control scripts to find and fix problems.

Check for missing rows:

uv run python scripts/check_completeness.py

Compares the output parquet against every row in the source files. Prints a per-file breakdown and saves missing_row_ids.csv.

Re-translate missing rows:

uv run python scripts/regenerate_missing.py

Reads missing_row_ids.csv, looks up the originals, and re-translates + grades them. Progress logged to regenerate_missing.log.

Check translation length ratios:

uv run python scripts/check_translation_length.py

Flags rows where the Greek character count is > 2× or < 20% of the English character count (likely garbled output). Saves flagged_translations.parquet.

Re-translate flagged rows:

uv run python scripts/regenerate_flagged.py

Removes flagged rows from partition files, then re-translates + grades them. Progress logged to regenerate_flagged.log.

Fix empty-original rows:

uv run python scripts/replace_empty.py

For any row where original_text == "", sets greek_translation = "".

Pipeline Modes

Mode	Input	Output	Use Case
`all` (default)	English parquet files	Fully graded records	Single-pass translate + grade
`translate`	English parquet files	Records with `evaluation_score=null`	Translate first, grade later
`grade`	Existing output parquet (ungraded rows)	Graded records	Grade a previously translated dataset

All three modes write to the same output file: data/tinystories_greek/tinystories_greek.parquet.

Resume is automatic on restart:

all / translate: reads the max row_id from the merged parquet and skips all rows up to that point
grade: filters the output parquet for rows where evaluation_score IS NULL and cross-checks against grading_progress.json to skip any rows graded in a previous interrupted run

Output Schema

File: data/tinystories_greek/tinystories_greek.parquet

Column	Type	Nullable	Description
`row_id`	int64	No	Unique row identifier
`original_text`	string	No	English source story
`greek_translation`	string	No	Greek translation
`evaluation_score`	int8	Yes	Quality score 1–5 (null if ungraded)
`evaluation_reasoning`	string	Yes	Evaluation explanation (null if ungraded)
`processing_timestamp`	datetime	No	Completion time
`processing_attempts`	int32	No	Number of API attempts
`source_file`	string	No	Source parquet filename
`source_row_index`	int64	No	Row index in source file

Scores use a 1–5 scale where 5 = excellent translation.

Storage Layout

data/tinystories_greek/
├── tinystories_greek.parquet     # Merged output (all processed rows)
├── partitions/
│   ├── batch_00001.parquet       # Raw partition files (100 rows each)
│   ├── batch_00002.parquet
│   └── ...
├── errors.parquet                # Merged error log
└── errors/
    ├── error_00001.parquet       # Error partition files
    └── ...

Partitions are flushed every PARTITION_SIZE rows (default 100) and immediately merged into the single consolidated tinystories_greek.parquet. Deduplication keeps graded rows over translate-only rows when the same row_id appears in multiple partitions.

Logging

File	Written by	Contents
`translation.log`	`main.py`	Main pipeline progress, row-level info/errors
`regenerate_missing.log`	`scripts/regenerate_missing.py`	Missing-row regeneration progress
`regenerate_flagged.log`	`scripts/regenerate_flagged.py`	Flagged-row regeneration progress

Quality Control Scripts

Script	Input	Output	Purpose
`scripts/check_completeness.py`	output parquet + source parquets	`missing_row_ids.csv`	Find gaps in row_id coverage
`scripts/check_translation_length.py`	output parquet	`flagged_translations.parquet`	Detect garbled/truncated translations
`scripts/regenerate_missing.py`	`missing_row_ids.csv` + source parquets	appended to output parquet	Re-translate missing stories
`scripts/regenerate_flagged.py`	`flagged_translations.parquet`	appended to output parquet	Re-translate bad translations
`scripts/replace_empty.py`	output parquet (in-place)	—	Set greek="" for empty originals

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
scripts		scripts
src		src
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
main.py		main.py
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TinyStories-GR

Overview

Architecture

Setup

End-to-End Procedure

Step 1 — Download the dataset

Step 2 — Run the pipeline

Step 3 — Quality control

Pipeline Modes

Output Schema

Storage Layout

Logging

Quality Control Scripts

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

TinyStories-GR

Overview

Architecture

Setup

End-to-End Procedure

Step 1 — Download the dataset

Step 2 — Run the pipeline

Step 3 — Quality control

Pipeline Modes

Output Schema

Storage Layout

Logging

Quality Control Scripts

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages