Skip to content

empero-org/taskgen

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

taskgen

A fast, concurrent SFT (Supervised Fine-Tuning) task generator for distillation datasets. Generates diverse, difficulty-weighted prompts across math, coding, science, computer science, creative writing, and conversational domains — via any OpenAI-compatible API.

Features

  • 45+ domains, 200+ subdomains across 6 categories
  • Weighted difficulty sampling (1–10 scale)
  • Configurable category distribution
  • Concurrent generation with lock-free atomic stats and pre-sampled task batches
  • Live progress bar with speed, token count, and error tracking
  • OpenAI-compatible API (works with OpenAI, Together, Mistral, local vLLM, etc.)
  • Free model discovery via OpenRouter with automatic health checks and periodic rescanning
  • Proxy support with round-robin or sticky rotation
  • Multiple API key rotation for load balancing across routers
  • Post-run deduplication (exact + semantic similarity via word-trigram Jaccard)
  • Graceful shutdown on API outages (5+ timeouts) or billing errors
  • Automatic retry with exponential backoff on rate limits (429)
  • JSONL output with metadata per task, flushed to disk after each write
  • Optional budget cap with per-token cost tracking
  • Append mode to resume interrupted runs
  • Auto-generated dataset README on completion

Install

git clone https://github.com/empero-org/taskgen.git
cd taskgen
cargo build --release

Binary will be at target/release/taskgen.

Usage

taskgen [OPTIONS]

Required

Flag Env Description
--api-key <KEY> OPENAI_API_KEY API key for the target provider (not needed if using --keyfile)

Options

Flag Default Description
--api-base <URL> https://api.openai.com/v1 API base URL
-m, --model <MODEL> gpt-4o-mini Model to use
-c, --count <N> 250 Number of tasks to generate
-w, --workers <N> 5 Concurrent workers
-o, --output <FILE> output.jsonl Output file path
-t, --temperature <F> 0.9 Sampling temperature
--append Append to existing output file
--distribution <STR> balanced Category weights (see below)
--difficulty <STR> bell curve Difficulty weights (see below)
--multilingual Generate tasks in 8 languages and split output by language
--system-prompt <STR> built-in Override the system prompt
--input-price <F> Input token price per 1M tokens (for cost tracking)
--output-price <F> Output token price per 1M tokens
--budget <F> Hard cost cap in USD (requires price flags)

Proxy & Key Rotation

Flag Default Description
--proxies <FILE> Proxy list file, one per line: host:port or host:port:user:pass
--rotating-proxy Use a single random proxy for all requests (sticky mode)
--keyfile <FILE> API key file, one key per line, rotated round-robin

Free Models (OpenRouter)

Flag Default Description
--free-models Auto-discover and use free models from OpenRouter
--free-rescan <MIN> 10 Rescan interval in minutes for free model availability

When --free-models is set, taskgen will:

  1. Override --api-base to https://openrouter.ai/api/v1
  2. Fetch all available models and filter for free, text-capable models with 16k+ context
  3. Health-check each candidate with a test request (429 = live, 502/timeout = offline)
  4. Rotate verified models round-robin across tasks
  5. Track per-model failures — if a model errors 3+ times, it triggers an immediate rescan
  6. Periodically rescan on --free-rescan interval to pick up newly available models

Each task records the actual model name in the taskgen_model metadata field.

Multilingual

When --multilingual is set, each task is randomly assigned one of 8 languages:

Code Language
en English
de German
fr French
es Spanish
nl Dutch
zh Chinese
ar Arabic
ru Russian

The LLM is instructed to write the task in the assigned language. A "language" field is added to each JSON entry's metadata. After generation (and dedup if enabled), the output is split into per-language files:

output_en.jsonl
output_de.jsonl
output_fr.jsonl
...

The generated dataset README includes a language distribution table with per-language task counts.

Deduplication

Flag Default Description
--dedup Run deduplication after generation
--dedup-threshold <F> 0.6 Semantic similarity threshold (0.0–1.0)

Two-pass dedup:

  1. Exact match — normalized (lowercase, whitespace-collapsed) string comparison
  2. Semantic match — word-trigram Jaccard similarity, removes entries above the threshold

Error Handling

  • 429 Rate Limits — exponential backoff with up to 5 retries, respects Retry-After header
  • Billing Errors (402, insufficient_quota, etc.) — immediate graceful shutdown
  • Timeouts — retries with backoff; 5 consecutive timeouts trigger graceful shutdown
  • Graceful Shutdown — all workers drain, completed tasks are saved, dedup runs if enabled, dataset README is written

Examples

Basic — generate 500 tasks with GPT-4o-mini:

taskgen --api-key $OPENAI_API_KEY -c 500

Free models via OpenRouter (no cost):

taskgen --free-models --api-key $OPENROUTER_KEY -c 5000 -w 10

Free models with faster rescan and dedup:

taskgen --free-models --api-key $OPENROUTER_KEY -c 10000 -w 20 \
  --free-rescan 5 --dedup --dedup-threshold 0.5

Multilingual dataset — tasks in 8 languages:

taskgen --api-key $OPENAI_API_KEY -c 2000 -w 10 --multilingual --dedup

Local vLLM / Ollama:

taskgen --api-base http://localhost:8000/v1 --api-key none -m mistral-7b-instruct -c 1000 -w 10

Together AI with cost tracking and budget cap:

taskgen \
  --api-base https://api.together.xyz/v1 \
  --api-key $TOGETHER_API_KEY \
  -m meta-llama/Llama-3-8b-chat-hf \
  -c 2000 -w 20 \
  --input-price 0.20 --output-price 0.20 \
  --budget 1.00

With proxies and multiple API keys:

taskgen \
  --api-key none \
  --keyfile keys.txt \
  --proxies proxies.txt \
  -c 5000 -w 20

Custom distribution — 50% coding, 30% math, 20% science:

taskgen --api-key $OPENAI_API_KEY --distribution "coding=0.5,math=0.3,science=0.2" -c 500

Custom difficulty — only hard tasks (levels 7–10):

taskgen --api-key $OPENAI_API_KEY --difficulty "7=0.25,8=0.25,9=0.25,10=0.25" -c 500

Append mode — resume a previous run:

taskgen --api-key $OPENAI_API_KEY -c 1000 --append -o my_dataset.jsonl

Output Format

Each line in the JSONL file is a self-contained task record:

{
  "prompt": "Prove that the sum of two odd integers is always even.",
  "domain": "math::Number Theory",
  "subdomain": "primes",
  "difficulty": 4,
  "language": "en",
  "taskgen_model": "gpt-4o-mini",
  "temperature": 0.9
}

The language field is only present when --multilingual is used.

A README.md summarising run parameters, token usage, and cost is written alongside the output file on completion.

Domains

Category Domains
math Algebra, Calculus, Probability, Statistics, Geometry, Number Theory, Discrete Math, Linear Algebra
coding Python, Rust, Go, JavaScript, C, C++, C#, Java, Ruby, Lua, SQL, Web Development
science Physics, Chemistry, Biology, Earth Science, Astronomy
cs Algorithms, Data Structures, OS, Networking, Databases, Compilers, Distributed Systems, ML, Cybersecurity, Software Engineering
creative Fiction, Poetry, Screenwriting, Journalism, Songwriting, Game Narrative, Copywriting, Blogging
conversation Debate, Advice, Interview, Teaching, Roleplay

Difficulty Scale

Level Label
1 Very Easy (child-level)
2 Easy (elementary)
3 Basic (middle school)
4 Intermediate (high school)
5 Standard (undergraduate intro)
6 Skilled (undergraduate advanced)
7 Proficient (graduate level)
8 Advanced (professional / researcher)
9 Expert (top specialist)
10 Polymath (1-in-a-million genius)

Support

If this tool has been useful, consider supporting the project:

  • BTC: bc1qx6zepu6sfkvshgdmc4ewu6pk6rpadvpgffpp7v
  • LTC: ltc1qv2mefzps2vtjcpwfx8xxdrpplrcvltswm68r7x
  • XMR: 42Dbm5xg5Nq26fdyzfEU7KBnAJfhi7Cvz5J2ex5CzHXkfKuNEJzYCcmJ1GTbgjFZ5MBx72sdG1G9239Cd6rsZfv4QeDkYJY

by empero-ai

About

A task generator for distillation data generation written in native rust

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages