A fast, concurrent SFT (Supervised Fine-Tuning) task generator for distillation datasets. Generates diverse, difficulty-weighted prompts across math, coding, science, computer science, creative writing, and conversational domains — via any OpenAI-compatible API.
- 45+ domains, 200+ subdomains across 6 categories
- Weighted difficulty sampling (1–10 scale)
- Configurable category distribution
- Concurrent generation with lock-free atomic stats and pre-sampled task batches
- Live progress bar with speed, token count, and error tracking
- OpenAI-compatible API (works with OpenAI, Together, Mistral, local vLLM, etc.)
- Free model discovery via OpenRouter with automatic health checks and periodic rescanning
- Proxy support with round-robin or sticky rotation
- Multiple API key rotation for load balancing across routers
- Post-run deduplication (exact + semantic similarity via word-trigram Jaccard)
- Graceful shutdown on API outages (5+ timeouts) or billing errors
- Automatic retry with exponential backoff on rate limits (429)
- JSONL output with metadata per task, flushed to disk after each write
- Optional budget cap with per-token cost tracking
- Append mode to resume interrupted runs
- Auto-generated dataset README on completion
git clone https://github.com/empero-org/taskgen.git
cd taskgen
cargo build --releaseBinary will be at target/release/taskgen.
taskgen [OPTIONS]| Flag | Env | Description |
|---|---|---|
--api-key <KEY> |
OPENAI_API_KEY |
API key for the target provider (not needed if using --keyfile) |
| Flag | Default | Description |
|---|---|---|
--api-base <URL> |
https://api.openai.com/v1 |
API base URL |
-m, --model <MODEL> |
gpt-4o-mini |
Model to use |
-c, --count <N> |
250 |
Number of tasks to generate |
-w, --workers <N> |
5 |
Concurrent workers |
-o, --output <FILE> |
output.jsonl |
Output file path |
-t, --temperature <F> |
0.9 |
Sampling temperature |
--append |
— | Append to existing output file |
--distribution <STR> |
balanced | Category weights (see below) |
--difficulty <STR> |
bell curve | Difficulty weights (see below) |
--multilingual |
— | Generate tasks in 8 languages and split output by language |
--system-prompt <STR> |
built-in | Override the system prompt |
--input-price <F> |
— | Input token price per 1M tokens (for cost tracking) |
--output-price <F> |
— | Output token price per 1M tokens |
--budget <F> |
— | Hard cost cap in USD (requires price flags) |
| Flag | Default | Description |
|---|---|---|
--proxies <FILE> |
— | Proxy list file, one per line: host:port or host:port:user:pass |
--rotating-proxy |
— | Use a single random proxy for all requests (sticky mode) |
--keyfile <FILE> |
— | API key file, one key per line, rotated round-robin |
| Flag | Default | Description |
|---|---|---|
--free-models |
— | Auto-discover and use free models from OpenRouter |
--free-rescan <MIN> |
10 |
Rescan interval in minutes for free model availability |
When --free-models is set, taskgen will:
- Override
--api-basetohttps://openrouter.ai/api/v1 - Fetch all available models and filter for free, text-capable models with 16k+ context
- Health-check each candidate with a test request (429 = live, 502/timeout = offline)
- Rotate verified models round-robin across tasks
- Track per-model failures — if a model errors 3+ times, it triggers an immediate rescan
- Periodically rescan on
--free-rescaninterval to pick up newly available models
Each task records the actual model name in the taskgen_model metadata field.
When --multilingual is set, each task is randomly assigned one of 8 languages:
| Code | Language |
|---|---|
en |
English |
de |
German |
fr |
French |
es |
Spanish |
nl |
Dutch |
zh |
Chinese |
ar |
Arabic |
ru |
Russian |
The LLM is instructed to write the task in the assigned language. A "language" field is added to each JSON entry's metadata. After generation (and dedup if enabled), the output is split into per-language files:
output_en.jsonl
output_de.jsonl
output_fr.jsonl
...
The generated dataset README includes a language distribution table with per-language task counts.
| Flag | Default | Description |
|---|---|---|
--dedup |
— | Run deduplication after generation |
--dedup-threshold <F> |
0.6 |
Semantic similarity threshold (0.0–1.0) |
Two-pass dedup:
- Exact match — normalized (lowercase, whitespace-collapsed) string comparison
- Semantic match — word-trigram Jaccard similarity, removes entries above the threshold
- 429 Rate Limits — exponential backoff with up to 5 retries, respects
Retry-Afterheader - Billing Errors (402,
insufficient_quota, etc.) — immediate graceful shutdown - Timeouts — retries with backoff; 5 consecutive timeouts trigger graceful shutdown
- Graceful Shutdown — all workers drain, completed tasks are saved, dedup runs if enabled, dataset README is written
Basic — generate 500 tasks with GPT-4o-mini:
taskgen --api-key $OPENAI_API_KEY -c 500Free models via OpenRouter (no cost):
taskgen --free-models --api-key $OPENROUTER_KEY -c 5000 -w 10Free models with faster rescan and dedup:
taskgen --free-models --api-key $OPENROUTER_KEY -c 10000 -w 20 \
--free-rescan 5 --dedup --dedup-threshold 0.5Multilingual dataset — tasks in 8 languages:
taskgen --api-key $OPENAI_API_KEY -c 2000 -w 10 --multilingual --dedupLocal vLLM / Ollama:
taskgen --api-base http://localhost:8000/v1 --api-key none -m mistral-7b-instruct -c 1000 -w 10Together AI with cost tracking and budget cap:
taskgen \
--api-base https://api.together.xyz/v1 \
--api-key $TOGETHER_API_KEY \
-m meta-llama/Llama-3-8b-chat-hf \
-c 2000 -w 20 \
--input-price 0.20 --output-price 0.20 \
--budget 1.00With proxies and multiple API keys:
taskgen \
--api-key none \
--keyfile keys.txt \
--proxies proxies.txt \
-c 5000 -w 20Custom distribution — 50% coding, 30% math, 20% science:
taskgen --api-key $OPENAI_API_KEY --distribution "coding=0.5,math=0.3,science=0.2" -c 500Custom difficulty — only hard tasks (levels 7–10):
taskgen --api-key $OPENAI_API_KEY --difficulty "7=0.25,8=0.25,9=0.25,10=0.25" -c 500Append mode — resume a previous run:
taskgen --api-key $OPENAI_API_KEY -c 1000 --append -o my_dataset.jsonlEach line in the JSONL file is a self-contained task record:
{
"prompt": "Prove that the sum of two odd integers is always even.",
"domain": "math::Number Theory",
"subdomain": "primes",
"difficulty": 4,
"language": "en",
"taskgen_model": "gpt-4o-mini",
"temperature": 0.9
}The language field is only present when --multilingual is used.
A README.md summarising run parameters, token usage, and cost is written alongside the output file on completion.
| Category | Domains |
|---|---|
math |
Algebra, Calculus, Probability, Statistics, Geometry, Number Theory, Discrete Math, Linear Algebra |
coding |
Python, Rust, Go, JavaScript, C, C++, C#, Java, Ruby, Lua, SQL, Web Development |
science |
Physics, Chemistry, Biology, Earth Science, Astronomy |
cs |
Algorithms, Data Structures, OS, Networking, Databases, Compilers, Distributed Systems, ML, Cybersecurity, Software Engineering |
creative |
Fiction, Poetry, Screenwriting, Journalism, Songwriting, Game Narrative, Copywriting, Blogging |
conversation |
Debate, Advice, Interview, Teaching, Roleplay |
| Level | Label |
|---|---|
| 1 | Very Easy (child-level) |
| 2 | Easy (elementary) |
| 3 | Basic (middle school) |
| 4 | Intermediate (high school) |
| 5 | Standard (undergraduate intro) |
| 6 | Skilled (undergraduate advanced) |
| 7 | Proficient (graduate level) |
| 8 | Advanced (professional / researcher) |
| 9 | Expert (top specialist) |
| 10 | Polymath (1-in-a-million genius) |
If this tool has been useful, consider supporting the project:
- BTC:
bc1qx6zepu6sfkvshgdmc4ewu6pk6rpadvpgffpp7v - LTC:
ltc1qv2mefzps2vtjcpwfx8xxdrpplrcvltswm68r7x - XMR:
42Dbm5xg5Nq26fdyzfEU7KBnAJfhi7Cvz5J2ex5CzHXkfKuNEJzYCcmJ1GTbgjFZ5MBx72sdG1G9239Cd6rsZfv4QeDkYJY
by empero-ai