TL;DR: We built a benchmark to test how well LLMs convert natural language to cron expressions. Claude Sonnet 4 and GPT-4o both hit ~79% accuracy. But a fine-tuned 7B model hits 87.8%, beating all the big API models. Training cost $1.76 on Lambda.
You're setting up a scheduled job for your AI agent. You need it to run "every weekday at 9 AM." Simple enough, right?
You stare at the cron syntax. Is it 0 9 * * 1-5 or 0 9 * * MON-FRI? Wait, does Sunday start at 0 or 1? You Google it. Again. For the third time this month.
Cron expressions are the assembly language of scheduling—powerful, ubiquitous, and perpetually confusing:
"Run at 2:30 AM on the 15th of every month" → 30 2 15 * * ✓ Easy
"Every 15 minutes during business hours" → */15 9-17 * * 1-5 ...okay
"First Monday of each month at 9 AM" → 🤯
So we wondered: Can LLMs just do this for us?
CronGen is a benchmark for testing how well LLMs convert natural language to cron expressions:
- 2,710 examples ranging from "every hour" to "third Thursday of months ending in 'ber'"
- Evaluation harness that checks both exact matches and semantic equivalence
- Full results for 6 models including fine-tuned and base Qwen 7B
Everything's open source: data/, src/evaluation/, results/
┌─────────────────────────────────────────────────────────────────────────────┐
│ Semantic Match Accuracy (%) │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ Crongen-Qwen-7B ████████████████████████████████████████████▏ 87.8% │
│ │
│ Claude Sonnet 4 ████████████████████████████████████████▎ 79.5% │
│ │
│ GPT-4o ███████████████████████████████████████▋ 78.8% │
│ │
│ GPT-4o-mini ██████████████████████████████████████▌ 76.3% │
│ │
│ Claude 3 Haiku ████████████████████████████████████▍ 72.4% │
│ │
│ Base Qwen 7B █████████████████████████▍ 50.7% │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
| Model | Exact Match | Semantic Match | Parse Rate |
|---|---|---|---|
| Crongen-Qwen-7B | 85.9% | 87.8% | 100.0% |
| Claude Sonnet 4 | 74.6% | 79.5% | 98.3% |
| GPT-4o | 73.7% | 78.8% | 99.8% |
| GPT-4o-mini | 70.7% | 76.3% | 98.8% |
| Claude 3 Haiku | 66.6% | 72.4% | 98.3% |
| Base Qwen 2.5 7B | 46.8% | 50.7% | 96.8% |
What "Semantic Match" means: Sometimes */30 * * * * and 0,30 * * * * both correctly express "every 30 minutes." We check if the next 10 scheduled times match, not just the string.
Full results: results/
┌─────────────────────────────────────────────────────────────────────────────┐
│ Semantic Match by Complexity (%) │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ BASIC (n=173) │
│ Crongen-Qwen ██████████████████████████████████████████████████ 98.8% │
│ Sonnet 4 ██████████████████████████████████████████████████ 99.4% │
│ GPT-4o █████████████████████████████████████████████████▏ 97.1% │
│ GPT-4o-mini █████████████████████████████████████████████████▏ 97.7% │
│ Haiku █████████████████████████████████████████████████▏ 97.7% │
│ │
│ INTERMEDIATE (n=149) │
│ Crongen-Qwen █████████████████████████████████████████▊ 84.6% │
│ Sonnet 4 ███████████████████████████████▏ 62.4% │
│ GPT-4o ███████████████████████████████▏ 62.4% │
│ GPT-4o-mini ████████████████████████████▌ 57.0% │
│ Haiku ████████████████████████████▊ 57.7% │
│ │
│ ADVANCED (n=55) │
│ Crongen-Qwen ███████████████████████████████████████████▋ 87.3% │
│ Sonnet 4 ███████████████████████████████████████▏ 78.2% │
│ GPT-4o ████████████████████████████████████████▉ 81.8% │
│ GPT-4o-mini ██████████████████████████████████████▎ 76.4% │
│ Haiku ██████████████████████████████▉ 61.8% │
│ │
│ EDGE CASES (n=33) │
│ Crongen-Qwen ██████████████████████▋ 45.5% │
│ Sonnet 4 ███████████████████████████▎ 54.5% │
│ GPT-4o █████████████████████████▊ 51.5% │
│ GPT-4o-mini █████████████████████████▊ 51.5% │
│ Haiku ████████████▏ 24.2% │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
| Complexity | Crongen-Qwen | Sonnet 4 | GPT-4o | GPT-4o-mini | Haiku |
|---|---|---|---|---|---|
| Basic | 98.8% | 99.4% | 97.1% | 97.7% | 97.7% |
| Intermediate | 84.6% | 62.4% | 62.4% | 57.0% | 57.7% |
| Advanced | 87.3% | 78.2% | 81.8% | 76.4% | 61.8% |
| Edge Cases | 45.5% | 54.5% | 51.5% | 51.5% | 24.2% |
Key observations:
-
Basic is solved. All models hit 97%+ on simple patterns like "every day at 3 PM."
-
Intermediate is where fine-tuning shines. The fine-tuned model gains +22 percentage points over API models on combinations like "every 15 minutes on weekdays."
-
Edge cases remain hard for everyone. These include genuinely ambiguous inputs like "biweekly" or things cron can't express like "every other Tuesday."
┌─────────────────────────────────────────────────────────────────────────────┐
│ Semantic Match by Category (%) │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ TIME-BASED (n=117) │
│ Crongen-Qwen ████████████████████████████████████████████████▎ 96.6% │
│ Sonnet 4 █████████████████████████████████████████████████▋ 99.2% │
│ GPT-4o █████████████████████████████████████████████████▏ 98.3% │
│ │
│ DAY-OF-WEEK (n=82) │
│ Crongen-Qwen ██████████████████████████████████████████████████ 100% │
│ Sonnet 4 ██████████████████████████████████████████████████ 100% │
│ GPT-4o ████████████████████████████████████████████████▎ 96.3% │
│ │
│ DATE-BASED (n=72) │
│ Crongen-Qwen ███████████████████████████████████████████████▏ 94.4% │
│ Sonnet 4 ████████████████████▊ 41.7% │
│ GPT-4o ████████████████████▊ 41.7% │
│ │
│ INTERVALS (n=31) │
│ Crongen-Qwen █████████████████████████████████████████▉ 83.9% │
│ Sonnet 4 ████████████████████████████████████████████████▍ 96.8% │
│ GPT-4o ██████████████████████████████████████████████████ 100% │
│ │
│ COMBINATIONS (n=75) │
│ Crongen-Qwen █████████████████████████████████████▎ 74.7% │
│ Sonnet 4 █████████████████████████████████▎ 66.7% │
│ GPT-4o ██████████████████████████████████ 68.0% │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
| Category | Crongen-Qwen | Sonnet 4 | GPT-4o | GPT-4o-mini | Haiku |
|---|---|---|---|---|---|
| Time-based | 96.6% | 99.2% | 98.3% | 97.4% | 95.7% |
| Day-of-week | 100% | 100% | 96.3% | 96.3% | 100% |
| Date-based | 94.4% | 41.7% | 41.7% | 41.7% | 41.7% |
| Intervals | 83.9% | 96.8% | 100% | 93.5% | 83.9% |
| Combinations | 74.7% | 66.7% | 68.0% | 58.7% | 52.0% |
| Edge cases | 45.5% | 54.5% | 51.5% | 51.5% | 24.2% |
The date-based surprise: All API models score exactly 41.7% on date-based patterns ("on the 15th of each month"). The fine-tuned model jumps to 94.4%. This category has a specific format that fine-tuning captures perfectly.
We built examples across four complexity tiers. Dataset files: data/train.json, data/test.json
"Every hour" → 0 * * * *
"Daily at midnight" → 0 0 * * *
"Every Monday at 9 AM" → 0 9 * * 1
Models ace these. 97%+ accuracy across the board.
"Every 15 minutes during work hours" → */15 9-17 * * *
"Weekdays at 8 AM and 5 PM" → 0 8,17 * * 1-5
"Every Tuesday and Thursday at noon" → 0 12 * * 2,4
Accuracy drops to 57-62% for API models. The fine-tuned model hits 84.6%.
"Every quarter (Jan, Apr, Jul, Oct) on the 1st at 6 AM" → 0 6 1 1,4,7,10 *
"Every 20 minutes on weekdays, but only from 9-5" → */20 9-17 * * 1-5
"Twice daily at 6 AM and 6 PM on the 1st and 15th" → 0 6,18 1,15 * *
GPT-4o does better on advanced than intermediate—it handles explicit complexity better than implicit combinations.
"Biweekly" → ??? (twice a week? every two weeks?)
"End of month" → ??? (28th? 30th? 31st?)
"Every other Friday" → ??? (cron can't express this!)
"Business days" → ??? (does this exclude holidays?)
These inputs are genuinely ambiguous—even humans disagree on the "correct" answer. Nobody's crushing it because there often isn't one right answer.
We expected few-shot prompting to help. It didn't.
┌─────────────────────────────────────────────────────────────────────────────┐
│ Zero-Shot vs Few-Shot (Semantic Match %) │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ GPT-4o Zero-shot ████████████████████████████████████████ 78.8% │
│ Few-shot ██████████████████████████████████████▋ 76.8% ↓ │
│ │
│ GPT-4o-mini Zero-shot ██████████████████████████████████████▌ 76.3% │
│ Few-shot ██████████████████████████████████████▋ 76.6% ≈ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
| Model | Zero-shot | Few-shot | Delta |
|---|---|---|---|
| GPT-4o | 78.8% | 76.8% | -2.0% |
| GPT-4o-mini | 76.3% | 76.6% | +0.3% |
GPT-4o got worse with examples. Our theory: examples constrain reasoning. When you show it patterns, it tries to match those patterns rather than reasoning from first principles.
Bottom line: Just use zero-shot unless you have a specific reason not to.
Full few-shot results: results/gpt-4o_few_shot_test_20260130_223144_summary.json
Let's look at some actual mistakes. These are educational and entertaining.
Input: "Daily at noon"
Expected: 0 12 * * *
GPT-4o-mini: 0 0 * * * ← Midnight, not noon!
"Noon" → hour 0? Someone's internal training data had a bad day.
Input: "Every Sunday at 10 AM"
Expected: 0 10 * * 0
Claude 3 Haiku: 0 10 * * 7 ← Works in some crons, not others
Is Sunday 0 or 7? Depends on the implementation! Both are arguably correct.
Input: "Every other Tuesday"
Expected: ???
All models: Various wrong answers
Standard cron can't express "every other week." You'd need a stateful scheduler. All models gamely try anyway, producing things like 0 0 * * 2 (every Tuesday) or 0 0 1,15 * 2 (creative but wrong).
Input: "Biweekly"
Expected: 0 0 1,15 * * (our interpretation: twice monthly)
Claude Sonnet 4: 0 0 * * 1,4 ← Twice weekly (Monday and Thursday)
GPT-4o: 0 0 */14 * * ← Every 14 days
Three different interpretations, all defensible. "Biweekly" is cursed—just don't use it.
We fine-tuned Qwen 2.5 7B on our training set using QLoRA. The improvement was dramatic:
┌─────────────────────────────────────────────────────────────────────────────┐
│ Fine-Tuning Impact: Base vs Trained (Qwen 7B) │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ SEMANTIC MATCH │
│ Base █████████████████████████▍ 50.7% │
│ Trained ███████████████████████████████████████████▉ 87.8% +37.1% │
│ │
│ PARSE RATE │
│ Base ████████████████████████████████████████████████▍ 96.8% │
│ Trained ██████████████████████████████████████████████████ 100.0% │
│ │
│ EXACT MATCH │
│ Base ███████████████████████▍ 46.8% │
│ Trained ██████████████████████████████████████████▉ 85.9% +39.1% │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
| Metric | Base Qwen 7B | Fine-tuned | Improvement |
|---|---|---|---|
| Semantic Match | 50.7% | 87.8% | +37.1% |
| Exact Match | 46.8% | 85.9% | +39.1% |
| Parse Rate | 96.8% | 100.0% | +3.2% |
The fine-tuned 7B model beats GPT-4o by 9 percentage points and achieves perfect parse rate.
- Base Model: Qwen 2.5 7B Instruct
- Method: QLoRA (4-bit quantization, rank 16)
- Training Data: 1,895 examples (data/train.json)
- Hardware: Single A100 on Lambda Labs
- Training Time: ~10 minutes (1.3 hours total with setup/eval)
- Total Cost: $1.76
- Model Weights: models/crongen-qwen-7b/
Training code: src/training/train.py
- Format consistency: The fine-tuned model always outputs just the cron expression—no explanations, no markdown, no "Here's your cron:"
- Domain specialization: It learned the specific patterns in our dataset
- Date-based mastery: 94.4% vs 41.7% on date patterns—a category where all API models struggled equally
- Self-hosted: No API costs, no latency, no data leaving your servers
- Consistent: Always outputs valid cron syntax (100% parse rate)
- Better: Actually more accurate than API models costing 100x more per token
For your weekend project: Any model works. GPT-4o-mini handles basic patterns at 97.7% accuracy.
For production:
- Use the fine-tuned model if you can self-host (87.8% accuracy, $0 per query)
- Always validate outputs with a cron parser
- Show users what their cron will actually do: "This will run every Monday at 9 AM"
For the paranoid:
from croniter import croniter
from datetime import datetime
def validate_cron(expression: str) -> list[datetime]:
"""Validate and show next 5 run times."""
try:
cron = croniter(expression, datetime.now())
return [cron.get_next(datetime) for _ in range(5)]
except:
return NoneIdeas we'd love to explore (PRs welcome):
- Cron → English: The reverse direction. "What does
0 */2 * * 1-5mean?" - Smaller models: Can we match this with a 3B or 1B model?
- Other formats: Quartz cron, systemd timers, cloud scheduler syntax
- Better edge cases: Handling genuinely ambiguous inputs more gracefully
@misc{crongpt2026,
title={CronGPT: Benchmarking LLMs on Natural Language to Cron Expression Generation},
author={Diamond Bishop and Claude},
year={2026},
url={https://github.com/dbish/crongpt}
}# Clone and install
git clone https://github.com/dbish/crongpt
cd crongen
pip install -e .
# Run evaluation on any model
export OPENAI_API_KEY="your-key"
python -m src.evaluation.evaluate --model gpt-4o-mini --strategy zero_shot
# Run evaluation with the fine-tuned model (requires GPU)
python -m src.evaluation.evaluate --model crongen-qwen-7b --strategy zero_shotcrongen/
├── data/
│ ├── train.json # 1,895 training examples
│ ├── test.json # 410 test examples
│ ├── validation.json # 405 validation examples
│ └── dataset_stats.json # Dataset statistics
├── src/
│ ├── dataset/ # Dataset generation
│ ├── evaluation/ # Evaluation harness
│ ├── training/ # Fine-tuning code
│ └── analysis/ # Visualization tools
├── models/
│ └── crongen-qwen-7b/ # Fine-tuned model (LoRA adapter)
├── results/ # All evaluation results (JSON)
└── demo/
└── app.py # Gradio demo
# Generate the full dataset
python -m src.dataset.generate --output-dir data
# Prepare for fine-tuning
python -m src.training.prepare_data --format chat# Train your own model
python -m src.training.train \
--model Qwen/Qwen2.5-7B-Instruct \
--output-dir models/my-crongen \
--epochs 3 \
--lora-r 16See src/training/train.py for full configuration options.
| Metric | Description |
|---|---|
| Exact Match | Cron string matches ground truth exactly |
| Semantic Match | Same schedule (next 10 run times match) |
| Parse Rate | % of outputs that are valid cron syntax |
| Field Accuracy | Per-field correctness breakdown |
- Test set: 410 examples, stratified across complexity and category
- Temperature: 0.0 (deterministic outputs)
- Semantic matching: Two crons are equivalent if their next 10 run times match
- Validation: All ground truth verified with croniter
Per-field accuracy in existing results: The result files in results/ contain incorrect per-field accuracy numbers due to a bug where exact-match predictions weren't counted in per-field stats. This has been fixed in the code, but the existing JSON files still have the old (low) numbers. The overall metrics (exact match, semantic match, parse rate) are correct. Re-run evaluations to get accurate per-field breakdowns.
- croniter for cron parsing and validation
- Qwen for the base model
- Hugging Face for the transformers ecosystem
Built because we were tired of Googling cron syntax.