CronGen: Can LLMs Write Cron Expressions?

TL;DR: We built a benchmark to test how well LLMs convert natural language to cron expressions. Claude Sonnet 4 and GPT-4o both hit ~79% accuracy. But a fine-tuned 7B model hits 87.8%, beating all the big API models. Training cost $1.76 on Lambda.

The Problem Every Developer Knows

You're setting up a scheduled job for your AI agent. You need it to run "every weekday at 9 AM." Simple enough, right?

You stare at the cron syntax. Is it 0 9 * * 1-5 or 0 9 * * MON-FRI? Wait, does Sunday start at 0 or 1? You Google it. Again. For the third time this month.

Cron expressions are the assembly language of scheduling—powerful, ubiquitous, and perpetually confusing:

"Run at 2:30 AM on the 15th of every month"  →  30 2 15 * *   ✓ Easy
"Every 15 minutes during business hours"    →  */15 9-17 * * 1-5   ...okay
"First Monday of each month at 9 AM"        →  🤯

So we wondered: Can LLMs just do this for us?

What We Built

CronGen is a benchmark for testing how well LLMs convert natural language to cron expressions:

2,710 examples ranging from "every hour" to "third Thursday of months ending in 'ber'"
Evaluation harness that checks both exact matches and semantic equivalence
Full results for 6 models including fine-tuned and base Qwen 7B

Everything's open source: data/, src/evaluation/, results/

Results

Overall Performance

┌─────────────────────────────────────────────────────────────────────────────┐
│                        Semantic Match Accuracy (%)                          │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  Crongen-Qwen-7B ████████████████████████████████████████████▏ 87.8%       │
│                                                                             │
│  Claude Sonnet 4 ████████████████████████████████████████▎     79.5%       │
│                                                                             │
│  GPT-4o          ███████████████████████████████████████▋      78.8%       │
│                                                                             │
│  GPT-4o-mini     ██████████████████████████████████████▌       76.3%       │
│                                                                             │
│  Claude 3 Haiku  ████████████████████████████████████▍         72.4%       │
│                                                                             │
│  Base Qwen 7B    █████████████████████████▍                    50.7%       │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Model	Exact Match	Semantic Match	Parse Rate
Crongen-Qwen-7B	85.9%	87.8%	100.0%
Claude Sonnet 4	74.6%	79.5%	98.3%
GPT-4o	73.7%	78.8%	99.8%
GPT-4o-mini	70.7%	76.3%	98.8%
Claude 3 Haiku	66.6%	72.4%	98.3%
Base Qwen 2.5 7B	46.8%	50.7%	96.8%

What "Semantic Match" means: Sometimes */30 * * * * and 0,30 * * * * both correctly express "every 30 minutes." We check if the next 10 scheduled times match, not just the string.

Full results: results/

Performance by Complexity

┌─────────────────────────────────────────────────────────────────────────────┐
│                     Semantic Match by Complexity (%)                        │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│ BASIC (n=173)                                                               │
│   Crongen-Qwen ██████████████████████████████████████████████████ 98.8%    │
│   Sonnet 4     ██████████████████████████████████████████████████ 99.4%    │
│   GPT-4o       █████████████████████████████████████████████████▏ 97.1%    │
│   GPT-4o-mini  █████████████████████████████████████████████████▏ 97.7%    │
│   Haiku        █████████████████████████████████████████████████▏ 97.7%    │
│                                                                             │
│ INTERMEDIATE (n=149)                                                        │
│   Crongen-Qwen █████████████████████████████████████████▊        84.6%    │
│   Sonnet 4     ███████████████████████████████▏                  62.4%    │
│   GPT-4o       ███████████████████████████████▏                  62.4%    │
│   GPT-4o-mini  ████████████████████████████▌                     57.0%    │
│   Haiku        ████████████████████████████▊                     57.7%    │
│                                                                             │
│ ADVANCED (n=55)                                                             │
│   Crongen-Qwen ███████████████████████████████████████████▋      87.3%    │
│   Sonnet 4     ███████████████████████████████████████▏          78.2%    │
│   GPT-4o       ████████████████████████████████████████▉         81.8%    │
│   GPT-4o-mini  ██████████████████████████████████████▎           76.4%    │
│   Haiku        ██████████████████████████████▉                   61.8%    │
│                                                                             │
│ EDGE CASES (n=33)                                                           │
│   Crongen-Qwen ██████████████████████▋                           45.5%    │
│   Sonnet 4     ███████████████████████████▎                      54.5%    │
│   GPT-4o       █████████████████████████▊                        51.5%    │
│   GPT-4o-mini  █████████████████████████▊                        51.5%    │
│   Haiku        ████████████▏                                     24.2%    │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Complexity	Crongen-Qwen	Sonnet 4	GPT-4o	GPT-4o-mini	Haiku
Basic	98.8%	99.4%	97.1%	97.7%	97.7%
Intermediate	84.6%	62.4%	62.4%	57.0%	57.7%
Advanced	87.3%	78.2%	81.8%	76.4%	61.8%
Edge Cases	45.5%	54.5%	51.5%	51.5%	24.2%

Key observations:

Basic is solved. All models hit 97%+ on simple patterns like "every day at 3 PM."
Intermediate is where fine-tuning shines. The fine-tuned model gains +22 percentage points over API models on combinations like "every 15 minutes on weekdays."
Edge cases remain hard for everyone. These include genuinely ambiguous inputs like "biweekly" or things cron can't express like "every other Tuesday."

Performance by Category

┌─────────────────────────────────────────────────────────────────────────────┐
│                      Semantic Match by Category (%)                         │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│ TIME-BASED (n=117)                                                          │
│   Crongen-Qwen ████████████████████████████████████████████████▎  96.6%    │
│   Sonnet 4     █████████████████████████████████████████████████▋ 99.2%    │
│   GPT-4o       █████████████████████████████████████████████████▏ 98.3%    │
│                                                                             │
│ DAY-OF-WEEK (n=82)                                                          │
│   Crongen-Qwen ██████████████████████████████████████████████████ 100%     │
│   Sonnet 4     ██████████████████████████████████████████████████ 100%     │
│   GPT-4o       ████████████████████████████████████████████████▎  96.3%    │
│                                                                             │
│ DATE-BASED (n=72)                                                           │
│   Crongen-Qwen ███████████████████████████████████████████████▏   94.4%    │
│   Sonnet 4     ████████████████████▊                              41.7%    │
│   GPT-4o       ████████████████████▊                              41.7%    │
│                                                                             │
│ INTERVALS (n=31)                                                            │
│   Crongen-Qwen █████████████████████████████████████████▉         83.9%    │
│   Sonnet 4     ████████████████████████████████████████████████▍  96.8%    │
│   GPT-4o       ██████████████████████████████████████████████████ 100%     │
│                                                                             │
│ COMBINATIONS (n=75)                                                         │
│   Crongen-Qwen █████████████████████████████████████▎             74.7%    │
│   Sonnet 4     █████████████████████████████████▎                 66.7%    │
│   GPT-4o       ██████████████████████████████████                 68.0%    │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Category	Crongen-Qwen	Sonnet 4	GPT-4o	GPT-4o-mini	Haiku
Time-based	96.6%	99.2%	98.3%	97.4%	95.7%
Day-of-week	100%	100%	96.3%	96.3%	100%
Date-based	94.4%	41.7%	41.7%	41.7%	41.7%
Intervals	83.9%	96.8%	100%	93.5%	83.9%
Combinations	74.7%	66.7%	68.0%	58.7%	52.0%
Edge cases	45.5%	54.5%	51.5%	51.5%	24.2%

The date-based surprise: All API models score exactly 41.7% on date-based patterns ("on the 15th of each month"). The fine-tuned model jumps to 94.4%. This category has a specific format that fine-tuning captures perfectly.

The Dataset

We built examples across four complexity tiers. Dataset files: data/train.json, data/test.json

Basic (42%) — The Warm-Up

"Every hour"                    → 0 * * * *
"Daily at midnight"             → 0 0 * * *
"Every Monday at 9 AM"          → 0 9 * * 1

Models ace these. 97%+ accuracy across the board.

Intermediate (36%) — Where Things Get Interesting

"Every 15 minutes during work hours"     → */15 9-17 * * *
"Weekdays at 8 AM and 5 PM"              → 0 8,17 * * 1-5
"Every Tuesday and Thursday at noon"     → 0 12 * * 2,4

Accuracy drops to 57-62% for API models. The fine-tuned model hits 84.6%.

Advanced (13%) — Now We're Cooking

"Every quarter (Jan, Apr, Jul, Oct) on the 1st at 6 AM"  → 0 6 1 1,4,7,10 *
"Every 20 minutes on weekdays, but only from 9-5"        → */20 9-17 * * 1-5
"Twice daily at 6 AM and 6 PM on the 1st and 15th"       → 0 6,18 1,15 * *

GPT-4o does better on advanced than intermediate—it handles explicit complexity better than implicit combinations.

Edge Cases (9%) — Here Be Dragons

"Biweekly"                      → ??? (twice a week? every two weeks?)
"End of month"                  → ??? (28th? 30th? 31st?)
"Every other Friday"            → ??? (cron can't express this!)
"Business days"                 → ??? (does this exclude holidays?)

These inputs are genuinely ambiguous—even humans disagree on the "correct" answer. Nobody's crushing it because there often isn't one right answer.

Few-Shot Learning: Surprisingly Unhelpful

We expected few-shot prompting to help. It didn't.

┌─────────────────────────────────────────────────────────────────────────────┐
│                    Zero-Shot vs Few-Shot (Semantic Match %)                 │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  GPT-4o      Zero-shot ████████████████████████████████████████ 78.8%      │
│              Few-shot  ██████████████████████████████████████▋  76.8%  ↓   │
│                                                                             │
│  GPT-4o-mini Zero-shot ██████████████████████████████████████▌  76.3%      │
│              Few-shot  ██████████████████████████████████████▋  76.6%  ≈   │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Model	Zero-shot	Few-shot	Delta
GPT-4o	78.8%	76.8%	-2.0%
GPT-4o-mini	76.3%	76.6%	+0.3%

GPT-4o got worse with examples. Our theory: examples constrain reasoning. When you show it patterns, it tries to match those patterns rather than reasoning from first principles.

Bottom line: Just use zero-shot unless you have a specific reason not to.

Full few-shot results: results/gpt-4o_few_shot_test_20260130_223144_summary.json

The Failure Museum

Let's look at some actual mistakes. These are educational and entertaining.

The Noon Problem

Input: "Daily at noon"
Expected: 0 12 * * *
GPT-4o-mini: 0 0 * * *  ← Midnight, not noon!

"Noon" → hour 0? Someone's internal training data had a bad day.

The Off-By-One Classic

Input: "Every Sunday at 10 AM"
Expected: 0 10 * * 0
Claude 3 Haiku: 0 10 * * 7  ← Works in some crons, not others

Is Sunday 0 or 7? Depends on the implementation! Both are arguably correct.

The Impossible Request

Input: "Every other Tuesday"
Expected: ???
All models: Various wrong answers

Standard cron can't express "every other week." You'd need a stateful scheduler. All models gamely try anyway, producing things like 0 0 * * 2 (every Tuesday) or 0 0 1,15 * 2 (creative but wrong).

The Ambiguity Champion

Input: "Biweekly"
Expected: 0 0 1,15 * * (our interpretation: twice monthly)
Claude Sonnet 4: 0 0 * * 1,4  ← Twice weekly (Monday and Thursday)
GPT-4o: 0 0 */14 * *  ← Every 14 days

Three different interpretations, all defensible. "Biweekly" is cursed—just don't use it.

Fine-Tuning: The Plot Twist

We fine-tuned Qwen 2.5 7B on our training set using QLoRA. The improvement was dramatic:

┌─────────────────────────────────────────────────────────────────────────────┐
│               Fine-Tuning Impact: Base vs Trained (Qwen 7B)                 │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  SEMANTIC MATCH                                                             │
│    Base      █████████████████████████▍                    50.7%           │
│    Trained   ███████████████████████████████████████████▉  87.8%   +37.1%  │
│                                                                             │
│  PARSE RATE                                                                 │
│    Base      ████████████████████████████████████████████████▍  96.8%      │
│    Trained   ██████████████████████████████████████████████████ 100.0%     │
│                                                                             │
│  EXACT MATCH                                                                │
│    Base      ███████████████████████▍                      46.8%           │
│    Trained   ██████████████████████████████████████████▉   85.9%   +39.1%  │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Metric	Base Qwen 7B	Fine-tuned	Improvement
Semantic Match	50.7%	87.8%	+37.1%
Exact Match	46.8%	85.9%	+39.1%
Parse Rate	96.8%	100.0%	+3.2%

The fine-tuned 7B model beats GPT-4o by 9 percentage points and achieves perfect parse rate.

Training Details

Base Model: Qwen 2.5 7B Instruct
Method: QLoRA (4-bit quantization, rank 16)
Training Data: 1,895 examples (data/train.json)
Hardware: Single A100 on Lambda Labs
Training Time: ~10 minutes (1.3 hours total with setup/eval)
Total Cost: $1.76
Model Weights: models/crongen-qwen-7b/

Training code: src/training/train.py

Why Fine-Tuning Wins

Format consistency: The fine-tuned model always outputs just the cron expression—no explanations, no markdown, no "Here's your cron:"
Domain specialization: It learned the specific patterns in our dataset
Date-based mastery: 94.4% vs 41.7% on date patterns—a category where all API models struggled equally

Production Benefits

Self-hosted: No API costs, no latency, no data leaving your servers
Consistent: Always outputs valid cron syntax (100% parse rate)
Better: Actually more accurate than API models costing 100x more per token

Recommendations

For your weekend project: Any model works. GPT-4o-mini handles basic patterns at 97.7% accuracy.

For production:

Use the fine-tuned model if you can self-host (87.8% accuracy, $0 per query)
Always validate outputs with a cron parser
Show users what their cron will actually do: "This will run every Monday at 9 AM"

For the paranoid:

from croniter import croniter
from datetime import datetime

def validate_cron(expression: str) -> list[datetime]:
    """Validate and show next 5 run times."""
    try:
        cron = croniter(expression, datetime.now())
        return [cron.get_next(datetime) for _ in range(5)]
    except:
        return None

What's Next

Ideas we'd love to explore (PRs welcome):

Cron → English: The reverse direction. "What does 0 */2 * * 1-5 mean?"
Smaller models: Can we match this with a 3B or 1B model?
Other formats: Quartz cron, systemd timers, cloud scheduler syntax
Better edge cases: Handling genuinely ambiguous inputs more gracefully

Citation

@misc{crongpt2026,
  title={CronGPT: Benchmarking LLMs on Natural Language to Cron Expression Generation},
  author={Diamond Bishop and Claude},
  year={2026},
  url={https://github.com/dbish/crongpt}
}

Appendix: Installation & Usage

Quick Start

# Clone and install
git clone https://github.com/dbish/crongpt
cd crongen
pip install -e .

# Run evaluation on any model
export OPENAI_API_KEY="your-key"
python -m src.evaluation.evaluate --model gpt-4o-mini --strategy zero_shot

# Run evaluation with the fine-tuned model (requires GPU)
python -m src.evaluation.evaluate --model crongen-qwen-7b --strategy zero_shot

Repository Structure

crongen/
├── data/
│   ├── train.json           # 1,895 training examples
│   ├── test.json            # 410 test examples
│   ├── validation.json      # 405 validation examples
│   └── dataset_stats.json   # Dataset statistics
├── src/
│   ├── dataset/             # Dataset generation
│   ├── evaluation/          # Evaluation harness
│   ├── training/            # Fine-tuning code
│   └── analysis/            # Visualization tools
├── models/
│   └── crongen-qwen-7b/     # Fine-tuned model (LoRA adapter)
├── results/                 # All evaluation results (JSON)
└── demo/
    └── app.py               # Gradio demo

Generate Dataset

# Generate the full dataset
python -m src.dataset.generate --output-dir data

# Prepare for fine-tuning
python -m src.training.prepare_data --format chat

Fine-Tuning

# Train your own model
python -m src.training.train \
    --model Qwen/Qwen2.5-7B-Instruct \
    --output-dir models/my-crongen \
    --epochs 3 \
    --lora-r 16

See src/training/train.py for full configuration options.

Evaluation Metrics

Metric	Description
Exact Match	Cron string matches ground truth exactly
Semantic Match	Same schedule (next 10 run times match)
Parse Rate	% of outputs that are valid cron syntax
Field Accuracy	Per-field correctness breakdown

Methodology

Test set: 410 examples, stratified across complexity and category
Temperature: 0.0 (deterministic outputs)
Semantic matching: Two crons are equivalent if their next 10 run times match
Validation: All ground truth verified with croniter

Known Issues

Per-field accuracy in existing results: The result files in results/ contain incorrect per-field accuracy numbers due to a bug where exact-match predictions weren't counted in per-field stats. This has been fixed in the code, but the existing JSON files still have the old (low) numbers. The overall metrics (exact match, semantic match, parse rate) are correct. Re-run evaluations to get accurate per-field breakdowns.

Acknowledgments

croniter for cron parsing and validation
Qwen for the base model
Hugging Face for the transformers ecosystem

Built because we were tired of Googling cron syntax.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
data		data
demo		demo
docs		docs
models/crongen-qwen-7b		models/crongen-qwen-7b
results		results
src		src
.gitignore		.gitignore
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
pyproject.toml		pyproject.toml

License

dbish/crongpt

Folders and files

Latest commit

History

Repository files navigation