Skip to content

dbish/crongpt

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CronGen: Can LLMs Write Cron Expressions?

License Python 3.10+

TL;DR: We built a benchmark to test how well LLMs convert natural language to cron expressions. Claude Sonnet 4 and GPT-4o both hit ~79% accuracy. But a fine-tuned 7B model hits 87.8%, beating all the big API models. Training cost $1.76 on Lambda.


The Problem Every Developer Knows

You're setting up a scheduled job for your AI agent. You need it to run "every weekday at 9 AM." Simple enough, right?

You stare at the cron syntax. Is it 0 9 * * 1-5 or 0 9 * * MON-FRI? Wait, does Sunday start at 0 or 1? You Google it. Again. For the third time this month.

Cron expressions are the assembly language of scheduling—powerful, ubiquitous, and perpetually confusing:

"Run at 2:30 AM on the 15th of every month"  →  30 2 15 * *   ✓ Easy
"Every 15 minutes during business hours"    →  */15 9-17 * * 1-5   ...okay
"First Monday of each month at 9 AM"        →  🤯

So we wondered: Can LLMs just do this for us?

Screenshot 2026-01-31 at 10 59 50

What We Built

CronGen is a benchmark for testing how well LLMs convert natural language to cron expressions:

  • 2,710 examples ranging from "every hour" to "third Thursday of months ending in 'ber'"
  • Evaluation harness that checks both exact matches and semantic equivalence
  • Full results for 6 models including fine-tuned and base Qwen 7B

Everything's open source: data/, src/evaluation/, results/


Results

Overall Performance

┌─────────────────────────────────────────────────────────────────────────────┐
│                        Semantic Match Accuracy (%)                          │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  Crongen-Qwen-7B ████████████████████████████████████████████▏ 87.8%       │
│                                                                             │
│  Claude Sonnet 4 ████████████████████████████████████████▎     79.5%       │
│                                                                             │
│  GPT-4o          ███████████████████████████████████████▋      78.8%       │
│                                                                             │
│  GPT-4o-mini     ██████████████████████████████████████▌       76.3%       │
│                                                                             │
│  Claude 3 Haiku  ████████████████████████████████████▍         72.4%       │
│                                                                             │
│  Base Qwen 7B    █████████████████████████▍                    50.7%       │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘
Model Exact Match Semantic Match Parse Rate
Crongen-Qwen-7B 85.9% 87.8% 100.0%
Claude Sonnet 4 74.6% 79.5% 98.3%
GPT-4o 73.7% 78.8% 99.8%
GPT-4o-mini 70.7% 76.3% 98.8%
Claude 3 Haiku 66.6% 72.4% 98.3%
Base Qwen 2.5 7B 46.8% 50.7% 96.8%

What "Semantic Match" means: Sometimes */30 * * * * and 0,30 * * * * both correctly express "every 30 minutes." We check if the next 10 scheduled times match, not just the string.

Full results: results/


Performance by Complexity

┌─────────────────────────────────────────────────────────────────────────────┐
│                     Semantic Match by Complexity (%)                        │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│ BASIC (n=173)                                                               │
│   Crongen-Qwen ██████████████████████████████████████████████████ 98.8%    │
│   Sonnet 4     ██████████████████████████████████████████████████ 99.4%    │
│   GPT-4o       █████████████████████████████████████████████████▏ 97.1%    │
│   GPT-4o-mini  █████████████████████████████████████████████████▏ 97.7%    │
│   Haiku        █████████████████████████████████████████████████▏ 97.7%    │
│                                                                             │
│ INTERMEDIATE (n=149)                                                        │
│   Crongen-Qwen █████████████████████████████████████████▊        84.6%    │
│   Sonnet 4     ███████████████████████████████▏                  62.4%    │
│   GPT-4o       ███████████████████████████████▏                  62.4%    │
│   GPT-4o-mini  ████████████████████████████▌                     57.0%    │
│   Haiku        ████████████████████████████▊                     57.7%    │
│                                                                             │
│ ADVANCED (n=55)                                                             │
│   Crongen-Qwen ███████████████████████████████████████████▋      87.3%    │
│   Sonnet 4     ███████████████████████████████████████▏          78.2%    │
│   GPT-4o       ████████████████████████████████████████▉         81.8%    │
│   GPT-4o-mini  ██████████████████████████████████████▎           76.4%    │
│   Haiku        ██████████████████████████████▉                   61.8%    │
│                                                                             │
│ EDGE CASES (n=33)                                                           │
│   Crongen-Qwen ██████████████████████▋                           45.5%    │
│   Sonnet 4     ███████████████████████████▎                      54.5%    │
│   GPT-4o       █████████████████████████▊                        51.5%    │
│   GPT-4o-mini  █████████████████████████▊                        51.5%    │
│   Haiku        ████████████▏                                     24.2%    │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘
Complexity Crongen-Qwen Sonnet 4 GPT-4o GPT-4o-mini Haiku
Basic 98.8% 99.4% 97.1% 97.7% 97.7%
Intermediate 84.6% 62.4% 62.4% 57.0% 57.7%
Advanced 87.3% 78.2% 81.8% 76.4% 61.8%
Edge Cases 45.5% 54.5% 51.5% 51.5% 24.2%

Key observations:

  1. Basic is solved. All models hit 97%+ on simple patterns like "every day at 3 PM."

  2. Intermediate is where fine-tuning shines. The fine-tuned model gains +22 percentage points over API models on combinations like "every 15 minutes on weekdays."

  3. Edge cases remain hard for everyone. These include genuinely ambiguous inputs like "biweekly" or things cron can't express like "every other Tuesday."


Performance by Category

┌─────────────────────────────────────────────────────────────────────────────┐
│                      Semantic Match by Category (%)                         │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│ TIME-BASED (n=117)                                                          │
│   Crongen-Qwen ████████████████████████████████████████████████▎  96.6%    │
│   Sonnet 4     █████████████████████████████████████████████████▋ 99.2%    │
│   GPT-4o       █████████████████████████████████████████████████▏ 98.3%    │
│                                                                             │
│ DAY-OF-WEEK (n=82)                                                          │
│   Crongen-Qwen ██████████████████████████████████████████████████ 100%     │
│   Sonnet 4     ██████████████████████████████████████████████████ 100%     │
│   GPT-4o       ████████████████████████████████████████████████▎  96.3%    │
│                                                                             │
│ DATE-BASED (n=72)                                                           │
│   Crongen-Qwen ███████████████████████████████████████████████▏   94.4%    │
│   Sonnet 4     ████████████████████▊                              41.7%    │
│   GPT-4o       ████████████████████▊                              41.7%    │
│                                                                             │
│ INTERVALS (n=31)                                                            │
│   Crongen-Qwen █████████████████████████████████████████▉         83.9%    │
│   Sonnet 4     ████████████████████████████████████████████████▍  96.8%    │
│   GPT-4o       ██████████████████████████████████████████████████ 100%     │
│                                                                             │
│ COMBINATIONS (n=75)                                                         │
│   Crongen-Qwen █████████████████████████████████████▎             74.7%    │
│   Sonnet 4     █████████████████████████████████▎                 66.7%    │
│   GPT-4o       ██████████████████████████████████                 68.0%    │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘
Category Crongen-Qwen Sonnet 4 GPT-4o GPT-4o-mini Haiku
Time-based 96.6% 99.2% 98.3% 97.4% 95.7%
Day-of-week 100% 100% 96.3% 96.3% 100%
Date-based 94.4% 41.7% 41.7% 41.7% 41.7%
Intervals 83.9% 96.8% 100% 93.5% 83.9%
Combinations 74.7% 66.7% 68.0% 58.7% 52.0%
Edge cases 45.5% 54.5% 51.5% 51.5% 24.2%

The date-based surprise: All API models score exactly 41.7% on date-based patterns ("on the 15th of each month"). The fine-tuned model jumps to 94.4%. This category has a specific format that fine-tuning captures perfectly.


The Dataset

We built examples across four complexity tiers. Dataset files: data/train.json, data/test.json

Basic (42%) — The Warm-Up

"Every hour"                    → 0 * * * *
"Daily at midnight"             → 0 0 * * *
"Every Monday at 9 AM"          → 0 9 * * 1

Models ace these. 97%+ accuracy across the board.

Intermediate (36%) — Where Things Get Interesting

"Every 15 minutes during work hours"     → */15 9-17 * * *
"Weekdays at 8 AM and 5 PM"              → 0 8,17 * * 1-5
"Every Tuesday and Thursday at noon"     → 0 12 * * 2,4

Accuracy drops to 57-62% for API models. The fine-tuned model hits 84.6%.

Advanced (13%) — Now We're Cooking

"Every quarter (Jan, Apr, Jul, Oct) on the 1st at 6 AM"  → 0 6 1 1,4,7,10 *
"Every 20 minutes on weekdays, but only from 9-5"        → */20 9-17 * * 1-5
"Twice daily at 6 AM and 6 PM on the 1st and 15th"       → 0 6,18 1,15 * *

GPT-4o does better on advanced than intermediate—it handles explicit complexity better than implicit combinations.

Edge Cases (9%) — Here Be Dragons

"Biweekly"                      → ??? (twice a week? every two weeks?)
"End of month"                  → ??? (28th? 30th? 31st?)
"Every other Friday"            → ??? (cron can't express this!)
"Business days"                 → ??? (does this exclude holidays?)

These inputs are genuinely ambiguous—even humans disagree on the "correct" answer. Nobody's crushing it because there often isn't one right answer.


Few-Shot Learning: Surprisingly Unhelpful

We expected few-shot prompting to help. It didn't.

┌─────────────────────────────────────────────────────────────────────────────┐
│                    Zero-Shot vs Few-Shot (Semantic Match %)                 │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  GPT-4o      Zero-shot ████████████████████████████████████████ 78.8%      │
│              Few-shot  ██████████████████████████████████████▋  76.8%  ↓   │
│                                                                             │
│  GPT-4o-mini Zero-shot ██████████████████████████████████████▌  76.3%      │
│              Few-shot  ██████████████████████████████████████▋  76.6%  ≈   │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘
Model Zero-shot Few-shot Delta
GPT-4o 78.8% 76.8% -2.0%
GPT-4o-mini 76.3% 76.6% +0.3%

GPT-4o got worse with examples. Our theory: examples constrain reasoning. When you show it patterns, it tries to match those patterns rather than reasoning from first principles.

Bottom line: Just use zero-shot unless you have a specific reason not to.

Full few-shot results: results/gpt-4o_few_shot_test_20260130_223144_summary.json


The Failure Museum

Let's look at some actual mistakes. These are educational and entertaining.

The Noon Problem

Input: "Daily at noon"
Expected: 0 12 * * *
GPT-4o-mini: 0 0 * * *  ← Midnight, not noon!

"Noon" → hour 0? Someone's internal training data had a bad day.

The Off-By-One Classic

Input: "Every Sunday at 10 AM"
Expected: 0 10 * * 0
Claude 3 Haiku: 0 10 * * 7  ← Works in some crons, not others

Is Sunday 0 or 7? Depends on the implementation! Both are arguably correct.

The Impossible Request

Input: "Every other Tuesday"
Expected: ???
All models: Various wrong answers

Standard cron can't express "every other week." You'd need a stateful scheduler. All models gamely try anyway, producing things like 0 0 * * 2 (every Tuesday) or 0 0 1,15 * 2 (creative but wrong).

The Ambiguity Champion

Input: "Biweekly"
Expected: 0 0 1,15 * * (our interpretation: twice monthly)
Claude Sonnet 4: 0 0 * * 1,4  ← Twice weekly (Monday and Thursday)
GPT-4o: 0 0 */14 * *  ← Every 14 days

Three different interpretations, all defensible. "Biweekly" is cursed—just don't use it.


Fine-Tuning: The Plot Twist

We fine-tuned Qwen 2.5 7B on our training set using QLoRA. The improvement was dramatic:

┌─────────────────────────────────────────────────────────────────────────────┐
│               Fine-Tuning Impact: Base vs Trained (Qwen 7B)                 │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  SEMANTIC MATCH                                                             │
│    Base      █████████████████████████▍                    50.7%           │
│    Trained   ███████████████████████████████████████████▉  87.8%   +37.1%  │
│                                                                             │
│  PARSE RATE                                                                 │
│    Base      ████████████████████████████████████████████████▍  96.8%      │
│    Trained   ██████████████████████████████████████████████████ 100.0%     │
│                                                                             │
│  EXACT MATCH                                                                │
│    Base      ███████████████████████▍                      46.8%           │
│    Trained   ██████████████████████████████████████████▉   85.9%   +39.1%  │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘
Metric Base Qwen 7B Fine-tuned Improvement
Semantic Match 50.7% 87.8% +37.1%
Exact Match 46.8% 85.9% +39.1%
Parse Rate 96.8% 100.0% +3.2%

The fine-tuned 7B model beats GPT-4o by 9 percentage points and achieves perfect parse rate.

Training Details

  • Base Model: Qwen 2.5 7B Instruct
  • Method: QLoRA (4-bit quantization, rank 16)
  • Training Data: 1,895 examples (data/train.json)
  • Hardware: Single A100 on Lambda Labs
  • Training Time: ~10 minutes (1.3 hours total with setup/eval)
  • Total Cost: $1.76
  • Model Weights: models/crongen-qwen-7b/

Training code: src/training/train.py

Why Fine-Tuning Wins

  1. Format consistency: The fine-tuned model always outputs just the cron expression—no explanations, no markdown, no "Here's your cron:"
  2. Domain specialization: It learned the specific patterns in our dataset
  3. Date-based mastery: 94.4% vs 41.7% on date patterns—a category where all API models struggled equally

Production Benefits

  • Self-hosted: No API costs, no latency, no data leaving your servers
  • Consistent: Always outputs valid cron syntax (100% parse rate)
  • Better: Actually more accurate than API models costing 100x more per token

Recommendations

For your weekend project: Any model works. GPT-4o-mini handles basic patterns at 97.7% accuracy.

For production:

  1. Use the fine-tuned model if you can self-host (87.8% accuracy, $0 per query)
  2. Always validate outputs with a cron parser
  3. Show users what their cron will actually do: "This will run every Monday at 9 AM"

For the paranoid:

from croniter import croniter
from datetime import datetime

def validate_cron(expression: str) -> list[datetime]:
    """Validate and show next 5 run times."""
    try:
        cron = croniter(expression, datetime.now())
        return [cron.get_next(datetime) for _ in range(5)]
    except:
        return None

What's Next

Ideas we'd love to explore (PRs welcome):

  • Cron → English: The reverse direction. "What does 0 */2 * * 1-5 mean?"
  • Smaller models: Can we match this with a 3B or 1B model?
  • Other formats: Quartz cron, systemd timers, cloud scheduler syntax
  • Better edge cases: Handling genuinely ambiguous inputs more gracefully

Citation

@misc{crongpt2026,
  title={CronGPT: Benchmarking LLMs on Natural Language to Cron Expression Generation},
  author={Diamond Bishop and Claude},
  year={2026},
  url={https://github.com/dbish/crongpt}
}

Appendix: Installation & Usage

Quick Start

# Clone and install
git clone https://github.com/dbish/crongpt
cd crongen
pip install -e .

# Run evaluation on any model
export OPENAI_API_KEY="your-key"
python -m src.evaluation.evaluate --model gpt-4o-mini --strategy zero_shot

# Run evaluation with the fine-tuned model (requires GPU)
python -m src.evaluation.evaluate --model crongen-qwen-7b --strategy zero_shot

Repository Structure

crongen/
├── data/
│   ├── train.json           # 1,895 training examples
│   ├── test.json            # 410 test examples
│   ├── validation.json      # 405 validation examples
│   └── dataset_stats.json   # Dataset statistics
├── src/
│   ├── dataset/             # Dataset generation
│   ├── evaluation/          # Evaluation harness
│   ├── training/            # Fine-tuning code
│   └── analysis/            # Visualization tools
├── models/
│   └── crongen-qwen-7b/     # Fine-tuned model (LoRA adapter)
├── results/                 # All evaluation results (JSON)
└── demo/
    └── app.py               # Gradio demo

Generate Dataset

# Generate the full dataset
python -m src.dataset.generate --output-dir data

# Prepare for fine-tuning
python -m src.training.prepare_data --format chat

Fine-Tuning

# Train your own model
python -m src.training.train \
    --model Qwen/Qwen2.5-7B-Instruct \
    --output-dir models/my-crongen \
    --epochs 3 \
    --lora-r 16

See src/training/train.py for full configuration options.

Evaluation Metrics

Metric Description
Exact Match Cron string matches ground truth exactly
Semantic Match Same schedule (next 10 run times match)
Parse Rate % of outputs that are valid cron syntax
Field Accuracy Per-field correctness breakdown

Methodology

  • Test set: 410 examples, stratified across complexity and category
  • Temperature: 0.0 (deterministic outputs)
  • Semantic matching: Two crons are equivalent if their next 10 run times match
  • Validation: All ground truth verified with croniter

Known Issues

Per-field accuracy in existing results: The result files in results/ contain incorrect per-field accuracy numbers due to a bug where exact-match predictions weren't counted in per-field stats. This has been fixed in the code, but the existing JSON files still have the old (low) numbers. The overall metrics (exact match, semantic match, parse rate) are correct. Re-run evaluations to get accurate per-field breakdowns.


Acknowledgments

  • croniter for cron parsing and validation
  • Qwen for the base model
  • Hugging Face for the transformers ecosystem

Built because we were tired of Googling cron syntax.

About

benchmark LLMs at NLP to structured cron tasks

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published