Text-to-CAD Evaluation with CADTests

Dimitrios Mallis, Marco Wang, Ahmet Serdar Karadeniz, Elisa Ricci, Anis Kacem, Djamila Aouada

This repository contains the CADTestBench, the first test-based benchmark for Text-to-CAD, along with code for evaluating any Text-to-CAD method against CADTestBench and for computing dataset metrics.

CADTestBench is described on the paper:

Text-to-CAD Evaluation with CADTests.

The benchmark data is loaded from Hugging Face by default (dimitrismallis/CADTestBench).

CADTestBench

A test-based benchmark for Text-to-CAD evaluation.

CADTestBench replaces geometric-similarity metrics (e.g., Chamfer Distance) with executable software tests that directly verify whether a generated CAD model satisfies the geometric and topological requirements of its prompt. Each prompt is paired with a suite of CADTests — Python predicates over the generated B-rep — and a model is correct only if it passes every test.

Why test-based evaluation?

Text prompts are inherently ambiguous: many valid CAD models can satisfy the same prompt. Comparing a generation to a single reference via geometric similarity penalizes valid design variations and rewards superficially close but structurally wrong outputs. Test-based evaluation is:

Reference-free — assesses prompt compliance directly, not similarity to one ground-truth solution.
Explicit — every dimensional and structural requirement is a checkable constraint.
Interpretable — failing tests return diagnostic messages explaining what went wrong.
Efficient — deterministic, fast, no inference cost from learned evaluators.

What's in the benchmark

200 CAD programs sourced from CADPrompt
Two prompt variants per program: abstract (high-level description) and detailed (with explicit geometric constraints)
A suite of CADTests per prompt, synthesized and refined through mutation analysis
Per-test metadata: requirement linkage, test type, classification rationale

Installation

Use a dedicated virtual environment for this repo so CadQuery and its dependencies.

Requires Python 3.10+. From the repository root:

python3 -m venv cadtestenv
source cadtestenv/bin/activate 
pip install --upgrade pip setuptools wheel
pip install -e . --timeout 120 --retries 10

Quick Start in Browser

Open CADTestBench Viewer for four paper baselines (GPT-5.2 and Claude 4.6 Sonnet, abstract and detailed splits) with preloaded cadtest results. Browse directly in the browser.

Run Paper Baselines

The repo includes four paper baselines (GPT-5.2 and Claude 4.6 Sonnet, abstract and detailed splits). From the project root, after pip install -e ., run:

cadtestbench evaluate baselines/GPT-5.2/Abstract --partition abstract
cadtestbench evaluate baselines/GPT-5.2/Detailed --partition detailed
cadtestbench evaluate baselines/Claude-4.6-Sonnet/Abstract --partition abstract
cadtestbench evaluate baselines/Claude-4.6-Sonnet/Detailed --partition detailed

Each command loads prompts and cadtests from dimitrismallis/CADTestBench. Use --limit N for a short run.

Usage

To evaluate a Text-to-CAD method with CADTestBench, load the benchmark prompts from Hugging Face (dimitrismallis/CADTestBench), then produce your method’s outputs in the directory layout described below.

Method Result layout (generated models)

Pass a method run root as the positional BASELINE_DIR. Under it, each benchmark sample id is a folder:

{BASELINE_DIR}/
└── generated_models/
    └── {sample_id}/
        └── gpt_generated.py      # default; other names via metric kwargs

Convention in this repo’s baselines: implement geometry in create_cad() (returns the model, usually cq.Workplane), then assign final_result at module scope so the runner can load it:

import cadquery as cq

def create_cad() -> cq.Workplane:
    result = cq.Workplane("XY").circle(10).extrude(20)
    return result

final_result = create_cad()

Evaluation

Partition — --partition is required: abstract or detailed.

cadtestbench evaluate path/to/method --partition abstract

More flags: --limit, --sample-ids, --dataset, --run-label, --eval-root — see cadtestbench evaluate --help.

cadtestbench evaluate path/to/method --partition detailed --limit 10

Evaluation output

Each run is written under eval_root (default {project_root}/eval) in a timestamped folder:

{eval_root}/
└── {YYYYMMDD_HHMMSS}_…_{partition}/
    ├── config.json                 # resolved run configuration
    ├── evaluation_summary.json     # aggregate metrics and timing
    └── samples/
        └── {sample_id}/
            ├── summary.json          # per-sample metrics and metadata
            ├── generated.stl         # mesh from the generated model
            └── code_with_cadtests.py # optional replay bundle (debug)

Inspect runs in CADTestBench Viewer by dragging the timestamped folder ({YYYYMMDD_HHMMSS}_…_{partition}/, the one that contains config.json, evaluation_summary.json, and samples/) into the page.

The `cadtest` metrics

For each sample, cadtest:

Loads the cadtest suite for the sample from the configured Hugging Face dataset.
Executes the generated model code.
Runs each cadtest against the live CadQuery object.
Combines results to compute reported metrics.

Hugging Face

CADTestBench is loaded from dimitrismallis/CADTestBench by default.

License

MIT — see LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
assets		assets
baselines		baselines
src/cadtestbench		src/cadtestbench
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Text-to-CAD Evaluation with CADTests

Dimitrios Mallis, Marco Wang, Ahmet Serdar Karadeniz, Elisa Ricci, Anis Kacem, Djamila Aouada

CADTestBench

Why test-based evaluation?

What's in the benchmark

Installation

Quick Start in Browser

Run Paper Baselines

Usage

Method Result layout (generated models)

Evaluation

Evaluation output

The `cadtest` metrics

Hugging Face

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Text-to-CAD Evaluation with CADTests

Dimitrios Mallis, Marco Wang, Ahmet Serdar Karadeniz, Elisa Ricci, Anis Kacem, Djamila Aouada

CADTestBench

Why test-based evaluation?

What's in the benchmark

Installation

Quick Start in Browser

Run Paper Baselines

Usage

Method Result layout (generated models)

Evaluation

Evaluation output

The cadtest metrics

Hugging Face

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

The `cadtest` metrics

Packages