[CVPR 2026] Benchmarking PhD-Level Coding in 3D Geometric Computer Vision
GeoCodeBench is the first PhD-level benchmark designed to evaluate the capability of LLMs to understand and implement complex 3D geometric vision code from scientific paper. Each problem is a fill-in-the-function implementation task curated from representative papers at recent top-tier venues. To solve these tasks, LLMs are provided with the structured full paper content, the partially masked source code, and a standardized execution template. Performance is measured via diverse, edge-case unit tests.
The benchmark organizes coding tasks into a two-level capability hierarchy:
- General 3D Capability: Evaluates foundational geometric reasoning, including Geometric Transformations (e.g., coordinate conversions, projections) and Mechanics/Optics Formulation (e.g., radiometric operators, differentiable equations).
- Research Capability: Evaluates higher-level procedural reasoning, including Novel Algorithm Implementation (translating mathematical principles into code) and Geometric Logic Routing (composing existing operators into complete pipelines).
.
├── GeoCodeBench/ # all benchmark tasks
│ ├── NN_ShortName/ # one directory per paper / task
│ │ ├── questions/
│ │ │ ├── code*.py # masked source (fill-in-the-function)
│ │ │ ├── template*.py # execution template for unit tests
│ │ │ ├── paper-full.json # structured full paper
│ │ │ └── paper-method.json # Intro–Method sections only
│ │ ├── answers/ # raw LLM outputs (.txt)
│ │ └── unittest*/ # unittest1, unittest2, … (per sub-problem)
│ │ ├── test_generator.py
│ │ ├── test_runner.py
│ │ ├── reference_implementation.py
│ │ └── llm_implementations/ # extracted runnable .py files
│ └── repos.csv # task metadata (venue, type, …)
├── batch_generate_answers.py # batch LLM inference
├── extract_code_from_answers.py # extract code from answers → llm_implementations/
├── unittest.sh # run all unit tests
├── unittest_summary_to_csv.py # summarize pass rates (overall)
└── unittest_summary_by_type.py # summarize pass rates (by question type)
Create and activate the environment :
conda env create -f environment.yml
conda activate geocodebenchCreate .env from .env.example:
cp .env.example .envSet LLM_API_KEY and LLM_BASE_URL
Use the provided script:
bash generation.shChange the following args:
--models sets which LLM(s) to use; --suffixes picks the paper prompt mode (paper-full, paper-method, or nopaper); --repo-ids optionally limits which benchmark repos to run (e.g. 01,03,10-12).
For example:
python3 batch_generate_answers.py \
--models gpt-5.4,claude-opus-4-6,gemini-3.1-pro-preview \
--suffixes paper-full,paper-method,nopaper \
--repo-ids 01-03 \
--max-workers 10 \
--skip-existingLLM answers will store in answers/answer_{idx}/{model}-{suffix}.txt.
python3 extract_code_from_answers.pyThis script:
- reads
answers/answer_*/{model}-{suffix}.txt, - extracts content inside
<answering>...</answering>if available (falls back to full text otherwise), - writes runnable code into
unittest*/llm_implementations/.
bash unittest.shBy default, each unittest runs with --num-tests 10.
Overall implementation average pass rate:
python3 unittest_summary_to_csv.pyAverage pass rate grouped by question type:
python3 unittest_summary_by_type.pyIf GeoCodeBench helps your research, please cite:
@article{li2026benchmarking,
title={Benchmarking PhD-Level Coding in 3D Geometric Computer Vision},
author={Wenyi, Li and Renkai, Luo and Yue, Yu and Huan-ang, Gao and Mingju, Gao and Li, Yuan and Chaoyou, Fu and Hao, Zhao},
journal={arXiv preprint arXiv:2603.30038},
year={2026},
}