- [2026.01] π Code and dataset released!
The rise of reasoning models necessitates large-scale verifiable data, for which programming tasks serve as an ideal source. However, while competitive programming platforms provide abundant problems and solutions, high-quality test cases for verification remain scarce.
CodeContests-O addresses this challenge with a novel Feedback-Driven Iterative Framework. Unlike existing approaches that rely solely on LLM's intrinsic generation capabilities, our method:
- π Leverages execution feedback from both correct and incorrect solutions
- π― Iteratively refines test cases toward high fidelity and discriminability
- β¨ Supports both refining existing generators and creating new ones from scratch
- π Achieves significant improvements in test case quality
| Feature | Description |
|---|---|
| π Feedback-Driven | Utilizes execution results as feedback to guide LLM in refining test cases |
| π High Quality | 89.37% TPR & 90.89% TNR on 11M+ solutions |
| π Training Effective | +9.52% improvement on LiveCodeBench after fine-tuning |
| π οΈ Extensible | Easily adaptable to other competitive programming datasets |
| π€ HuggingFace Ready | Direct integration with HuggingFace Datasets |
| β¨ Generator Flexible | Works with existing generators or creates new ones from scratch via LLM |
| πΎ Resumable | Both generation and evaluation support checkpoint resume from interruption |
| Dataset | TPR (β) | TNR (β) | Avg (β) |
|---|---|---|---|
| CodeContests | 85.05% | 81.52% | 83.29% |
| CodeContests+ | 79.00% | 83.04% | 81.02% |
| CodeContests-O (Ours) | 89.37% | 90.89% | 90.13% |
Evaluated on 11M+ solutions from the complete solution pool
# Clone the repository
git clone https://github.com/cai-jianfeng/CodeContests-O.git
cd CodeContests-O
# Install dependencies
pip install -e .π¦ Requirements
- Python β₯ 3.8
- openai β₯ 1.0.0
- pydantic β₯ 2.0.0
- requests β₯ 2.28.0
- tqdm β₯ 4.64.0
- datasets β₯ 2.0.0
Before running the framework, you need to download testlib.h - a widely-used library for competitive programming test generation:
# Download testlib.h to your working directory
wget https://raw.githubusercontent.com/MikeMirzayanov/testlib/master/testlib.h
# Or using curl
curl -O https://raw.githubusercontent.com/MikeMirzayanov/testlib/master/testlib.hπ‘ Note:
testlib.his required for compiling and running test case generators. Make sure it's accessible via the--testlib_pathargument (defaults to./testlib.hin current directory).
β οΈ Important: If you want to run the generators/checkers from our CodeContests-O dataset, please use the simplifiedtestlib.hprovided in our HuggingFace repository instead of the official version. The simplified version is optimized for compatibility with our generated code.
For more information about testlib, visit the official repository.
To support our feedback-driven framework, we have extended ByteDance's SandboxFusion with additional features for competitive programming test case generation. Our enhanced version is available at: cai-jianfeng/SandboxFusion
# Clone and setup the enhanced sandbox
git clone https://github.com/cai-jianfeng/SandboxFusion.git
cd SandboxFusion
# Follow the setup instructions in the repositoryπ Please refer to the SandboxFusion README for detailed setup and configuration instructions.
python -m codecontests_o.main \
--data_path ByteDance-Seed/Code-Contests-Plus \
--results_dir ./results \
--api_key $OPENAI_API_KEY \
--sandbox_hosts localhost \
--testlib_path ./testlib.hpython -m codecontests_o.main \
--data_path /path/to/codecontests/json \
--results_dir ./results \
--api_key $OPENAI_API_KEY \
--sandbox_hosts localhost \
--testlib_path ./testlib.h# Development (low parallelism, debug mode)
python -m codecontests_o.main --preset development --data_path ./data
# Production (high parallelism)
python -m codecontests_o.main --preset production --data_path ./data
# Quick test (generate only, skip validation)
python -m codecontests_o.main --preset quick --data_path ./datafrom codecontests_o import Config, CodeContestsReader, ParallelProcessor, get_preset_config
import base64
# 1. Setup configuration
config = Config.from_dict(get_preset_config("development"))
config.openai.api_key = "YOUR_API_KEY" # Replace with your actual OpenAI API key
config.dataset.data_path = "ByteDance-Seed/Code-Contests-Plus"
config.dataset.results_dir = "./results"
# 2. Load testlib.h
with open("testlib.h", "rb") as f:
testlib_files = {"testlib.h": base64.b64encode(f.read()).decode()}
# 3. Create dataset reader (auto-detects HuggingFace vs local)
dataset = CodeContestsReader(data_path=config.dataset.data_path, split="test", start=0, end=10)
# 4. Run generation
processor = ParallelProcessor(config=config, testlib_files=testlib_files)
stats = processor.process_dataset(dataset, config.dataset.results_dir)
print(f"β
Completed: {stats['completed']}/{stats['total']}")Evaluate solution performance on datasets to analyze their quality (e.g. CodeContests-O, CodeContests+, CodeContests, etc.):
# Example: Evaluate on CodeContests-O (default)
python -m codecontests_o.solutions_eval \
--data_path caijanfeng/CodeContests-O \
# Note: We need deepmind/code_contests to fetch solutions as CodeContests-O does not store them redundantly
--codecontests_path deepmind/code_contests \
--results_dir ./results_eval \
--start 0 --end 100
# Analyze results (TPR/TNR)
python -m codecontests_o.analyze_results --results_dir ./results_evalNote: The
solutions_evalscript usescaijanfeng/CodeContests-Oas the default test dataset. You can specify other parameters like--data_pathand--splitto evaluate other datasets.
π Metrics Explanation
| Metric | Description |
|---|---|
| TPR (True Positive Rate) | Proportion of correct solutions identified as correct (β better) |
| TNR (True Negative Rate) | Proportion of incorrect solutions identified as incorrect (β better) |
Easily integrate your own datasets by implementing the DatasetReader interface:
from codecontests_o.data import DatasetReader, Sample, Solution, TestCase, Language
class MyDatasetReader(DatasetReader):
def __init__(self, data_path: str, start: int = 0, end: int = -1):
self.data_path = data_path
self._samples = self._load_data()
def _load_data(self):
samples = []
for item in your_data:
sample = Sample(
id=item['id'],
name=item['name'],
description=item['description'],
# Optional: C++ generator using testlib.h
# - If provided: framework iteratively refines it based on feedback
# - If None: framework generates a new generator from scratch using LLM
generator=item.get('generator_code'),
canonical_solutions=[Solution(code=item['solution'], language=Language.PYTHON)],
correct_solutions=[Solution(code=item['solution'], language=Language.PYTHON)],
incorrect_solutions=[Solution(code=item['wrong_solution'], language=Language.PYTHON)],
test_cases=[
# Required if using solutions_eval to evaluate existing test cases
TestCase(input="1\n", output="2\n")
]
)
samples.append(sample)
return samples
def __iter__(self):
yield from self._samples
def __len__(self):
return len(self._samples)
@property
def name(self):
return "MyDataset"python -m codecontests_o.main --custom_reader my_reader.py --data_path /path/to/data
python -m codecontests_o.solutions_eval --custom_reader my_reader.py --data_path /path/to/data --results_dir ./results_evalπ€ OpenAI Configuration
| Option | Default | Description |
|---|---|---|
api_base |
https://api.openai.com/v1 |
API base URL |
api_key |
- | API key |
model |
gpt-4o |
Model name |
max_tokens |
8000 |
Maximum tokens |
π₯οΈ Sandbox Configuration
| Option | Default | Description |
|---|---|---|
hosts |
["localhost"] |
Sandbox hosts |
base_port |
8080 |
Base port |
port_range |
4 |
Ports per host |
compile_timeout |
20 |
Compilation timeout (s) |
run_timeout |
20 |
Execution timeout (s) |
β‘ Processing Configuration
| Option | Default | Description |
|---|---|---|
max_iterations |
3 |
Max iterations per sample |
sample_level_workers |
4 |
Sample-level parallelism |
output_generation_workers |
4 |
Output generation parallelism |
solution_validation_workers |
4 |
Validation parallelism |
Click to expand
codecontests_o/
βββ pyproject.toml # Build configuration
βββ setup.py # Installation script
βββ src/codecontests_o/ # Source code
β βββ main.py # Main entry point
β βββ solutions_eval.py # Solution evaluation
β βββ analyze_results.py # Result analysis
β βββ config/ # Configuration management
β βββ data/ # Data processing & readers
β βββ clients/ # OpenAI & Sandbox clients
β βββ core/ # Generator & Validator
β βββ parallel/ # Parallel processing
β βββ prompts/ # LLM prompt templates
β βββ utils/ # Utilities & logging
βββ README.md
- Core feedback-driven iterative framework
- Support for existing generator refinement
- Generator creation from scratch via LLM
- HuggingFace Datasets integration
- Custom dataset reader interface
- Solution evaluation module (TPR/TNR analysis)
- Multi-level parallel processing
- π§ Checker code co-generation and iterative refinement
- Release filtered correct/incorrect solutions dataset
- Support for more programming languages (currently C++/Python/Java)
- Distributed sandbox execution across multiple nodes
π‘ Contributing: PRs are welcome! Feel free to open an issue to discuss new features or improvements.
If you find this work useful, please cite our paper:
@misc{cai2026codecontestsopoweringllmsfeedbackdriven,
title={CodeContests-O: Powering LLMs via Feedback-Driven Iterative Test Case Generation},
author={Jianfeng Cai and Jinhua Zhu and Ruopei Sun and Kangwen Zhao and Dongyun Xue and Mingxiao Feng and Wengang Zhou and Houqiang Li},
year={2026},
eprint={2601.13682},
archivePrefix={arXiv},
primaryClass={cs.SE},
url={https://arxiv.org/abs/2601.13682},
}- CodeContests by DeepMind and CodeContests+ by ByteDance.
- testlib.h by Mike Mirzayanov
For questions or issues, please:
- π§ Open an Issue
- β Star this repo if you find it helpful!
