CodeFuse-CommitEval is the first benchmark tailored to commit Message-Code Inconsistency (MCI) detection with large language models (LLMs). Building on the ApacheCM dataset for diversity and quality, we synthesize seven types of inconsistent messages via rule-guided mutations of originally consistent commits and apply two-fold validation to verify both positive (inconsistent) and negative (consistent) samples. Using this rich and labeled dataset of message–diff pairs, we then evaluate six state-of-the-art open-source LLMs under a vanilla setting and with three augmentation strategies: few-shot prompting, chain-of-thought (CoT), and extended context.
- Multilingual & large-scale dataset
- Even distruction of samples
- Rich inconsistent commit types
- Modular commit mutation rules
- Effective verification for synthesized samples
- ApacheCM Dataset - Contextual Code Retrieval for Commit Message Generation: A Preliminary Study
Execute under Python 3.9.6
python3 -m pip install langchain langchain_openai langchain_communityFirst you need to download all the repositories for contextual code retrieval:
python3 evaluation/clone_repos.py <dataset_json_path> <repo_collection_path>Then, you need to deploy the targeted models by yourself, or use public apis. In our paper, we evaluated the following models:
- DeepSeek-V3.1 (Remote API)
- gpt-oss-20b (Local deployment)
- Qwen3-30B-A3B (Local deployment)
- Llama-3.1-8B (Local deployment)
- Mistral-Small-3.2-24B (Local deployment)
- Kimi-K2-Instruct (Remote API)
Run benchmarking:
python3 evaluation/evaluate_main.py \
-s {pure_llm,fewshot_llm,cot_llm} \
--ctx <context_code_lines> \
-d <dataset_json_path> \
-r <repo_collection_path> \
--api_key <api_key> \
--api-base <base_url> \
--model <model_name> \
-o <output_json_path> \
--worker <concurrent_worker_num>We welcome and encourage contributions from the community! If you're interested in contributing to this project, please follow these guidelines:
-
Identify a Need: Before submitting a pull request (PR), ensure that your contribution addresses a real need or improvement for the project.
-
Submit a PR: Create a pull request with a clear description of:
- The problem or feature request you're addressing
- How your changes solve the problem or implement the feature
- Any relevant test cases or documentation updates
-
Review Process: Our team will review your PR based on:
- Whether the contribution addresses a genuine need for the project
- The quality and correctness of the implementation
- Adherence to the project's coding standards and architecture
We appreciate your interest in making CodeFuse-CommitEval better.
@misc{zhang2025codefusecommitevalbenchmarkingllmspower,
title={CodeFuse-CommitEval: Towards Benchmarking LLM's Power on Commit Message and Code Change Inconsistency Detection},
author={Qingyu Zhang and Puzhuo Liu and Peng Di and Chenxiong Qian},
year={2025},
eprint={2511.19875},
archivePrefix={arXiv},
primaryClass={cs.SE},
url={https://arxiv.org/abs/2511.19875},
}
CodeFuse-CommitEval is licensed under the Apache License 2.0.
