Skip to content

CodeFuse-CommitEval is the first benchmark tailored to commit Message-Code Inconsistency (MCI) detection with large language models (LLMs).

License

Notifications You must be signed in to change notification settings

codefuse-ai/CodeFuse-CommitEval

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

License Python arXiv

CodeFuse-CommitEval

CodeFuse-CommitEval is the first benchmark tailored to commit Message-Code Inconsistency (MCI) detection with large language models (LLMs). Building on the ApacheCM dataset for diversity and quality, we synthesize seven types of inconsistent messages via rule-guided mutations of originally consistent commits and apply two-fold validation to verify both positive (inconsistent) and negative (consistent) samples. Using this rich and labeled dataset of message–diff pairs, we then evaluate six state-of-the-art open-source LLMs under a vanilla setting and with three augmentation strategies: few-shot prompting, chain-of-thought (CoT), and extended context.

Features

  • Multilingual & large-scale dataset
  • Even distruction of samples
  • Rich inconsistent commit types
  • Modular commit mutation rules
  • Effective verification for synthesized samples

Related Project

  • ApacheCM Dataset - Contextual Code Retrieval for Commit Message Generation: A Preliminary Study

Documentations

Environment Setup

Execute under Python 3.9.6

python3 -m pip install langchain langchain_openai langchain_community

Benchmarking

First you need to download all the repositories for contextual code retrieval:

python3 evaluation/clone_repos.py <dataset_json_path> <repo_collection_path>

Then, you need to deploy the targeted models by yourself, or use public apis. In our paper, we evaluated the following models:

  • DeepSeek-V3.1 (Remote API)
  • gpt-oss-20b (Local deployment)
  • Qwen3-30B-A3B (Local deployment)
  • Llama-3.1-8B (Local deployment)
  • Mistral-Small-3.2-24B (Local deployment)
  • Kimi-K2-Instruct (Remote API)

Run benchmarking:

python3 evaluation/evaluate_main.py \
    -s {pure_llm,fewshot_llm,cot_llm} \
    --ctx <context_code_lines> \
    -d <dataset_json_path> \
    -r <repo_collection_path> \
    --api_key <api_key> \
    --api-base <base_url> \
    --model <model_name> \
    -o <output_json_path> \
    --worker <concurrent_worker_num>

Contribution

We welcome and encourage contributions from the community! If you're interested in contributing to this project, please follow these guidelines:

  1. Identify a Need: Before submitting a pull request (PR), ensure that your contribution addresses a real need or improvement for the project.

  2. Submit a PR: Create a pull request with a clear description of:

    • The problem or feature request you're addressing
    • How your changes solve the problem or implement the feature
    • Any relevant test cases or documentation updates
  3. Review Process: Our team will review your PR based on:

    • Whether the contribution addresses a genuine need for the project
    • The quality and correctness of the implementation
    • Adherence to the project's coding standards and architecture

We appreciate your interest in making CodeFuse-CommitEval better.

Citation

@misc{zhang2025codefusecommitevalbenchmarkingllmspower,
      title={CodeFuse-CommitEval: Towards Benchmarking LLM's Power on Commit Message and Code Change Inconsistency Detection}, 
      author={Qingyu Zhang and Puzhuo Liu and Peng Di and Chenxiong Qian},
      year={2025},
      eprint={2511.19875},
      archivePrefix={arXiv},
      primaryClass={cs.SE},
      url={https://arxiv.org/abs/2511.19875}, 
}

License

CodeFuse-CommitEval is licensed under the Apache License 2.0.

About

CodeFuse-CommitEval is the first benchmark tailored to commit Message-Code Inconsistency (MCI) detection with large language models (LLMs).

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages