We introduce SRank, a novel reranking strategy for selecting the best solution from code generation that focuses on modeling inter-cluster relationship. By quantifying the functional overlap between clusters, our approach provides a better ranking strategy of code solutions. Empirical results show that our method achieves a remarkable results on pass@1 score. For instance, on the Human-Eval benchmark, we achieve 69.66% in pass@1 with Codex002, 75.31% for WizardCoder, 53.99% for StarCoder and 60.55% for CodeGen, which surpass the state-of-the-arts solution ranking methods, such as CodeT and Coder-Reviewer on the same CodeLLM with significant margin (≈6.1% improvement on average). Comparing to the random sampling method, we can achieve an average improvement of ≈23.07% on Human-Eval and 17.64% on MBPP. Even in scenarios with limited test inputs, our approach demonstrates robustness and superiority, marking a new state-of-the-arts in code generation reranking.
- All experiments are run with
python==3.9.17
. - Install pyminifier from source.
Installing
pyminifier
requires reverting setup tools to an older version (pip install setuptools==57.5.0
). For other issues of installingpyminifier
, checkout their issues for potential fixes. - Install
human-eval
from source. - Install the other packages by
pip install -r requirements.txt
Available models:
- wizardcoder34B
- wizardcoder15B
- codegen25
- starcoder
- davinci002
- codegen16B
Available datasets:
- humaneval
- mbpp
- apps
The processed results will be saved at these locations with pre-defined file names
- Post-processed code solutions:
generation/gen_code/preds/${dataset}/${model}/postprocessed_T${temperature}_N${num_samples}.jsonl
- Post-processed test cases:
generation/gen_test/preds/${dataset}/${model}/postprocessed_T${temperature}_N${num_samples}.jsonl
- Execution results:
execution/results/${dataset}/${model}/T${temperature}_N{$num_samples}/*
cd generation/gen_code/sh
./run.py ${device_ids} ${model} ${dataset} ${max_sequence_length} ${number_of_sequences} ${running_script}
For example, running wizardcoder
on humaneval
cd generation/gen_code/sh
./run.sh 0,1,2,3 wizardcoder humaneval 2048 8 wizardcoder.py
Results are saved to generation/gen_code/preds/${dataset}/${model}/T${temperature}_N${num_samples}/
cd generation/gen_test/sh
./run.py ${device_ids} ${model} ${dataset} ${max_sequence_length} ${number_of_sequences} ${running_script}
For example, running wizardcoder
on humaneval
cd generation/gen_test/sh
./run.sh 0,1,2,3 wizardcoder humaneval 2048 8 wizardcoder.py
Results are saved to generation/gen_test/preds/${dataset}/${model}/T${temperature}_N${num_samples}/
cd generation/gen_code/sh
./postprocess.sh ${model} ${dataset}
Results are saved to generation/gen_code/preds/${dataset}/${model}/postprocessed_T${temperature}_N${num_samples}.jsonl
cd generation/gen_test/sh
./postprocess.sh ${model} ${dataset}
Results are saved to generation/gen_test/preds/${dataset}/${model}/postprocessed_T${temperature}_N${num_samples}.jsonl
cd execution/sh
./run.sh ${model} ${dataset}
Execution results are saved to execution/results/${dataset}/${model}/T${temperature}_N{$num_samples}/
. The folder contains the following files:
ground_truth_exec_result.pkl
: Execution results of code solutions on ground truth test cases, as provided by benchmark datasets.model_generated_test_cases.pkl
: Processed model-generated test cases, excluding those with syntactic and partially semantic inaccuracies.test_inputs_exec_result.pkl
: Execution outputs of code solutions on model-generated test cases.
Available reranking methods:
- attention
- random
./run.sh ${model} ${dataset} ${temperature} ${num_samples} ${reranking_method}
For example, running reranking wizardcoder15B
on humaneval
cd reranking/sh
./run.sh wizardcoder humaneval 0.8 100 attention
This code base is adapted from