Learning to Generate Unit Tests for Automated Debugging

Overview

This repository contains the code for our paper Learning to Generate Unit Tests for Automated Debugging. We present UTGen, a data curation and training method for teaching models to generate unit tests (inputs and outputs for a given function), and UTDebug, a debugging pipeline that uses generated unit tests for automated code-debugging with LLMs. In this repo, we provide the code for UTDebug to evaluate unit tests extrinsically and a script to evaluate attack rate, output accuracy, and acc $\cap$ attack on three debugging datasets: HE+Fix, MBPP+Fix, and MBPP+Fix (Hard).

Dependencies

This project is built on Python 3.10.11. All dependencies can be installed via:

pip install -r requirements.txt

Scripts and Running UTDebug

For Intrinsic Evaluation

Run scripts he_intrinsic_eval.py for HE+Fix and mbpp_intrinsic_eval.py for MBPP+Fix and MBPP+Fix Hard. For the latter add the --hard argument. When evaluating off-the-shelf models, please pass the argument --eval-base, or otherwise provide a trained LoRA checkpoint dir via --ckpt-dir. For randomly-sampled UTs use the argument --random.

Example for evaluating trained Qwen2.5 on MBPP+Hard as done in the paper:

python mbpp_intrinsic_eval.py --model qwen --ckpt-dir <path to ckpt> --use-temp --num-units 3

Example for evaluating randomly-sampled UTs from Qwen2.5 on MBPP+Hard as done in the paper:

python mbpp_intrinsic_eval.py --model qwen --eval-base --random --use-temp --num-units 3

For running UTDebug

The UT generation method is provided using the --unit-mode argument. Use --unit-mode joint_sc for the prompted failing UTs, --unit-mode train_joint_sc for UTGen or any trained models, and --unit-mode random_joint_sc for randomly-sampled UTs. Note that by default, round 0 evaluated the given erroneous code in the dataset, so set --max-turns <number of debug rounds> + 1

Example for running UTDebug with UTGen trained model on MBPP+Hard as done in the paper with 3 UTs and 3 turns:

python mbpp_utdebug.py --model qwen --ckpt-dir <path to ckpt> --units 3 --max-turns 4 --backtrack --unit-mode train_joint_sc --dataset mbpp_plus_fix_hard

The no UT feedback baseline is implemented in no_ut_{he/mbpp}.py files and only takes --dataset, --model, and --max-turns arguments.

Reference

Please cite our paper as

@article{prasad2025unit,
    title = {Learning to Generate Unit Tests for Automated Debugging},
    author = {Prasad, Archiki and Stengel-Eskin, Elias and Chen, Justin Chih-Yao and Khan, Zaid and Bansal, Mohit}, 
    year = {2025},
    journal={arXiv preprint 2502.01619} 
}

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
assets		assets
datasets		datasets
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
he_intrinsic_eval.py		he_intrinsic_eval.py
he_utdebug.py		he_utdebug.py
mbpp_intrinsic_eval.py		mbpp_intrinsic_eval.py
mbpp_utdebug.py		mbpp_utdebug.py
no_ut_he.py		no_ut_he.py
no_ut_mbpp.py		no_ut_mbpp.py
requirements.txt		requirements.txt
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

Repository files navigation

Learning to Generate Unit Tests for Automated Debugging

Overview

Dependencies

Scripts and Running UTDebug

For Intrinsic Evaluation

For running UTDebug

Reference

About

Uh oh!

Releases

Packages

Contributors 2

Languages

Uh oh!

License

Uh oh!

archiki/UTGenDebug

Folders and files

Latest commit

History

Repository files navigation

Learning to Generate Unit Tests for Automated Debugging

Overview

Dependencies

Scripts and Running UTDebug

For Intrinsic Evaluation

For running UTDebug

Reference

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages