This is the replicatin package for the paper entitled "Not One to Rule Them All: Mining Meaningful Code Review Orders From GitHub" published at the International Conference on Evaluation and Assessment in Software Engineering (EASE 2025).
You need the poetry package manager. Main app built with streamlit, see here for available extra UI components.
Then install with poetry install and every time before running a script open the environment with poetry shell.
We also provide a requirements.txt.
If you added packages to poetry and want to update the requirements file you can use:
poetry export --without-hashes --format=requirements.txt > requirements.txt
For the GitHub API you'll need to get a token (see your profile settings on GitHub).
Then create a file .env in the root directory of this project and add the following line:
API_KEY_GITHUB=<your token>
You'll also need to fill out the github token for the pr-scraper in config/application.properties.
For more information on the scraper, please refer to the main repo. The code is adapted from the ETRC Tool.
Run scripts with python -m folder.scriptname (without .py).
-
To pick which repositories we mine, use the seart github search. Download the
.csvof your desired sample and clean it up withprepare_sampled_projects.pypython select_repos_prs/prepare_sampled_projects.pyWe did this for our sample frame and shuffled the repositories randomly (see
selected_projects....csv). -
Now, select a repository for which to collect data & get its "slug" (e.g.
adap/flower). Most of the scripts are slug-based, i.e. you pick which repository you process by passing the slug as the command line argument. For some, we also haverun_....scripts for some to run directly for a list of projects. -
Determine which pull requests have review comments by running
python -m select_repos_prs.collect_prs_with_review_comments <repo_slug> python -m select_repos_prs.collect_prs_with_review_comments adap/flowerThis script queries the GitHub GraphQL API & saves the list of pull requests to
prs_with_review_comments.csvin the corresponding repo folder undermined_data. -
To download the comments and commits for these pull request, use our scraper tool. TODO add or refer to instructions on how to set up DB from repo of mining tool
java -jar pr-scraper/prscraper-1.0.1.jar --reposToMine=<slug> java -jar pr-scraper/prscraper-1.0.1.jar --reposToMine=adap/flower -
Next, we build the hunks of the diff that the first reviewer saw during their review. This script also selects only those comments that are relevant for our analysis (valid data, first review of a pull request not from pr author, ...). The script produces the csv which is basis for our whole later analysis:
hunk_output.csvin the folder of the respective repository. This script also reconstructs the diff visible in the first review round and saves it under/diffsin the corresponding repo folder undermined_data.python -m build_hunks <slug> python -m build_hunks adap/flowerif you already mined all the compares & diffs in an earlier run, skip mining by adding "-s"
python -m build_hunks -s adap/flower -
To calculate the file and pr embeddings for the given repository:
python -m claculate_file_embeddings <repo_slug> python -m claculate_file_embeddings adap/flower python -m claculate_pr_embeddings <repo_slug> python -m claculate_pr_embeddings adap/flowerWe use the checkpoint
Salesforce/codet5p-110m-embeddingof the CodeT5+ model with GPU support on mac using devicemps. PR embeddings are saved inpr_embeddings.csvof the corresponding repo undermined_data, while file embeddings are saved under\file_embeddings. -
To calculate cosine similarity between PR embedding and diff file embeddings for each PR of a given repo:
python -m claculate_cosine_similarity <repo_slug> <aggregation_function> python -m claculate_cosine_similarity adap/flower meanAvailable aggregation functions for the diff hunks of a given file are
sumormean. The output is stored incosine_similarity.csvin the corresponding repo folder undermined_data.
To look at our analysis and visualizations, run the streamlit files in data_analysis.
Order to produce intermediate files for analysis:
-
run
compare_orders.pywith:streamlit run data_analysis/compare_orders.pyWhen run for the first time, this script filters the comments and caclculates the Kendall tau correlation for the various orders (test-first, most-similar first, largest-diff first) and produces
kendall_results.csvandall_prs_with_orders.csv. The latter is used for all subsequent visualisations and will not be re-generated if already present. -
run
kendall_tau_order_corr.pywith one of the cmd argumentssim,test,difforalphfor the corresponding orderstreamlit run data_analysis/kendall_tau_order_corr.py <order> streamlit run data_analysis/kendall_tau_order_corr.py sim -
run
review_outcome_analysis.py: to perform the review outcome analysis, we need to fetch the status of the first review round for each pull request in our filtered listall_prs_with_orders.csv:python -m select_repos_prs.collect_pr_reviewsThe script queries the GitHub GraphQL API and saves review data as
<reponame>___<pr number>.csvunder/mined_reviews. The collective list of all reviews is also saved inpr_reviews.csv.Then the visualisation can be run with:
streamlit run data_analysis/review_outcome_analysis.py -
To get an overview of project preferences, run:
streamlit run data_analysis/project_preference.py -
To get an overview of reviewer order preferences, run:
streamlit run data_analysis/reviewer_analysis.py
If the data or software contained in this replication package is helping your research, consider to cite it is as follows, thanks!
@inproceedings{EASE_2025,
title={{Not One to Rule Them All: Mining Meaningful Code Review Orders From GitHub}},
author={Bouraffa, Abir and Brandt, Carolin and Zaidman, Andy and Maalej, Walid},
booktitle={Proceedings of the International Conference on Evaluation and Assessment on Software Engineering (EASE)},
pages={To appear},
publisher={ACM},
year={2025}
}