Mining pull request review comments to investigate reviewing order

This is the replicatin package for the paper entitled "Not One to Rule Them All: Mining Meaningful Code Review Orders From GitHub" published at the International Conference on Evaluation and Assessment in Software Engineering (EASE 2025).

Setup

You need the poetry package manager. Main app built with streamlit, see here for available extra UI components.

Then install with poetry install and every time before running a script open the environment with poetry shell.

We also provide a requirements.txt. If you added packages to poetry and want to update the requirements file you can use:

poetry export --without-hashes --format=requirements.txt > requirements.txt

If using GitHubAPI / Download Scripts

For the GitHub API you'll need to get a token (see your profile settings on GitHub). Then create a file .env in the root directory of this project and add the following line:

API_KEY_GITHUB=<your token>

You'll also need to fill out the github token for the pr-scraper in config/application.properties. For more information on the scraper, please refer to the main repo. The code is adapted from the ETRC Tool.

Run scripts with python -m folder.scriptname (without .py).

Mine Data

To pick which repositories we mine, use the seart github search. Download the .csv of your desired sample and clean it up with prepare_sampled_projects.py
```
python select_repos_prs/prepare_sampled_projects.py
```
We did this for our sample frame and shuffled the repositories randomly (see selected_projects....csv).
Now, select a repository for which to collect data & get its "slug" (e.g. adap/flower). Most of the scripts are slug-based, i.e. you pick which repository you process by passing the slug as the command line argument. For some, we also have run_.... scripts for some to run directly for a list of projects.
Determine which pull requests have review comments by running
```
python -m select_repos_prs.collect_prs_with_review_comments <repo_slug>
python -m select_repos_prs.collect_prs_with_review_comments adap/flower
```
This script queries the GitHub GraphQL API & saves the list of pull requests to prs_with_review_comments.csv in the corresponding repo folder under mined_data.
To download the comments and commits for these pull request, use our scraper tool. TODO add or refer to instructions on how to set up DB from repo of mining tool
```
java -jar pr-scraper/prscraper-1.0.1.jar --reposToMine=<slug>
java -jar pr-scraper/prscraper-1.0.1.jar --reposToMine=adap/flower
```
Next, we build the hunks of the diff that the first reviewer saw during their review. This script also selects only those comments that are relevant for our analysis (valid data, first review of a pull request not from pr author, ...). The script produces the csv which is basis for our whole later analysis: hunk_output.csv in the folder of the respective repository. This script also reconstructs the diff visible in the first review round and saves it under /diffs in the corresponding repo folder under mined_data.
```
python -m build_hunks <slug>
python -m build_hunks adap/flower
```
if you already mined all the compares & diffs in an earlier run, skip mining by adding "-s"
```
python -m build_hunks -s adap/flower
```
To calculate the file and pr embeddings for the given repository:
```
python -m claculate_file_embeddings <repo_slug>
python -m claculate_file_embeddings adap/flower

python -m claculate_pr_embeddings <repo_slug>
python -m claculate_pr_embeddings adap/flower
```
We use the checkpoint Salesforce/codet5p-110m-embedding of the CodeT5+ model with GPU support on mac using device mps. PR embeddings are saved in pr_embeddings.csv of the corresponding repo under mined_data, while file embeddings are saved under \file_embeddings.
To calculate cosine similarity between PR embedding and diff file embeddings for each PR of a given repo:
```
python -m claculate_cosine_similarity <repo_slug> <aggregation_function>
python -m claculate_cosine_similarity adap/flower mean
```
Available aggregation functions for the diff hunks of a given file are sum or mean. The output is stored in cosine_similarity.csv in the corresponding repo folder under mined_data.

Data Analysis

To look at our analysis and visualizations, run the streamlit files in data_analysis.

Order to produce intermediate files for analysis:

run compare_orders.py with:
```
streamlit run data_analysis/compare_orders.py
```
When run for the first time, this script filters the comments and caclculates the Kendall tau correlation for the various orders (test-first, most-similar first, largest-diff first) and produces kendall_results.csv and all_prs_with_orders.csv. The latter is used for all subsequent visualisations and will not be re-generated if already present.

run kendall_tau_order_corr.py with one of the cmd arguments sim, test, diff or alph for the corresponding order

streamlit run data_analysis/kendall_tau_order_corr.py <order>
streamlit run data_analysis/kendall_tau_order_corr.py sim

run review_outcome_analysis.py: to perform the review outcome analysis, we need to fetch the status of the first review round for each pull request in our filtered list all_prs_with_orders.csv:
```
python -m select_repos_prs.collect_pr_reviews 
```
The script queries the GitHub GraphQL API and saves review data as <reponame>___<pr number>.csv under /mined_reviews. The collective list of all reviews is also saved in pr_reviews.csv.

Then the visualisation can be run with:
```
streamlit run data_analysis/review_outcome_analysis.py 
```

To get an overview of project preferences, run:

streamlit run data_analysis/project_preference.py

To get an overview of reviewer order preferences, run:
```
streamlit run data_analysis/reviewer_analysis.py 
```

How to cite:

If the data or software contained in this replication package is helping your research, consider to cite it is as follows, thanks!

@inproceedings{EASE_2025,
  title={{Not One to Rule Them All:  Mining Meaningful Code Review Orders From GitHub}},
  author={Bouraffa, Abir and Brandt, Carolin and Zaidman, Andy and Maalej, Walid},
  booktitle={Proceedings of the International Conference on Evaluation and Assessment on Software Engineering (EASE)},
  pages={To appear},
  publisher={ACM},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
config		config
data_analysis		data_analysis
pr-scraper		pr-scraper
select_repos_prs		select_repos_prs
util		util
.gitignore		.gitignore
LICENSE		LICENSE
__init__.py		__init__.py
all_prs_with_orders.csv		all_prs_with_orders.csv
build_hunks.py		build_hunks.py
calculate_cosine_similarity.py		calculate_cosine_similarity.py
calculate_file_embeddings.py		calculate_file_embeddings.py
calculate_pr_embeddings.py		calculate_pr_embeddings.py
kendall_results.csv		kendall_results.csv
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
readme.md		readme.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Mining pull request review comments to investigate reviewing order

Setup

If using GitHubAPI / Download Scripts

Mine Data

Data Analysis

How to cite:

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Mining pull request review comments to investigate reviewing order

Setup

If using GitHubAPI / Download Scripts

Mine Data

Data Analysis

How to cite:

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages