Skip to content

abiUni/github_reviews_study

Repository files navigation

Mining pull request review comments to investigate reviewing order

This is the replicatin package for the paper entitled "Not One to Rule Them All: Mining Meaningful Code Review Orders From GitHub" published at the International Conference on Evaluation and Assessment in Software Engineering (EASE 2025).

Setup

You need the poetry package manager. Main app built with streamlit, see here for available extra UI components.

Then install with poetry install and every time before running a script open the environment with poetry shell.

We also provide a requirements.txt. If you added packages to poetry and want to update the requirements file you can use:

poetry export --without-hashes --format=requirements.txt > requirements.txt

If using GitHubAPI / Download Scripts

For the GitHub API you'll need to get a token (see your profile settings on GitHub). Then create a file .env in the root directory of this project and add the following line:

API_KEY_GITHUB=<your token>

You'll also need to fill out the github token for the pr-scraper in config/application.properties. For more information on the scraper, please refer to the main repo. The code is adapted from the ETRC Tool.

Run scripts with python -m folder.scriptname (without .py).

Mine Data

  1. To pick which repositories we mine, use the seart github search. Download the .csv of your desired sample and clean it up with prepare_sampled_projects.py

    python select_repos_prs/prepare_sampled_projects.py
    

    We did this for our sample frame and shuffled the repositories randomly (see selected_projects....csv).

  2. Now, select a repository for which to collect data & get its "slug" (e.g. adap/flower). Most of the scripts are slug-based, i.e. you pick which repository you process by passing the slug as the command line argument. For some, we also have run_.... scripts for some to run directly for a list of projects.

  3. Determine which pull requests have review comments by running

    python -m select_repos_prs.collect_prs_with_review_comments <repo_slug>
    python -m select_repos_prs.collect_prs_with_review_comments adap/flower
    

    This script queries the GitHub GraphQL API & saves the list of pull requests to prs_with_review_comments.csv in the corresponding repo folder under mined_data.

  4. To download the comments and commits for these pull request, use our scraper tool. TODO add or refer to instructions on how to set up DB from repo of mining tool

    java -jar pr-scraper/prscraper-1.0.1.jar --reposToMine=<slug>
    java -jar pr-scraper/prscraper-1.0.1.jar --reposToMine=adap/flower
    
  5. Next, we build the hunks of the diff that the first reviewer saw during their review. This script also selects only those comments that are relevant for our analysis (valid data, first review of a pull request not from pr author, ...). The script produces the csv which is basis for our whole later analysis: hunk_output.csv in the folder of the respective repository. This script also reconstructs the diff visible in the first review round and saves it under /diffs in the corresponding repo folder under mined_data.

    python -m build_hunks <slug>
    python -m build_hunks adap/flower
    

    if you already mined all the compares & diffs in an earlier run, skip mining by adding "-s"

    python -m build_hunks -s adap/flower
    
  6. To calculate the file and pr embeddings for the given repository:

    python -m claculate_file_embeddings <repo_slug>
    python -m claculate_file_embeddings adap/flower
    
    python -m claculate_pr_embeddings <repo_slug>
    python -m claculate_pr_embeddings adap/flower
    

    We use the checkpoint Salesforce/codet5p-110m-embedding of the CodeT5+ model with GPU support on mac using device mps. PR embeddings are saved in pr_embeddings.csv of the corresponding repo under mined_data, while file embeddings are saved under \file_embeddings.

  7. To calculate cosine similarity between PR embedding and diff file embeddings for each PR of a given repo:

    python -m claculate_cosine_similarity <repo_slug> <aggregation_function>
    python -m claculate_cosine_similarity adap/flower mean
    

    Available aggregation functions for the diff hunks of a given file are sum or mean. The output is stored in cosine_similarity.csv in the corresponding repo folder under mined_data.

Data Analysis

To look at our analysis and visualizations, run the streamlit files in data_analysis.

Order to produce intermediate files for analysis:

  1. run compare_orders.py with:

    streamlit run data_analysis/compare_orders.py
    

    When run for the first time, this script filters the comments and caclculates the Kendall tau correlation for the various orders (test-first, most-similar first, largest-diff first) and produces kendall_results.csv and all_prs_with_orders.csv. The latter is used for all subsequent visualisations and will not be re-generated if already present.

  2. run kendall_tau_order_corr.py with one of the cmd arguments sim, test, diff or alph for the corresponding order

    streamlit run data_analysis/kendall_tau_order_corr.py <order>
    streamlit run data_analysis/kendall_tau_order_corr.py sim
    
  3. run review_outcome_analysis.py: to perform the review outcome analysis, we need to fetch the status of the first review round for each pull request in our filtered list all_prs_with_orders.csv:

    python -m select_repos_prs.collect_pr_reviews 
    

    The script queries the GitHub GraphQL API and saves review data as <reponame>___<pr number>.csv under /mined_reviews. The collective list of all reviews is also saved in pr_reviews.csv.

    Then the visualisation can be run with:

    streamlit run data_analysis/review_outcome_analysis.py 
    
  4. To get an overview of project preferences, run:

    streamlit run data_analysis/project_preference.py 
    
  5. To get an overview of reviewer order preferences, run:

    streamlit run data_analysis/reviewer_analysis.py 
    

How to cite:

If the data or software contained in this replication package is helping your research, consider to cite it is as follows, thanks!

@inproceedings{EASE_2025,
  title={{Not One to Rule Them All:  Mining Meaningful Code Review Orders From GitHub}},
  author={Bouraffa, Abir and Brandt, Carolin and Zaidman, Andy and Maalej, Walid},
  booktitle={Proceedings of the International Conference on Evaluation and Assessment on Software Engineering (EASE)},
  pages={To appear},
  publisher={ACM},
  year={2025}
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages