ML Request: Predict time to merge of a PR #236

MichaelClifford · 2021-04-19T19:49:19Z

We would like to create a GitHub bot that ingests information from a PR, including the written description, author, number of files, etc, in addition to the diff, and return a prediction for how long it will take to be merged into the master branch.

Given that the OCP repo currently has over 17,000 closed PR's that could be used as our historical training dataset, this project strikes me as feasible. We can approach this as a regression problem; where given sets of features derived from the PR's that are labeled with their historical time to merge, we should be able to train a model to return an estimated time to merge for future PR's.

This ML request depends on #146 (collecting data from GitHub), but should be independent of any other data sources.

MichaelClifford · 2021-04-21T18:24:27Z

cc @pacospace @xtuchyna

xtuchyna · 2021-04-23T12:14:47Z

Hello @MichaelClifford , I am tracking issues related to this, forget me but I was unavailable this week.
So, this request is then related to this thoth-station/mi-scheduler#130

Currently, some of the features are analysed in mi (look at https://github.com/thoth-station/mi/blob/b322d0a71db4fd6aa7f6f4ca4ce740c30fda0dc0/srcopsmetrics/entities/pull_request.py#L87), but some of them not - therefore I will create a custom data aggregation for you so that all features are satisfied. Let me just assure I got all of them right, and I can add them to the mi:

For a PR collect its:

title
body
author
changed files (identified by their fullpath)
PR reviews ( https://github.com/thoth-station/mi/blob/b322d0a71db4fd6aa7f6f4ca4ce740c30fda0dc0/srcopsmetrics/entities/pull_request.py#L34 )
PR commits (identified by their sha)
time features: create, merge, close time
PR labels

And for all referenced commits within PRs, collect features of diffs (this is standalone aggregation, because it's not API dependent as a PR data extraction - you can just drill commits with repository cloning)?

Is this everything you need? If so, I can launch the data aggregation today!

MichaelClifford · 2021-04-23T12:33:45Z

@xtuchyna This sounds good. Let's start with this set of items and see if it suffices for our needs : )

collect features of diffs (this is standalone aggregation, because it's not API dependent as a PR data extraction - you can just drill commits with repository cloning)?

Does this mean the diffs are not available through the API and that we would need to pull them through a separate process by relying on the PR commit sha's? Which is fine, just want to make sure I understand correctly : )

xtuchyna · 2021-04-23T12:40:35Z

@MichaelClifford the commits can be still gathered by API, but they can be also drilled by cloning the repo and doing it locally - which is much faster then requesting it from API

xtuchyna · 2021-04-23T12:41:43Z

I'm gonna launch the aggregation and provide you the data, should be ready by weekend

xtuchyna · 2021-04-26T13:42:30Z

Hello, just for better organization I am referencing here this Issue thoth-station/mi-scheduler#130, so have a look at the dataset and see if it's valuable :)

chauhankaranraj · 2021-05-25T17:13:20Z

Hey all, @oindrillac and I have a few clarifying questions regarding this ML request:

Should the output of the model be “number of seconds to merge”? Or should this prediction horizon be split into categories like “1-3 days”, “3-6 days”, “1-2weeks”, “>2weeks”, the same way as we have size labels?
Should the time to merge prediction made only once when PR is created, or should it be updated regularly as reviews, labels, etc. are added/removed?
Are reviewers not considered interactors? Because it seems possible to have a username in the reviewers column but not in interactions column.
The labels feature shows the final state of labels in the PR. But is there a way to determine what labels existed at a given point in time? E.g. the XS/S/M/L/XL labels will be available right after PR creation so we can use those in model training; but would labels like bugzilla/severity-high be also available or are they added later by SMEs?

MichaelClifford · 2021-05-25T18:39:46Z

Hey @chauhankaranraj here are my thoughts, but happy to discuss further if you disagree 😄

Should the output of the model be “number of seconds to merge”?

I think seconds is probably too fine grained and a bit over kill to be practically useful, but I also don't think we should predict wide categories like you've suggested. Have you dug into the data and determined a reasonable time frame to chunk the time units by? Maybe some additional EDA could give a clue into this? I was thinking that maybe predicting merge time to the hour might make the most sense from a practical point of view. WDYT?

Should the time to merge prediction made only once when PR is created

For now yes. Let's focus on the simplest case first.

The labels feature shows the final state of labels in the PR. But is there a way to determine what labels existed at a given point in time?

I dont know. @xtuchyna Do you know if we can capture label time stamps as well?

chauhankaranraj · 2021-05-25T19:41:47Z

Have you dug into the data and determined a reasonable time frame to chunk the time units by? Maybe some additional EDA could give a clue into this? I was thinking that maybe predicting merge time to the hour might make the most sense from a practical point of view. WDYT?

AFAIK we haven't done EDA on this, so I'm not sure what a reasonable granularity would be, but we can look into it and find out :)

oindrillac · 2022-03-07T13:29:11Z

@chauhankaranraj shall we close this initial ML Request issue given we have a service up and running? We can address the "bots" aspect in #362

chauhankaranraj · 2022-03-07T13:32:20Z

@chauhankaranraj shall we close this initial ML Request issue given we have a service up and running? We can address the "bots" aspect in #362

Sounds good to me :)

/close

sesheta · 2022-03-07T13:32:23Z

@chauhankaranraj: Closing this issue.

In response to this:

@chauhankaranraj shall we close this initial ML Request issue given we have a service up and running? We can address the "bots" aspect in #362

Sounds good to me :)

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

MichaelClifford created this issue from a note in AI-4-CI (Backlog) Apr 19, 2021

xtuchyna mentioned this issue Apr 23, 2021

Change PR structure thoth-station/mi#377

Merged

2 tasks

MichaelClifford moved this from Backlog to To Do in AI-4-CI May 19, 2021

MichaelClifford moved this from To Do to In Progress in AI-4-CI May 20, 2021

Shreyanand self-assigned this May 20, 2021

oindrillac assigned oindrillac and unassigned Shreyanand May 20, 2021

chauhankaranraj self-assigned this May 20, 2021

MichaelClifford moved this from In Progress to To Do in AI-4-CI Jun 3, 2021

chauhankaranraj mentioned this issue Jun 9, 2021

Add EDA for distribtion of time to merge #287

Merged

2 tasks

oindrillac moved this from To Do to Done (Sprint 08/12/21 - 08/26/21) in AI-4-CI Aug 26, 2021

oindrillac moved this from Done (Sprint 08/12/21 - 08/26/21) to Done (7/30/21 - 8/12/21) in AI-4-CI Aug 26, 2021

oindrillac moved this from Done (7/30/21 - 8/12/21) to Done (Sprint 08/12/21 - 08/26/21) in AI-4-CI Aug 26, 2021

chauhankaranraj mentioned this issue Aug 26, 2021

Integrate Time-to-Merge Model with Thoth Bots #362

Open

2 tasks

chauhankaranraj mentioned this issue Sep 10, 2021

Update time-to-merge Service to Accept Raw Data as Input #380

Closed

1 task

chauhankaranraj mentioned this issue Dec 15, 2021

Update GitHub PR TTM Service Deployment #457

Closed

3 tasks

sesheta closed this as completed Mar 7, 2022

AI-4-CI automation moved this from Done (Sprint 08/12/21 - 08/26/21) to Done (10/07/21 - 10/21/21) Mar 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ML Request: Predict time to merge of a PR #236

ML Request: Predict time to merge of a PR #236

MichaelClifford commented Apr 19, 2021

MichaelClifford commented Apr 21, 2021

xtuchyna commented Apr 23, 2021

MichaelClifford commented Apr 23, 2021

xtuchyna commented Apr 23, 2021 •

edited

xtuchyna commented Apr 23, 2021

xtuchyna commented Apr 26, 2021

chauhankaranraj commented May 25, 2021

MichaelClifford commented May 25, 2021

chauhankaranraj commented May 25, 2021

oindrillac commented Mar 7, 2022

chauhankaranraj commented Mar 7, 2022

sesheta commented Mar 7, 2022

ML Request: Predict time to merge of a PR #236

ML Request: Predict time to merge of a PR #236

Comments

MichaelClifford commented Apr 19, 2021

MichaelClifford commented Apr 21, 2021

xtuchyna commented Apr 23, 2021

MichaelClifford commented Apr 23, 2021

xtuchyna commented Apr 23, 2021 • edited

xtuchyna commented Apr 23, 2021

xtuchyna commented Apr 26, 2021

chauhankaranraj commented May 25, 2021

MichaelClifford commented May 25, 2021

chauhankaranraj commented May 25, 2021

oindrillac commented Mar 7, 2022

chauhankaranraj commented Mar 7, 2022

sesheta commented Mar 7, 2022

xtuchyna commented Apr 23, 2021 •

edited