Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ML Request: Predict time to merge of a PR #236

Closed
MichaelClifford opened this issue Apr 19, 2021 · 12 comments
Closed

ML Request: Predict time to merge of a PR #236

MichaelClifford opened this issue Apr 19, 2021 · 12 comments
Assignees
Projects

Comments

@MichaelClifford
Copy link
Member

We would like to create a GitHub bot that ingests information from a PR, including the written description, author, number of files, etc, in addition to the diff, and return a prediction for how long it will take to be merged into the master branch.

Given that the OCP repo currently has over 17,000 closed PR's that could be used as our historical training dataset, this project strikes me as feasible. We can approach this as a regression problem; where given sets of features derived from the PR's that are labeled with their historical time to merge, we should be able to train a model to return an estimated time to merge for future PR's.

This ML request depends on #146 (collecting data from GitHub), but should be independent of any other data sources.

@MichaelClifford MichaelClifford created this issue from a note in AI-4-CI (Backlog) Apr 19, 2021
@MichaelClifford
Copy link
Member Author

cc @pacospace @xtuchyna

@xtuchyna
Copy link
Collaborator

Hello @MichaelClifford , I am tracking issues related to this, forget me but I was unavailable this week.
So, this request is then related to this thoth-station/mi-scheduler#130

Currently, some of the features are analysed in mi (look at https://github.com/thoth-station/mi/blob/b322d0a71db4fd6aa7f6f4ca4ce740c30fda0dc0/srcopsmetrics/entities/pull_request.py#L87), but some of them not - therefore I will create a custom data aggregation for you so that all features are satisfied. Let me just assure I got all of them right, and I can add them to the mi:

For a PR collect its:

And for all referenced commits within PRs, collect features of diffs (this is standalone aggregation, because it's not API dependent as a PR data extraction - you can just drill commits with repository cloning)?

Is this everything you need? If so, I can launch the data aggregation today!

@MichaelClifford
Copy link
Member Author

@xtuchyna This sounds good. Let's start with this set of items and see if it suffices for our needs : )

collect features of diffs (this is standalone aggregation, because it's not API dependent as a PR data extraction - you can just drill commits with repository cloning)?

Does this mean the diffs are not available through the API and that we would need to pull them through a separate process by relying on the PR commit sha's? Which is fine, just want to make sure I understand correctly : )

@xtuchyna
Copy link
Collaborator

xtuchyna commented Apr 23, 2021

@MichaelClifford the commits can be still gathered by API, but they can be also drilled by cloning the repo and doing it locally - which is much faster then requesting it from API

@xtuchyna
Copy link
Collaborator

I'm gonna launch the aggregation and provide you the data, should be ready by weekend

@xtuchyna
Copy link
Collaborator

Hello, just for better organization I am referencing here this Issue thoth-station/mi-scheduler#130, so have a look at the dataset and see if it's valuable :)

@MichaelClifford MichaelClifford moved this from Backlog to To Do in AI-4-CI May 19, 2021
@MichaelClifford MichaelClifford moved this from To Do to In Progress in AI-4-CI May 20, 2021
@Shreyanand Shreyanand self-assigned this May 20, 2021
@oindrillac oindrillac assigned oindrillac and unassigned Shreyanand May 20, 2021
@chauhankaranraj chauhankaranraj self-assigned this May 20, 2021
@chauhankaranraj
Copy link
Member

Hey all, @oindrillac and I have a few clarifying questions regarding this ML request:

  1. Should the output of the model be “number of seconds to merge”? Or should this prediction horizon be split into categories like “1-3 days”, “3-6 days”, “1-2weeks”, “>2weeks”, the same way as we have size labels?
  2. Should the time to merge prediction made only once when PR is created, or should it be updated regularly as reviews, labels, etc. are added/removed?
  3. Are reviewers not considered interactors? Because it seems possible to have a username in the reviewers column but not in interactions column.
  4. The labels feature shows the final state of labels in the PR. But is there a way to determine what labels existed at a given point in time? E.g. the XS/S/M/L/XL labels will be available right after PR creation so we can use those in model training; but would labels like bugzilla/severity-high be also available or are they added later by SMEs?

@MichaelClifford
Copy link
Member Author

Hey @chauhankaranraj here are my thoughts, but happy to discuss further if you disagree 😄

Should the output of the model be “number of seconds to merge”?

I think seconds is probably too fine grained and a bit over kill to be practically useful, but I also don't think we should predict wide categories like you've suggested. Have you dug into the data and determined a reasonable time frame to chunk the time units by? Maybe some additional EDA could give a clue into this? I was thinking that maybe predicting merge time to the hour might make the most sense from a practical point of view. WDYT?

Should the time to merge prediction made only once when PR is created

For now yes. Let's focus on the simplest case first.

The labels feature shows the final state of labels in the PR. But is there a way to determine what labels existed at a given point in time?

I dont know. @xtuchyna Do you know if we can capture label time stamps as well?

@chauhankaranraj
Copy link
Member

Have you dug into the data and determined a reasonable time frame to chunk the time units by? Maybe some additional EDA could give a clue into this? I was thinking that maybe predicting merge time to the hour might make the most sense from a practical point of view. WDYT?

AFAIK we haven't done EDA on this, so I'm not sure what a reasonable granularity would be, but we can look into it and find out :)

@MichaelClifford MichaelClifford moved this from In Progress to To Do in AI-4-CI Jun 3, 2021
@oindrillac oindrillac moved this from To Do to Done (Sprint 08/12/21 - 08/26/21) in AI-4-CI Aug 26, 2021
@oindrillac oindrillac moved this from Done (Sprint 08/12/21 - 08/26/21) to Done (7/30/21 - 8/12/21) in AI-4-CI Aug 26, 2021
@oindrillac oindrillac moved this from Done (7/30/21 - 8/12/21) to Done (Sprint 08/12/21 - 08/26/21) in AI-4-CI Aug 26, 2021
@oindrillac
Copy link
Member

@chauhankaranraj shall we close this initial ML Request issue given we have a service up and running? We can address the "bots" aspect in #362

@chauhankaranraj
Copy link
Member

@chauhankaranraj shall we close this initial ML Request issue given we have a service up and running? We can address the "bots" aspect in #362

Sounds good to me :)

/close

@sesheta
Copy link
Contributor

sesheta commented Mar 7, 2022

@chauhankaranraj: Closing this issue.

In response to this:

@chauhankaranraj shall we close this initial ML Request issue given we have a service up and running? We can address the "bots" aspect in #362

Sounds good to me :)

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@sesheta sesheta closed this as completed Mar 7, 2022
AI-4-CI automation moved this from Done (Sprint 08/12/21 - 08/26/21) to Done (10/07/21 - 10/21/21) Mar 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
AI-4-CI
  
Done (10/07/21 - 10/21/21)
Development

No branches or pull requests

6 participants