New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ML Request: Predict time to merge of a PR #236
Comments
Hello @MichaelClifford , I am tracking issues related to this, forget me but I was unavailable this week. Currently, some of the features are analysed in For a PR collect its:
And for all referenced commits within PRs, collect features of diffs (this is standalone aggregation, because it's not API dependent as a PR data extraction - you can just drill commits with repository cloning)? Is this everything you need? If so, I can launch the data aggregation today! |
@xtuchyna This sounds good. Let's start with this set of items and see if it suffices for our needs : )
Does this mean the diffs are not available through the API and that we would need to pull them through a separate process by relying on the PR commit sha's? Which is fine, just want to make sure I understand correctly : ) |
@MichaelClifford the commits can be still gathered by API, but they can be also drilled by cloning the repo and doing it locally - which is much faster then requesting it from API |
I'm gonna launch the aggregation and provide you the data, should be ready by weekend |
Hello, just for better organization I am referencing here this Issue thoth-station/mi-scheduler#130, so have a look at the dataset and see if it's valuable :) |
Hey all, @oindrillac and I have a few clarifying questions regarding this ML request:
|
Hey @chauhankaranraj here are my thoughts, but happy to discuss further if you disagree 😄
I think seconds is probably too fine grained and a bit over kill to be practically useful, but I also don't think we should predict wide categories like you've suggested. Have you dug into the data and determined a reasonable time frame to chunk the time units by? Maybe some additional EDA could give a clue into this? I was thinking that maybe predicting merge time to the hour might make the most sense from a practical point of view. WDYT?
For now yes. Let's focus on the simplest case first.
I dont know. @xtuchyna Do you know if we can capture label time stamps as well? |
AFAIK we haven't done EDA on this, so I'm not sure what a reasonable granularity would be, but we can look into it and find out :) |
@chauhankaranraj shall we close this initial ML Request issue given we have a service up and running? We can address the "bots" aspect in #362 |
Sounds good to me :) /close |
@chauhankaranraj: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
We would like to create a GitHub bot that ingests information from a PR, including the written description, author, number of files, etc, in addition to the diff, and return a prediction for how long it will take to be merged into the master branch.
Given that the OCP repo currently has over 17,000 closed PR's that could be used as our historical training dataset, this project strikes me as feasible. We can approach this as a regression problem; where given sets of features derived from the PR's that are labeled with their historical time to merge, we should be able to train a model to return an estimated time to merge for future PR's.
This ML request depends on #146 (collecting data from GitHub), but should be independent of any other data sources.
The text was updated successfully, but these errors were encountered: