New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove ensemble split and indices in AutoMLSearch #2260
Conversation
Codecov Report
@@ Coverage Diff @@
## main #2260 +/- ##
=========================================
- Coverage 100.0% 100.0% -0.0%
=========================================
Files 280 280
Lines 24382 24274 -108
=========================================
- Hits 24360 24252 -108
Misses 22 22
Continue to review full report at Codecov.
|
Hm. Most recent attempt passed, but noting these random Windows failures: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@angela97lin Good job chasing this issue down! I'm glad we're removing our ensemble split for the time being as it cuts down the complexity of our automl and engine code.
It seems like the root case of #2093 was that we weren't fairly comparing the performance of our ensemble pipeline to the other pipelines. I agree with your analysis that removing the ensemble split moves the rankings in the direction we want but
I feel like we're not really fixing the root cause because the cv score for ensembles only considers the first fold as opposed to all folds. I imagine this could be unrealistic in datasets with variance across the folds, i.e. fail the high-variance-cv check.
I think as long as the ensemble pipeline is not scored on the same data as the other pipelines, we leave the door open for people to question the validity of our leaderboard and for similar tricky open-ended questions in the future, e.g "ensemble shows up first in leaderboard but does not outperform xgboost in holdout data. why?"
I'll file a separate issue to see if we can compute the score for ensembles the same way as the other pipelines.
@freddyaboulton Yes, I definitely agree with you that even here, we're not doing a fair apples-to-apples comparison. I've talked to @dsherry about this before, and he mentioned that he and @kmax12 have had some discussions about a separate "model-selection" split where we hold out some data that is then used to validate the model and determine the actual ranking of the models as they should appear on the leaderboard, rather than using the training cv score we're currently relying on. I know work is being done for the automl algo right now, not sure if this would step on those toes but I've filed #2284 to track this, feel free to add more there! :D |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! 🥳
Closes #2093