-
Notifications
You must be signed in to change notification settings - Fork 91
Add baseline models for a given dataset #746
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## master #746 +/- ##
=======================================
Coverage 99.52% 99.52%
=======================================
Files 159 159
Lines 6257 6257
=======================================
Hits 6227 6227
Misses 30 30 Continue to review full report at Codecov.
|
@angela97lin RE your questions in the PR body above
Good thinking! I'd actually consider this case to be a bug in our data splitting. In the future, I think we should add protections against this, or at least detect and warn when it happens. In fact, I'll file an issue for that now so we don't forget. #760 There's two things which would be bad about this case: 1) Mechanically, we could have code which checks the number of uniques in the training data in order to do something with that info, and that could fail. 2) The code could work fine, but we'd end up with a model training with one class missing, and when users try to evaluate on new labeled data they'd never get the missing class and thus get poor performance. But for this baseline work, I'd say we can ignore that problem for now.
Nah, I agree, let's keep it simple. The baseline models don't need to perform well; they represent a naive approach. |
Doesn't have to be this PR, but I think it'd be very good to have a model that predicts randomly based on the frequency. It is still naive enough to be consider a baseline model and likely would perform better than the a straight random guess. I think it'd be reasonable to include both, maybe To implement, We can use The other thought that comes to mind is maybe we can name this model something different than |
@kmax12 That's a good idea, I'll augment this PR since it shouldn't be too much work :) I had seen the name ZeroR floating around while I was looking into this, but it seems pretty specific (ie using the majority class), so I like Baseline much much more! Will update! |
evalml/automl/auto_search_base.py
Outdated
@@ -226,7 +236,8 @@ def _check_stopping_condition(self, start): | |||
return False | |||
|
|||
should_continue = True | |||
num_pipelines = len(self.results['pipeline_results']) | |||
num_pipelines = len(list(filter(lambda result: result['pipeline_class'].model_family != ModelFamily.BASELINE, self.results['pipeline_results'].values()))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
now that I think about it, do we really need to have special logic to not count the baseline towards max_pipelines? I don't think it's weird to only return the baseline pipeline if the user asks for 1 pipeline. It is a pipeline, just a simple one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm, I don't know the right answer to this but from my conversation with @dsherry I think we came to the conclusion that if a user asks for one pipeline, they'd expect one non-trivial pipeline, not just the baseline. I treat the baseline is just something that we provide as an extra functionality to help users better understand how their trained pipelines performed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that is what i originally assumed, but I wonder if we're overcomplicating it. I'm not sure as a user I'd be bothered by evalml considering the baseline to be one pipeline. that's easier to understand than why we treat it as a special case compared to other pipelines. if I was bother, it is pretty easy to request n+1 pipelines. if anything, I'd rather have it counted at a pipeline and then be given the option to turn that functionality on or off.
I also like this approach of treating it more like the other pipelines because it makes our code easier to maintain and less error prone to off by 1 errors since we can get rid of the special handling the baseline pipeline case.
given we don't have any users yet to make an informed decision, I'm leaning towards easier to maintain
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha. I agree with what you said about maintainability, especially since I felt that worry of off-by-one errors while updating the codebase. Also agree that it'd be easier to just increment max_pipelines on the user-facing side. In that case, I'll update this PR to include base pipelines as part of the number of pipelines! I'll also update it to count towards the max time we spend searching for pipelines too, then, by the same logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kmax12 @angela97lin nice, I like this discussion, and I agree: let's not special-case the number of pipelines. @angela97lin if its easier for you to merge this PR and then address that as a follow-on PR, I suggest you do that.
My hope is that the max_pipelines
parameter becomes less important for users as we continue to improve the automl algorithm!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
strategy = self.parameters["strategy"] | ||
if strategy == "mode": | ||
if self._mode is None: | ||
raise RuntimeError("You must fit Baseline classifier before calling predict!") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be nice to smoosh all these checks into one, i.e. if self._mode is None and self._classes is None: raise RuntimeError(...)
. Could refactor into def _check_fitted
helper and call in predict
and predict_proba
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great idea, updated! also updated baseline_regressor even though there's only one place for that for consistency :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@angela97lin looks great! I left one comment about punting applying column labels and returning pandas DFs to #236. Otherwise LGTM
Closes #409