Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add baseline models for a given dataset #746

Merged
merged 76 commits into from May 27, 2020
Merged

Add baseline models for a given dataset #746

merged 76 commits into from May 27, 2020

Conversation

angela97lin
Copy link
Contributor

@angela97lin angela97lin commented May 6, 2020

Closes #409

@angela97lin angela97lin self-assigned this May 6, 2020
@codecov
Copy link

codecov bot commented May 6, 2020

Codecov Report

Merging #746 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master     #746   +/-   ##
=======================================
  Coverage   99.52%   99.52%           
=======================================
  Files         159      159           
  Lines        6257     6257           
=======================================
  Hits         6227     6227           
  Misses         30       30           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 494104b...494104b. Read the comment docs.

@dsherry
Copy link
Collaborator

dsherry commented May 8, 2020

@angela97lin RE your questions in the PR body above

what to do if training data doesn't contain all classes?

Good thinking! I'd actually consider this case to be a bug in our data splitting. In the future, I think we should add protections against this, or at least detect and warn when it happens. In fact, I'll file an issue for that now so we don't forget. #760

There's two things which would be bad about this case: 1) Mechanically, we could have code which checks the number of uniques in the training data in order to do something with that info, and that could fail. 2) The code could work fine, but we'd end up with a model training with one class missing, and when users try to evaluate on new labeled data they'd never get the missing class and thus get poor performance.

But for this baseline work, I'd say we can ignore that problem for now.

random: should it depend on distribution / frequency? --> Intuition says no.

Nah, I agree, let's keep it simple. The baseline models don't need to perform well; they represent a naive approach.

@kmax12
Copy link
Contributor

kmax12 commented May 12, 2020

Nah, I agree, let's keep it simple. The baseline models don't need to perform well; they represent a naive approach.

Doesn't have to be this PR, but I think it'd be very good to have a model that predicts randomly based on the frequency. It is still naive enough to be consider a baseline model and likely would perform better than the a straight random guess. I think it'd be reasonable to include both, maybe random and random_weighted?

To implement, We can use p parameter in numpy.random.choice (https://numpy.org/doc/stable/reference/random/generated/numpy.random.choice.html?highlight=random%20choice#numpy.random.choice)

The other thought that comes to mind is maybe we can name this model something different than ZeroR? What if we just called it BaselineClassifer or BaselineRegressor? The doc string can then explain what that means. I had never heard of the particular name ZeroR before this PR, so I think we'd only improve things by going with a name like that.

@angela97lin
Copy link
Contributor Author

angela97lin commented May 12, 2020

@kmax12 That's a good idea, I'll augment this PR since it shouldn't be too much work :)

I had seen the name ZeroR floating around while I was looking into this, but it seems pretty specific (ie using the majority class), so I like Baseline much much more! Will update!

@angela97lin angela97lin changed the title Add "baseline" models for a given dataset Add baseline models for a given dataset May 12, 2020
@@ -226,7 +236,8 @@ def _check_stopping_condition(self, start):
return False

should_continue = True
num_pipelines = len(self.results['pipeline_results'])
num_pipelines = len(list(filter(lambda result: result['pipeline_class'].model_family != ModelFamily.BASELINE, self.results['pipeline_results'].values())))
Copy link
Contributor

@kmax12 kmax12 May 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now that I think about it, do we really need to have special logic to not count the baseline towards max_pipelines? I don't think it's weird to only return the baseline pipeline if the user asks for 1 pipeline. It is a pipeline, just a simple one.

Copy link
Contributor Author

@angela97lin angela97lin May 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, I don't know the right answer to this but from my conversation with @dsherry I think we came to the conclusion that if a user asks for one pipeline, they'd expect one non-trivial pipeline, not just the baseline. I treat the baseline is just something that we provide as an extra functionality to help users better understand how their trained pipelines performed?

Copy link
Contributor

@kmax12 kmax12 May 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that is what i originally assumed, but I wonder if we're overcomplicating it. I'm not sure as a user I'd be bothered by evalml considering the baseline to be one pipeline. that's easier to understand than why we treat it as a special case compared to other pipelines. if I was bother, it is pretty easy to request n+1 pipelines. if anything, I'd rather have it counted at a pipeline and then be given the option to turn that functionality on or off.

I also like this approach of treating it more like the other pipelines because it makes our code easier to maintain and less error prone to off by 1 errors since we can get rid of the special handling the baseline pipeline case.

given we don't have any users yet to make an informed decision, I'm leaning towards easier to maintain

Copy link
Contributor Author

@angela97lin angela97lin May 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha. I agree with what you said about maintainability, especially since I felt that worry of off-by-one errors while updating the codebase. Also agree that it'd be easier to just increment max_pipelines on the user-facing side. In that case, I'll update this PR to include base pipelines as part of the number of pipelines! I'll also update it to count towards the max time we spend searching for pipelines too, then, by the same logic.

Copy link
Collaborator

@dsherry dsherry May 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kmax12 @angela97lin nice, I like this discussion, and I agree: let's not special-case the number of pipelines. @angela97lin if its easier for you to merge this PR and then address that as a follow-on PR, I suggest you do that.

My hope is that the max_pipelines parameter becomes less important for users as we continue to improve the automl algorithm!

Copy link
Contributor Author

@angela97lin angela97lin May 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kmax12 @dsherry I went ahead and updated this PR based on these comments--I figured it's the easiest to do so, rather than introduce some potential off-by-one logic. If you guys could take one last look at this PR before I merged, that'd be fantastic!

@angela97lin angela97lin requested a review from dsherry May 26, 2020
@angela97lin angela97lin requested review from dsherry and removed request for dsherry May 27, 2020
evalml/automl/auto_search_base.py Outdated Show resolved Hide resolved
strategy = self.parameters["strategy"]
if strategy == "mode":
if self._mode is None:
raise RuntimeError("You must fit Baseline classifier before calling predict!")
Copy link
Collaborator

@dsherry dsherry May 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be nice to smoosh all these checks into one, i.e. if self._mode is None and self._classes is None: raise RuntimeError(...). Could refactor into def _check_fitted helper and call in predict and predict_proba

Copy link
Contributor Author

@angela97lin angela97lin May 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great idea, updated! also updated baseline_regressor even though there's only one place for that for consistency :)

Copy link
Collaborator

@dsherry dsherry left a comment

@angela97lin looks great! I left one comment about punting applying column labels and returning pandas DFs to #236. Otherwise LGTM

@angela97lin angela97lin merged commit 03673ab into master May 27, 2020
2 checks passed
@angela97lin angela97lin deleted the 409_base branch May 27, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Functionality to provide a "baseline" model for a given dataset
3 participants