Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update automl search API: AutoMLSearch class #825

Closed
SydneyAyx opened this issue Jun 1, 2020 · 6 comments · Fixed by #871
Closed

Update automl search API: AutoMLSearch class #825

SydneyAyx opened this issue Jun 1, 2020 · 6 comments · Fixed by #871
Assignees
Labels
enhancement An improvement to an existing feature.
Milestone

Comments

@SydneyAyx
Copy link

SydneyAyx commented Jun 1, 2020

Instantiating AutoClassificationSearch() without specifying Multiclass=True and not providing an objective results in an error.

ValueError                                Traceback (most recent call last)
<ipython-input-23-5d3db2adac13> in <module>
      1 automl = AutoClassificationSearch()
----> 2 automl.search(X, y)

~\AppData\Local\Continuum\anaconda3\envs\evalml\lib\site-packages\evalml\automl\auto_base.py in search(self, X, y, feature_types, raise_errors, show_iteration_plot)
    135 
    136         if self.problem_type != ProblemTypes.REGRESSION:
--> 137             self._check_multiclass(y)
    138 
    139         logger.log_title("Beginning pipeline search")

~\AppData\Local\Continuum\anaconda3\envs\evalml\lib\site-packages\evalml\automl\auto_base.py in _check_multiclass(self, y)
    230             return
    231         if self.objective.problem_type != ProblemTypes.MULTICLASS:
--> 232             raise ValueError("Given objective {} is not compatible with a multiclass problem.".format(self.objective.name))
    233         for obj in self.additional_objectives:
    234             if obj.problem_type != ProblemTypes.MULTICLASS:

ValueError: Given objective Log Loss Binary is not compatible with a multiclass problem.

Code to reproduce:

data = pd.read_csv("./iris.csv")
target = "class"
X = data.drop([target], axis=1)
y = data[target]
automl = AutoClassificationSearch()
automl.search(X, y)

The error message is clear and this is super easy to work around, but it is probably not the expected behavior for a user trying to use defaults and auto-model in easy mode.

@SydneyAyx SydneyAyx added bug Issues tracking problems with existing features. good first issue Issues which would be a good starting point for new hires. labels Jun 1, 2020
@dsherry
Copy link
Contributor

dsherry commented Jun 5, 2020

Thanks @SydneyAyx ! This is great feedback to have. I agree this is nonintuitive and that we can improve it.

The reason this usage triggers an error is that the provided data is multiclass but the AutoClassificationSearch wasn't provided with multiclass=True option.

The objective and problem type are set in AutoClassificationSearch.__init__ and AutoRegressionSearch.__init__. We don't use the objective directly until search, although it does appear in __str__. We need the problem_type in AutoSearchBase.__init__ so that we can compute self.allowed_pipelines.

Options which come to mind:

  1. Define AutoMulticlassClassificationSearch and AutoBinaryClassificationSearch instead of having the multiclass flag. This would line up well with how we're organizing our pipelines and objectives.
  2. Delete the multiclass flag and infer whether a problem is multiclass vs binary from the provided target. @SydneyAyx provided an example of this in Automl: infer problem type from target data #826
  3. We could move the multiclass flag and the computation of self.allowed_pipelines into search.
  4. Do nothing.

I'm split between options 1 and 2. I don't feel great about options 3 or 4.

@kmax12 what do you think?

@dsherry dsherry removed the good first issue Issues which would be a good starting point for new hires. label Jun 5, 2020
@ctduffy ctduffy self-assigned this Jun 12, 2020
@dsherry
Copy link
Contributor

dsherry commented Jun 12, 2020

Looking at this and #826 again, here's what I'd like us to do:

  • Go with option 1 from the list above. Define AutoMulticlassClassificationSearch and AutoBinaryClassificationSearch instead of having the multiclass flag. This would line up well with how we're organizing our pipelines and objectives. And then delete AutoClassificationSearch
  • After this issue is merged, we can use Automl: infer problem type from target data #826 to think about adding a helper method to infer the problem type from the target data. But I think for now, its best if the users determine the problem in advance. I'll update that issue to match.

@dsherry dsherry added this to the June 2020 milestone Jun 12, 2020
@kmax12
Copy link
Contributor

kmax12 commented Jun 12, 2020

rather than have 3 different class that are so long in name, what if we had one class with a required problem_type argument?

# could use enum instead, but i bet most users wouldn't
AutoMLSearch(problem_type="regression")
AutoMLSearch(problem_type="binary")
AutoMLSearch(problem_type="multiclass")

if a user has to look up to know the name of the complicated class, they can look up the parameter.

I also think this structure better presents what is going on. The searches are more similar than different, which was part of the motivation for lumping binary and multiclass together in the first place.

this also sets us up better in the future if we don't want to make problem type required. The dynamic pipeline in #841 will make this change eaiser, since we dont have to determine the pipelines at init any more.

since, we're trying to tackle this this month, lmk if talking live would be better

@dsherry
Copy link
Contributor

dsherry commented Jun 12, 2020

That's a cool idea @kmax12 . That could be a nice simplification over what we have now. Yeah, since this API is the first thing most users will see, let's take some time and talk it over. I just sent you and @ctduffy an invite for Tues afternoon.

Worth noting that if the scope creeps on this, we may wanna get a short-term fix in for June and file the API update as a separate issue.

@dsherry
Copy link
Contributor

dsherry commented Jun 16, 2020

@ctduffy @kmax12 and I just met to discuss. Here's our notes.

Next steps

  • @ctduffy write a design doc, goal is to have a draft out EOD tomorrow (2020/06/17 Weds)
  • Confirm implementation plan and who will implement it
    • We should aim to have steps 1 and 2 (tracked by this issue) in for the June release, at minimum.
    • We can do steps 3 (Automl: infer problem type from target data #826) and 4 (more validation) in the future if needed.
    • Theoretically, someone could do the implementation for step 3 in parallel, meaning we could get it in for the release too. So Clara could do 1+2, and Dylan could do 3, for example
    • We have 1.5 weeks to make this happen.
    • Question to be resolved: will @ctduffy have enough bandwidth to meet the release deadline while also working on notebooks with @gsheni ?

@dsherry
Copy link
Contributor

dsherry commented Jun 19, 2020

@ctduffy and I synced an hour ago. The design doc is done! Next step is @ctduffy is going to make an epic for this and #826 , and we'll get this issue done for the June milestone and the rest for July.

We estimated this issue will take 6 days to complete. So we have just enough time to get it done before the June release on Tues the 30th.

@dsherry dsherry changed the title Default Objective in AutoClassificationSearch Fails for Multiclass Update automl search API: AutoMLSearch class Jun 26, 2020
@dsherry dsherry added enhancement An improvement to an existing feature. and removed bug Issues tracking problems with existing features. labels Jun 26, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement An improvement to an existing feature.
Projects
None yet
4 participants