Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor AutoML Search for Parallel Workers #1295

Closed
3 tasks
christopherbunn opened this issue Oct 9, 2020 · 2 comments · Fixed by #1337 or #1410
Closed
3 tasks

Refactor AutoML Search for Parallel Workers #1295

christopherbunn opened this issue Oct 9, 2020 · 2 comments · Fixed by #1337 or #1410
Assignees
Labels
enhancement An improvement to an existing feature.

Comments

@christopherbunn
Copy link
Contributor

  • Define an evaluation method (name can be whatever) which takes pipeline+data input, outputs scores/results/CV data etc., with no side-effects
  • To do this: update _evaluate to call _add_result afterwards instead of inside that method, so that _evaluate has no side effects.
  • May have to update _compute_cv_scores too
@christopherbunn christopherbunn added the enhancement An improvement to an existing feature. label Oct 12, 2020
@christopherbunn christopherbunn self-assigned this Oct 12, 2020
@christopherbunn
Copy link
Contributor Author

I have a first pass of removing the side effects out of _evaluate. In doing this, I realized that _evaluate is essentially just _compute_cv_scores and the side effect in one function. By removing the side effect, we're just left with a useless wrapper for _compute_cv_scores.

Rather than taking out the side effect or _evaluate altogether, I propose that we go in the opposite direction and beef up _evaluate by renaming to _evaluate_pipelines, keeping the side effect, and moving some of the steps in automl.search() to here. Essentially, this renamed function would:

  • Evaluate the current pipeline batch.
  • Update the AutoML Algorithm
  • For parallel searches, it would copy the required data for _compute_cv_scores for searches into a dictionary and create Dask futures

automl.search() will still take care of the AutoML Algorithm setup, creating new batches, and checking for the stopping condition.

Pros:

  • It centralizes the pipeline evaluation logic into a function.
    • When parallel engines are implemented, this will help prevent search from getting too unwieldy. This becomes more important if we add different ways of processing it other than iteratively and with batching.

Cons:

  • KeyboardInterrupt logic might need to be adjusted. I think we can just move it to _evaluate_pipeline but not sure if that would mean interrupting the keyboard while the AutoML get another batch would be covered.
  • Not sure if separating the AutoML update step into a separate function from the AutoML setup and batch generation steps will cause problems down the line.

For the scope of this issue, we would just move the AutoML Algorithm update code (lines 482 to 485) and the function name to _evaluate_pipelines.

@dsherry
Copy link
Contributor

dsherry commented Oct 22, 2020

@christopherbunn when you say "Update the AutoML Algorithm" that's referring to sending the evaluation results to the AutoMLAlgorithm object, yeah?

Yep I think I follow what you've described here. I do think we should still have one private method in AutoMLSearch which takes the training data and a pipeline as input and returns the pipeline scores; currently, that's _evaluate. Let's keep that encapsulation in some way shape or form. Happy to discuss more verbally.

Hmm, I hear you about the keyboard interrupt. As long as the behavior of that feature doesn't change, I'm on board. If we're gonna change the behavior there, let's discuss first.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement An improvement to an existing feature.
Projects
None yet
2 participants