Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add EvalMLAlgorithm #2525

Merged
merged 40 commits into from Aug 2, 2021
Merged

Add EvalMLAlgorithm #2525

merged 40 commits into from Aug 2, 2021

Conversation

jeremyliweishih
Copy link
Contributor

@jeremyliweishih jeremyliweishih commented Jul 19, 2021

Fixes #2347.

Still need do the following:

  • ensure test coverage
  • remove duplicate code
  • remove unnecessary impl (FS custom estimator mainly)

Saving custom_hyperparameter and pipeline params for AutoMLSearch integration.

@codecov
Copy link

codecov bot commented Jul 19, 2021

Codecov Report

Merging #2525 (4f3f3e8) into main (a7af3c3) will increase coverage by 0.1%.
The diff coverage is 100.0%.

Impacted file tree graph

@@           Coverage Diff           @@
##            main   #2525     +/-   ##
=======================================
+ Coverage   99.9%   99.9%   +0.1%     
=======================================
  Files        291     293      +2     
  Lines      26664   26881    +217     
=======================================
+ Hits       26621   26838    +217     
  Misses        43      43             
Impacted Files Coverage Δ
evalml/automl/automl_algorithm/__init__.py 100.0% <100.0%> (ø)
evalml/automl/automl_algorithm/evalml_algorithm.py 100.0% <100.0%> (ø)
evalml/pipelines/utils.py 99.2% <100.0%> (ø)
evalml/tests/automl_tests/test_evalml_algorithm.py 100.0% <100.0%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a7af3c3...4f3f3e8. Read the comment docs.

class EvalMLAlgorithm(AutoMLAlgorithm):
"""An automl algorithm that consists of two modes: fast and long. Where fast is a subset of long.

1. Naive pipelines:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thought it would be nice to list it out here - can remove as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: this current algorithm doesn't have any feature engineering steps as that is blocked by #2439.

pipeline_params=None,
custom_hyperparameters=None,
n_jobs=-1,
number_features=None,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if number_features is still necessary anymore. I still see it in IterativeAlgorithm but don't see it mentioned anywhere else in the codebase.

self._custom_hyperparameters = custom_hyperparameters or {}
self._selected_cols = None
self._top_n_pipelines = None
if custom_hyperparameters and not isinstance(custom_hyperparameters, dict):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's some duplicate logic from IterativeAlgorithm that we can refactor into a base class later on.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please file an issue with specifics!

@jeremyliweishih jeremyliweishih marked this pull request as ready for review July 21, 2021 14:43
@jeremyliweishih jeremyliweishih requested review from a team and removed request for a team July 27, 2021 15:32
@jeremyliweishih jeremyliweishih requested review from a team and removed request for a team July 27, 2021 17:41
@jeremyliweishih jeremyliweishih requested a review from a team July 27, 2021 17:57
Copy link
Contributor

@bchen1116 bchen1116 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did a quick look-over and it's looking pretty good! Left a few suggestions on ways to clean up the code.

evalml/pipelines/utils.py Outdated Show resolved Hide resolved
evalml/automl/automl_algorithm/evalml_algorithm.py Outdated Show resolved Hide resolved
evalml/automl/automl_algorithm/evalml_algorithm.py Outdated Show resolved Hide resolved
evalml/automl/automl_algorithm/evalml_algorithm.py Outdated Show resolved Hide resolved
evalml/automl/automl_algorithm/evalml_algorithm.py Outdated Show resolved Hide resolved
evalml/automl/automl_algorithm/evalml_algorithm.py Outdated Show resolved Hide resolved
for pipeline_dict in self._best_pipeline_info.values()
]
estimators.sort(key=lambda pipeline: pipeline[1])
estimators = estimators[:n]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that this is choosing the best estimators based on the mean_cv_score, which makes sense. I think the sorting should be done based on what objective is being used, since R2==1 is the best for regression problems but log loss == 0 is the best for classification problems.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all the scores coming in should have been converted to "lower is better" so this order should be ok!

next_batch = self._create_long_top_n(n=3)
elif self.batch_number % 2 != 0:
next_batch = self._create_ensemble()
else:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! I like this if/elif/else logic

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

100%. I think this is really making it clear how this is working. Huge fan.

evalml/automl/automl_algorithm/evalml_algorithm.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@chukarsten chukarsten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Jeremy, so sorry I delayed you for such a long time. This is really good work. Not going to lie - I was intimidated by the IterativeAlgorithm, but this really feels and looks a lot cleaner and easier to understand. I think there's a few places that can use some cleanup or refactor.

  1. add_result() maybe needs a breakup. It's a little thick and could use a bit more addition to the docstring.
  2. The hardcoded 10 and 50 should be replaced with variables and possibly exposed as params for the EvalMLAlgorithm
  3. EvalMLAlgorithm should be renamed "JeremyAlgorithm"
  4. The use of the pipeline's new function in various places should be explored and determined whether it's necessary or not. I'd recommend doing a commit that just pulls out the new calls and see what happens. I feel like those are done because they're an ancient and hallowed tradition or we're uncertain whether we need yet another copy of the pipeline.
  5. File an issue for creating another common baseclass with Iterative Algorithm.

But yea, this is really good work and super clean. I can see why it took such a long time as it's polished. Well done.

evalml/automl/automl_algorithm/evalml_algorithm.py Outdated Show resolved Hide resolved
evalml/automl/automl_algorithm/evalml_algorithm.py Outdated Show resolved Hide resolved
self._custom_hyperparameters = custom_hyperparameters or {}
self._selected_cols = None
self._top_n_pipelines = None
if custom_hyperparameters and not isinstance(custom_hyperparameters, dict):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please file an issue with specifics!

evalml/automl/automl_algorithm/evalml_algorithm.py Outdated Show resolved Hide resolved
evalml/automl/automl_algorithm/evalml_algorithm.py Outdated Show resolved Hide resolved
{"Select Columns Transformer": {"columns": self._selected_cols}}
)
next_batch.append(
pipeline.new(parameters=parameters, random_seed=self.random_seed)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious how necessary the new pipeline is in all cases throughout this code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's not necessary in this case as @bchen1116 pointed out but should be when we're creating new pipelines from a tuner proposed parameter set.

next_batch = self._create_long_top_n(n=3)
elif self.batch_number % 2 != 0:
next_batch = self._create_ensemble()
else:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

100%. I think this is really making it clear how this is working. Huge fan.

Comment on lines +322 to +349
if self.batch_number == 2 and self._selected_cols is None:
if is_regression(self.problem_type):
self._selected_cols = pipeline.get_component(
"RF Regressor Select From Model"
).get_names()
else:
self._selected_cols = pipeline.get_component(
"RF Classifier Select From Model"
).get_names()

current_best_score = self._best_pipeline_info.get(
pipeline.model_family, {}
).get("mean_cv_score", np.inf)
if (
score_to_minimize is not None
and score_to_minimize < current_best_score
and pipeline.model_family != ModelFamily.ENSEMBLE
):
self._best_pipeline_info.update(
{
pipeline.model_family: {
"mean_cv_score": score_to_minimize,
"pipeline": pipeline,
"parameters": pipeline.parameters,
"id": trained_pipeline_results["id"],
}
}
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how I feel about all this being in add_result(), not like you really have much other choice, but I think maybe some of what's happening might want to make its way into the docstring. I'm not 100% sure what _selected_cols is, or why it's being updated and it's also good to know that the _best_pipeline_info is being updated here.

evalml/pipelines/utils.py Outdated Show resolved Hide resolved
evalml/tests/automl_tests/test_evalml_algorithm.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@chukarsten chukarsten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for making the changes!

@jeremyliweishih jeremyliweishih merged commit d886206 into main Aug 2, 2021
@freddyaboulton freddyaboulton deleted the js_2347_algo branch May 13, 2022 15:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

AutoML Algorithm Upgrade: new algorithm
3 participants