-
Notifications
You must be signed in to change notification settings - Fork 85
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement RandomSearch and GridSearch Tuners #240
Conversation
Codecov Report
@@ Coverage Diff @@
## master #240 +/- ##
=======================================
Coverage 98.58% 98.58%
=======================================
Files 111 111
Lines 3744 3744
=======================================
Hits 3691 3691
Misses 53 53 Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like a good start! Can you also add in the AutoBase changes and some test cases for the next iteration?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some additional questions:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good, just need a little bit more on testing and a more graceful handling of the exception on the AutoML side. Adding a tutorial to the docs and API reference can come next as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good, just need tests on AutoML side taking in new tuners and handling the exceptions.
I discussed this issue with @jeremyliweishih last Friday, but the current structure as of now makes it difficult to pass in parameters for the tuner. For example, the
The current implementation here hard codes these parameters for now. If we want to, we could merge this in to show that we have multiple tuners and then refactor them once we solidify #272. |
@dsherry aside of these issues, this PR should be ready to be merged. We should discuss how we want to proceed in the next planning meeting.
|
evalml/models/auto_base.py
Outdated
except Exception as e: | ||
self.parameter_exception = True | ||
self.logger.log('\n✘ ' + str(e) + '\n') | ||
return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean an error would show up in the terminal output, if we're using tqdm?
7e635e7
to
3fdb7ac
Compare
evalml/automl/auto_base.py
Outdated
try: | ||
parameters = self._propose_parameters(pipeline_class) | ||
except Exception as e: | ||
raise e |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't rethrow here. If the tuner says the search space has been exhausted, we need automl to handle that. I think we need to have that in this PR in order to merge.
I see two options to do that:
- Update
Tuner
API. Add aTuner.is_search_space_exhausted()
member (name could be better). HaveAutoBase._check_stopping_condition
call this and return true if its true - Just update autobase code. Add an instance var
_search_space_exhausted
toAutoBase
, and set that to true here whenNoParamsException
comes up the stack. UpdateAutoBase._check_stopping_condition()
to return true ifself._search_space_exhausted
.
Option 1 feels cleaner. It does make each tuner's implementation more complex. But that's what I'd do. @christopherbunn lmk if you'd like to discuss this / talk about how to implement it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like option 1!
evalml/tuners/grid_search_tuner.py
Outdated
Arguments: | ||
points: The number of points to uniformly sample from \ | ||
Real dimensions. | ||
random_state: Not used in grid search, kept for compatibility |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is missing a description of the space
parameter
evalml/tuners/grid_search_tuner.py
Outdated
class GridSearchTuner: | ||
"""Grid Search Optimizer""" | ||
|
||
def __init__(self, space, points=10, random_state=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a nit-pick, but could we call points n_points
instead? To me, the "n" prefix implies that this is a number, rather than a list of points or something
evalml/tuners/grid_search_tuner.py
Outdated
|
||
Arguments: | ||
points: The number of points to uniformly sample from \ | ||
Real dimensions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is meant by "Real dimensions"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Real dimensions" refers to a search space dimension that can take on any real value. Maybe it would be better to reword this to "The number of points to uniformly sample from Real search spaces"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah ok, thanks. Personally I find sklearn's docstring there to be unclear/cryptic. How about "the number of points to sample from along each dimension defined in the space
argument" ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I stand by what I said, would be nice to update this docstring a bit
evalml/tuners/grid_search_tuner.py
Outdated
Arguments: | ||
points: The number of points to uniformly sample from \ | ||
Real dimensions. | ||
random_state: Not used in grid search, kept for compatibility |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can just say "unused in this class"
evalml/tuners/grid_search_tuner.py
Outdated
else: | ||
return Exception("Invalid dimension type in tuner") | ||
raw_dimensions.append(range_values) | ||
self.grid_points = list(itertools.product(*raw_dimensions)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By calling list()
here, you're coercing the itertools.product
generator into a list. That will cause all the combinations to be generated here, which could be harmful performance-wise as we increase n_points
.
Fortunately, iterators are awesome! Here you can just say
self._grid_points = itertools.product(*raw_dimensions)
And then:
def propose(self):
try:
return next(self._grid_points)
except StopIteration:
raise NoParamsException("Grid search has exhausted all possible parameters.")
Here's a page from the python doc which shows the approximate code used to implement itertools.product
. Note the use of yield
, which adds an option to a generator rather than generating the output at that moment. Lazy computation ftw :)
evalml/tuners/grid_search_tuner.py
Outdated
self.grid_points = list(itertools.product(*raw_dimensions)) | ||
|
||
def add(self, parameters, score): | ||
# Since this is a grid search, we don't need to store the results. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My personal preference is to delete this comment. In general I think code comments (distinct from docstrings) should only exist if they explain something which would be nearly impossible to discern from just reading the code. But that's just my opinion and I respect if you feel differently.
Hey @dsherry and @jeremyliweishih, thanks for taking a look a few days ago. I incorporated most of the changes you all suggested. A few notes: Re:
I gave updating the Tuner API a shot so that it handles this stopping condition more gracefully. I'm not sure if it's the most efficient method especially since it requires moving up getting the type of pipeline in Re: pipeline mocking, I didn't really push much on this front. While it would definitely speed up performance I think it would be better to have a consistent mocking strategy by having a more broader implementation. The current tests are fairly minimal in terms of search & training. Imo, mocking fits better in a separate PR but again I'm open to thoughts/opinions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good from the questions I had!
# get new pipeline and check tuner | ||
self._current_pipeline_class = self._select_pipeline() | ||
if self.tuners[self._current_pipeline_class.name].is_search_space_exhausted(): | ||
return False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great for now :) good going
evalml/automl/auto_base.py
Outdated
@@ -178,6 +178,11 @@ def search(self, X, y, feature_types=None, raise_errors=False, show_iteration_pl | |||
pbar.close() | |||
|
|||
def _check_stopping_condition(self, start): | |||
# get new pipeline and check tuner | |||
self._current_pipeline_class = self._select_pipeline() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great. Let's add this to __init__
:
self._current_pipeline_class = None
evalml/automl/auto_base.py
Outdated
@@ -178,6 +178,11 @@ def search(self, X, y, feature_types=None, raise_errors=False, show_iteration_pl | |||
pbar.close() | |||
|
|||
def _check_stopping_condition(self, start): | |||
# get new pipeline and check tuner | |||
self._current_pipeline_class = self._select_pipeline() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And can we rename this to _next_pipeline_class
? I feel like that's a better description because it's the next class we'll be using in _do_iteration
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But I acknowledge this is really just a nit-pick haha
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great! Awesome work, particularly on updating auto_base.py
I agree with @jeremyliweishih 's suggestion about mocking in the tests. And I left a couple small comments. Otherwise, 🚢 :)
Resolves #230