Handle index columns in AutoMLSearch and DataChecks#2138
Handle index columns in AutoMLSearch and DataChecks#2138jeremyliweishih merged 30 commits intomainfrom
Conversation
Codecov Report
@@ Coverage Diff @@
## main #2138 +/- ##
=========================================
+ Coverage 100.0% 100.0% +0.1%
=========================================
Files 293 293
Lines 24009 24056 +47
=========================================
+ Hits 23999 24046 +47
Misses 10 10
Continue to review full report at Codecov.
|
chukarsten
left a comment
There was a problem hiding this comment.
Looks good to me - feel free to take the nit or leave it!
evalml/automl/automl_search.py
Outdated
| else: | ||
| self.pipeline_parameters['Drop Columns Transformer']['columns'].extend(index_columns) | ||
| elif len(index_columns) > 0: | ||
| self.pipeline_parameters['Drop Columns Transformer'] = {} |
There was a problem hiding this comment.
nit: self.pipeline_parameters['Drop Columns Transformer'] ={"columns": index_columns}
bchen1116
left a comment
There was a problem hiding this comment.
Looks good! I left a few comments on testing just to be nit-picky heh
| def validate(self, X, y): | ||
| return {"warnings": [], "errors": [], "actions": []} | ||
|
|
||
| assert MockDataCheck().validate(X, y) |
There was a problem hiding this comment.
is this assert necessary? Can we just assert that it equals the dict of empty warnings/errors/actions?
There was a problem hiding this comment.
I needed it for codecov to pass or else it would fail on return {"warnings": [], "errors": [], "actions": []} 😄
|
|
||
| validate_args = MockDataCheck.validate.call_args_list | ||
| for arg in validate_args: | ||
| assert 'index_col' not in arg[0][0].columns |
There was a problem hiding this comment.
nit, but can we assert that col 0 is in the arg cols just to check that we are checking what we expect to be X?
evalml/automl/automl_search.py
Outdated
| drop_columns = self.pipeline_parameters['Drop Columns Transformer']['columns'] if 'Drop Columns Transformer' in self.pipeline_parameters else None | ||
| if len(index_columns) > 0 and drop_columns is not None: | ||
| index_columns.extend(drop_columns) | ||
| self.pipeline_parameters['Drop Columns Transformer']['columns'] = index_columns |
There was a problem hiding this comment.
This will cause errors if you run search(). Intest_automl_drop_index_columns get something like this:
ValueError: Default parameters for components in pipeline Decision Tree Classifier w/ Imputer + Drop Columns Transformer not in the hyperparameter ranges
Point (['most_frequent', 'mean', ['index_col'], 'gini', 'auto', 6]) is not within the bounds
of the space ([('most_frequent',), ('mean', 'median', 'most_frequent'), ('index_col',), ('gini', 'entropy'), ('auto', 'sqrt', 'log2'), (4, 10)]).When you set a list as a hyperparameter value, skopt will treat it as a categorical where the valid values are the elements of the list, not the list itself. This is similar to #2130. Maybe we should think of a general solution for this problem and #2130. Otherwise we'll have to modify IterativeAlgorithm to specifically pass parameters to the drop columns transformer.
| allowed_estimators = get_estimators(self.problem_type, self.allowed_model_families) | ||
| logger.debug(f"allowed_estimators set to {[estimator.name for estimator in allowed_estimators]}") | ||
|
|
||
| index_columns = list(self.X_train.select('index').columns) |
There was a problem hiding this comment.
What happens when users specify their own pipelines via allowed_pipelines but there isn't a DropColumnsTransformer but they have an index column?
There was a problem hiding this comment.
In the current impl the index column will be passed down to the estimator. My thinking was that if a user specifies their own pipelines via allowed_pipelines they don't want AutoML to come up with preprocessing components/pipelines and will use their own preprocessing components in their own pipelines. This should be consistent (from my understanding) with behavior like if a user has null values but don't specify an imputer in alloed_pipelines.
There was a problem hiding this comment.
I think that makes sense! I agree its consistent with our previous design choices. My one concern is that if the index column won't be picked up by the data checks, users may not know that there's anything "wrong" with not explicitly dropping it before AutoMLSearch. But maybe that's too small a set of users to worry about. If someone explicitly sets a woodwork index, they probably know that means.
|
@freddyaboulton @angela97lin @dsherry @chukarsten I made the following changes (an easy diff view here):
The fundamental issue I wanted to address with these changes is that user defined pipeline/component initialization values and user defined hyperparameter were bundled together in |
|
@jeremyliweishih Nice! One concern I have with adding the I'm not sure if that's good for our tuner? Doing I'm wondering if it is better to unblock this PR without introducing an argument to I guess, in my mind, I'm hopeful we'll be able to properly treat the concept of |
|
@freddyaboulton I removed the additional changes and put this PR back in scope of just adding the ability to handle index columns. However, I also removed the handling of additional parameters passed in to |
freddyaboulton
left a comment
There was a problem hiding this comment.
@jeremyliweishih Looks great!
|
|
||
| @patch('evalml.pipelines.BinaryClassificationPipeline.score', return_value={"Log Loss Binary": 0.3}) | ||
| @patch('evalml.automl.engine.sequential_engine.train_pipeline') | ||
| def test_automl_drop_index_columns(mock_train, mock_binary_score, X_y_binary): |
| component_parameters[param_name] = value.rvs(random_state=self.random_seed) | ||
| else: | ||
| component_parameters[param_name] = value | ||
| if name in self._pipeline_params and name == 'Drop Columns Transformer' and self._batch_number > 0: |
There was a problem hiding this comment.
Nice. I think we'll hopefully get rid of this line soon in #2130 but I agree this is the way to get it done for now!
Fixes #1862.