Skip to content

Handle index columns in AutoMLSearch and DataChecks#2138

Merged
jeremyliweishih merged 30 commits intomainfrom
js_1862_drop_index
Apr 21, 2021
Merged

Handle index columns in AutoMLSearch and DataChecks#2138
jeremyliweishih merged 30 commits intomainfrom
js_1862_drop_index

Conversation

@jeremyliweishih
Copy link
Collaborator

Fixes #1862.

@jeremyliweishih jeremyliweishih marked this pull request as ready for review April 14, 2021 19:45
@codecov
Copy link

codecov bot commented Apr 14, 2021

Codecov Report

Merging #2138 (7883ea8) into main (66cb6a0) will increase coverage by 0.1%.
The diff coverage is 100.0%.

Impacted file tree graph

@@            Coverage Diff            @@
##             main    #2138     +/-   ##
=========================================
+ Coverage   100.0%   100.0%   +0.1%     
=========================================
  Files         293      293             
  Lines       24009    24056     +47     
=========================================
+ Hits        23999    24046     +47     
  Misses         10       10             
Impacted Files Coverage Δ
...lml/automl/automl_algorithm/iterative_algorithm.py 100.0% <100.0%> (ø)
evalml/automl/automl_search.py 100.0% <100.0%> (ø)
evalml/data_checks/data_checks.py 100.0% <100.0%> (ø)
evalml/pipelines/utils.py 100.0% <100.0%> (ø)
evalml/tests/automl_tests/test_automl.py 100.0% <100.0%> (ø)
evalml/tests/data_checks_tests/test_data_checks.py 100.0% <100.0%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 66cb6a0...7883ea8. Read the comment docs.

Copy link
Contributor

@chukarsten chukarsten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me - feel free to take the nit or leave it!

else:
self.pipeline_parameters['Drop Columns Transformer']['columns'].extend(index_columns)
elif len(index_columns) > 0:
self.pipeline_parameters['Drop Columns Transformer'] = {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: self.pipeline_parameters['Drop Columns Transformer'] ={"columns": index_columns}

Copy link
Contributor

@bchen1116 bchen1116 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! I left a few comments on testing just to be nit-picky heh

def validate(self, X, y):
return {"warnings": [], "errors": [], "actions": []}

assert MockDataCheck().validate(X, y)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this assert necessary? Can we just assert that it equals the dict of empty warnings/errors/actions?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I needed it for codecov to pass or else it would fail on return {"warnings": [], "errors": [], "actions": []} 😄


validate_args = MockDataCheck.validate.call_args_list
for arg in validate_args:
assert 'index_col' not in arg[0][0].columns
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, but can we assert that col 0 is in the arg cols just to check that we are checking what we expect to be X?

drop_columns = self.pipeline_parameters['Drop Columns Transformer']['columns'] if 'Drop Columns Transformer' in self.pipeline_parameters else None
if len(index_columns) > 0 and drop_columns is not None:
index_columns.extend(drop_columns)
self.pipeline_parameters['Drop Columns Transformer']['columns'] = index_columns
Copy link
Contributor

@freddyaboulton freddyaboulton Apr 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will cause errors if you run search(). Intest_automl_drop_index_columns get something like this:

ValueError: Default parameters for components in pipeline Decision Tree Classifier w/ Imputer + Drop Columns Transformer not in the hyperparameter ranges
Point (['most_frequent', 'mean', ['index_col'], 'gini', 'auto', 6]) is not within the bounds 
of the space ([('most_frequent',), ('mean', 'median', 'most_frequent'), ('index_col',), ('gini', 'entropy'), ('auto', 'sqrt', 'log2'), (4, 10)]).

When you set a list as a hyperparameter value, skopt will treat it as a categorical where the valid values are the elements of the list, not the list itself. This is similar to #2130. Maybe we should think of a general solution for this problem and #2130. Otherwise we'll have to modify IterativeAlgorithm to specifically pass parameters to the drop columns transformer.

allowed_estimators = get_estimators(self.problem_type, self.allowed_model_families)
logger.debug(f"allowed_estimators set to {[estimator.name for estimator in allowed_estimators]}")

index_columns = list(self.X_train.select('index').columns)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens when users specify their own pipelines via allowed_pipelines but there isn't a DropColumnsTransformer but they have an index column?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the current impl the index column will be passed down to the estimator. My thinking was that if a user specifies their own pipelines via allowed_pipelines they don't want AutoML to come up with preprocessing components/pipelines and will use their own preprocessing components in their own pipelines. This should be consistent (from my understanding) with behavior like if a user has null values but don't specify an imputer in alloed_pipelines.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that makes sense! I agree its consistent with our previous design choices. My one concern is that if the index column won't be picked up by the data checks, users may not know that there's anything "wrong" with not explicitly dropping it before AutoMLSearch. But maybe that's too small a set of users to worry about. If someone explicitly sets a woodwork index, they probably know that means.

@jeremyliweishih
Copy link
Collaborator Author

jeremyliweishih commented Apr 19, 2021

@freddyaboulton @angela97lin @dsherry @chukarsten

I made the following changes (an easy diff view here):

  • Added custom_hyperparameters parameter to AutoMLSearch and IterativeAlgorithm - this behaves mostly like how pipeline_parameters behaved previously
  • Changed pipeline_parameters parameter to only initialize pipelines with a singular value

The fundamental issue I wanted to address with these changes is that user defined pipeline/component initialization values and user defined hyperparameter were bundled together in pipeline_parameters. In the above changes I changed pipeline_parameters to represent values we want pipelines and components to be initialized with and custom_hyperparameters to represent user defined ranges. However, this change doesn't support pipeline_parameters that fall out of a components hyperparameter ranges (which may be why these two features were bundled into one previously). What do you guys think of this change? I wanted to make sure we're on board with this before ironing out the details and updating the docs (for example: I think if component_class.name in self._pipeline_params and self._batch_number == 0: needs to be changed to just if component_class.name in self._pipeline_params so that we use those values for every batch.).

@freddyaboulton
Copy link
Contributor

@jeremyliweishih Nice! One concern I have with adding the pipeline_parameters is that there are two ways to do the same thing. Users can do pipeline_parameters={"Logistic Regression Classifier": {"C": 0.5}} and custom_hyperparameters={"Logistic Regression Classifier": {"C": Categorical([0.5])} (assuming we fix the bug with freezing hyperparameters to not the default value lol, e.g. #2131).

I'm not sure if that's good for our tuner? Doing pipeline_parameters={"Logistic Regression Classifier": {"C": 0.5}} means that the tuner will propose values for C (because the hyperparameter ranges have not been updated) but we'll set the value to 0.5 without the tuner knowing. I think the tuner should know the space has been restricted?

I'm wondering if it is better to unblock this PR without introducing an argument to AutoMLSearch? I think we can do this by keeping the custom_hyperparameters and pipeline_params argument to IterativeAlgorithm and setting the arguments to the Drop Columns Transformer in pipeline_params before creating the IterativeAlgorithm. We do something similar with the problem_configuration now.

I guess, in my mind, pipeline_parameters is just a special case of custom_hyperparameters (in the former, the search space only has only element). Rather than exposing that distinction to users, I think it's better to handle it internally.

I'm hopeful we'll be able to properly treat the concept of pipeline_parameters as a search space with one element in the future. But until we get there, I think adding pipeline_parameters to IterativeAlgorithm will suffice (and unblock @angela97lin 's drop columns data check actions work. I'm fairly certain we would have run into this same problem there!).

@jeremyliweishih
Copy link
Collaborator Author

@freddyaboulton I removed the additional changes and put this PR back in scope of just adding the ability to handle index columns. However, I also removed the handling of additional parameters passed in to pipeline_parameters for Drop Columns Transformer due to #2028 disallowing lists as input for pipeline_parameters and I believe we can add this functioanlity when we come up with a solution for #2130 and #2131.

Copy link
Contributor

@freddyaboulton freddyaboulton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jeremyliweishih Looks great!


@patch('evalml.pipelines.BinaryClassificationPipeline.score', return_value={"Log Loss Binary": 0.3})
@patch('evalml.automl.engine.sequential_engine.train_pipeline')
def test_automl_drop_index_columns(mock_train, mock_binary_score, X_y_binary):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice test!

component_parameters[param_name] = value.rvs(random_state=self.random_seed)
else:
component_parameters[param_name] = value
if name in self._pipeline_params and name == 'Drop Columns Transformer' and self._batch_number > 0:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. I think we'll hopefully get rid of this line soon in #2130 but I agree this is the way to get it done for now!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Automatically drop features tagged as "primary key"

4 participants