Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update components and pipelines to return Woodwork data structures #1668

Merged
merged 118 commits into from Jan 27, 2021

Conversation

angela97lin
Copy link
Contributor

@angela97lin angela97lin commented Jan 8, 2021

Closes #1406

Not too much implementation wise but notes that may be useful:

  • Sometimes it doesn't always work to just call .to_series()/.to_dataframe() on a Woodwork data structure, because the types inferred might not be what we want. Example is: Woodwork could convert to a series w/ Int64. Converting that to numpy would create an array with an object type, which is usually not what we want. 馃槵
  • Partial dependence method update context: previously, we could pass our pipelines directly to scikit-learn's partial dependence method after a bit of finagling (adding attributes expected). This is no longer true because we're returning Woodwork data structures from predict and scikit-learn doesn't know how to handle that. Thus, I needed to wrap our pipeline in scikit_learn_wrapped_estimator (implemented for ensemble), updating the fields now on this wrapped object. IMO kinda cleaner than previous, where we had to delete these ugly attributes we added just for scikit-learn to operate since we don't care for the wrapped obj afterwards :d
  • Imputer handling indices and restoring to original was moved to SimpleImputer. This way, we support the same behavior for SimpleImputer and don't need to maintain it in two separate places since the Imputer is just a combination of SimpleImputers.
  • I noticed there was a lot of duplicate code in fit_features and compute_final_component_features so I refactored using a new helper function, _fit_transform_features_helper
  • Currently, the ComponentGraph object is responsible for keeping track of original logical types passed by a user and transforming the transformed data to the original types as necessary. We will have to update this in Custom woodwork types aren't preserved in components聽#1662
  • Unrelated, but cleaned up docstrings for random state for consistency as I was updating each component.
  • A lot of duplicate code between fit/transform and fit_transform for FeatureSelector. Removed fit_transform.
  • Updated a lot of tests for consistency. Mock fit should still return self. Mock transforms still need to return WW data structures.
  • Combine _compute_predictions for classification and time series classification pipelines.
  • Cleaned up a lot of duplicate impl for components fit/transform/fit_transform and removed if unnecessary.
  • Removed random_state from OutliersDataCheck. No functional change, but since we're currently not using IsolationForest anymore, this isn't necessary.
  • Updated user_guide/components notebook: removed duplicate sections of LinearRegressor code and DropNull code before code generation.

Perf tests graphs (I want to double check on this, but having some issues running the tests):
Time automl takes increased by a bit. In most cases, this was less than a second increase. In the most extreme case, it was a 12 second increase. I think this is expected and not too large.
image

Performance is... odder. I don't know how much of this is due to randomness, and would like to try again but here are the results thus far:
image

More perf test results for the interested:
image

image

@angela97lin angela97lin self-assigned this Jan 8, 2021
Makefile Outdated Show resolved Hide resolved
Copy link
Collaborator

@chukarsten chukarsten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Angela - wow. I can't believe you went through all that. That was a lot of changes and great attention to detail. My eyes are bleeding a little bit. Hopefully, next time, we can keep it to a little less of a heroic PR? I don't see anything in here that will require a re-review though - just some doc strings to maybe touch up on. Again, great work.

evalml/automl/utils.py Outdated Show resolved Hide resolved
@@ -461,19 +461,20 @@ def partial_dependence(pipeline, X, feature, grid_resolution=100):
raise ValueError("Pipeline to calculate partial dependence for must be fitted")
if pipeline.model_family == ModelFamily.BASELINE:
raise ValueError("Partial dependence plots are not supported for Baseline pipelines")
if isinstance(pipeline, evalml.pipelines.ClassificationPipeline):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Funny that you should make this change. I was just looking at this section of the code and questioning it. This certainly explains why it's being done, but it still doesn't feel like the right place to do it. Was there an argument against adding this _estimator_type attribute to the pipeline class itself? I don't think it's in the scope of this PR to do so, but it would be good if we came out of this with an issue determining how to ultimately remove this chunk from partial dependence.

evalml/pipelines/classification_pipeline.py Show resolved Hide resolved
np.testing.assert_allclose(clf.predict(X), get_random_state(0).choice(np.unique(y), len(X)))

predictions = clf.predict(X)
assert_series_equal(pd.Series(get_random_state(0).choice(np.unique(y), len(X)), dtype="Int64"), predictions.to_series())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a little unclear to me what we're trying to test here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we're just trying to test that the baseline classifier's predict method returns as we'd expect, here for the case where the strategy is "random" (just randomly select from target to use as prediction). I'll break this down into two separate lines, so hopefully it'll be a bit more clear :)

evalml/tests/component_tests/test_components.py Outdated Show resolved Hide resolved
Copy link
Contributor

@bchen1116 bchen1116 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Giant PR! Wow this was a lot of edits.

There's some issues popping up in AutoMLSearch in the docs, here as well. I think this was caused through these changes somewhere, since the errors seem to stem from type errors thrown by ww. Would be useful to fix before merging this PR!

Also, do we want to change the tutorials in the docs to use and return woodwork instead of pandas?

Other than that, left a few comments/questions, but looks good! Hopefully I didn't miss any major errors. This PR was a doozy

evalml/automl/automl_algorithm/automl_algorithm.py Outdated Show resolved Hide resolved
evalml/tests/automl_tests/test_automl.py Outdated Show resolved Hide resolved
evalml/tests/component_tests/test_components.py Outdated Show resolved Hide resolved
evalml/tests/component_tests/test_datetime_featurizer.py Outdated Show resolved Hide resolved
evalml/tests/component_tests/test_lgbm_classifier.py Outdated Show resolved Hide resolved
@angela97lin
Copy link
Contributor Author

@bchen1116 Thank you for reviewing and pointing out the docs issues!! I'll dig into them and see what's up 馃榿

@@ -402,45 +416,6 @@
"AutoML will perform a search over the allowed ranges for each parameter to select models which produce optimal performance within those ranges. AutoML gets the allowed ranges for each component from the component's `hyperparameter_ranges` class attribute. Any component parameter you add an entry for in `hyperparameter_ranges` will be included in the AutoML search. If parameters are omitted, AutoML will use the default value in all pipelines. "
]
},
{
"cell_type": "code",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is accidental duplicate code, deleting 馃槺

"from evalml.pipelines.components.utils import generate_component_code\n",
"\n",
"class MyDropNullColumns(Transformer):\n",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to repeat this, I believe the only difference is the name and there's nothing special about this necessary for code gen so deleting!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Update pipeline and components to return Woodwork data structures
4 participants