Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support custom woodwork types in components #1784

Merged
merged 50 commits into from Feb 11, 2021

Conversation

angela97lin
Copy link
Contributor

@angela97lin angela97lin commented Feb 2, 2021

Closes #1662. Another large-ish PR but not too much logic change--just replacing the final line to convert to WW to new util method, and writing a lot of tests. I initially wanted to iterate through all the components and test like the pattern we already have in test_components.py but realized that each component has its own nuanced behavior, so decided against it and added the tests to their individual files. The parameterization is similar in many cases though. 馃槵

  • Moves Woodwork-related util methods to woodwork_utils.py and move tests to test_woodwork_utils.py
  • Add private utility method to help best reassign logical types after transformation.
  • Merged _convert_to_woodwork_structure with infer_feature_types by combining the two implementations.

Notes on compatible and incompatible types. Mostly for my own understanding, but also helped inform test cases.

Datetime

  • Datetime --> Double: errors
  • Datetime --> Integer: errors
  • Datetime --> Categorical: works
  • Datetime --> Natural Language: works

Categorical

  • Categorical --> Datetime: errors
  • Categorical --> Double: works, sometimes. ("1", "2" okay, "a", "b" not okay)
  • Categorical --> Natural Language: works
  • Categorical --> Integer: errors

Integer

  • Integer --> Datetime: works... but isn't great lol (literally .000000001 ms)
  • Integer --> Double: works
  • Integer --> Categorical: works
  • Integer --> Natural Language: works

Double

  • Double --> Datetime: works... but isn't great lol (literally .000000001 ms)
  • Double --> Integer: works sometimes (if possible to convert)
  • Double --> Categorical: works
  • Double --> Natural Language: works

**Natural Language **

  • Natural Language --> Datetime: errors

  • Natural Language --> Double: works, sometimes

  • Natural Language --> Categorical: works

  • Natural Language --> Integer: works, sometimes

  • There's no good way of confirming when it is safe to convert floats to category or when it is not without trying. For example: can convert ['1', '2'] (categories) to float, but not ['a', 'b']. Hence, current implementation will try to set the type. If it fails, then don't do anything. If it succeeds, add to logical type dictionary to pass along to the initialization of Woodwork DataTable, so that Woodwork does not have to do type inference, saving us some time.

  • Look into: featuretools component which changed the columns of the original input DF?

@angela97lin angela97lin self-assigned this Feb 2, 2021
@codecov
Copy link

codecov bot commented Feb 3, 2021

Codecov Report

Merging #1784 (ad57157) into main (5cba8a9) will increase coverage by 0.1%.
The diff coverage is 100.0%.

Impacted file tree graph

@@            Coverage Diff            @@
##             main    #1784     +/-   ##
=========================================
+ Coverage   100.0%   100.0%   +0.1%     
=========================================
  Files         252      255      +3     
  Lines       20250    20564    +314     
=========================================
+ Hits        20242    20556    +314     
  Misses          8        8             
Impacted Files Coverage 螖
evalml/tests/utils_tests/test_gen_utils.py 100.0% <酶> (酶)
evalml/utils/gen_utils.py 99.6% <酶> (-<0.1%) 猬囷笍
evalml/automl/automl_search.py 99.7% <100.0%> (-<0.1%) 猬囷笍
evalml/data_checks/class_imbalance_data_check.py 100.0% <100.0%> (酶)
evalml/data_checks/data_checks.py 100.0% <100.0%> (酶)
evalml/data_checks/highly_null_data_check.py 100.0% <100.0%> (酶)
evalml/data_checks/id_columns_data_check.py 100.0% <100.0%> (酶)
evalml/data_checks/invalid_targets_data_check.py 100.0% <100.0%> (酶)
evalml/data_checks/multicollinearity_data_check.py 100.0% <100.0%> (酶)
evalml/data_checks/no_variance_data_check.py 100.0% <100.0%> (酶)
... and 70 more

Continue to review full report at Codecov.

Legend - Click here to learn more
螖 = absolute <relative> (impact), 酶 = not affected, ? = missing data
Powered by Codecov. Last update 5cba8a9...ad57157. Read the comment docs.

@@ -572,3 +572,5 @@ General Utils
drop_rows_with_nans
infer_feature_types
save_plot
is_all_numeric
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated change but adding to API ref :)

@@ -17,8 +17,6 @@

logger = get_logger(__file__)

numeric_and_boolean_ww = [ww.logical_types.Integer, ww.logical_types.Double, ww.logical_types.Boolean]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved all Woodwork related utils to woodwork_utils.py :)

@@ -117,7 +117,7 @@ def test_simple_imputer_all_bool_return_original(data_type, make_data_type):

@pytest.mark.parametrize("data_type", ['pd', 'ww'])
def test_simple_imputer_bool_dtype_object(data_type, make_data_type):
X = pd.DataFrame([True, np.nan, False, np.nan, True], dtype=object)
X = pd.DataFrame([True, np.nan, False, np.nan, True], dtype='boolean')
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This update (and many of the others made in the tests) was made because now we try to convert the columns back to the original types. Before, the original dtype was specified as object, so the column would have been stored as a categorical col and we would have expected X_expected_arr to return a categorical col. We need to explicitly set to boolean for the expected outcome to be boolean.

I think these conversions make sense over previous behavior--user input is boolean, so output is boolean.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed! Glad we don't have to do this anymore.

@@ -34,17 +35,17 @@ def transform(self, X, y=None):
Returns:
ww.DataTable: Transformed X
"""
X_ww = _convert_to_woodwork_structure(X)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I decided to move these out of the try/except because I think it's more clear what the try/except is for. If it errors out while doing Woodwork conversions (as it did while I was testing stuff), it shouldn't throw a MethodPropertyNotFoundError :)

@@ -271,153 +268,6 @@ def test_rename_column_names_to_numeric():
assert X_renamed.logical_types == {0: ww.logical_types.Categorical, 1: ww.logical_types.Categorical}


def test_convert_woodwork_types_wrapper_with_nan():
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to test_woodwork_utils.py

Copy link
Contributor

@freddyaboulton freddyaboulton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@angela97lin Looks good to me! I have some questions on the unit tests but nothing blocking.

@@ -117,7 +117,7 @@ def test_simple_imputer_all_bool_return_original(data_type, make_data_type):

@pytest.mark.parametrize("data_type", ['pd', 'ww'])
def test_simple_imputer_bool_dtype_object(data_type, make_data_type):
X = pd.DataFrame([True, np.nan, False, np.nan, True], dtype=object)
X = pd.DataFrame([True, np.nan, False, np.nan, True], dtype='boolean')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed! Glad we don't have to do this anymore.

evalml/tests/component_tests/test_standard_scaler.py Outdated Show resolved Hide resolved
evalml/tests/component_tests/test_imputer.py Outdated Show resolved Hide resolved
transformed = dft.transform(X, y)
assert isinstance(transformed, ww.DataTable)
if logical_type == Integer or logical_type == Double or logical_type == Categorical:
assert transformed.logical_types == {0: logical_type,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should modify the delayed feature transformer to carry over the ints. I'd be happy to file an issue if you are on board!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, that sounds good! I'm guessing shift just returns floats which is why we end up with Doubles right now?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes!

evalml/tests/component_tests/test_datetime_featurizer.py Outdated Show resolved Hide resolved
@angela97lin angela97lin added this to the Sprint 2021 Feb A milestone Feb 10, 2021
Copy link
Collaborator

@chukarsten chukarsten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh man. Great work getting through all this and testing it all...what a feat of endurance. Just a few questions I had here and there. If you want to get back to me on them, just ping me on Slack so I can go check it out, but nothing to really stop merge.

X = _convert_to_woodwork_structure(X)
X = infer_feature_types(X)
if y is not None:
y = _convert_to_woodwork_structure(y)
y = infer_feature_types(y)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's my boy right there! I DREAM about this code chunk.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly, I've seen this so often I swear I do too 馃槅

X_ww = infer_feature_types(X)
X = _convert_woodwork_types_wrapper(X_ww.to_dataframe())
if y is not None:
y = infer_feature_types(y)
y = _convert_woodwork_types_wrapper(y.to_series())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There it is <3

X = _convert_to_woodwork_structure(X)
cat_cols = list(X.select('category').columns)
X = infer_feature_types(X)
cat_cols = list(X.select(['category', 'boolean']).columns)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What prompted this? This seems like a fine change...but just curious. This comment is also a bookmark so I can go make sure this is tested.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can double check we're testing this but IIRC, it's so that we treat boolean columns like categorical cols. We don't want to pass it to our numeric_imputer which might try to use "mean" as the impute strategy and end up with nonsense for the imputed values.

try:
X = ww.DataTable(X_df, logical_types={0: logical_type})
except TypeError:
continue
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'm tired, but what is the intent behind this chunk here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah so basically, I'm trying every possible logical type conversion that's allowed (see desc of PR for allowed vs error), but rather than manually keeping track of all possible conversions, I'm just trying. Ex: Int --> Nat Lang is valid conversion and therefore a valid override a user might provide to EvalML for us to test. Nat Lang --> Int might not always be (and will raise a TypeError). If its not, don't try to pass to EvalML to test.

@angela97lin angela97lin merged commit d4e2f54 into main Feb 11, 2021
2 checks passed
@angela97lin angela97lin deleted the 1662_custom_woodwork_in_components branch February 11, 2021 22:53
@chukarsten chukarsten mentioned this pull request Feb 23, 2021
@dsherry dsherry mentioned this pull request Mar 10, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Custom woodwork types aren't preserved in components
5 participants