Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update evalml from selecting using pandas dtypes to selecting using Woodwork logical types #1551

Merged
merged 59 commits into from
Jan 5, 2021

Conversation

angela97lin
Copy link
Contributor

Closes #1290

@angela97lin angela97lin self-assigned this Dec 14, 2020
@codecov
Copy link

codecov bot commented Dec 14, 2020

Codecov Report

Merging #1551 (13e2737) into main (d4739cf) will not change coverage.
The diff coverage is 100.0%.

Impacted file tree graph

@@           Coverage Diff           @@
##             main    #1551   +/-   ##
=======================================
  Coverage   100.0%   100.0%           
=======================================
  Files         240      240           
  Lines       18270    18271    +1     
=======================================
+ Hits        18262    18263    +1     
  Misses          8        8           
Impacted Files Coverage Δ
...ts/data_checks_tests/test_id_columns_data_check.py 100.0% <ø> (ø)
evalml/data_checks/id_columns_data_check.py 100.0% <100.0%> (ø)
evalml/data_checks/invalid_targets_data_check.py 100.0% <100.0%> (ø)
evalml/data_checks/outliers_data_check.py 100.0% <100.0%> (ø)
evalml/data_checks/target_leakage_data_check.py 100.0% <100.0%> (ø)
evalml/demos/fraud.py 100.0% <100.0%> (ø)
...nents/transformers/dimensionality_reduction/lda.py 100.0% <100.0%> (ø)
...nents/transformers/dimensionality_reduction/pca.py 100.0% <100.0%> (ø)
...components/transformers/encoders/onehot_encoder.py 100.0% <100.0%> (ø)
.../transformers/preprocessing/datetime_featurizer.py 100.0% <100.0%> (ø)
... and 12 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d4739cf...13e2737. Read the comment docs.

Copy link
Contributor

@dsherry dsherry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@angela97lin looks great! I left a few questions/suggestions

@angela97lin angela97lin requested a review from dsherry December 17, 2020 18:50
@angela97lin
Copy link
Contributor Author

@dsherry I went through and addressed your comments! I deleted the categorical_ww_type and others in place of just selecting the correct logical type/semantic tag, but I kept numeric_and_boolean_ww around. Reason is, the alternative would be to check if the semantic tag was numeric or the logic type was boolean. In the case where we also want to check categoricals, we would have to check if either 'numeric' or 'category' was in the semantic tags, or if the logic type was boolean. Keeping the list of logic types to check for seemed easier to understand, but lmk if you feel differently!

@angela97lin angela97lin dismissed dsherry’s stale review January 4, 2021 17:15

Addressed comments

@angela97lin angela97lin merged commit 593b88a into main Jan 5, 2021
@angela97lin angela97lin deleted the 1290_woodwork_type_selection branch January 5, 2021 23:49
Copy link
Contributor

@dsherry dsherry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@angela97lin looks great! Sorry I didn't get to this before merge. I left some questions.

X = _convert_woodwork_types_wrapper(X.to_dataframe())
if y.logical_type not in numeric_and_boolean_ww:
return messages
X_num = X.select(include=numeric_and_boolean_ww)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason to include boolean too? Perhaps this should just be X.select('numeric')

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this should just be based on whether or not we want to / it makes sense to check for label leakage if the target is boolean :)


def fit(self, X, y=None):
top_n = self.parameters['top_n']
X = _convert_to_woodwork_structure(X)
if self.features_to_encode is None:
self.features_to_encode = self._get_cat_cols(X)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did moving this code change any behavior? Or was it just cleaner this way, but functionally equivalent?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It shouldn't change behavior; the reason for moving this is because we want to take advantage of Woodwork so we want to pass in X as a Woodwork structure before converting it later :)

@@ -77,8 +76,7 @@ def __init__(self, features_to_extract=None, encode_as_categories=False, random_

def fit(self, X, y=None):
X = _convert_to_woodwork_structure(X)
X = _convert_woodwork_types_wrapper(X.to_dataframe())
self._date_time_col_names = X.select_dtypes(include=datetime_dtypes).columns
self._date_time_col_names = X.select(include=["datetime"]).columns
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't do X.select('datetime').columns ? Just curious because we're doing that elsewhere for numeric

@@ -26,9 +25,9 @@ def fit(self, X, y):
"""
X = _convert_to_woodwork_structure(X)
y = _convert_to_woodwork_structure(y)
if "numeric" not in y.semantic_tags:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an is_numeric util method in woodwork which we can use? Or is this the recommended way to do this check? @gsheni

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, there is (thought it wasn't meant to be user facing).
https://github.com/alteryx/woodwork/blob/main/woodwork/datacolumn.py#L318

ParthivNaresh added a commit that referenced this pull request Jan 6, 2021
ParthivNaresh added a commit that referenced this pull request Jan 6, 2021
* First round

* Removed RMSLE, MSLE, and MAPE from non_core_objectives

* Add objective parameter to all data_check subclasses

* Add new data_check_message_code for a target that is incompatible with an objective

* Add check to invalid_target_data_check for RMSLE, MSLE, and MAPE

* Invalid Target Data Check update

* Fix test case to not include invalid objectives in regression_core_objectives

* Test updates

* Invalid target data check update

* Lint

* Release notes

* Tests and updates to code

* Release notes

* Latest Dependencies Version update

* Lint fix

* Dependency check

* Fixing minor issues

* test none objective update

* Update to reflect changes in #1597

* Make objective mandatory for DefaultDataChecks and InvalidTargetDataCheck

* Mock data check to see if objective is being passed from AutoML

* Fix breaking tests due to mandatory objective param in DefaultDataChecks, and check error output from DefaultDataChecks

* Change label to all positive so RMSLE, MSLE, and MAPE can be included in regression_core_objectives

* Raise error for None objective in InvalidTargetDataCheck

* Docstring example fix

* Lint and docstring example

* Jupyter notebook update for build_docs

* Jupyter notebook update

* remove print statement

* Added test cases and docstring, updated error message

* test case fixes

* Test update

* lint error

* Add positive_only class attribute for ObjectiveBase

* Adding @classproperty to ObjectiveBase for positive_only check

* Lint error

* Fix conflicts with #1551

* Lint
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Update components from selecting using pandas dtypes to selecting using DataTable semantic or logical types
5 participants