Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add DateTimeNaNDataCheck #2039

Merged
merged 32 commits into from Mar 31, 2021
Merged

Add DateTimeNaNDataCheck #2039

merged 32 commits into from Mar 31, 2021

Conversation

jeremyliweishih
Copy link
Contributor

Fixes #1999.

@codecov
Copy link

codecov bot commented Mar 25, 2021

Codecov Report

Merging #2039 (79fff01) into main (8556450) will increase coverage by 0.1%.
The diff coverage is 100.0%.

Impacted file tree graph

@@            Coverage Diff            @@
##             main    #2039     +/-   ##
=========================================
+ Coverage   100.0%   100.0%   +0.1%     
=========================================
  Files         280      282      +2     
  Lines       22934    23004     +70     
=========================================
+ Hits        22925    22995     +70     
  Misses          9        9             
Impacted Files Coverage Δ
evalml/data_checks/__init__.py 100.0% <100.0%> (ø)
evalml/data_checks/data_check_message_code.py 100.0% <100.0%> (ø)
evalml/data_checks/datetime_nan_data_check.py 100.0% <100.0%> (ø)
evalml/data_checks/default_data_checks.py 100.0% <100.0%> (ø)
evalml/tests/data_checks_tests/test_data_checks.py 100.0% <100.0%> (ø)
.../data_checks_tests/test_datetime_nan_data_check.py 100.0% <100.0%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8556450...79fff01. Read the comment docs.

@jeremyliweishih jeremyliweishih marked this pull request as ready for review March 25, 2021 21:58
@@ -19,11 +20,12 @@ class DefaultDataChecks(DataChecks):
- `InvalidTargetDataCheck`
- `NoVarianceDataCheck`
- `ClassImbalanceDataCheck` (for classification problem types)
- `DateTimeNaNDataCheck`

"""

_DEFAULT_DATA_CHECK_CLASSES = [HighlyNullDataCheck, IDColumnsDataCheck,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since AutoMLSearch will error out if a user provides datetime columns and has NaN values, I opted to add to default data checks as a safe guard.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't realize that don't have data checks as part of AutoMLSearch now. Open to removing this as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine to leave in for now, as we can still use the default data checks with DefaultDataChecks().validate(X_train, y_train). As our data check process updates, this might change, but I see no issues in adding it here

error_contains_nan = "Input datetime column ({}) contains NaN values. Please input NaN values or drop this column."


class DateTimeNaNDataCheck(DataCheck):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a standardization of NaN vs Null terminology? I decided to use NaN here but can change to null as well.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good question...I don't think nulls fall into the scope of this. We have data checks to alert us to highly null columns and I don't think a null by itself makes the algorithm choke.

Copy link
Contributor

@bchen1116 bchen1116 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks good to me! I left a few suggestions on tests to add to solidify our coverage a bit more, and I also agree that adding this datacheck to DefaultDataChecks is fine. PR should be good to go after the added tests though! 👌

"""Checks if datetime columns contain NaN values."""

def __init__(self):
"""Checks each column in the input for datetime features and will issue and error if NaN values are present.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: will issue an error

@@ -19,11 +20,12 @@ class DefaultDataChecks(DataChecks):
- `InvalidTargetDataCheck`
- `NoVarianceDataCheck`
- `ClassImbalanceDataCheck` (for classification problem types)
- `DateTimeNaNDataCheck`

"""

_DEFAULT_DATA_CHECK_CLASSES = [HighlyNullDataCheck, IDColumnsDataCheck,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine to leave in for now, as we can still use the default data checks with DefaultDataChecks().validate(X_train, y_train). As our data check process updates, this might change, but I see no issues in adding it here

@bchen1116
Copy link
Contributor

Ah, and also add this new data check to the docs

@CLAassistant
Copy link

CLAassistant commented Mar 26, 2021

CLA assistant check
All committers have signed the CLA.

Copy link
Contributor

@ParthivNaresh ParthivNaresh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great man

evalml/data_checks/default_data_checks.py Show resolved Hide resolved
Copy link
Contributor

@freddyaboulton freddyaboulton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jeremyliweishih Implementation looks solid! I agree with the suggestions to add coverage!

Copy link
Collaborator

@chukarsten chukarsten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think besides the test coverage suggested by @bchen1116 , there's a few copy pastas from the highly null data check you used for a template ;). Otherwise, good work!

error_contains_nan = "Input datetime column ({}) contains NaN values. Please input NaN values or drop this column."


class DateTimeNaNDataCheck(DataCheck):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good question...I don't think nulls fall into the scope of this. We have data checks to alert us to highly null columns and I don't think a null by itself makes the algorithm choke.

Copy link
Contributor

@angela97lin angela97lin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit of copy pasta and testing updates needed, but I also think it might be useful to combine all the columns into one for the error. Unlike the other data checks that return granular info per col, this only reports which datetime cols have nan values, so seems cleaner to just combine.

evalml/data_checks/__init__.py Outdated Show resolved Hide resolved
evalml/data_checks/datetime_nan_data_check.py Outdated Show resolved Hide resolved
data_check_name="NoVarianceDataCheck",
message_code=DataCheckMessageCode.NO_VARIANCE,
details={"column": "Y"}).to_dict()],
"errors": messages[4:7] + [DataCheckError(message="Y has 1 unique value.",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jeremyliweishih jeremyliweishih dismissed stale reviews from angela97lin and chukarsten March 31, 2021 14:20

stale

@jeremyliweishih jeremyliweishih merged commit d6d2d84 into main Mar 31, 2021
@freddyaboulton freddyaboulton deleted the js_1999_dtf_nan branch May 13, 2022 15:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Datetime featurizer errors when there are nans
7 participants