-
Notifications
You must be signed in to change notification settings - Fork 91
Add DateTimeNaNDataCheck
#2039
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add DateTimeNaNDataCheck
#2039
Conversation
Codecov Report
@@ Coverage Diff @@
## main #2039 +/- ##
=========================================
+ Coverage 100.0% 100.0% +0.1%
=========================================
Files 280 282 +2
Lines 22934 23004 +70
=========================================
+ Hits 22925 22995 +70
Misses 9 9
Continue to review full report at Codecov.
|
@@ -19,11 +20,12 @@ class DefaultDataChecks(DataChecks): | |||
- `InvalidTargetDataCheck` | |||
- `NoVarianceDataCheck` | |||
- `ClassImbalanceDataCheck` (for classification problem types) | |||
- `DateTimeNaNDataCheck` | |||
|
|||
""" | |||
|
|||
_DEFAULT_DATA_CHECK_CLASSES = [HighlyNullDataCheck, IDColumnsDataCheck, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since AutoMLSearch will error out if a user provides datetime columns and has NaN values, I opted to add to default data checks as a safe guard.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didn't realize that don't have data checks as part of AutoMLSearch now. Open to removing this as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's fine to leave in for now, as we can still use the default data checks with DefaultDataChecks().validate(X_train, y_train)
. As our data check process updates, this might change, but I see no issues in adding it here
error_contains_nan = "Input datetime column ({}) contains NaN values. Please input NaN values or drop this column." | ||
|
||
|
||
class DateTimeNaNDataCheck(DataCheck): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have a standardization of NaN vs Null terminology? I decided to use NaN here but can change to null as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good question...I don't think nulls fall into the scope of this. We have data checks to alert us to highly null columns and I don't think a null by itself makes the algorithm choke.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code looks good to me! I left a few suggestions on tests to add to solidify our coverage a bit more, and I also agree that adding this datacheck to DefaultDataChecks
is fine. PR should be good to go after the added tests though! 👌
"""Checks if datetime columns contain NaN values.""" | ||
|
||
def __init__(self): | ||
"""Checks each column in the input for datetime features and will issue and error if NaN values are present. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: will issue an error
@@ -19,11 +20,12 @@ class DefaultDataChecks(DataChecks): | |||
- `InvalidTargetDataCheck` | |||
- `NoVarianceDataCheck` | |||
- `ClassImbalanceDataCheck` (for classification problem types) | |||
- `DateTimeNaNDataCheck` | |||
|
|||
""" | |||
|
|||
_DEFAULT_DATA_CHECK_CLASSES = [HighlyNullDataCheck, IDColumnsDataCheck, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's fine to leave in for now, as we can still use the default data checks with DefaultDataChecks().validate(X_train, y_train)
. As our data check process updates, this might change, but I see no issues in adding it here
Ah, and also add this new data check to the docs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great man
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jeremyliweishih Implementation looks solid! I agree with the suggestions to add coverage!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think besides the test coverage suggested by @bchen1116 , there's a few copy pastas from the highly null data check you used for a template ;). Otherwise, good work!
error_contains_nan = "Input datetime column ({}) contains NaN values. Please input NaN values or drop this column." | ||
|
||
|
||
class DateTimeNaNDataCheck(DataCheck): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good question...I don't think nulls fall into the scope of this. We have data checks to alert us to highly null columns and I don't think a null by itself makes the algorithm choke.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A bit of copy pasta and testing updates needed, but I also think it might be useful to combine all the columns into one for the error. Unlike the other data checks that return granular info per col, this only reports which datetime cols have nan values, so seems cleaner to just combine.
data_check_name="NoVarianceDataCheck", | ||
message_code=DataCheckMessageCode.NO_VARIANCE, | ||
details={"column": "Y"}).to_dict()], | ||
"errors": messages[4:7] + [DataCheckError(message="Y has 1 unique value.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixes #1999.