-
Notifications
You must be signed in to change notification settings - Fork 92
Update DatetimeFormatDataCheck to use woodwork
#3425
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## main #3425 +/- ##
=======================================
+ Coverage 99.7% 99.7% +0.1%
=======================================
Files 336 336
Lines 33295 33297 +2
=======================================
+ Hits 33163 33165 +2
Misses 132 132
Continue to review full report at Codecov.
|
jeremyliweishih
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This LGTM! Just some clarifying questions but might be prudent to wait for a review from someone with more context than me 😄 good work!
| >>> datetime_format_dc = DateTimeFormatDataCheck(datetime_column="dates") | ||
| >>> assert datetime_format_dc.validate(X, y) == [ | ||
| ... { | ||
| ... "message": "Column 'dates' has datetime values missing between start and end date.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this being removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's no longer detected by woodwork - it purely errors out unable to detect a frequency, so no more missing values. IMO, this is the more accurate output!
evalml/tests/data_checks_tests/test_datetime_format_data_check.py
Outdated
Show resolved
Hide resolved
chukarsten
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome, looks great. I think just a question about maybe updating the docstring summary to make it a bit more comprehensive. Other than that, super excited here!
|
|
||
| DATETIME_HAS_REDUNDANT_ROW = "datetime_has_redundant_row" | ||
| """Message code for when datetime information has more than one row per datetime""" | ||
| """Message code for when datetime information has more than one row per datetime.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Angela would be proud.
| ) | ||
| inferred_freq = ww_payload[0] | ||
| debug_object = ww_payload[1] | ||
| if inferred_freq is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A little confused why you'd want to return here if there's a detected inferable frequency. At this point you've determined whether it's monotonic or not and that it has datetime information...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If woodwork returns an inferred frequency, it's the same as if the pd.infer_freq function returned a frequency. In that case, the datetime column is perfect and we don't need to run the rest of the checks as we know they don't hold.
Now that I talk that through though, I realize we can move that check up higher. We do run inferred_freq = pd.infer_freq(...), so it would make more sense to check if the frequency is perfect up there instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, I take that back. After making that change, a bunch of tests that I forgot about fail. pd.infer_freq and woodwork's infer_frequency can detect a backwards frequency (i.e. going backwards in time), but we require forwards frequencies. That's why its necessary that we check for an inferable frequency after checking for monotonically increasing values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for explaining that one to me!
| @@ -36,14 +37,6 @@ def validate(self, X, y): | |||
| >>> datetime_format_dc = DateTimeFormatDataCheck(datetime_column="dates") | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to update .validate()'s docstring a little bit more outside of the pydocstring?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can add another more descriptive line!
ParthivNaresh
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good thanks for getting this in!
Closes #3385