Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update DatetimeFormatDataCheck to use woodwork #3425

Merged
merged 17 commits into from
Apr 6, 2022
Merged

Update DatetimeFormatDataCheck to use woodwork #3425

merged 17 commits into from
Apr 6, 2022

Conversation

eccabay
Copy link
Contributor

@eccabay eccabay commented Mar 30, 2022

Closes #3385

@codecov
Copy link

codecov bot commented Mar 30, 2022

Codecov Report

Merging #3425 (1f8ec3c) into main (80868c4) will increase coverage by 0.1%.
The diff coverage is 100.0%.

@@           Coverage Diff           @@
##            main   #3425     +/-   ##
=======================================
+ Coverage   99.7%   99.7%   +0.1%     
=======================================
  Files        336     336             
  Lines      33295   33297      +2     
=======================================
+ Hits       33163   33165      +2     
  Misses       132     132             
Impacted Files Coverage Δ
evalml/data_checks/data_check_message_code.py 100.0% <100.0%> (ø)
evalml/data_checks/datetime_format_data_check.py 100.0% <100.0%> (ø)
...ta_checks_tests/test_datetime_format_data_check.py 100.0% <100.0%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 80868c4...1f8ec3c. Read the comment docs.

Copy link
Collaborator

@jeremyliweishih jeremyliweishih left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM! Just some clarifying questions but might be prudent to wait for a review from someone with more context than me 😄 good work!

@@ -36,14 +37,6 @@ def validate(self, X, y):
>>> datetime_format_dc = DateTimeFormatDataCheck(datetime_column="dates")
>>> assert datetime_format_dc.validate(X, y) == [
... {
... "message": "Column 'dates' has datetime values missing between start and end date.",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this being removed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's no longer detected by woodwork - it purely errors out unable to detect a frequency, so no more missing values. IMO, this is the more accurate output!

Copy link
Contributor

@chukarsten chukarsten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, looks great. I think just a question about maybe updating the docstring summary to make it a bit more comprehensive. Other than that, super excited here!

@@ -113,10 +113,13 @@ class DataCheckMessageCode(Enum):
"""Message code for when the datetime values have uneven intervals."""

DATETIME_HAS_REDUNDANT_ROW = "datetime_has_redundant_row"
"""Message code for when datetime information has more than one row per datetime"""
"""Message code for when datetime information has more than one row per datetime."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Angela would be proud.

)
inferred_freq = ww_payload[0]
debug_object = ww_payload[1]
if inferred_freq is not None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A little confused why you'd want to return here if there's a detected inferable frequency. At this point you've determined whether it's monotonic or not and that it has datetime information...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If woodwork returns an inferred frequency, it's the same as if the pd.infer_freq function returned a frequency. In that case, the datetime column is perfect and we don't need to run the rest of the checks as we know they don't hold.

Now that I talk that through though, I realize we can move that check up higher. We do run inferred_freq = pd.infer_freq(...), so it would make more sense to check if the frequency is perfect up there instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, I take that back. After making that change, a bunch of tests that I forgot about fail. pd.infer_freq and woodwork's infer_frequency can detect a backwards frequency (i.e. going backwards in time), but we require forwards frequencies. That's why its necessary that we check for an inferable frequency after checking for monotonically increasing values.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for explaining that one to me!

@@ -36,14 +37,6 @@ def validate(self, X, y):
>>> datetime_format_dc = DateTimeFormatDataCheck(datetime_column="dates")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to update .validate()'s docstring a little bit more outside of the pydocstring?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can add another more descriptive line!

Copy link
Contributor

@ParthivNaresh ParthivNaresh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good thanks for getting this in!

@eccabay eccabay merged commit 5a6c668 into main Apr 6, 2022
@eccabay eccabay deleted the 3385_dtfdc_ww branch April 6, 2022 17:38
@chukarsten chukarsten mentioned this pull request Apr 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Update DateTimeFormatDataCheck to use woodwork frequency inference
4 participants