-
Notifications
You must be signed in to change notification settings - Fork 89
Set threshold to account for too many duplicate or nan values in DateTimeFormatDataCheck
#3883
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## main #3883 +/- ##
=======================================
+ Coverage 99.7% 99.7% +0.1%
=======================================
Files 346 346
Lines 36284 36304 +20
=======================================
+ Hits 36147 36167 +20
Misses 137 137
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
|
||
assert result[2]["code"] == "DATETIME_HAS_UNEVEN_INTERVALS" | ||
|
||
X.iloc[0, -1] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changing this value crosses the threshold into DATETIME_NO_FREQUENCY_INFERRED
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the threshold was arbitrary, I think it certainly wouldn't hurt to make it a passable argument. Regardless, looks great!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense to me! +1 to @eccabay's comment on making it a passable argument or even just an internal global variable.
Currently if a
time_index
is chosen that has duplicate or nan values, these values are dropped prior to the creation of analias_dict
in Woodwork which iterates over all identified frequencies in the time series to determine the most likely one.The side effect of this is that in a dataset with a length of 20,000, if the
time_index
is just 50 consecutive daily datetime values repeated 400 times, then we currently returnDATETIME_HAS_UNEVEN_INTERVALS
and an action code to fix this data.This
time_index
should not be considered for regularization as regularizing this data would reduce this dataset from 20,000 observations to 50, since all duplicates and nans would be dropped and the remaining datetime values would have a frequency of1 day
. This is more of a multi-series concern which is outside the scope of this data check.I've added a check where if 25% of the
time_index
consists of solely duplicate and nan values, thenDATETIME_NO_FREQUENCY_INFERRED
will be returned instead ofDATETIME_HAS_UNEVEN_INTERVALS
. I'm open to changing this threshold or making it a passable argument, it was more or less arbitrary.