Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DatetimeFormatDataCheck for equal interval and sorting #2603

Merged
merged 15 commits into from Aug 11, 2021

Conversation

ParthivNaresh
Copy link
Contributor

Fixes: #2124

@codecov
Copy link

codecov bot commented Aug 6, 2021

Codecov Report

Merging #2603 (f182076) into main (06d05ed) will increase coverage by 0.1%.
The diff coverage is 100.0%.

Impacted file tree graph

@@           Coverage Diff           @@
##            main   #2603     +/-   ##
=======================================
+ Coverage   99.9%   99.9%   +0.1%     
=======================================
  Files        295     297      +2     
  Lines      26895   27027    +132     
=======================================
+ Hits       26851   26983    +132     
  Misses        44      44             
Impacted Files Coverage Δ
evalml/data_checks/data_checks.py 100.0% <ø> (ø)
evalml/automl/automl_search.py 99.9% <100.0%> (+0.1%) ⬆️
evalml/data_checks/__init__.py 100.0% <100.0%> (ø)
evalml/data_checks/data_check_message_code.py 100.0% <100.0%> (ø)
evalml/data_checks/datetime_format_data_check.py 100.0% <100.0%> (ø)
evalml/data_checks/default_data_checks.py 100.0% <100.0%> (ø)
evalml/tests/automl_tests/test_search.py 100.0% <100.0%> (ø)
evalml/tests/data_checks_tests/test_data_checks.py 100.0% <100.0%> (ø)
...ta_checks_tests/test_datetime_format_data_check.py 100.0% <100.0%> (ø)
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 06d05ed...f182076. Read the comment docs.

@ParthivNaresh ParthivNaresh self-assigned this Aug 9, 2021
@@ -100,6 +110,22 @@ def search(X_train=None, y_train=None, problem_type=None, objective="auto", **kw
X_train = infer_feature_types(X_train)
y_train = infer_feature_types(y_train)
problem_type = handle_problem_types(problem_type)

datetime_column = None
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depending on the outcome of our time series discussions, we can change this if we want to consider the index as the default placement of datetime data for time series. Until then, we'd need the user to pass in a minimum of date_index for time series default objectives.

y = infer_feature_types(y)

if self.datetime_column != "index":
datetime_values = X[self.datetime_column]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're not trying to extract every datetime column, just the one that indexes the datetime information

datetime_values = X.index
if not isinstance(datetime_values, pd.DatetimeIndex):
datetime_values = y.index
if not isinstance(datetime_values, pd.DatetimeIndex):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This allows the user to just specify index, and we can check the X index first followed by the y index.

def __init__(self, problem_type, objective, n_splits=3):
def __init__(self, problem_type, objective, n_splits=3, datetime_column="index"):
default_checks = self._DEFAULT_DATA_CHECK_CLASSES
data_check_params = {}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed the layout of this to make it cleaner and more easily understandable as to what data checks and params are being passed across different problem_types.


@pytest.mark.parametrize("input_type", ["pd", "ww"])
@pytest.mark.parametrize(
"uneven,type_errors", [(True, False), (False, True), (False, False)]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checks across uneven frequencies and incorrect datetime types

@ParthivNaresh ParthivNaresh marked this pull request as ready for review August 9, 2021 19:16
Copy link
Contributor

@angela97lin angela97lin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Left a couple of nit-picking comments :)

docs/source/release_notes.rst Outdated Show resolved Hide resolved
docs/source/user_guide/data_checks.ipynb Outdated Show resolved Hide resolved
docs/source/user_guide/data_checks.ipynb Outdated Show resolved Hide resolved
evalml/data_checks/datetime_format_data_check.py Outdated Show resolved Hide resolved
evalml/data_checks/datetime_format_data_check.py Outdated Show resolved Hide resolved
evalml/tests/automl_tests/test_search.py Outdated Show resolved Hide resolved
Copy link
Contributor

@freddyaboulton freddyaboulton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ParthivNaresh This looks great! I think there are two changes I'd like to make before merge though. The first is disallowing monotonic decreasing columns, the second is returning DataCheckErrors rather than raising TypeErrors in the data check.

evalml/data_checks/default_data_checks.py Outdated Show resolved Hide resolved
evalml/data_checks/datetime_format_data_check.py Outdated Show resolved Hide resolved
evalml/data_checks/datetime_format_data_check.py Outdated Show resolved Hide resolved
evalml/data_checks/default_data_checks.py Outdated Show resolved Hide resolved
evalml/data_checks/datetime_format_data_check.py Outdated Show resolved Hide resolved
docs/source/user_guide/data_checks.ipynb Show resolved Hide resolved
@ParthivNaresh ParthivNaresh merged commit ee9aabd into main Aug 11, 2021
@chukarsten chukarsten mentioned this pull request Aug 12, 2021
@freddyaboulton freddyaboulton deleted the DataCheck-For-Equal-Interval branch May 13, 2022 15:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DataCheck for TimeSeries problems - Equal interval data
3 participants