Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update DateTimeFormatDataCheck with actions and make pipeline from actions #3454

Merged
merged 16 commits into from Apr 14, 2022

Conversation

ParthivNaresh
Copy link
Contributor

@ParthivNaresh ParthivNaresh commented Apr 8, 2022

Fixes #3437

@codecov
Copy link

codecov bot commented Apr 8, 2022

Codecov Report

Merging #3454 (9b17265) into main (eebacf1) will increase coverage by 0.1%.
The diff coverage is 100.0%.

@@           Coverage Diff           @@
##            main   #3454     +/-   ##
=======================================
+ Coverage   99.7%   99.7%   +0.1%     
=======================================
  Files        336     336             
  Lines      33297   33375     +78     
=======================================
+ Hits       33165   33243     +78     
  Misses       132     132             
Impacted Files Coverage Δ
evalml/data_checks/default_data_checks.py 100.0% <ø> (ø)
evalml/data_checks/data_check_action_code.py 100.0% <100.0%> (ø)
evalml/data_checks/data_check_message_code.py 100.0% <100.0%> (ø)
evalml/data_checks/datetime_format_data_check.py 100.0% <100.0%> (ø)
...nsformers/preprocessing/time_series_regularizer.py 100.0% <100.0%> (ø)
evalml/pipelines/utils.py 99.5% <100.0%> (+0.1%) ⬆️
...ts/component_tests/test_time_series_regularizer.py 100.0% <100.0%> (ø)
...ta_checks_tests/test_datetime_format_data_check.py 100.0% <100.0%> (ø)
..._tests/test_data_checks_and_actions_integration.py 100.0% <100.0%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update eebacf1...9b17265. Read the comment docs.

TimeSeriesRegularizer(time_index=parameters["time_index"]),
TimeSeriesImputer(),
]
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Open question, should we have a break statement or something similar here? If we're adding the ts regularizer and imputer I'm not sure how relevant the rest of the actions might be.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's best if we keep this "dumb" (spit out pipeline from actions) and have the caller of this function "smart" (knowing which datacheck actions are relevant for time series).

Copy link
Contributor

@freddyaboulton freddyaboulton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ParthivNaresh ! Code looks good but left some comments on the UX implications + a refactor to not have to infer frequency twice.

TimeSeriesRegularizer(time_index=parameters["time_index"]),
TimeSeriesImputer(),
]
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's best if we keep this "dumb" (spit out pipeline from actions) and have the caller of this function "smart" (knowing which datacheck actions are relevant for time series).

evalml/pipelines/utils.py Outdated Show resolved Hide resolved
)
else:
messages.append(
DataCheckError(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're adding this new error instead of adding it to everyone of the already existing data check errors to avoid having duplicate data check actions right?

I think this may be confusing UX to users because they'll see multiple errors but only the "DATETIME_HAS_UNEVEN_INTERVALS" will appear "fixable" via an action even though this action will fix all other errors.

This may be the best we can do for now. Tagging @Cmancuso so we can discuss further.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ParthivNaresh and I talked about this - errors will be consolidated in the future.

"default_value": col_name,
}
},
metadata={"is_target": True},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're not using is_target anywhere right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An EvalML consumer might check for is_target when running data check actions to determine if the target has been passed and to raise an error if it hasn't when the target is being modified. I felt like that case needed to be covered but if it doesn't I have no problem taking that out.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to keep it! just wondering why since it didn't see it being "used"

Copy link
Contributor

@eccabay eccabay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome @ParthivNaresh, thanks for doing it! I just left a couple nits. I also agree with Freddy about the potential confusion with how we tie the errors to actions. It might be helpful to discuss the best way to do this before moving forward.

evalml/data_checks/data_check_message_code.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@chukarsten chukarsten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the follow up Parthiv

evalml/data_checks/data_check_message_code.py Outdated Show resolved Hide resolved
@@ -152,10 +160,24 @@ def test_ts_regularizer_no_issues(ts_data):


@pytest.mark.parametrize("y_passed", [True, False])
def test_ts_regularizer_X_only(y_passed, combination_of_faulty_datetime):
def test_ts_regularizer_X_only_equal_payload(y_passed, combination_of_faulty_datetime):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious what you mean by "equal_payload"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is verifying that if a payload is explicitly passed in through the parameters to the class, it provides an equivalent output to the payload inferred in fit.

Copy link
Contributor

@eccabay eccabay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚢

Copy link
Contributor

@freddyaboulton freddyaboulton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ParthivNaresh !

evalml/pipelines/utils.py Outdated Show resolved Hide resolved
)
else:
messages.append(
DataCheckError(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ParthivNaresh and I talked about this - errors will be consolidated in the future.

@ParthivNaresh ParthivNaresh merged commit 6829737 into main Apr 14, 2022
@chukarsten chukarsten mentioned this pull request Apr 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Integrate Data Check Actions for DateTimeFormatDataCheck to support regularizer and imputer work
4 participants