Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TargetLeakageDataCheck maintains user logical types #2711

Merged
merged 7 commits into from
Aug 31, 2021

Conversation

eccabay
Copy link
Contributor

@eccabay eccabay commented Aug 30, 2021

Closes #2683

@codecov
Copy link

codecov bot commented Aug 30, 2021

Codecov Report

Merging #2711 (8cac7ff) into main (3ae1500) will increase coverage by 0.1%.
The diff coverage is 100.0%.

Impacted file tree graph

@@           Coverage Diff           @@
##            main   #2711     +/-   ##
=======================================
+ Coverage   99.9%   99.9%   +0.1%     
=======================================
  Files        301     301             
  Lines      27600   27607      +7     
=======================================
+ Hits       27556   27563      +7     
  Misses        44      44             
Impacted Files Coverage Δ
evalml/data_checks/target_leakage_data_check.py 100.0% <100.0%> (ø)
...ata_checks_tests/test_target_leakage_data_check.py 100.0% <100.0%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3ae1500...8cac7ff. Read the comment docs.

@eccabay eccabay marked this pull request as ready for review August 30, 2021 20:15
Copy link
Contributor

@freddyaboulton freddyaboulton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @eccabay !!

cols_to_compare = infer_feature_types(
pd.DataFrame({col: X[col], str(col) + "y": y})
)
logical_types = {col: type(X.ww.logical_types[col]), str(col) + "y": y_type}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can simplify this with this:

cols_to_compare = X.ww[[col]]
cols_to_compare.ww[str(col) + "y"] = y

What I also like about this is that this implementation will also preserve other parts of the schema, like semantic tags and metadata. For the sake of this data check, I think preserving the logical types is enough but we should get in the habit of preserving as much of the schema as possible in our implementation.


X.ww.init(logical_types={"A": "Unknown", "B": "Double"})
warnings = TargetLeakageDataCheck().validate(X, y)["warnings"]
assert not any(w["message"].startswith("Column 'A'") for w in warnings)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should add a comment explaining that mutual information is not supported for Unknown logical types so they should not be included.

I'm also thinking we should just mock ww.mutual information() and verify the logical types are consistent there? Not sure how tricky that would be though.

@eccabay eccabay merged commit 635f3e7 into main Aug 31, 2021
@eccabay eccabay deleted the 2683_targetleakage_types branch August 31, 2021 20:45
@chukarsten chukarsten mentioned this pull request Sep 1, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

TargetLeakageDataCheck wipes user-selected logical types
3 participants