Skip to content

TargetLeakageDataCheck maintains user logical types#2711

Merged
eccabay merged 7 commits into
mainfrom
2683_targetleakage_types
Aug 31, 2021
Merged

TargetLeakageDataCheck maintains user logical types#2711
eccabay merged 7 commits into
mainfrom
2683_targetleakage_types

Conversation

@eccabay
Copy link
Copy Markdown
Contributor

@eccabay eccabay commented Aug 30, 2021

Closes #2683

@codecov
Copy link
Copy Markdown

codecov Bot commented Aug 30, 2021

Codecov Report

Merging #2711 (8cac7ff) into main (3ae1500) will increase coverage by 0.1%.
The diff coverage is 100.0%.

Impacted file tree graph

@@           Coverage Diff           @@
##            main   #2711     +/-   ##
=======================================
+ Coverage   99.9%   99.9%   +0.1%     
=======================================
  Files        301     301             
  Lines      27600   27607      +7     
=======================================
+ Hits       27556   27563      +7     
  Misses        44      44             
Impacted Files Coverage Δ
evalml/data_checks/target_leakage_data_check.py 100.0% <100.0%> (ø)
...ata_checks_tests/test_target_leakage_data_check.py 100.0% <100.0%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3ae1500...8cac7ff. Read the comment docs.

@eccabay eccabay marked this pull request as ready for review August 30, 2021 20:15
Copy link
Copy Markdown
Contributor

@freddyaboulton freddyaboulton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @eccabay !!

cols_to_compare = infer_feature_types(
pd.DataFrame({col: X[col], str(col) + "y": y})
)
logical_types = {col: type(X.ww.logical_types[col]), str(col) + "y": y_type}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can simplify this with this:

cols_to_compare = X.ww[[col]]
cols_to_compare.ww[str(col) + "y"] = y

What I also like about this is that this implementation will also preserve other parts of the schema, like semantic tags and metadata. For the sake of this data check, I think preserving the logical types is enough but we should get in the habit of preserving as much of the schema as possible in our implementation.


X.ww.init(logical_types={"A": "Unknown", "B": "Double"})
warnings = TargetLeakageDataCheck().validate(X, y)["warnings"]
assert not any(w["message"].startswith("Column 'A'") for w in warnings)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should add a comment explaining that mutual information is not supported for Unknown logical types so they should not be included.

I'm also thinking we should just mock ww.mutual information() and verify the logical types are consistent there? Not sure how tricky that would be though.

@eccabay eccabay merged commit 635f3e7 into main Aug 31, 2021
@eccabay eccabay deleted the 2683_targetleakage_types branch August 31, 2021 20:45
@chukarsten chukarsten mentioned this pull request Sep 1, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

TargetLeakageDataCheck wipes user-selected logical types

3 participants