TargetLeakageDataCheck maintains user logical types#2711
Conversation
Codecov Report
@@ Coverage Diff @@
## main #2711 +/- ##
=======================================
+ Coverage 99.9% 99.9% +0.1%
=======================================
Files 301 301
Lines 27600 27607 +7
=======================================
+ Hits 27556 27563 +7
Misses 44 44
Continue to review full report at Codecov.
|
freddyaboulton
left a comment
There was a problem hiding this comment.
Thank you @eccabay !!
| cols_to_compare = infer_feature_types( | ||
| pd.DataFrame({col: X[col], str(col) + "y": y}) | ||
| ) | ||
| logical_types = {col: type(X.ww.logical_types[col]), str(col) + "y": y_type} |
There was a problem hiding this comment.
I think we can simplify this with this:
cols_to_compare = X.ww[[col]]
cols_to_compare.ww[str(col) + "y"] = yWhat I also like about this is that this implementation will also preserve other parts of the schema, like semantic tags and metadata. For the sake of this data check, I think preserving the logical types is enough but we should get in the habit of preserving as much of the schema as possible in our implementation.
|
|
||
| X.ww.init(logical_types={"A": "Unknown", "B": "Double"}) | ||
| warnings = TargetLeakageDataCheck().validate(X, y)["warnings"] | ||
| assert not any(w["message"].startswith("Column 'A'") for w in warnings) |
There was a problem hiding this comment.
Maybe we should add a comment explaining that mutual information is not supported for Unknown logical types so they should not be included.
I'm also thinking we should just mock ww.mutual information() and verify the logical types are consistent there? Not sure how tricky that would be though.
Closes #2683