Update TargetLeakageDataCheck to use Woodwork's dependence_dict#3728
Conversation
Codecov Report
@@ Coverage Diff @@
## main #3728 +/- ##
=======================================
+ Coverage 99.7% 99.7% +0.1%
=======================================
Files 339 339
Lines 34786 34810 +24
=======================================
+ Hits 34655 34679 +24
Misses 131 131
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
| pct_corr_threshold (float): The correlation threshold to be considered leakage. Defaults to 0.95. | ||
| method (string): The method to determine correlation. Use 'mutual' for mutual information, otherwise 'pearson' for Pearson correlation. Defaults to 'mutual'. | ||
| method (string): The method to determine correlation. Use 'max' for the maximum correlation, or for specific correlation metrics, use their name (ie 'mutual_info' for mutual information, 'pearson' for Pearson correlation, etc). | ||
| Defaults to 'mutual_info'. |
There was a problem hiding this comment.
Default to mutual_info for now as we fix a bug in WW. Will change to max once the bug is fixed and new release is out
There was a problem hiding this comment.
Should we consider this a breaking change, since mutual is no longer accepted as an input to this Data Check?
| Pearson correlation returns a value in [-1, 1], while mutual information returns a value in [0, 1]. | ||
| If `method='mutual_info'`, this data check uses mutual information and supports all target and feature types. | ||
| Other correlation metrics only support binary with numeric and boolean dtypes, and return a value in [-1, 1], while mutual information returns a value in [0, 1]. | ||
| Correlation metrics available can be found in Woodwork's `dependence_dict method <https://woodwork.alteryx.com/en/stable/generated/woodwork.table_accessor.WoodworkTableAccessor.dependence_dict.html#woodwork.table_accessor.WoodworkTableAccessor.dependence_dict>`_. |
ParthivNaresh
left a comment
There was a problem hiding this comment.
Awesome work! Just left a few comments
| pct_corr_threshold (float): The correlation threshold to be considered leakage. Defaults to 0.95. | ||
| method (string): The method to determine correlation. Use 'mutual' for mutual information, otherwise 'pearson' for Pearson correlation. Defaults to 'mutual'. | ||
| method (string): The method to determine correlation. Use 'max' for the maximum correlation, or for specific correlation metrics, use their name (ie 'mutual_info' for mutual information, 'pearson' for Pearson correlation, etc). | ||
| Defaults to 'mutual_info'. |
There was a problem hiding this comment.
Should we consider this a breaking change, since mutual is no longer accepted as an input to this Data Check?
jeremyliweishih
left a comment
There was a problem hiding this comment.
LGTM - just some docstring and error raising nits

Update data checks