New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add multicollinearity data check #1515
Conversation
Codecov Report
@@ Coverage Diff @@
## main #1515 +/- ##
=========================================
+ Coverage 100.0% 100.0% +0.1%
=========================================
Files 234 236 +2
Lines 16827 16881 +54
=========================================
+ Hits 16819 16873 +54
Misses 8 8
Continue to review full report at Codecov.
|
@@ -32,7 +32,6 @@ def validate(self, X, y=None): | |||
|
|||
Arguments: | |||
X (ww.DataTable, pd.DataFrame, np.ndarray): The input features to check | |||
threshold (float): The probability threshold to be considered an ID column. Defaults to 1.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not related but removing extraneous argument in docstring :d
} | ||
|
||
|
||
def test_multicollinearity_data_check_input_formats(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would have liked adding ww test here but made more sense to reuse data from test_multicollinearity_nonnumeric_cols
…o 811_multicollinearity
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Might be nice to add a np.array
as a test to see whether the output columns are understandable, but not necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Current implementation looks good - just a couple questions,
-
Is this added to AutoML by default? I'm a bit worried about performance across larger datasets but maybe the WW team would know more about it.
-
Did we go with WW's implementation for convenience? I remember us looking at a VIF implementation in the past. I tried researching but couldn't find if there was a difference in using NMIF vs. just correlation as well.
I think all user facing things should be "multicollinear" but usage of correlation or correlated in the code should be fine! |
@jeremyliweishih Yup, currently not adding to default data checks because of what you mentioned, but open to discussing if we should or not. Would be nice to get some sort of clustering/faster algo first before doing so. And WW's normalized mutual information method handles categoricals, which is nice (and I believe something VIF doesn't) :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Nice!!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@angela97lin Looks good! Maybe we should benchmark this to know where we are now in terms of performance. Obviously, that shouldn't block this merge :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@angela97lin LGTM! Had a couple comments. We should definitely lower the default threshold.
if mutual_info_df.empty: | ||
return messages | ||
above_threshold = mutual_info_df.loc[mutual_info_df['mutual_info'] >= self.threshold] | ||
correlated_cols = [(col_1, col_2) for col_1, col_2 in zip(above_threshold['column_1'], above_threshold['column_2'])] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this remove duplicate pairs, i.e. ('col_A', 'col_B')
and ('col_B', 'col_A')
? Ideally we should find a way to do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Luckily, mutual_information
from Woodwork handles this; that's why we only see one pair in the expected test results 😁
Resolves #811 by adding an initial version of a multicollinearity data check.
I switch between "correlation" and "multicollinear" because mutual information right now is technically just checking for correlation but open to thoughts about consistency 🤷