Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add multicollinearity data check #1515

Merged
merged 17 commits into from Dec 16, 2020
Merged

Add multicollinearity data check #1515

merged 17 commits into from Dec 16, 2020

Conversation

angela97lin
Copy link
Contributor

@angela97lin angela97lin commented Dec 7, 2020

Resolves #811 by adding an initial version of a multicollinearity data check.

I switch between "correlation" and "multicollinear" because mutual information right now is technically just checking for correlation but open to thoughts about consistency 🤷

@angela97lin angela97lin added this to the December 2020 milestone Dec 7, 2020
@angela97lin angela97lin self-assigned this Dec 7, 2020
@codecov
Copy link

codecov bot commented Dec 7, 2020

Codecov Report

Merging #1515 (a58ecc7) into main (af44074) will increase coverage by 0.1%.
The diff coverage is 100.0%.

Impacted file tree graph

@@            Coverage Diff            @@
##             main    #1515     +/-   ##
=========================================
+ Coverage   100.0%   100.0%   +0.1%     
=========================================
  Files         234      236      +2     
  Lines       16827    16881     +54     
=========================================
+ Hits        16819    16873     +54     
  Misses          8        8             
Impacted Files Coverage Δ
evalml/data_checks/highly_null_data_check.py 100.0% <ø> (ø)
evalml/data_checks/id_columns_data_check.py 100.0% <ø> (ø)
evalml/data_checks/no_variance_data_check.py 100.0% <ø> (ø)
evalml/data_checks/outliers_data_check.py 100.0% <ø> (ø)
evalml/data_checks/__init__.py 100.0% <100.0%> (ø)
evalml/data_checks/data_check_message_code.py 100.0% <100.0%> (ø)
evalml/data_checks/multicollinearity_data_check.py 100.0% <100.0%> (ø)
..._checks_tests/test_multicollinearity_data_check.py 100.0% <100.0%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update af44074...a58ecc7. Read the comment docs.

@@ -32,7 +32,6 @@ def validate(self, X, y=None):

Arguments:
X (ww.DataTable, pd.DataFrame, np.ndarray): The input features to check
threshold (float): The probability threshold to be considered an ID column. Defaults to 1.0
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not related but removing extraneous argument in docstring :d

}


def test_multicollinearity_data_check_input_formats():
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would have liked adding ww test here but made more sense to reuse data from test_multicollinearity_nonnumeric_cols

Copy link
Contributor

@bchen1116 bchen1116 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Might be nice to add a np.array as a test to see whether the output columns are understandable, but not necessary.

Copy link
Contributor

@jeremyliweishih jeremyliweishih left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Current implementation looks good - just a couple questions,

  1. Is this added to AutoML by default? I'm a bit worried about performance across larger datasets but maybe the WW team would know more about it.

  2. Did we go with WW's implementation for convenience? I remember us looking at a VIF implementation in the past. I tried researching but couldn't find if there was a difference in using NMIF vs. just correlation as well.

@jeremyliweishih
Copy link
Contributor

I switch between "correlation" and "multicollinear" because mutual information right now is technically just checking for correlation but open to thoughts about consistency 🤷

I think all user facing things should be "multicollinear" but usage of correlation or correlated in the code should be fine!

@angela97lin
Copy link
Contributor Author

angela97lin commented Dec 8, 2020

@jeremyliweishih Yup, currently not adding to default data checks because of what you mentioned, but open to discussing if we should or not. Would be nice to get some sort of clustering/faster algo first before doing so.

And WW's normalized mutual information method handles categoricals, which is nice (and I believe something VIF doesn't) :)

Copy link
Contributor

@tyler3991 tyler3991 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Nice!!

Copy link
Contributor

@freddyaboulton freddyaboulton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@angela97lin Looks good! Maybe we should benchmark this to know where we are now in terms of performance. Obviously, that shouldn't block this merge :)

evalml/data_checks/multicollinearity_data_check.py Outdated Show resolved Hide resolved
Copy link
Contributor

@dsherry dsherry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@angela97lin LGTM! Had a couple comments. We should definitely lower the default threshold.

evalml/data_checks/multicollinearity_data_check.py Outdated Show resolved Hide resolved
if mutual_info_df.empty:
return messages
above_threshold = mutual_info_df.loc[mutual_info_df['mutual_info'] >= self.threshold]
correlated_cols = [(col_1, col_2) for col_1, col_2 in zip(above_threshold['column_1'], above_threshold['column_2'])]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this remove duplicate pairs, i.e. ('col_A', 'col_B') and ('col_B', 'col_A') ? Ideally we should find a way to do that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Luckily, mutual_information from Woodwork handles this; that's why we only see one pair in the expected test results 😁

@angela97lin angela97lin merged commit 73bff4a into main Dec 16, 2020
@dsherry dsherry mentioned this pull request Dec 29, 2020
@angela97lin angela97lin deleted the 811_multicollinearity branch January 13, 2021 20:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Data check to detect multicollinearity
6 participants