-
Notifications
You must be signed in to change notification settings - Fork 884
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow boolean columns to be included in remove_highly_correlated_features #2231
Conversation
Codecov Report
@@ Coverage Diff @@
## main #2231 +/- ##
=======================================
Coverage 99.29% 99.29%
=======================================
Files 146 146
Lines 17577 17590 +13
=======================================
+ Hits 17453 17466 +13
Misses 124 124
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
# Convert boolean column to be float64 | ||
if pd.api.types.is_bool_dtype(less_complex_col): | ||
less_complex_col = less_complex_col.astype("float64") | ||
|
||
if abs(more_complex_col.corr(less_complex_col)) >= pct_corr_threshold: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noob question but is it a normal case for corr to be nan
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's possible for the column type to LatLong or a logical type where you cannot calculate the correlation value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. Also, I didn't realize that the feature matrix had columns of all zeros (like bools = diff_ints
), which makes correlation being nan
make perfect sense.
@@ -330,3 +331,42 @@ def test_multi_output_selection(): | |||
for f in unsliced_features: | |||
for f_name in f.get_feature_names(): | |||
assert f_name in matrix_columns | |||
|
|||
|
|||
def test_remove_highly_correlatied_features_on_boolean_cols(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def test_remove_highly_correlatied_features_on_boolean_cols(): | |
def test_remove_highly_correlated_features_on_boolean_cols(): |
7200f03
to
e61b05b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good
Can we squeeze in a docs fix? On line 21 of the release notes, instead of "user:`sbadithe`" it should be ":user:`sbadithe`"
Fixes #2229
Converts boolean columns to
float64
inremove_highly_correlated_features
to avoid the error that comes from callingcorr
on boolean columns