Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow boolean columns to be included in remove_highly_correlated_features #2231

Merged
merged 4 commits into from
Aug 10, 2022

Conversation

tamargrey
Copy link
Contributor

Fixes #2229

Converts boolean columns to float64 in remove_highly_correlated_features to avoid the error that comes from calling corr on boolean columns

@codecov
Copy link

codecov bot commented Aug 8, 2022

Codecov Report

Merging #2231 (03b8933) into main (6c23c14) will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##             main    #2231   +/-   ##
=======================================
  Coverage   99.29%   99.29%           
=======================================
  Files         146      146           
  Lines       17577    17590   +13     
=======================================
+ Hits        17453    17466   +13     
  Misses        124      124           
Impacted Files Coverage Δ
featuretools/selection/selection.py 100.00% <100.00%> (ø)
featuretools/tests/selection/test_selection.py 100.00% <100.00%> (ø)

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

# Convert boolean column to be float64
if pd.api.types.is_bool_dtype(less_complex_col):
less_complex_col = less_complex_col.astype("float64")

if abs(more_complex_col.corr(less_complex_col)) >= pct_corr_threshold:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noob question but is it a normal case for corr to be nan?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's possible for the column type to LatLong or a logical type where you cannot calculate the correlation value.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. Also, I didn't realize that the feature matrix had columns of all zeros (like bools = diff_ints), which makes correlation being nan make perfect sense.

@@ -330,3 +331,42 @@ def test_multi_output_selection():
for f in unsliced_features:
for f_name in f.get_feature_names():
assert f_name in matrix_columns


def test_remove_highly_correlatied_features_on_boolean_cols():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def test_remove_highly_correlatied_features_on_boolean_cols():
def test_remove_highly_correlated_features_on_boolean_cols():

@tamargrey tamargrey requested a review from sbadithe August 9, 2022 20:47
Copy link
Contributor

@rwedge rwedge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good

Can we squeeze in a docs fix? On line 21 of the release notes, instead of "user:`sbadithe`" it should be ":user:`sbadithe`"

@gsheni gsheni merged commit 079c882 into main Aug 10, 2022
@gsheni gsheni deleted the fix-corr-on-bools branch August 10, 2022 16:48
@ozzieD ozzieD mentioned this pull request Aug 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants