Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

1164 feat tabular new check correlation between features #1606

Merged
merged 37 commits into from
Jun 16, 2022

Conversation

TheSolY
Copy link
Contributor

@TheSolY TheSolY commented Jun 12, 2022

Reference Issue

1164

What does this implement/fix?

New data integrity check feature-feature correlation to check the pairwise correlations between features

@TheSolY TheSolY requested review from a team as code owners June 12, 2022 12:40
@TheSolY TheSolY linked an issue Jun 12, 2022 that may be closed by this pull request
@TheSolY TheSolY requested review from ItayGabbay, shir22 and a team as code owners June 12, 2022 12:40
@TheSolY TheSolY added the feature Feature update or code change to the package label Jun 12, 2022

return CheckResult(value=full_df, header='Feature-Feature Correlation', display=fig)

def add_condition_max_number_of_pairs_above(self, threshold: float = 0.9, n_pairs: int = 0):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add threshold in the name? and maybe keep using greater/less

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the name but open for discussion

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with JKL. We should keep our usual naming convention + incomplete sentence seems weird to me

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JKL98ISR @nirhutnik suggest concrete names

…nly categorical data

Added a fixture for the adult dataset
test for the case of data with nans and only numerical
@nirhutnik
Copy link
Contributor

@TheSolY About the display - it's really great :) Would only add a short explanation under the heatmap itself, as is done in FeatureLabelCorrelation (but much, much shorter). It should explain that this is a heatmap of correlation between features.

Co-authored-by: Nir Hutnik <92314933+nirhutnik@users.noreply.github.com>
# along with Deepchecks. If not, see <http://www.gnu.org/licenses/>.
# ----------------------------------------------------------------------------
#
"""module contains Feature-Feature Correlation check."""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"""module contains Feature-Feature Correlation check."""
"""module contains the Feature-Feature Correlation check."""


encoded_cat_data = dataset.data.loc[:, cat_features].apply(lambda x: pd.factorize(x)[0])

all_features = num_features + cat_features
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there are features which are neither (and we have that option), we should say something (perhaps an asterisk noting that n features where not showed because they are neither categorical nor numeric)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@noamzbr asterisk added to the display


1. numerical-numerical: `Pearson's correlation coefficient <https://en.wikipedia.org/wiki/Pearson_correlation_coefficient>`__
2. numerical-categorical: `Correlation ratio <https://en.wikipedia.org/wiki/Correlation_ratio>`__
3. categorical-categorical: `Symmetric Theil's U <https://en.wikipedia.org/wiki/Uncertainty_coefficient>`__
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the symmetric isn't a thing (right?), we should add more detail here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took the formula from Wikipedia link

@@ -0,0 +1,65 @@
# -*- coding: utf-8 -*-
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't see the plot here, but worth slightly modifying the adult dataset to create a case that is "clearly not OK" and has two features which are clearly redundant.

If the adult dataset already has such a pair, then explain in the "Define a Condition" section what what was found. The condition should alert on that pair.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK just seen that this happens for education and education num. So just need to make sure the condition catches that (define it such that it will), and say something about it

Copy link
Contributor

@Nadav-Barak Nadav-Barak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!!
Would add 2-3 more tests, specifically one that include null values in different features


# Numerical-categorical correlations
if num_features and cat_features:
num_cat_corr = generalized_corrwith(dataset.data.loc[:, num_features], encoded_cat_data,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO places in which the numeric feature has a null value should be ignored (masked)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the way null values are handled depends on the method

…_correlation.py

Co-authored-by: Noam Bressler <noamzbr@gmail.com>
full_df.loc[cat_features, num_features] = num_cat_corr.transpose()

# Display
fig = px.imshow(full_df)
Copy link
Contributor

@ItayGabbay ItayGabbay Jun 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if we have 1000 features? will the plot be readable?
Maybe we should limit to the top X highest correlations?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the highest are not necessarily the most interesting, and the user should just take this into account and run the check on the partial dataset that they like. Showing only the top could create visual perception bias because the color scheme will be spread on a narrow scale.

Copy link
Contributor Author

@TheSolY TheSolY Jun 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ItayGabbay I've implemented show_top_n_columns

this generalized method applies to any two Dataframes with the same number of rows, regardless of the column names.

Parameters
__________
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's in the wrong format. should be ---
Don't know if this is github or for real...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's probably Github, in the real code it looks like this
image

Added a fixture for a DataFrame with mixed datatypes, NaNs, a whole Nan column, and a datetime column
…een-features

# Conflicts:
#	tests/conftest.py
Mine - added fixtures, Theirs- imports changes and tqdm
@noamzbr noamzbr enabled auto-merge (squash) June 16, 2022 15:00
@noamzbr noamzbr merged commit aed8c9c into main Jun 16, 2022
@delete-merged-branch delete-merged-branch bot deleted the 1164-feat-tabular-new-check-correlation-between-features branch June 16, 2022 17:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Feature update or code change to the package
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEAT] tabular - new check - correlation between features
6 participants