1164 feat tabular new check correlation between features #1606

TheSolY · 2022-06-12T12:40:04Z

Reference Issue

1164

What does this implement/fix?

New data integrity check feature-feature correlation to check the pairwise correlations between features

…column names when reading the csv preliminary test for feature_feature_correlation feature_feature_correlation WIP

…e column names feature_feature_correlation now returns the dataframe with the expected values

…d, parameter name and docstring changes

Implemented value_frequency for entropy calculation for Theil U changed categorical method from Crammers C to symmetric Theil U Added heatmap display to the check result

Pylint

deleted commented out lines

…tion docs WIP

…een-features

deepchecks/tabular/checks/data_integrity/feature_feature_correlation.py

JKL98ISR · 2022-06-12T14:10:22Z

deepchecks/tabular/checks/data_integrity/feature_feature_correlation.py

+
+        return CheckResult(value=full_df, header='Feature-Feature Correlation', display=fig)
+
+    def add_condition_max_number_of_pairs_above(self, threshold: float = 0.9, n_pairs: int = 0):


maybe add threshold in the name? and maybe keep using greater/less

I like the name but open for discussion

Agree with JKL. We should keep our usual naming convention + incomplete sentence seems weird to me

@JKL98ISR @nirhutnik suggest concrete names

deepchecks/tabular/datasets/classification/adult.py

…nly categorical data Added a fixture for the adult dataset test for the case of data with nans and only numerical

nirhutnik · 2022-06-13T08:33:43Z

@TheSolY About the display - it's really great :) Would only add a short explanation under the heatmap itself, as is done in FeatureLabelCorrelation (but much, much shorter). It should explain that this is a heatmap of correlation between features.

deepchecks/utils/dataframes.py

deepchecks/tabular/checks/data_integrity/feature_feature_correlation.py

Co-authored-by: Nir Hutnik <92314933+nirhutnik@users.noreply.github.com>

noamzbr · 2022-06-13T11:21:20Z

deepchecks/tabular/checks/data_integrity/feature_feature_correlation.py

+# along with Deepchecks.  If not, see <http://www.gnu.org/licenses/>.
+# ----------------------------------------------------------------------------
+#
+"""module contains  Feature-Feature Correlation check."""


Suggested change

"""module contains Feature-Feature Correlation check."""

"""module contains the Feature-Feature Correlation check."""

deepchecks/tabular/checks/data_integrity/feature_feature_correlation.py

noamzbr · 2022-06-13T11:25:10Z

deepchecks/tabular/checks/data_integrity/feature_feature_correlation.py

+
+        encoded_cat_data = dataset.data.loc[:, cat_features].apply(lambda x: pd.factorize(x)[0])
+
+        all_features = num_features + cat_features


If there are features which are neither (and we have that option), we should say something (perhaps an asterisk noting that n features where not showed because they are neither categorical nor numeric)

@noamzbr asterisk added to the display

noamzbr · 2022-06-13T11:27:55Z

docs/source/checks/tabular/data_integrity/plot_feature_feature_correlation.py

+
+1. numerical-numerical: `Pearson's correlation coefficient <https://en.wikipedia.org/wiki/Pearson_correlation_coefficient>`__
+2. numerical-categorical: `Correlation ratio <https://en.wikipedia.org/wiki/Correlation_ratio>`__
+3. categorical-categorical: `Symmetric Theil's U <https://en.wikipedia.org/wiki/Uncertainty_coefficient>`__


Because the symmetric isn't a thing (right?), we should add more detail here

I took the formula from Wikipedia link

docs/source/checks/tabular/data_integrity/plot_feature_feature_correlation.py

noamzbr · 2022-06-13T11:29:49Z

docs/source/checks/tabular/data_integrity/plot_feature_feature_correlation.py

@@ -0,0 +1,65 @@
+# -*- coding: utf-8 -*-


I can't see the plot here, but worth slightly modifying the adult dataset to create a case that is "clearly not OK" and has two features which are clearly redundant.

If the adult dataset already has such a pair, then explain in the "Define a Condition" section what what was found. The condition should alert on that pair.

OK just seen that this happens for education and education num. So just need to make sure the condition catches that (define it such that it will), and say something about it

tests/tabular/checks/integrity/feature_feature_correlation_test.py

Nadav-Barak

Looks good!!
Would add 2-3 more tests, specifically one that include null values in different features

Nadav-Barak · 2022-06-13T11:32:30Z

deepchecks/tabular/checks/data_integrity/feature_feature_correlation.py

+
+        # Numerical-categorical correlations
+        if num_features and cat_features:
+            num_cat_corr = generalized_corrwith(dataset.data.loc[:, num_features], encoded_cat_data,


IMO places in which the numeric feature has a null value should be ignored (masked)

the way null values are handled depends on the method

…_correlation.py Co-authored-by: Noam Bressler <noamzbr@gmail.com>

ItayGabbay · 2022-06-14T08:43:00Z

deepchecks/tabular/checks/data_integrity/feature_feature_correlation.py

+            full_df.loc[cat_features, num_features] = num_cat_corr.transpose()
+
+        # Display
+        fig = px.imshow(full_df)


What happens if we have 1000 features? will the plot be readable?
Maybe we should limit to the top X highest correlations?

I think that the highest are not necessarily the most interesting, and the user should just take this into account and run the check on the partial dataset that they like. Showing only the top could create visual perception bias because the color scheme will be spread on a narrow scale.

@ItayGabbay I've implemented show_top_n_columns

ItayGabbay · 2022-06-14T08:45:01Z

deepchecks/utils/dataframes.py

+    this generalized method applies to any two Dataframes with the same number of rows, regardless of the column names.
+
+    Parameters
+    __________


I think it's in the wrong format. should be ---
Don't know if this is github or for real...

It's probably Github, in the real code it looks like this

Added a fixture for a DataFrame with mixed datatypes, NaNs, a whole Nan column, and a datetime column

…s' of https://github.com/deepchecks/deepchecks into 1164-feat-tabular-new-check-correlation-between-features

…een-features # Conflicts: # tests/conftest.py Mine - added fixtures, Theirs- imports changes and tqdm

…een-features

TheSolY added 16 commits June 8, 2022 18:00

preliminary logic

d7c9c0f

Fixed a bug in the adult dataset in train_test False mode: added the …

81dd9dd

…column names when reading the csv preliminary test for feature_feature_correlation feature_feature_correlation WIP

Added generalized_corrwith because Pandas' corrwith only supports sam…

68a2208

…e column names feature_feature_correlation now returns the dataframe with the expected values

correlation ratio now expects the categorical variable already encode…

d5a6ec6

…d, parameter name and docstring changes

Implemented symmetric Theil U

b2e8106

Implemented value_frequency for entropy calculation for Theil U changed categorical method from Crammers C to symmetric Theil U Added heatmap display to the check result

value_frequency moved to utils/distribution/preprocessing

dcd6d95

Add condition correlation less than

77d8021

condition and tests for condition

934182c

Test expected result structure

9cf8faf

Pylint

added the check to __init__

2f79a62

deleted commented out lines

adjusted correlation_methods_test to the changes in correlation_methods

19401d1

test for value_frequencies

5482c16

test generalized_corrwith

028ee10

the condition name changed due to Github Copilot's much better sugges…

45da2a5

…tion docs WIP

docs add condition example

54f3a0b

Merge branch 'main' into 1164-feat-tabular-new-check-correlation-betw…

679c34f

…een-features

TheSolY requested review from a team as code owners June 12, 2022 12:40

TheSolY linked an issue Jun 12, 2022 that may be closed by this pull request

[FEAT] tabular - new check - correlation between features #1164

Closed

TheSolY requested review from ItayGabbay, shir22 and a team as code owners June 12, 2022 12:40

TheSolY added the feature Feature update or code change to the package label Jun 12, 2022

TheSolY added 2 commits June 12, 2022 15:53

Pylint

43aae63

flake8

2dede1f

JKL98ISR reviewed Jun 12, 2022

View reviewed changes

deepchecks/tabular/checks/data_integrity/feature_feature_correlation.py Outdated Show resolved Hide resolved

JKL98ISR reviewed Jun 12, 2022

View reviewed changes

deepchecks/tabular/datasets/classification/adult.py Show resolved Hide resolved

JKL98ISR approved these changes Jun 12, 2022

View reviewed changes

fixed a typo in the example name

db05cdb

TheSolY added 2 commits June 13, 2022 10:33

changes based on PR comments

4b8e327

Fixed feature_feature_correlation for the case of only numerical or o…

8c7f8c2

…nly categorical data Added a fixture for the adult dataset test for the case of data with nans and only numerical

test for value_frequency with NaNs

b9880f6

nirhutnik reviewed Jun 13, 2022

View reviewed changes

deepchecks/utils/dataframes.py Outdated Show resolved Hide resolved

nirhutnik reviewed Jun 13, 2022

View reviewed changes

deepchecks/tabular/checks/data_integrity/feature_feature_correlation.py Outdated Show resolved Hide resolved

Apply suggestions from code review

8fc81c6

Co-authored-by: Nir Hutnik <92314933+nirhutnik@users.noreply.github.com>

noamzbr reviewed Jun 13, 2022

View reviewed changes

Nadav-Barak approved these changes Jun 13, 2022

View reviewed changes

Update docs/source/checks/tabular/data_integrity/plot_feature_feature…

fece2e2

…_correlation.py Co-authored-by: Noam Bressler <noamzbr@gmail.com>

ItayGabbay reviewed Jun 14, 2022

View reviewed changes

TheSolY added 7 commits June 14, 2022 23:51

added filtering parameters to the check

3b89610

Added a fixture for a DataFrame with mixed datatypes, NaNs, a whole Nan column, and a datetime column

Merge branch '1164-feat-tabular-new-check-correlation-between-feature…

aee30a2

…s' of https://github.com/deepchecks/deepchecks into 1164-feat-tabular-new-check-correlation-between-features

Merge branch 'main' into 1164-feat-tabular-new-check-correlation-betw…

480fb43

…een-features # Conflicts: # tests/conftest.py Mine - added fixtures, Theirs- imports changes and tqdm

post-merge fixed

55a50bf

display improvements

291c763

Extended docsting + pylint

2332bfe

replaced double-loop in the condition

0b70e8e

TheSolY requested review from nirhutnik, ItayGabbay and noamzbr June 16, 2022 11:36

isort

8d70af9

ItayGabbay approved these changes Jun 16, 2022

View reviewed changes

nirhutnik approved these changes Jun 16, 2022

View reviewed changes

TheSolY added 2 commits June 16, 2022 16:49

minor fixes

1071f24

Merge branch 'main' into 1164-feat-tabular-new-check-correlation-betw…

50170a1

…een-features

noamzbr enabled auto-merge (squash) June 16, 2022 15:00

noamzbr merged commit aed8c9c into main Jun 16, 2022

delete-merged-branch bot deleted the 1164-feat-tabular-new-check-correlation-between-features branch June 16, 2022 17:12

TheSolY mentioned this pull request Oct 11, 2022

[FEAT] tabular - new check - correlation between features #1164

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

1164 feat tabular new check correlation between features #1606

1164 feat tabular new check correlation between features #1606

TheSolY commented Jun 12, 2022

JKL98ISR Jun 12, 2022

TheSolY Jun 12, 2022

nirhutnik Jun 13, 2022

TheSolY Jun 16, 2022

nirhutnik commented Jun 13, 2022

noamzbr Jun 13, 2022

noamzbr Jun 13, 2022

TheSolY Jun 16, 2022

noamzbr Jun 13, 2022

TheSolY Jun 13, 2022

noamzbr Jun 13, 2022

noamzbr Jun 13, 2022

Nadav-Barak left a comment

Nadav-Barak Jun 13, 2022

TheSolY Jun 14, 2022

ItayGabbay Jun 14, 2022 •

edited

Loading

TheSolY Jun 14, 2022

TheSolY Jun 14, 2022 •

edited

Loading

ItayGabbay Jun 14, 2022

TheSolY Jun 14, 2022


		return CheckResult(value=full_df, header='Feature-Feature Correlation', display=fig)

		def add_condition_max_number_of_pairs_above(self, threshold: float = 0.9, n_pairs: int = 0):

	"""module contains Feature-Feature Correlation check."""
	"""module contains the Feature-Feature Correlation check."""


		encoded_cat_data = dataset.data.loc[:, cat_features].apply(lambda x: pd.factorize(x)[0])

		all_features = num_features + cat_features

1164 feat tabular new check correlation between features #1606

1164 feat tabular new check correlation between features #1606

Conversation

TheSolY commented Jun 12, 2022

Reference Issue

What does this implement/fix?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nirhutnik commented Jun 13, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Nadav-Barak left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ItayGabbay Jun 14, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TheSolY Jun 14, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ItayGabbay Jun 14, 2022 •

edited

Loading

TheSolY Jun 14, 2022 •

edited

Loading