[FEAT] tabular - new check - correlation between features #1164

Nadav-Barak · 2022-04-03T13:42:25Z

A single data set test (as part of integrity) that checks if two columns are extremely correlated or even duplicated.

In order to do that we need to calculate correlation between different types of columns:
Corr numeric - numeric : spearman method (via pandas)
Corr numeric - categorical : correlation_ratio (via utils\correlation_methods)
Corr categorical - categorical: cramers V (via utils\distribution\drift)

Null filling:
categorical feature nulls should be mapped to a new category.
For numeric features, for corralation ratio, calculate the indexes of null in each feature and send it as a parameter to correlation_ratio method.
For numeric features, for spearman, nothing is required.

The plot output in a MxM heatmap, numeric features to appear first.

Make sure to check run time for large datasets (many features and many samples) to see if an issue arise.
If possible, try and implements in a matrix based fashion to reduce run time.

Condition should check that there are not more than x column pairs with correlation above y. Also add a condition that allows no columns pairs with correlation above y (in my opinion this should be the default one)

noamzbr · 2022-05-23T08:49:35Z

Preceded by #1421

Nadav-Barak · 2022-05-23T09:33:51Z

Additional suggestion - add a feature flag to use PPS for all calculations instead

TheSolY · 2022-10-11T08:43:21Z

הללויה

…

On Thu, 16 Jun 2022 at 20:12 Noam Bressler ***@***.***> wrote: Closed #1164 <#1164> as completed via #1606 <#1606>. — Reply to this email directly, view it on GitHub <#1164 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AXWKMSRCG76YBX7FV4TSDMLVPNOA5ANCNFSM5SNGZ56Q> . You are receiving this because you were assigned.Message ID: ***@***.***>

Nadav-Barak added the suggestion label Apr 3, 2022

noamzbr added the feature Feature update or code change to the package label Apr 3, 2022

shir22 added this to the 0.7 milestone Apr 24, 2022

noamzbr modified the milestones: 0.7, 0.6.4 May 8, 2022

noamzbr assigned Nadav-Barak May 10, 2022

noamzbr mentioned this issue May 11, 2022

[FEAT] Feature correlation research #1421

Closed

noamzbr unassigned Nadav-Barak May 11, 2022

noamzbr modified the milestones: 0.7, 0.7.1 May 23, 2022

noamzbr assigned Nadav-Barak May 23, 2022

deepchecks deleted a comment from noamzbr May 23, 2022

deepchecks deleted a comment from nirhutnik May 23, 2022

Nadav-Barak removed their assignment May 25, 2022

JKL98ISR assigned JKL98ISR and unassigned JKL98ISR May 29, 2022

noamzbr added the ds Tasks suited for Data Scientists label May 29, 2022

noamzbr assigned TheSolY May 31, 2022

TheSolY linked a pull request Jun 12, 2022 that will close this issue

1164 feat tabular new check correlation between features #1606

Merged

noamzbr closed this as completed in #1606 Jun 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEAT] tabular - new check - correlation between features #1164

[FEAT] tabular - new check - correlation between features #1164

Nadav-Barak commented Apr 3, 2022 •

edited

Loading

noamzbr commented May 23, 2022

Nadav-Barak commented May 23, 2022

TheSolY commented Oct 11, 2022 via email

[FEAT] tabular - new check - correlation between features #1164

[FEAT] tabular - new check - correlation between features #1164

Comments

Nadav-Barak commented Apr 3, 2022 • edited Loading

noamzbr commented May 23, 2022

Nadav-Barak commented May 23, 2022

TheSolY commented Oct 11, 2022 via email

Nadav-Barak commented Apr 3, 2022 •

edited

Loading