Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEAT] tabular - new check - correlation between features #1164

Closed
Nadav-Barak opened this issue Apr 3, 2022 · 3 comments · Fixed by #1606
Closed

[FEAT] tabular - new check - correlation between features #1164

Nadav-Barak opened this issue Apr 3, 2022 · 3 comments · Fixed by #1606
Assignees
Labels
ds Tasks suited for Data Scientists feature Feature update or code change to the package
Milestone

Comments

@Nadav-Barak
Copy link
Contributor

Nadav-Barak commented Apr 3, 2022

A single data set test (as part of integrity) that checks if two columns are extremely correlated or even duplicated.

In order to do that we need to calculate correlation between different types of columns:
Corr numeric - numeric : spearman method (via pandas)
Corr numeric - categorical : correlation_ratio (via utils\correlation_methods)
Corr categorical - categorical: cramers V (via utils\distribution\drift)

Null filling:
categorical feature nulls should be mapped to a new category.
For numeric features, for corralation ratio, calculate the indexes of null in each feature and send it as a parameter to correlation_ratio method.
For numeric features, for spearman, nothing is required.

The plot output in a MxM heatmap, numeric features to appear first.

Make sure to check run time for large datasets (many features and many samples) to see if an issue arise.
If possible, try and implements in a matrix based fashion to reduce run time.

Condition should check that there are not more than x column pairs with correlation above y. Also add a condition that allows no columns pairs with correlation above y (in my opinion this should be the default one)

@noamzbr noamzbr added the feature Feature update or code change to the package label Apr 3, 2022
@shir22 shir22 added this to the 0.7 milestone Apr 24, 2022
@noamzbr noamzbr modified the milestones: 0.7, 0.6.4 May 8, 2022
@noamzbr
Copy link
Collaborator

noamzbr commented May 23, 2022

Preceded by #1421

@noamzbr noamzbr modified the milestones: 0.7, 0.7.1 May 23, 2022
@Nadav-Barak
Copy link
Contributor Author

Additional suggestion - add a feature flag to use PPS for all calculations instead

@deepchecks deepchecks deleted a comment from noamzbr May 23, 2022
@deepchecks deepchecks deleted a comment from noamzbr May 23, 2022
@deepchecks deepchecks deleted a comment from nirhutnik May 23, 2022
@Nadav-Barak Nadav-Barak removed their assignment May 25, 2022
@JKL98ISR JKL98ISR assigned JKL98ISR and unassigned JKL98ISR May 29, 2022
@noamzbr noamzbr added the ds Tasks suited for Data Scientists label May 29, 2022
@TheSolY TheSolY linked a pull request Jun 12, 2022 that will close this issue
@TheSolY
Copy link
Contributor

TheSolY commented Oct 11, 2022 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ds Tasks suited for Data Scientists feature Feature update or code change to the package
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants