You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A single data set test (as part of integrity) that checks if two columns are extremely correlated or even duplicated.
In order to do that we need to calculate correlation between different types of columns:
Corr numeric - numeric : spearman method (via pandas)
Corr numeric - categorical : correlation_ratio (via utils\correlation_methods)
Corr categorical - categorical: cramers V (via utils\distribution\drift)
Null filling:
categorical feature nulls should be mapped to a new category.
For numeric features, for corralation ratio, calculate the indexes of null in each feature and send it as a parameter to correlation_ratio method.
For numeric features, for spearman, nothing is required.
The plot output in a MxM heatmap, numeric features to appear first.
Make sure to check run time for large datasets (many features and many samples) to see if an issue arise.
If possible, try and implements in a matrix based fashion to reduce run time.
Condition should check that there are not more than x column pairs with correlation above y. Also add a condition that allows no columns pairs with correlation above y (in my opinion this should be the default one)
The text was updated successfully, but these errors were encountered:
A single data set test (as part of integrity) that checks if two columns are extremely correlated or even duplicated.
In order to do that we need to calculate correlation between different types of columns:
Corr numeric - numeric : spearman method (via pandas)
Corr numeric - categorical : correlation_ratio (via utils\correlation_methods)
Corr categorical - categorical: cramers V (via utils\distribution\drift)
Null filling:
categorical feature nulls should be mapped to a new category.
For numeric features, for corralation ratio, calculate the indexes of null in each feature and send it as a parameter to correlation_ratio method.
For numeric features, for spearman, nothing is required.
The plot output in a MxM heatmap, numeric features to appear first.
Make sure to check run time for large datasets (many features and many samples) to see if an issue arise.
If possible, try and implements in a matrix based fashion to reduce run time.
Condition should check that there are not more than x column pairs with correlation above y. Also add a condition that allows no columns pairs with correlation above y (in my opinion this should be the default one)
The text was updated successfully, but these errors were encountered: