Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEAT] Feature correlation research #1421

Closed
noamzbr opened this issue May 11, 2022 · 3 comments · Fixed by #1484
Closed

[FEAT] Feature correlation research #1421

noamzbr opened this issue May 11, 2022 · 3 comments · Fixed by #1484
Assignees
Milestone

Comments

@noamzbr
Copy link
Collaborator

noamzbr commented May 11, 2022

Is your feature request related to a problem? Please describe.
In order to implement #1164 , need a robust and scaled way to compute correlation between all features - not just numeric to numeric, but also categorical to categorical and numeric to categorical.

Describe the solution you'd like
A proven and tested (in the sense that it has tests) util function that can compute these arbitrary correlations.

Source

  1. https://towardsdatascience.com/the-search-for-categorical-correlation-a1cf7f1888c9
  2. https://www.statology.org/correlation-between-categorical-variables/
  3. https://datascience.stackexchange.com/questions/893/how-to-get-correlation-between-two-categorical-variable-and-a-categorical-variab
  4. https://medium.com/@outside2SDs/an-overview-of-correlation-measures-between-categorical-and-continuous-variables-4c7f85610365
@Nadav-Barak
Copy link
Contributor

Nadav-Barak commented May 18, 2022

Categorical to categorical
Theil's U an asymmetric measure ranges [0,1] based on entropy. Run time is M^2 times by how fast entropy is calculated (which is in turn a function of the number of unique values)
Alternative is Cramer's V, problem is that it is symmetric therefore losses some information.
We need to compare them in practice (performance and run time). Both have a solid theoretical backbone.

@Nadav-Barak
Copy link
Contributor

Nadav-Barak commented May 18, 2022

Numeric to categorical
Correlation ratio looks most promising. It is an asymmetric (single direction!) variance based method to determine how much a numeric feature explains a categorical feature. return a value in [0,1].
It groups values from the numeric feature based on the corresponding category and measures the weighted variance of the groups means divided by the variance of all samples (squared).
Other option is ANOVA which use the same grouping mechanism but uses a simpler (yet similar) metric - compares variance between groups mean with the average variance inside a group

@Nadav-Barak
Copy link
Contributor

After discussion with the one and only @noamzbr it was decided to use Theil's U for cat to cat (bi directional) and Correlation ration for numeric to cat. the output will look something like this:

  • need to consider only presenting some of the columns

Image

@Nadav-Barak Nadav-Barak linked a pull request May 19, 2022 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants