Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Different Scores for Uncertainty Coefficient #70

Closed
aviasd opened this issue Dec 22, 2020 · 1 comment
Closed

Different Scores for Uncertainty Coefficient #70

aviasd opened this issue Dec 22, 2020 · 1 comment
Labels
question Further information is requested

Comments

@aviasd
Copy link

aviasd commented Dec 22, 2020

When showing the categorical variable's report, we have two information zones:

  1. [feature_name] PROVIDES INFORMATION ON...
  2. THESE FEATURES GIVE INFORMATION ON [feature_name]

This is an example from the Kaggle Housing Price dataset:
image

We can see that MSZoning provides information on Alley and Neighborhood and many more features.
We can also see that Alley and Neighborhood and many more features provide information on MSZoning.
The uncertainty coefficient of MSZoning on Alley is 0.27 and the uncertainty coefficient of Alley on MSZoning is 0.11.
The uncertainty coefficient of MSZoning on Neighborhood is 0.16 and the uncertainty coefficient of Neighborhood on MSZoning is 0.67.

Why arent these numbers the same from every direction? if MSZoning affects Neighborhood, it should be the same as the effect of Neighborhood on MSZoning (if we are talking about correlation).

How are these numbers calculated?
Thank you,

@fbdesignpro
Copy link
Owner

Hi @aviasd, thank you for the detailed question.

The fact that they are not the same is actually a big feature of the uncertainty coefficient; because these features are categorical (not numerical), these are not a "correlation" in the usual sense. The uncertainty coefficient (albeit with some caveats as a certain underlying distribution is assumed for the calculations) is made for such categorical features and works as you would expect the numbers you give. I took this from the article: https://towardsdatascience.com/the-search-for-categorical-correlation-a1cf7f1888c9

i.e. given the numbers above, knowing "MSZoning" tells us 0.27 about "Alley", but Alley tells us less (0.11) about MSZoning, and this is based on occurrences in the categorical data.

It's actually pretty awesome. A lot of the time, it is very symmetrical, but sometimes it is not and that can be interesting!

For example in the Titanic dataset: Pclass provides information on Survived at 0.09, but Survived tells us information on Pclass at only 0.06. This tells us that knowing the passenger class tells us more about whether they survived, but knowing if they survived tells us less about the class. Some of this can be related to the number of categories (e.g. there are 3 Pclasses but only 2 (yes/no) in survived).

I hope this helps! If there are any issues with this calculation, definitely let me know as I do want these things to be helpful! :)

@fbdesignpro fbdesignpro added the question Further information is requested label Feb 15, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants