Different Scores for Uncertainty Coefficient #70

aviasd · 2020-12-22T11:59:27Z

When showing the categorical variable's report, we have two information zones:

[feature_name] PROVIDES INFORMATION ON...
THESE FEATURES GIVE INFORMATION ON [feature_name]

This is an example from the Kaggle Housing Price dataset:

We can see that MSZoning provides information on Alley and Neighborhood and many more features.
We can also see that Alley and Neighborhood and many more features provide information on MSZoning.
The uncertainty coefficient of MSZoning on Alley is 0.27 and the uncertainty coefficient of Alley on MSZoning is 0.11.
The uncertainty coefficient of MSZoning on Neighborhood is 0.16 and the uncertainty coefficient of Neighborhood on MSZoning is 0.67.

Why arent these numbers the same from every direction? if MSZoning affects Neighborhood, it should be the same as the effect of Neighborhood on MSZoning (if we are talking about correlation).

How are these numbers calculated?
Thank you,

The text was updated successfully, but these errors were encountered:

fbdesignpro · 2020-12-22T13:37:51Z

Hi @aviasd, thank you for the detailed question.

The fact that they are not the same is actually a big feature of the uncertainty coefficient; because these features are categorical (not numerical), these are not a "correlation" in the usual sense. The uncertainty coefficient (albeit with some caveats as a certain underlying distribution is assumed for the calculations) is made for such categorical features and works as you would expect the numbers you give. I took this from the article: https://towardsdatascience.com/the-search-for-categorical-correlation-a1cf7f1888c9

i.e. given the numbers above, knowing "MSZoning" tells us 0.27 about "Alley", but Alley tells us less (0.11) about MSZoning, and this is based on occurrences in the categorical data.

It's actually pretty awesome. A lot of the time, it is very symmetrical, but sometimes it is not and that can be interesting!

For example in the Titanic dataset: Pclass provides information on Survived at 0.09, but Survived tells us information on Pclass at only 0.06. This tells us that knowing the passenger class tells us more about whether they survived, but knowing if they survived tells us less about the class. Some of this can be related to the number of categories (e.g. there are 3 Pclasses but only 2 (yes/no) in survived).

I hope this helps! If there are any issues with this calculation, definitely let me know as I do want these things to be helpful! :)

fbdesignpro closed this as completed Dec 22, 2020

fbdesignpro added the question Further information is requested label Feb 15, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Different Scores for Uncertainty Coefficient #70

Different Scores for Uncertainty Coefficient #70

aviasd commented Dec 22, 2020

fbdesignpro commented Dec 22, 2020

Different Scores for Uncertainty Coefficient #70

Different Scores for Uncertainty Coefficient #70

Comments

aviasd commented Dec 22, 2020

fbdesignpro commented Dec 22, 2020