You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We can see that MSZoning provides information on Alley and Neighborhood and many more features.
We can also see that Alley and Neighborhood and many more features provide information on MSZoning.
The uncertainty coefficient of MSZoning on Alley is 0.27 and the uncertainty coefficient of Alley on MSZoning is 0.11.
The uncertainty coefficient of MSZoning on Neighborhood is 0.16 and the uncertainty coefficient of Neighborhood on MSZoning is 0.67.
Why arent these numbers the same from every direction? if MSZoning affects Neighborhood, it should be the same as the effect of Neighborhood on MSZoning (if we are talking about correlation).
How are these numbers calculated?
Thank you,
The text was updated successfully, but these errors were encountered:
The fact that they are not the same is actually a big feature of the uncertainty coefficient; because these features are categorical (not numerical), these are not a "correlation" in the usual sense. The uncertainty coefficient (albeit with some caveats as a certain underlying distribution is assumed for the calculations) is made for such categorical features and works as you would expect the numbers you give. I took this from the article: https://towardsdatascience.com/the-search-for-categorical-correlation-a1cf7f1888c9
i.e. given the numbers above, knowing "MSZoning" tells us 0.27 about "Alley", but Alley tells us less (0.11) about MSZoning, and this is based on occurrences in the categorical data.
It's actually pretty awesome. A lot of the time, it is very symmetrical, but sometimes it is not and that can be interesting!
For example in the Titanic dataset: Pclass provides information on Survived at 0.09, but Survived tells us information on Pclass at only 0.06. This tells us that knowing the passenger class tells us more about whether they survived, but knowing if they survived tells us less about the class. Some of this can be related to the number of categories (e.g. there are 3 Pclasses but only 2 (yes/no) in survived).
I hope this helps! If there are any issues with this calculation, definitely let me know as I do want these things to be helpful! :)
When showing the categorical variable's report, we have two information zones:
This is an example from the Kaggle Housing Price dataset:
We can see that MSZoning provides information on Alley and Neighborhood and many more features.
We can also see that Alley and Neighborhood and many more features provide information on MSZoning.
The uncertainty coefficient of MSZoning on Alley is
0.27
and the uncertainty coefficient of Alley on MSZoning is0.11
.The uncertainty coefficient of MSZoning on Neighborhood is
0.16
and the uncertainty coefficient of Neighborhood on MSZoning is0.67
.Why arent these numbers the same from every direction? if MSZoning affects Neighborhood, it should be the same as the effect of Neighborhood on MSZoning (if we are talking about correlation).
How are these numbers calculated?
Thank you,
The text was updated successfully, but these errors were encountered: