<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Imports" data-toc-modified-id="Imports-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Imports</a></span></li><li><span><a href="#correlation-between-categorical-variables" data-toc-modified-id="correlation-between-categorical-variables-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>correlation between categorical variables</a></span></li><li><span><a href="#Cramer's-V-(cat-vs-cat,-symmetric)" data-toc-modified-id="Cramer's-V-(cat-vs-cat,-symmetric)-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Cramer's V (cat vs cat, symmetric)</a></span></li><li><span><a href="#Thiels-U-(cat-vs-cat,-assymetric)" data-toc-modified-id="Thiels-U-(cat-vs-cat,-assymetric)-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Thiels U (cat vs cat, assymetric)</a></span></li><li><span><a href="#Correlation-Ratio-(numeric-vs-categorical)" data-toc-modified-id="Correlation-Ratio-(numeric-vs-categorical)-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Correlation Ratio (numeric vs categorical)</a></span></li></ul></div>

# Imports

In [1]:
import numpy as np
import pandas as pd
import os,sys,time

# visualization
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

import scipy.stats as stats

# settings
SEED = 100
pd.set_option('max_columns',100)
pd.set_option('plotting.backend','matplotlib') # matplotlib, bokeh, altair, plotly

%matplotlib inline
%load_ext watermark
%watermark -iv

numpy   1.19.5
pandas  1.1.4
seaborn 0.11.0



# correlation between categorical variables

In [2]:
df = pd.DataFrame({'a':['a','b','c'],'c':['a','b','c'],'d':['a','b','c']})
df

Unnamed: 0,a,c,d
0,a,a,a
1,b,b,b
2,c,c,c


In [4]:
df1 = df.apply(lambda x : pd.factorize(x)[0])
df1

Unnamed: 0,a,c,d
0,0,0,0
1,1,1,1
2,2,2,2


In [5]:
df1.corr(method='pearson', min_periods=1)

Unnamed: 0,a,c,d
a,1.0,1.0,1.0
c,1.0,1.0,1.0
d,1.0,1.0,1.0


In [6]:
df1 = df.apply(lambda x : pd.factorize(x)[0])+1
df1

Unnamed: 0,a,c,d
0,1,1,1
1,2,2,2
2,3,3,3


In [7]:
pd.DataFrame([stats.chisquare(df1[x].values,f_exp=df1.values.T,axis=1)[0] for x in df1])

Unnamed: 0,0,1,2
0,0.0,0.0,0.0
1,0.0,0.0,0.0
2,0.0,0.0,0.0


# Cramer's V (cat vs cat, symmetric)

- https://en.wikipedia.org/wiki/Cramér%27s_V
- https://towardsdatascience.com/the-search-for-categorical-correlation-a1cf7f1888c9
- https://stackoverflow.com/questions/20892799/using-pandas-calculate-cramérs-coefficient-matrix
- https://github.com/shakedzy/dython

![](images/cramers_symmetry.png)

In [12]:
def cramers_v(x, y):
    confusion_matrix = pd.crosstab(x,y)
    chi2 = ss.chi2_contingency(confusion_matrix)[0]
    n = confusion_matrix.sum().sum()
    phi2 = chi2/n
    r,k = confusion_matrix.shape
    phi2corr = max(0, phi2-((k-1)*(r-1))/(n-1))
    rcorr = r-((r-1)**2)/(n-1)
    kcorr = k-((k-1)**2)/(n-1)
    return np.sqrt(phi2corr/min((kcorr-1),(rcorr-1)))

In [13]:
import scipy.stats as ss
import pandas as pd
import numpy as np
def cramers_corrected_stat(x,y):

    """ calculate Cramers V statistic for categorial-categorial association.
        uses correction from Bergsma and Wicher, 
        Journal of the Korean Statistical Society 42 (2013): 323-328
    """
    result=-1
    if len(x.value_counts())==1 :
        print("First variable is constant")
    elif len(y.value_counts())==1:
        print("Second variable is constant")
    else:   
        conf_matrix=pd.crosstab(x, y)

        if conf_matrix.shape[0]==2:
            correct=False
        else:
            correct=True

        chi2 = ss.chi2_contingency(conf_matrix, correction=correct)[0]

        n = sum(conf_matrix.sum())
        phi2 = chi2/n
        r,k = conf_matrix.shape
        phi2corr = max(0, phi2 - ((k-1)*(r-1))/(n-1))    
        rcorr = r - ((r-1)**2)/(n-1)
        kcorr = k - ((k-1)**2)/(n-1)
        result=np.sqrt(phi2corr / min( (kcorr-1), (rcorr-1)))
    return round(result,6)

# Thiels U (cat vs cat, assymetric)

[Theil’s U](https://en.wikipedia.org/wiki/Uncertainty_coefficient), also referred to as the Uncertainty Coefficient, is based on the conditional entropy between x and y — or in human language, given the value of x, how many possible states does y have, and how often do they occur. 


Just like Cramer’s V, the output value is on the range of [0,1], with the same interpretations as before — but unlike Cramer’s V, it is asymmetric, meaning U(x,y)≠U(y,x) (while V(x,y)=V(y,x), where V is Cramer’s V).

Using Theil’s U in the simple case above will let us find out that knowing y means we know x, but not vice-versa.


Implementing the formula as a Python function yields this (full code with the conditional_entropy function can be found on my Github page — link at the top of the post):

In [11]:
def theils_u(x, y):
    s_xy = conditional_entropy(x,y)
    x_counter = Counter(x)
    total_occurrences = sum(x_counter.values())
    p_x = list(map(lambda n: n/total_occurrences, x_counter.values()))
    s_x = ss.entropy(p_x)
    if s_x == 0:
        return 1
    else:
        return (s_x - s_xy) / s_x

# Correlation Ratio (numeric vs categorical)

So now we have a way to measure the correlation between two continuous features, and two ways of measuring association between two categorical features. But what about a pair of a continuous feature and a categorical feature? For this, we can use the [Correlation Ratio](https://en.wikipedia.org/wiki/Correlation_ratio) (often marked using the greek letter eta). Mathematically, it is defined as the weighted variance of the mean of each category divided by the variance of all samples; in human language, the Correlation Ratio answers the following question: Given a continuous number, how well can you know to which category it belongs to? Just like the two coefficients we’ve seen before, here too the output is on the range of [0,1].