# Three correlations coefficient

In [1]:
import numpy as np
import pandas as pd 
import seaborn as sns 

import matplotlib.pyplot as plt 
%matplotlib inline
%config InlineBackend.figure_format='retina'

from utils.viz import viz 
viz.get_style()

In [51]:
# data 

x = pd.Series([1, 2, 3, 4, 5, 6])
y = pd.Series([0.3, 0.9, 2.7, 2, 3.5, 5])

## 1. Pearson correlation coefficient

Measure the correlation between two Gaussian variables.

That is, the linear assumption must be satisfied to produce a reasonable measure.

In [52]:
def corr(x, y, method='pearson'):
    x, y, n = np.array(x), np.array(y), len(x)
    if method == 'pearson':
        cov = ((x - x.mean()) * (y - y.mean()) / n).sum() 
        return cov / (x.std() * y.std())

In [53]:
x.corr(y, method='pearson'),  corr(x, y, method='pearson')

(0.9481366640102854, 0.9481366640102854)

# Any problem? 
The Pearson correlation is derived from the Gaussian PDF. The equation does not apply when the continuous space is not linear. 

To derive the correlation indicator of the nonGaussian variable, we first need to find its PDF. However, this is also not easy because not all continuous distribution has a closed-form PDF. One idea is approximating the continuous distribution using the sampling method. Here, each data point considers a sample. We can digitize the sampling space to obtain a discrete distribution.

In [65]:
bins = np.linspace(0, 7, 100)
x_digit = bins[np.digitize(x, bins=bins)]
y_digit = bins[np.digitize(y, bins=bins)]
corr(x, y, method='pearson'), corr(x_digit, y_digit, method='pearson')

(0.9481366640102854, 0.9475731981236369)

## 1. Spearman correlation coefficient

The Spearman correlation also applies the sampling idea, except that the metdod discrete the space using rank


In [46]:
def corr(x, y, method='pearson'):
    
    if method == 'pearson':
        x, y, n = np.array(x), np.array(y), len(x)
        cov = ((x - x.mean()) * (y - y.mean()) / (n-1)).sum() 
        return cov / (x.std() * y.std())
    if method == 'spearman':
        n = x.shape[0]
        x.index = np.arange(n)
        y.index = np.arange(n)
        d2 = (x.sort_values().index - y.sort_values().index)**2
        sumd2 = d2.values.sum()
        return 1-n*sumd2 /(n*(n**2-1))
    

In [47]:
x.corr(y, method='spearman'), corr(x, y, method='spearman')

(0.942857142857143, 0.9428571428571428)