## Tutorial 9. Pearson correlation and chi-square test


Created by Emanuel Flores-Bautista 2019.  All content contained in this notebook is licensed under a [Creative Commons License 4.0](https://creativecommons.org/licenses/by/4.0/). The code is licensed under a [MIT license](https://opensource.org/licenses/MIT).

In [None]:
import numpy as np
import numba
import pandas as pd
import seaborn as sns
import scipy.stats as stats
import matplotlib.pyplot as plt
import TCD19_utils as TCD

TCD.set_plotting_style_2()

#Magic command to enable plotting inside notebook
%matplotlib inline

#Magic command to enable svg format in plots
%config InlineBackend.figure_format = 'svg'

## Pearson correlation

As we've seen in the presentation, the Pearson correlation coefficient is a measure of the linear *co-relation* that occurs between two variables. It can be computed from the following. It can also be thought of the ratio between the covariance and the individual variance of the two variables. We're going to go back to the CONAPO data to figure out if the correlation between education and economy have statistically significant correlation using both usual and hacker methods. 

\begin{align}
\text{Pearson's r} = \frac{\mathrm{cov}[XY]}{\sigma_{x}  \sigma_{y}} = \frac{ \sum_{i = 1}^{n}(X_{i}- \overline{X})(Y_{i}- \overline{Y} )}{ \sqrt{n_{x} \, n_{y}} \,  \sigma_{x} \, \sigma{y}}
\end{align}


In [None]:
df = pd.read_csv('../data/data_CONAPO_municipal_90-15.csv', encoding = "ISO-8859-1")

In [None]:
df = df.rename(columns = {'SPRIM': '% sin primaria', 
                       'OVSD': '% sin drenaje', 
                       'ANALF': '% analfabeta', 
                       'OVSEE': '% sin energía eléctrica', 
                       'OVPT': '% con piso de tierra', 
                       'GM': 'Grado de marginación', 
                       'PO2SM': '% con ingresos de menos de 2 salarios mín.',
                       'OVSAE': '% sin agua entubada',
                        'IM': 'índice de marginación'})

In [None]:
df = df.apply(pd.to_numeric, errors='coerce')

In [None]:
df.head(3)

In [None]:
df_pearson = df[['% con ingresos de menos de 2 salarios mín.', '% sin primaria']].dropna()

In [None]:
df_pearson.head(3)

In [None]:
df_pearson = df_pearson.apply(pd.to_numeric, errors='coerce')

In [None]:
df_pearson.corr()

In [None]:
sns.heatmap(df_pearson.corr(), cmap = 'magma_r', vmin = 0.5, vmax = 1)

In [None]:
df_pearson['margi_edu'] = (df_pearson['% sin primaria'] >
                    df_pearson['% sin primaria'].median())

df_pearson['margi_econ'] = (df_pearson['% con ingresos de menos de 2 salarios mín.'] >
                      df_pearson['% con ingresos de menos de 2 salarios mín.'].median())

In [None]:
contingency = pd.crosstab(df_pearson['margi_edu'], df_pearson['margi_econ'])

contingency

In [None]:
p_val = stats.chi2_contingency(contingency)[1]

p_val

In [None]:
edu = df['% sin primaria'].dropna().values

econ = df['% con ingresos de menos de 2 salarios mín.'].dropna().values

In [None]:
stats.pearsonr(np.array(edu), np.array(econ))

## Bootstrap test for pearson correlation

In [None]:
@numba.jit(nopython=True)
def draw_bs_sample(data):
    """
    Draw a bootstrap sample from a 1D data set.
    """
    return np.random.choice(data, size=len(data))

@numba.jit(nopython=True)
def draw_bs_pairs(x, y):
    """
    Draw a pairs bootstrap sample.
    """
    inds = np.arange(len(x))
    bs_inds = draw_bs_sample(inds)
    return x[bs_inds], y[bs_inds]


@numba.jit(nopython=True)
def pearson_r(x, y):
    """
    Compute Pearson correlation coefficient.
    """
    return np.sum((x - np.mean(x)) * (y - np.mean(y))) / np.std(x) / np.std(y) \
                / np.sqrt(len(x)) / np.sqrt(len(y))

In [None]:
@numba.jit(nopython=True)
def draw_bs_pairs_reps_pearson(x, y, size=10000):
    """
    Draw bootstrap pairs replicates.
    """
    out = np.empty(size)
    for i in range(size):
        out[i] = pearson_r(*draw_bs_pairs(x, y))
    return out

In [None]:
# Get reps
bs_reps_pearson = draw_bs_pairs_reps_pearson(edu, econ)

# Get the confidence intervals
conf_int_edu_econ = np.percentile(bs_reps_pearson, [2.5, 97.5])

conf_int_edu_econ

We can clearly see there is a high correlation between education and income, and that the bootstrap samples for the pearson correlation coefficient is a quite narrow distribution.

### Pearson r p-value

* $H_{0}$ : Education and income are independent variables. 
* $H_{1}$ : There is a linear relationship between education and income. 

In [None]:
@numba.jit(nopython=True)
def draw_perm_sample(x, y):
    """Generate a permutation sample."""
    concat_data = np.concatenate((x, y))
    np.random.shuffle(concat_data)
    return concat_data[:len(x)], concat_data[len(x):]

def draw_perm_reps(x, y, stat_fun, size=10000):
    """
    Generate array of permuation replicates.
    """
    return np.array([stat_fun(*draw_perm_sample(x, y)) for _ in range(size)])

In [None]:
pearson_edu_econ = pearson_r(edu, econ)

# Get permutation replicates
perm_pearson_edu_econ = draw_perm_reps((edu), econ,pearson_r, size=1000000)

In [None]:
p_val_pearson = np.sum(perm_pearson_edu_econ > pearson_edu_econ) / len(perm_pearson_edu_econ)

print('permutation p-value for pearson coefficient', p_val_pearson) 

## Follow-up

What other variables do you think are correlated ? You can guide your hypothesis by using a Seaborn pairgrid. 