# Calculate correlation

In this notebook we're going to look at different ways to calculate the correlation between some of the variables from the 2 datasets: school and auto-mpg

Let's first import the pandas library and load the school dataset into a pandas dataframe


In [1]:
import pandas as pd
df = pd.read_csv('./data/school.csv')
df.head()

Unnamed: 0,sex,age,height,weight
0,m,147,128.270257,35.833768
1,f,143,130.302261,22.906396
2,f,147,130.810262,29.029888
3,m,149,133.350267,36.740952
4,f,139,134.112268,28.803092


The ```corr()``` function of the pandas library outputs the correlation matrix of all the numerical columns of the dataframe


In [5]:
df.corr()

Unnamed: 0,age,height,weight
age,1.0,0.648857,0.634636
height,0.648857,1.0,0.774876
weight,0.634636,0.774876,1.0


The correlation of a variable with itself is 1. Which is why the diagonal of the correlation matrix is filled with ones. The correlation matrix is also symmetrical.


In fact, the ```corr()``` method returns another dataframe.


In [9]:
r = df.corr()
type(r)

pandas.core.frame.DataFrame

So to get the correlation between 2 specific variables in a dataframe, we can simply **index** the correlation dataframe.

For instance to get the correlation between the weight and the height variables in the school dataset:


In [8]:
df.corr()['weight']['height']

0.7748761066276011

The ```corr()``` function accepts a method parameter that you can use to specify the type of correlation you want to calculate.

**Pearson** is the default correlation. 

You can also choose Spearman or Kendall.


In [10]:
# to see the documentation for the corr() function precede it with a ?
?df.corr

So let's now compare the Pearson and Spearman correlations over the auto-mpg dataset

Load the *auto-mpg* dataset into a dataframe:


In [12]:
df = pd.read_csv('./data/auto-mpg.csv')
df.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin,name
0,18.0,8,307.0,130.0,3504.0,12.0,70,1,chevrolet chevelle malibu
1,15.0,8,350.0,165.0,3693.0,11.5,70,1,buick skylark 320
2,18.0,8,318.0,150.0,3436.0,11.0,70,1,plymouth satellite
3,16.0,8,304.0,150.0,3433.0,12.0,70,1,amc rebel sst
4,17.0,8,302.0,140.0,3449.0,10.5,70,1,ford torino


and calculate the Pearson correlation of  the **outcome** variable *mpg* with the other variables

In [13]:
df.corr(method = 'pearson')['mpg']

mpg             1.000000
cylinders      -0.777618
displacement   -0.805127
horsepower     -0.778427
weight         -0.832244
acceleration    0.423329
year            0.580541
origin          0.565209
Name: mpg, dtype: float64

Now for the Spearman correlation, we set method equals spearman

In [15]:
df.corr(method = 'spearman')['mpg']

mpg             1.000000
cylinders      -0.823175
displacement   -0.855234
horsepower     -0.853616
weight         -0.875585
acceleration    0.441539
year            0.574841
origin          0.580482
Name: mpg, dtype: float64

For this dataset, the Pearson an Spearman collerations are not very different.

It is also possible to calculate Pearson the correlation between 2 variables with the numpy library using the ```corrcoef()``` function. 

For instance to calculate the correlation between mpg and acceleration, we would write:

In [18]:
import numpy as np
np.corrcoef(df.mpg, df.acceleration)

array([[1.        , 0.42332854],
       [0.42332854, 1.        ]])