One way to quantify the relationship between two variables is to use the *Pearson correlation coefficient*, which is a measure of the linear association between two variables. It always takes on a value between -1 and 1 where:

* -1 indicates a perfectly negative linear correlation between two variables
* 0 indicates no linear correlation between two variables
* 1 indicates a perfectly positive linear correlation between two variables

In [3]:
import numpy as np

np.random.seed(100)

# create array of 50 random integers between 0 an 10
var1 = np.random.randint(0, 10, 50)

# create a positively correlated array with some random noise
var2 = var1 + np.random.normal(0, 10, 50)

In [4]:
var1

array([8, 8, 3, 7, 7, 0, 4, 2, 5, 2, 2, 2, 1, 0, 8, 4, 0, 9, 6, 2, 4, 1,
       5, 3, 4, 4, 3, 7, 1, 1, 7, 7, 0, 2, 9, 9, 3, 2, 5, 8, 1, 0, 7, 6,
       2, 0, 8, 2, 5, 1])

In [5]:
var2

array([ 13.92575746,   8.37958129,   6.71967572,  10.59753947,
         5.59603535,  -4.43738512,   9.68770162,   7.38488768,
        18.15574179,  16.78978297,   4.21426255,  -9.17894159,
        -0.8499993 ,  -4.31187446,   9.42536252, -10.96393454,
         3.0687456 ,   8.5365277 ,  10.17144275,  -4.03770522,
        10.85967248,   9.59973786,   6.91555442,  14.78227928,
        10.00283569, -14.44659863,   3.4258809 ,   6.75915012,
         4.84057844,   0.50907502,  -9.01691177,   5.61237434,
         0.12086266, -10.98144167,   7.71113962,   6.17078367,
         1.20831746,   8.75751769,  28.80595342,  -3.46700861,
        -1.91143345,  -8.54225772,  15.63406468,   2.8361877 ,
        -7.02225772,  -7.38562393,   4.04661119,   2.84200853,
         3.14377547,  12.08439231])

In [7]:
# calculate the correlation between the two arrays
corr = np.corrcoef(var1, var2)
corr

array([[1.       , 0.3350184],
       [0.3350184, 1.       ]])

##### To test if this correlation is statistically significant, we have to calculate Pearson correlation coefficient by using scipy pearsonr() function

In [8]:
from scipy.stats.stats import pearsonr
pearson = pearsonr(var1, var2)

In [22]:
print("Pearson Correlation Coefficient: %.3f" % pearson[0])
print("P-value: %.3f" % pearson[1])

Pearson Correlation Coefficient: 0.335
P-value: 0.017


##### With pandas DataFrame

In [23]:
import pandas as pd

data = pd.DataFrame(np.random.randint(0, 10, size = (5,3)), columns = ["A","B","C"])
data

Unnamed: 0,A,B,C
0,1,4,8
1,8,2,2
2,7,2,1
3,2,7,1
4,0,5,3


In [24]:
# calculate correlation coefficients for all pairwise combinations
data.corr()

Unnamed: 0,A,B,C
A,1.0,-0.775567,-0.493769
B,-0.775567,1.0,0.0
C,-0.493769,0.0,1.0


In [26]:
# correlation for specific variables
data["A"].corr(data["B"])

-0.7755667343294814