# Covariance, Correlation
__Task__ : Given a dataframe, create a function that calculate correlation, covariance without using the numpy.cov

## Set up a dataframe

In [1]:
import numpy as np
import pandas as pd

pd.set_option('display.max_row', None)

x = np.random.normal(3, 1, 100)
y = np.random.normal(50, 10, 100)

df = pd.DataFrame({'column1' : x, 'column2' : y})

df.head()

Unnamed: 0,column1,column2
0,3.489598,58.787053
1,3.865043,46.028437
2,2.339022,52.222332
3,2.43044,27.490359
4,3.769574,55.098161


## Covariance
There're 2 formulas calculating 'Covariance' depending if data is from sample population or entire population 
<br>
<br>
<br>
Population Covariance = $\frac{1}{n} \sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y}) / n$
<br>
Sample Covariance = $\frac{1}{n} \sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y}) / (n-1)$
<br>
<br>
<br>
I'll go with the sample Covariance forumula in this task since it's more often than not the more likely scenario

In [2]:
def cov(df, x, y):
    '''
    Given a dataframe, calculate the covariance between 2 columns
    
    Parameters :
    - df (Dataframe): Dataframe containing the data
    - x (str) : Name of the first column
    - y (str) : Name of the second column
    
    Returns :
    - float : Covariance of the 2 columns
    '''
    x_mean = df[x].mean()
    y_mean = df[x].mean()
    
    return sum((df[x] - x_mean)*(df[y] - y_mean))/(len(df)-1)

## Correlation

Correlation = $\frac{{\text{cov}(X, Y)}}{{\sigma_X \cdot \sigma_Y}}$

In [3]:
def corr(df, x, y) :
    '''
    Given a dataframe, calculate the correlation between 2 columns
    
    Parameters :
    - df (Dataframe): Dataframe containing the data
    - x (str) : Name of the first column
    - y (str) : Name of the second column
    
    Returns :
    - float : Correlation of the 2 columns
    '''
    return cov(df, x, y)/(df[x].std()*df[y].std())

## Call the functions

In [4]:
cov_computed = cov(df, "column1", "column2")
corr_computed = corr(df, "column1", "column2")


print(f'Covariance : {cov_computed:.2f}')
print(f'Correlation : {corr_computed:.2f}')

Covariance : -1.04
Correlation : -0.09


## Sanity Check

In [5]:
np_cov = np.cov(df['column1'], df['column2'])[0, 1]
pd_corr = df['column1'].corr(df['column2'])

cov_diff = cov_computed - np_cov
corr_diff = corr_computed - pd_corr

print(f'Difference between covariance computed by custom function and numpy.cov : {cov_diff:.5f}')
print(f'Difference between correlation computed by custom function and pd.corr : {corr_diff:.5f}')

Difference between covariance computed by custom function and numpy.cov : 0.00000
Difference between correlation computed by custom function and pd.corr : 0.00000


Given that there are little to no difference between the values, it's safe to safe the custom function are good for use