<h1 style="text-align:center; font-size: 2.5em;"> Eigenvalues / EigenVectors, Covariance, Correlation</h1>

# Table of contents
1. [Introduction](#tit1)
2. [Eigenvalues and eigenvectors ](#tit2)
3. [Covariance](#tit3)
3. [Correlation](#tit4)
3. [Application with the diabetes dataset](#tit5)
3. [Conclusion](#tit6)



# Introduction <a name="tit1"></a> 
The goal of this netbook is to understand the definition of eigenvalue/vector, variance/covariance, the correlation.
<br> We will make some functions to calculate this parameters and compare them with numpy/pandas functions

In [104]:
# Import numpy library
import numpy as np

# Eigenvalues and eigenvectors <a name="tit2"></a> 

Example of eigenvalues and eigenvectors, A is a matrix of (nxn): n=2


In [23]:
A=np.array([[7,3],[3,-1]])

The identity matrix (nxn)

In [24]:
I=np.identity(len(A))
I

array([[1., 0.],
       [0., 1.]])

#### Eigenvalues and eigenvectors: Numpy function

In [4]:
lambdas,vecs=np.linalg.eig(A)

### Eigenvalues

The eigenvalues of a matrix A is defined as the lambda values that verify the below equation:
    <br>***det(A − λI)=0***

In [5]:
lambdas

array([ 8., -2.])

##### The det(A)

In [29]:
np.linalg.det(A)

-15.999999999999998

##### The first eigenvalue

In [30]:
np.linalg.det(A-8*I)

0.0

##### The second eigenvalue

In [31]:
np.linalg.det(A+2*I)

0.0

### Eigenverctors

For each eigenvalue **λ**, the eigenvector **v** is defined by the below equation

$ {\displaystyle \left(A-\lambda I\right)\mathbf {v} =\mathbf {0} ,} $

In [34]:
vecs.T

array([[ 0.9486833 ,  0.31622777],
       [-0.31622777,  0.9486833 ]])

##### The fist eignevector

In [20]:
(A-8*I).dot(vecs.T[0])

array([ 0.00000000e+00, -2.22044605e-16])

the product is practically **0**

##### The second eignevector

In [37]:
(A+2*I).dot(vecs.T[1])

array([-3.88578059e-16, -5.55111512e-17])

the product is practically **0**

For more details, please see the wikipedia page: 
   <br> https://en.wikipedia.org/wiki/Eigenvalues_and_eigenvectors

# Covariance  <a name="tit3"></a> 

The covariance of vectors ***X*** and ***Y*** is defined by the below equation: 

$ \operatorname {Cov}(X,Y)\equiv \operatorname {E}[(X-\operatorname {E}[X])\,(Y-\operatorname {E}[Y])] $

### Covariance of two vectors

In [39]:
def cov(x,y): 
    return ((x-x.mean())*(y-y.mean())).mean()

We use seed to find the same random output each time

***X*** vector

In [70]:
np.random.seed(1)
X=np.random.uniform(-1,1,5)
X

array([-0.16595599,  0.44064899, -0.99977125, -0.39533485, -0.70648822])

***Y*** vector

In [71]:
np.random.seed(2)
Y=np.random.uniform(-1,1,5)
Y

array([-0.1280102 , -0.94814754,  0.09932496, -0.12935521, -0.1592644 ])

##### The covariance of X,Y

In [72]:
cov(X,Y)

-0.15891454248114759

##### Comparaison with numpy covariance function

In [74]:
np.cov(X,Y)[0][1]

-0.19864317810143445

The value calculated by numpy function is different, because the default numpy function use a bias correction

##### The covariance funciton of two vectors with the bias correction

In [52]:
def cov2(x,y,bias_correction=True): 
    n=len(x)
    if bias_correction:
        return ((x-x.mean())*(y-y.mean())).mean()*n/(n-1)
    else: 
        return ((x-x.mean())*(y-y.mean())).mean()

##### X,Y covariance with bias correction: same as numpy function

In [75]:
cov2(X,Y)

-0.19864317810143448

##### X,Y covariance without bias correction

In [107]:
cov2(X,Y,bias_correction=False)

-0.15891454248114759

### The covariance of a matrix (nxm)

Wikipedia definition: For matrix ***M***

$ {  M}={\begin{pmatrix}X_{1}\\\vdots \\X_{p}\end{pmatrix}} $

The covariance is defined as: 

$ \operatorname {Cov}({M})={\begin{pmatrix}\operatorname {Var}(X_{1})&\operatorname {Cov}(X_{{1}},X_{{2}})&\cdots &\operatorname {Cov}(X_{{1}},X_{{p}})\\\operatorname {Cov}(X_{{2}},X_{{1}})&\ddots &\cdots &\vdots \\\vdots &\vdots &\ddots &\vdots \\\operatorname {Cov}(X_{{p}},X_{{1}})&\cdots &\cdots &\operatorname {Var}(X_{p})\end{pmatrix}}={\begin{pmatrix}\sigma _{{x_{1}}}^{2}&\sigma _{{x_{{1}}x_{{2}}}}&\cdots &\sigma _{{x_{{1}}x_{{p}}}}\\\sigma _{{x_{{2}}x_{{1}}}}&\ddots &\cdots &\vdots \\\vdots &\vdots &\ddots &\vdots \\\sigma _{{x_{{p}}x_{{1}}}}&\cdots &\cdots &\sigma _{{x_{p}}}^{2}\end{pmatrix}} $

With Var(Xi)=Cov(Xi,Xi)

##### The covariance function of a matrix M

In [142]:
def CovM(M,bias_correction=True): 
    cov_out=np.zeros((M.shape[0],M.shape[0]))
    for i,x1 in enumerate(M):
        for j,x2 in enumerate(M): 
            cov_out[i,j]=cov2(x1,x2,bias_correction=bias_correction)
    return cov_out

##### M matrix
We use seed to find the same random output each time

In [86]:
np.random.seed(3)
M=np.random.uniform(-10,10,(3,3))
M

array([[ 1.01595805,  4.16295645, -4.18190522],
       [ 0.2165521 ,  7.85893909,  7.92586178],
       [-7.48829379, -5.85514244, -8.97065593]])

##### The covariance of ***M*** With our funcition

In [87]:
CovM(M)

array([[17.75968298, -2.7633031 ,  6.4738616 ],
       [-2.7633031 , 19.64066885,  0.14078121],
       [ 6.4738616 ,  0.14078121,  2.42850087]])

##### Comparaison with numpy funciton

In [88]:
np.cov(M)

array([[17.75968298, -2.7633031 ,  6.4738616 ],
       [-2.7633031 , 19.64066885,  0.14078121],
       [ 6.4738616 ,  0.14078121,  2.42850087]])

We find the same result: fine 👌


For more information, please see the wikipedia pages below:
<br>https://fr.wikipedia.org/wiki/Covariance
<br> https://en.wikipedia.org/wiki/Covariance_matrix

# Correlation  <a name="tit4"></a> 

Wikipedia definition: 

The population correlation coefficient $ {\displaystyle \rho _{X,Y}}$ between two random variables  $X$ and $Y$ with expected values $ \mu _{X} $ and  $\sigma _{X}$ and $ \sigma_Y $ is defined as:

$ {\displaystyle \rho _{X,Y}=\operatorname {corr} (X,Y)={\operatorname {cov} (X,Y) \over \sigma _{X}\sigma _{Y}}={\operatorname {E} [(X-\mu _{X})(Y-\mu _{Y})] \over \sigma _{X}\sigma _{Y}},\quad {\text{if}}\ \sigma _{X}\sigma _{Y}>0.}$ 

The Pearson correlation is defined only if both standard deviations are finite and positive. An alternative formula purely in terms of moments is:

$ {\displaystyle \rho _{X,Y}={\operatorname {E} (XY)-\operatorname {E} (X)\operatorname {E} (Y) \over {\sqrt {\operatorname {E} (X^{2})-\operatorname {E} (X)^{2}}}\cdot {\sqrt {\operatorname {E} (Y^{2})-\operatorname {E} (Y)^{2}}}}} $

E(X)=Average(X)
<br> 
**unlike the covariance, the correlation coefficient is in the range [-1,1]**

### The correlation funciton of two vectros

In [90]:
def corr(x,y): 
    #return cov2(x,y)/(x.std()*y.std())
    N=(x*y).mean()-x.mean()*y.mean()
    D=np.sqrt((x**2).mean()-x.mean()**2)
    D=D*np.sqrt((y**2).mean()-y.mean()**2)
    return N/D

Reminder of X,Y vectors already defined 

In [96]:
X,Y

(array([-0.16595599,  0.44064899, -0.99977125, -0.39533485, -0.70648822]),
 array([-0.1280102 , -0.94814754,  0.09932496, -0.12935521, -0.1592644 ]))

##### The correlation of X, Y: calculation by the defined function 

In [95]:
corr(X,Y)

-0.8982972930960346

##### Comparaison with numpy function

In [98]:
np.corrcoef(X, Y)[0,1]

-0.8982972930960343

We find the same result

### The correlation of a matrix

In [132]:
def corrM(M):
    corr_out=np.zeros((M.shape[0],M.shape[0]))
    for i,x1 in enumerate(M):
        for j,x2 in enumerate(M): 
            corr_out[i,j]=corr(x1,x2)
    return corr_out

Reminder of the M matrix

In [101]:
M

array([[ 1.01595805,  4.16295645, -4.18190522],
       [ 0.2165521 ,  7.85893909,  7.92586178],
       [-7.48829379, -5.85514244, -8.97065593]])

##### M correlation with the local function

In [102]:
corrM(M)

array([[ 1.        , -0.14795607,  0.98577245],
       [-0.14795607,  1.        ,  0.02038438],
       [ 0.98577245,  0.02038438,  1.        ]])

##### M correlation with numpy function

In [103]:
np.corrcoef(M)

array([[ 1.        , -0.14795607,  0.98577245],
       [-0.14795607,  1.        ,  0.02038438],
       [ 0.98577245,  0.02038438,  1.        ]])

We found the same value

For more information, please see the wikipedia pages below:
<br> https://en.wikipedia.org/wiki/Correlation
<br> https://en.wikipedia.org/wiki/Correlation

# Application with the diabetes dataset <a name="tit5"></a> 

### The dataset 

For more information about this dataset, please see the project [Linear Logistic Regression with NumPy pure code and a comparison wit Sklearn model](https://bouz1.github.io/fils/NumpyLR.html), or the official link https://scikit-learn.org/stable/datasets/toy_dataset.html

##### Import the dataset

In [115]:
from sklearn import datasets
X,Y=datasets.load_diabetes(return_X_y=True)

##### Organize data in Pandas DataFrame

In [120]:
import pandas as pd

In [121]:
inp=['x'+str(i) for i in range(X.shape[1])]
inp+['y']

['x0', 'x1', 'x2', 'x3', 'x4', 'x5', 'x6', 'x7', 'x8', 'x9', 'y']

In [122]:
['x0', 'x1', 'x2', 'x3', 'x4', 'x5', 'x6', 'x7', 'x8', 'x9', 'y']
df=pd.concat([pd.DataFrame(X,columns=inp),pd.DataFrame(Y,columns=['y'])],axis=1)
df.head(3)

Unnamed: 0,x0,x1,x2,x3,x4,x5,x6,x7,x8,x9,y
0,0.038076,0.05068,0.061696,0.021872,-0.044223,-0.034821,-0.043401,-0.002592,0.019907,-0.017646,151.0
1,-0.001882,-0.044642,-0.051474,-0.026328,-0.008449,-0.019163,0.074412,-0.039493,-0.068332,-0.092204,75.0
2,0.085299,0.05068,0.044451,-0.00567,-0.045599,-0.034194,-0.032356,-0.002592,0.002861,-0.02593,141.0


### The covariance

##### The covariance with panda funcition: 

In [139]:
df.cov()

Unnamed: 0,x0,x1,x2,x3,x4,x5,x6,x7,x8,x9,y
x0,0.002268,0.000394,0.00042,0.000761,0.00059,0.000497,-0.00017,0.000462,0.000614,0.000684,0.689758
x1,0.000394,0.002268,0.0002,0.000547,8e-05,0.000323,-0.00086,0.000753,0.00034,0.000472,0.158085
x2,0.00042,0.0002,0.002268,0.000897,0.000566,0.000592,-0.000832,0.000938,0.001012,0.000881,2.152914
x3,0.000761,0.000547,0.000897,0.002268,0.00055,0.000421,-0.000405,0.000584,0.000892,0.000885,1.620722
x4,0.00059,8e-05,0.000566,0.00055,0.002268,0.002033,0.000117,0.001229,0.001169,0.000739,0.778355
x5,0.000497,0.000323,0.000592,0.000421,0.002033,0.002268,-0.000445,0.001496,0.000722,0.000659,0.638967
x6,-0.00017,-0.00086,-0.000832,-0.000405,0.000117,-0.000445,0.002268,-0.001675,-0.000904,-0.000621,-1.449309
x7,0.000462,0.000753,0.000938,0.000584,0.001229,0.001496,-0.001675,0.002268,0.001401,0.000946,1.580234
x8,0.000614,0.00034,0.001012,0.000892,0.001169,0.000722,-0.000904,0.001401,0.002268,0.001054,2.077409
x9,0.000684,0.000472,0.000881,0.000885,0.000739,0.000659,-0.000621,0.000946,0.001054,0.002268,1.404133


##### The covariance with our function
For more visibility: Convert the result to a dataframe

In [144]:
pd.DataFrame(CovM(df.values.T))

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10
0,0.002268,0.000394,0.00042,0.000761,0.00059,0.000497,-0.00017,0.000462,0.000614,0.000684,0.689758
1,0.000394,0.002268,0.0002,0.000547,8e-05,0.000323,-0.00086,0.000753,0.00034,0.000472,0.158085
2,0.00042,0.0002,0.002268,0.000897,0.000566,0.000592,-0.000832,0.000938,0.001012,0.000881,2.152914
3,0.000761,0.000547,0.000897,0.002268,0.00055,0.000421,-0.000405,0.000584,0.000892,0.000885,1.620722
4,0.00059,8e-05,0.000566,0.00055,0.002268,0.002033,0.000117,0.001229,0.001169,0.000739,0.778355
5,0.000497,0.000323,0.000592,0.000421,0.002033,0.002268,-0.000445,0.001496,0.000722,0.000659,0.638967
6,-0.00017,-0.00086,-0.000832,-0.000405,0.000117,-0.000445,0.002268,-0.001675,-0.000904,-0.000621,-1.449309
7,0.000462,0.000753,0.000938,0.000584,0.001229,0.001496,-0.001675,0.002268,0.001401,0.000946,1.580234
8,0.000614,0.00034,0.001012,0.000892,0.001169,0.000722,-0.000904,0.001401,0.002268,0.001054,2.077409
9,0.000684,0.000472,0.000881,0.000885,0.000739,0.000659,-0.000621,0.000946,0.001054,0.002268,1.404133


We find the same result

### The correlation

##### The correlation with panda funcition: 

In [124]:
df.corr()

Unnamed: 0,x0,x1,x2,x3,x4,x5,x6,x7,x8,x9,y
x0,1.0,0.173737,0.185085,0.335428,0.260061,0.219243,-0.075181,0.203841,0.270774,0.301731,0.187889
x1,0.173737,1.0,0.088161,0.24101,0.035277,0.142637,-0.37909,0.332115,0.149916,0.208133,0.043062
x2,0.185085,0.088161,1.0,0.395411,0.249777,0.26117,-0.366811,0.413807,0.446157,0.38868,0.58645
x3,0.335428,0.24101,0.395411,1.0,0.242464,0.185548,-0.178762,0.25765,0.39348,0.39043,0.441482
x4,0.260061,0.035277,0.249777,0.242464,1.0,0.896663,0.051519,0.542207,0.515503,0.325717,0.212022
x5,0.219243,0.142637,0.26117,0.185548,0.896663,1.0,-0.196455,0.659817,0.318357,0.2906,0.174054
x6,-0.075181,-0.37909,-0.366811,-0.178762,0.051519,-0.196455,1.0,-0.738493,-0.398577,-0.273697,-0.394789
x7,0.203841,0.332115,0.413807,0.25765,0.542207,0.659817,-0.738493,1.0,0.617859,0.417212,0.430453
x8,0.270774,0.149916,0.446157,0.39348,0.515503,0.318357,-0.398577,0.617859,1.0,0.464669,0.565883
x9,0.301731,0.208133,0.38868,0.39043,0.325717,0.2906,-0.273697,0.417212,0.464669,1.0,0.382483


##### The correlation with our function
For more visibility: Convert the result to a dataframe

In [133]:
pd.DataFrame(corrM(df.values.T))

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10
0,1.0,0.173737,0.185085,0.335428,0.260061,0.219243,-0.075181,0.203841,0.270774,0.301731,0.187889
1,0.173737,1.0,0.088161,0.24101,0.035277,0.142637,-0.37909,0.332115,0.149916,0.208133,0.043062
2,0.185085,0.088161,1.0,0.395411,0.249777,0.26117,-0.366811,0.413807,0.446157,0.38868,0.58645
3,0.335428,0.24101,0.395411,1.0,0.242464,0.185548,-0.178762,0.25765,0.39348,0.39043,0.441482
4,0.260061,0.035277,0.249777,0.242464,1.0,0.896663,0.051519,0.542207,0.515503,0.325717,0.212022
5,0.219243,0.142637,0.26117,0.185548,0.896663,1.0,-0.196455,0.659817,0.318357,0.2906,0.174054
6,-0.075181,-0.37909,-0.366811,-0.178762,0.051519,-0.196455,1.0,-0.738493,-0.398577,-0.273697,-0.394789
7,0.203841,0.332115,0.413807,0.25765,0.542207,0.659817,-0.738493,1.0,0.617859,0.417212,0.430453
8,0.270774,0.149916,0.446157,0.39348,0.515503,0.318357,-0.398577,0.617859,1.0,0.464669,0.565883
9,0.301731,0.208133,0.38868,0.39043,0.325717,0.2906,-0.273697,0.417212,0.464669,1.0,0.382483


We find the same result

# Conclusion <a name="tit6"></a> 
In this netbook we defined the eigenvalue/vector, variance/covariance, the correlation. In the second hand we compared local function to calculate the parameters with the numpy/pandas functions and we founded the same result each time