# Introduction to Multivariate Statistics

In [None]:
"""
1. Expected Value and Mean
2. Variance and Standard Deviation
3. Covariance and Correlation
4. Covariance Matrix
"""

### Expected Value and Mean

In [None]:
"""In probability, the average value of some random variable X is called the expected value or the
expectation

The expected value uses the notation E with square brackets around the name of
the variable; for example:

E[X]

It is calculated as the probability weighted sum of values that can be drawn.

E[X] = Summation of (x1 × p1, x2 × p2, x3 × p3, · · · , xn × pn)

In simple cases, such as the flipping of a coin or rolling a dice, the probability of each event is
just as likely. Therefore, the expected value can be calculated as the sum of all values multiplied
by the reciprocal of the number of values.

E[X] = 1/n × SUMMATION OF(x1, x2, x3, · · · , xn)

In statistics, the mean, or more technically the arithmetic mean or sample mean, can be
estimated from a sample of examples drawn from the domain. It is confusing because mean,
average, and expected value are used interchangeably. In the abstract, the mean is denoted by
the lower case Greek letter mu µ and is calculated from the sample of observations, rather than
all possible values.

µ =1/n × SUMMATION OF(x1, x2, x3, · · · , xn)

µ = P(x) × SUMMATION OF(x)

Where x is the vector of observations and P(x) is the calculated probability for each value.
When calculated for a specific variable, such as x, the mean is denoted as a lower case variable
name with a line above, called x-bar e.g. ¯x.

x¯ =1/n × SUMMATION OF i=1 to n(xi)

The arithmetic mean can be calculated for a vector or matrix in NumPy by using the mean()
function. The example below defines a 6-element vector and calculates the mean
"""

In [1]:
### Example of calculating a vector mean.
# vector mean
from numpy import array
from numpy import mean
# define vector
v = array([1,2,3,4,5,6])
print(v)
# calculate mean
result = mean(v)
print(result)


[1 2 3 4 5 6]
3.5


In [2]:
"""
The mean function can calculate the row or column means of a matrix by specifying the
axis argument and the value 0 or 1 respectively. The example below defines a 2 × 6 matrix and
calculates both column and row means.
"""
### Example of calculating matrix means
# matrix means
from numpy import array
from numpy import mean
# define matrix
M = array([
[1,2,3,4,5,6],
[1,2,3,4,5,6]])
print(M)
# column means
col_mean = mean(M, axis=0)
print(col_mean)
# row means
row_mean = mean(M, axis=1)
print(row_mean)


[[1 2 3 4 5 6]
 [1 2 3 4 5 6]]
[1. 2. 3. 4. 5. 6.]
[3.5 3.5]


###  Variance and Standard Deviation

In [None]:
"""
In probability, the variance of some random variable X is a measure of how much values in the
distribution vary on average with respect to the mean. The variance is denoted as the function
Var() on the variable.

Var[X]

Variance is calculated as the average squared difference of each value in the distribution
from the expected value. Or the expected squared difference from the expected value.
Var[X] = E[(X − E[X])2]

Assuming the expected value of the variable has been calculated (E[X]), the variance of the
random variable can be calculated as the sum of the squared difference of each example from
the expected value multiplied by the probability of that value.

Var[X] = SUMMMATION OF (p(x1) × (x1 − E[X])2, p(x2) × (x2 − E[X])2, · · · , p(xn) × (xn − E[X])2)

If the probability of each example in the distribution is equal, variance calculation can drop
the individual probabilities and multiply the sum of squared differences by the reciprocal of the
number of examples in the distribution.

Var[X] = 1/n × SUMMATION OF (x1 − E[X])2 ,(x2 − E[X])2, · · · ,(xn − E[X])2


In statistics, the variance can be estimated from a sample of examples drawn from the
domain. In the abstract, the sample variance is denoted by the lower case sigma with a 2
superscript indicating the units are squared (e.g. σ2), not that you must square the final value.
The sum of the squared differences is multiplied by the reciprocal of the number of examples
minus 1 to correct for a bias (bias is related to a deeper discussion on degrees of freedom and I
refer you to references at the end of the lesson).

σ2 =1/n − 1 × SUMMMATION OF i=1 of n (xi − µ)2

"""

In [3]:
# vector variance
from numpy import array
from numpy import var
# define vector
v = array([1,2,3,4,5,6])
print(v)
# calculate variance
result = var(v, ddof=1)
print(result)

[1 2 3 4 5 6]
3.5


In [4]:
# matrix variances
from numpy import array
from numpy import var
# define matrix
M = array([
[1,2,3,4,5,6],
[1,2,3,4,5,6]])
print(M)

# column variances
col_var = var(M, ddof=1, axis=0)
print(col_var)
# row variances
row_var = var(M, ddof=1, axis=1)
print(row_var)

[[1 2 3 4 5 6]
 [1 2 3 4 5 6]]
[0. 0. 0. 0. 0. 0.]
[3.5 3.5]


# standard deviation
The standard deviation is calculated as the square root of the variance and is denoted as
lowercase s.

In [None]:
s =√σ2 # formula

"""
To keep with this notation, sometimes the variance is indicated as s2
, with 2 as a superscript,again showing that the units are squared. NumPy also provides a function for calculating
the standard deviation directly via the std() function. As with the var() function, the ddof
argument must be set to 1 to calculate the unbiased sample standard deviation and column and
row standard deviations can be calculated by setting the axis argument to 0 and 1 respectively.
The example below demonstrates how to calculate the sample standard deviation for the rows
and columns of a matrix
"""

In [5]:
# matrix standard deviation
from numpy import array
from numpy import std
# define matrix
M = array([
[1,2,3,4,5,6],
[1,2,3,4,5,6]])
print(M)
# column standard deviations
col_std = std(M, ddof=1, axis=0)
print(col_std)
# row standard deviations
row_std = std(M, ddof=1, axis=1)
print(row_std)

[[1 2 3 4 5 6]
 [1 2 3 4 5 6]]
[0. 0. 0. 0. 0. 0.]
[1.87082869 1.87082869]


### Covariance and Correlation


In [None]:
"""
In probability, covariance is the measure of the joint probability for two random variables. It
describes how the two variables change together. It is denoted as the function cov(X, Y ), where
X and Y are the two random variables being considered.

cov(X, Y )

Covariance is calculated as expected value or average of the product of the differences of
each random variable from their expected values, where E[X] is the expected value for X and
E[Y] is the expected value of y.

cov(X, Y ) = E[(X − E[X] × (Y − E[Y ])]

cov(X, Y ) = 1/n × SUMMATION OF(x − E[X]) × (y − E[Y ])

In statistics, the sample covariance can be calculated in the same way, although with a bias
correction, the same as with the variance

cov(X, Y ) = 1/n − 1 × SUMMATION OF(x − E[X]) × (y − E[Y ])

The sign of the covariance can be interpreted as whether the two variables increase together
(positive) or decrease together (negative). The magnitude of the covariance is not easily
interpreted. A covariance value of zero indicates that both variables are completely independent.
NumPy does not have a function to calculate the covariance between two variables directly.
Instead, it has a function for calculating a covariance matrix called cov() that we can use to
retrieve the covariance. By default, the cov()function will calculate the unbiased or sample
covariance between the provided random variables.
"""

In [6]:
# vector covariance
from numpy import array
from numpy import cov
# define first vector
x = array([1,2,3,4,5,6,7,8,9])
print(x)
# define second covariance
y = array([9,8,7,6,5,4,3,2,1])
print(y)
# calculate covariance
Sigma = cov(x,y)[0,1]
print(Sigma)

[1 2 3 4 5 6 7 8 9]
[9 8 7 6 5 4 3 2 1]
-7.5


In [None]:
"""
The covariance can be normalized to a score between -1 and 1 to make the magnitude
interpretable by dividing it by the standard deviation of X and Y . The result is called the
correlation of the variables, also called the Pearson correlation coefficient, named for the
developer of the method.

r =cov(X, Y ) / sX × sY

Where r is the correlation coefficient of X and Y , cov(X, Y ) is the sample covariance of X
and Y and sX and sY are the standard deviations of X and Y respectively. NumPy provides
the corrcoef() function for calculating the correlation between two variables directly. Like
cov(), it returns a matrix, in this case a correlation matrix. As with the results from cov() we
can access just the correlation of interest from the [0,1] value from the returned squared matrix.
"""

In [7]:
# vector correlation
from numpy import array
from numpy import corrcoef
# define first vector
x = array([1,2,3,4,5,6,7,8,9])
print(x)
# define second vector
y = array([9,8,7,6,5,4,3,2,1])
print(y)
# calculate correlation
corr = corrcoef(x,y)[0,1]
print(corr)


[1 2 3 4 5 6 7 8 9]
[9 8 7 6 5 4 3 2 1]
-1.0


# Covariance Matrix

In [None]:
"""
The covariance matrix is a square and symmetric matrix that describes the covariance between
two or more random variables. The diagonal of the covariance matrix are the variances of each
of the random variables, as such it is often called the variance-covariance matrix. A covariance
matrix is a generalization of the covariance of two variables and captures the way in which all
variables in the dataset may change together. The covariance matrix is denoted as the uppercase
Greek letter Sigma, e.g. Σ. The covariance for each pair of random variables is calculated as
above.
Σ = E[(X − E[X] × (Y − E[Y ])]
Σi,j = cov(Xi, Xj )
Below is an example that defines a dataset with 5 observations across 3 features
and calculates the covariance matrix.
"""

In [1]:
# covariance matrix
from numpy import array
from numpy import cov
# define matrix of observations
X = array([
[1, 5, 8],
[3, 5, 11],
[2, 4, 9],
[3, 6, 10],
[1, 5, 10]])
print(X)
# calculate covariance matrix
Sigma = cov(X.T)
print(Sigma)


[[ 1  5  8]
 [ 3  5 11]
 [ 2  4  9]
 [ 3  6 10]
 [ 1  5 10]]
[[1.   0.25 0.75]
 [0.25 0.5  0.25]
 [0.75 0.25 1.3 ]]
