## Variance and covariance

First lets create some data.

In [2]:
data <- iris
data <- data[,-5]

print(head(data))
nrow(data)
ncol(data)

  Sepal.Length Sepal.Width Petal.Length Petal.Width
1          5.1         3.5          1.4         0.2
2          4.9         3.0          1.4         0.2
3          4.7         3.2          1.3         0.2
4          4.6         3.1          1.5         0.2
5          5.0         3.6          1.4         0.2
6          5.4         3.9          1.7         0.4


## Variance

Variance, or squared standard deviation, is a measure of spread between numbers in data set. Squared distance between each data point $x_i$ and mean of data set $\mu$ is summed together and divided with total number of items in set.

$$ \sigma^2 = \frac{\sum_{i=1}^{n} (x_i - \mu)^2}{N} $$

This is known as "population variance". However, sample variance is more commonly used, as most data sets are considered incomplete.

$$ \sigma^2 = \frac{\sum_{i=1}^{n} (x_i - \mu)^2}{N-1} $$

Thus, sample variance for the first dimension of our sample data set can be calculated as follows -

In [2]:
x <- data[,1]
x <- x - mean(x)
x <- x^2

var <- sum(x) / (length(x) - 1)
print(var)

ssd <- sd(data[,1])^2
print(ssd)

[1] 0.6856935
[1] 0.6856935


We can validate this result by calculating squared standard deviation using R built-in `sd` function.

### Covariance

Covariance measures how two distinct variables (data dimensions in our case) are linearly related. If covariance is negative, then so is linear relationship between. I.e., values in one vector are likely to decrease as the other increases. Sample covariance can be defined as follows:

$$ C_{xy} = \frac{1}{n - 1}\sum_{i=1}^{n}(x_i - \mu_x)(y_i - \mu_y) $$

Covariance between first and second dimensions of our sample dataset can therefore be calculated as follows -

### Variance-covariance matrix

Also known as simply "covariance matrix" is matrix representation of covariances for all dataset dimensions. Central diagonal of this matrix is comprised of variances as there is little practical reason to measure covariance for single variable.

In [3]:
covMatrix <- cov(data)
print(covMatrix)

             Sepal.Length Sepal.Width Petal.Length Petal.Width
Sepal.Length    0.6856935  -0.0424340    1.2743154   0.5162707
Sepal.Width    -0.0424340   0.1899794   -0.3296564  -0.1216394
Petal.Length    1.2743154  -0.3296564    3.1162779   1.2956094
Petal.Width     0.5162707  -0.1216394    1.2956094   0.5810063


To illustrate this point

In [4]:
dims <- c(1,2,3,4)

covMatrix <- matrix(rep("cov", 16), nrow = 4, ncol = 4)
colnames(covMatrix) <- dims
rownames(covMatrix) <- dims
diag(covMatrix) <- "var"
print(covMatrix)

  1     2     3     4    
1 "var" "cov" "cov" "cov"
2 "cov" "var" "cov" "cov"
3 "cov" "cov" "var" "cov"
4 "cov" "cov" "cov" "var"
