# Video: Calculating the Variance

This video shows how to calculate the variance of a data set in pure Python or using the NumPy library. Adjustments for sample variance are also covered.

In [None]:
import numpy as np

In [None]:
data = [3, 4, 4, 5, 7, 2, 10]

Script:
* Let's talk about the variance of a data set now.
* The variance of a data set is a way to measure how spread out the data is around the mean.
* Here's the calculation.

In [None]:
mean = sum(data) / len(data)

In [None]:
variance = sum((d - mean)**2 for d in data) / len(data)

Script:
* Let's go through the details.
* HIGHLIGHT MEAN
* mean is the same calculation that we covered earlier.
* HIGHLIGHT D MINUS MEAN
* d minus mean is the difference between a particular sample d and the mean.
* HIGHLIGHT SQUARED TERM
* d minus mean in parenthesis, followed by that double asterisk two is taking the square of that difference. Double asterisk in python is exponentiation, or repeated multiplication.
* HIGHLIGHT FOR D IN DATA
* The next part, for d in data, runs that calculation for every value in the data variable.
* HIGHLIGHT SUM
* Then all of those calculated values, all the squared differences, are passed into the sum function that adds them up.
* HIGHLIGHT LEN
* Then all of that is divided by len(data), which is the number of data points.
* So that's the variance. Sometimes called the population variance.
* There's a slightly different calculation that we should be using for samples, the sample variance.

In [None]:
sample_variance = sum((d - mean)**2 for d in data) / (len(data) - 1)

Script:
* HIGHLIGHT DENOMINATOR
* The difference is in the denominator which is decreased by one.
* The technical reason for this change is that we will underestimate the variance if we calculate it using the first formula, while using the mean calculated from the same data set.
* Practically, this is mostly an issue for small data sets.
* But you will notice that this calculation blows up if you only have one data point, since that turns into a divide by zero in the sample variance calculation.
* Calculating the variance of a single data point is a little silly - are we calculating its variation from itself, so the answer is zero?
* Or are we calculating its variation from some true mean that we don't know?
* The sample variance adjustment calls out the one sample variance as shenanigans.
* In a previous video, I promised that the numpy module would provide functions to handle these calculations, so let's wrap up with those.
* The NumPy function is called var, so using the usual np import, we will call it as `np.var`.

In [None]:
np.var(data)

6.285714285714286

Script:
* That's our unadjusted population variance calculation.

In [None]:
np.var(data, ddof=1)

7.333333333333333

Script:
* And that's our adjusted sample variance.
* You will often see that adjustment skipped, and it won't be a big difference for large data sets, but now you know the right way to do it.
* As for standard deviation and sample standard deviation, you just take the square root of the variance or sample variance respectively.
* That would be math.sqrt.

In [None]:
import math

In [None]:
math.sqrt(np.var(data))

2.5071326821120348

Script:
* But since you already are using numpy, use numpy.std to calculate it directly.

In [None]:
np.std(data)

2.5071326821120348

Script:
* And the sample adjustment is the same.

In [None]:
np.std(data, ddof=1)

2.70801280154532

Script:
* And now you can calculate the variance and standard deviation.
