3.2 - Sample variance
-----------------------
----------------------

The variance of a distribution is the sum of the squares of the deviations around the mean, divided by the total number of observations minus one.

The sample variance formula is defined as:

$
s^2 = \frac{\sum_{i=1}^{n} (x_i - \bar{x})^2}{n-1}
$

We say that:

- $(s^2)$ is the sample variance.
- $(n)$ is the sample size.
- $(x_i)$ are the individual values in the sample.
- $(\bar{x})$ is the mean of the sample.

The larger the sample variance, the greater the dispersion of the data.

The sample variance cannot be negative, because it is a sum of squares. Its value can only be equal to zero if all deviations are equal to zero, that is, if all observations coincide with the mean.

For obvious reasons, we say that the variance of a constant is zero. If all observations are equal, there is no dispersion.

The variance of the sum of two independent variables is the sum of the variance of each of them.
V(x+y)=V(x)+V(y)

**Two Problems:**

When we calculate the variance, the units of measurement are always squared. This can lead to results that don't make much sense. For example, if we are evaluating the errors that students make on an exam, the term errors squared does not make much sense.

There is no clear limit to evaluate the size of the variance. Sometimes a result in the variance calculation may appear large and this may be because there is a large variability in the data. But it may also be because the variable values are large.

**Example 1**

The website www.fueleconomy.gov contains a wealth of information about the fuel characteristics of various vehicles. In addition to Environmental Protection Agency (EPA) mileage ratings, there are many vehicle users who have reported their own fuel efficiency (mpg) values. Consider the following sample of n = 11 efficiencies for the 2009 Ford Focus.

In [34]:
import pandas as pd

In [35]:
data = {'Car': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11],
       '$x_i$':[27.3, 27.9, 32.9, 35.2, 44.9, 39.9, 30.0, 29.7, 28.5, 32.0, 37.6]}

# Create DataFrame
df = pd.DataFrame(data)

# Calculate the mean of the sample
mean = df['$x_i$'].mean()

# Calculate deviations from the mean
df['$x_i - \overline{x}$'] = df['$x_i$'] - mean
 
# Calculate the the square of each deviation around the mean
df['$(x_i - \overline{x})^2$'] = (df['$x_i$'] - mean) ** 2

# Print the output.
df.head(11)

Unnamed: 0,Car,$x_i$,$x_i - \overline{x}$,$(x_i - \overline{x})^2$
0,1,27.3,-5.963636,35.564959
1,2,27.9,-5.363636,28.768595
2,3,32.9,-0.363636,0.132231
3,4,35.2,1.936364,3.749504
4,5,44.9,11.636364,135.404959
5,6,39.9,6.636364,44.041322
6,7,30.0,-3.263636,10.651322
7,8,29.7,-3.563636,12.699504
8,9,28.5,-4.763636,22.692231
9,10,32.0,-1.263636,1.596777


In [36]:
#Calculate the sum of the squares of the deviations
sum_sq_deviations = df['$(x_i - \overline{x})^2$'].sum()

print("The sum of the squares of the deviations:", sum_sq_deviations)

The sum of the squares of the deviations: 314.1054545454545


In [37]:
# Calculate the sample variance
sample_variance = df['$(x_i - \overline{x})^2$'].sum() / (len(df) - 1)

print("Sample variance:", sample_variance)

Sample variance: 31.41054545454545


There is a python library that allows us to summarize all these steps and work faster.

In [38]:
# Importing Statistics module
import statistics

# This function will automatically calculate
# it's mean and set it as xbar
print("Variance of sample set is % s"
      %(statistics.variance(df['$x_i$'])))

Variance of sample set is 31.410545454545453
