## Calculating the Mean, Variance and Standard Deviation

In this notebook, we'll discuss how to estimate three fundamental statistics: the mean, variance, and standard deviation. These statistics provide a summary of your dataset's central tendency and variability. To understand these concepts, we'll assume familiarity with histograms, statistical distributions, especially the normal distribution.

The main goal is to estimate population parameters. In statistics, a population is the entire group that you want to draw conclusions about while a sample is the specific group you collect data from. For instance, if you were trying to understand the average height of all humans, humans would be your population, but you might collect data from a sample of humans in one geographic area.

In reality, it's often not feasible to collect data from an entire population. Therefore, we collect data from a sample and use statistical methods to estimate the population parameters. We'll start by discussing how to calculate these parameters if you have data from an entire population and then discuss how to estimate these parameters from a sample.

Let's start by creating a dataset that we'll use to illustrate these concepts.

### References
- [Statistics Fundamentals](https://www.youtube.com/watch?v=74oD9D6tfuw&list=PLblh5JKOoLUIcdlgu78MnlATeyx4cEVeR) video series by Josh Starmer.
- [Python for Data Analysis](https://www.oreilly.com/library/view/python-for-data/9781491957653/) by Wes McKinney.

In [13]:
import numpy as np

# Seed for reproducibility
np.random.seed(0)

# Create a dataset
data = np.random.normal(20, 10, 240000000)

This dataset represents the number of mRNA transcripts from gene X in all 240 billion liver cells (the population). We simulated these numbers from a normal distribution with a mean of 20 and a standard deviation of 10. Now let's calculate the population parameters.

## Calculating the Mean

The mean, often called the average, is calculated by summing all the measurements and dividing by the number of measurements. In mathematical notation, the population mean $\mu$ is calculated as:

$$
\mu = \frac{1}{N}\sum_{i=1}^{N}x_{i}
$$

where $N$ is the number of measurements and $x_{i}$ is each individual measurement. Let's calculate the mean for our dataset.

In [14]:
# Calculate the population mean
mu = np.mean(data)
mu

20.00096812478495

As expected, our calculated population mean is very close to 20, the mean of the normal distribution we sampled from.

## Calculating the Variance

Variance is a measure of how spread out the measurements are from the mean. It's calculated as the average of the squared differences from the mean. In mathematical notation, the population variance $\sigma^2$ is calculated as:

$$
\sigma^2 = \frac{1}{N} \sum_{i=1}^{N} (x_i - \mu)^2
$$

where $\mu$ is the population mean. Let's calculate the variance.

In [15]:
# Calculate the population variance
variance = np.var(data)
variance

99.98689412135116

As expected, our calculated population variance is very close to 100, the square of the standard deviation of the normal distribution we sampled from.

However, because the variance is calculated by squaring the differences from the mean, its units are the square of the original units. To get a measure of spread in the original units, we take the square root of the variance to get the standard deviation.

## Calculating the Standard Deviation

The standard deviation is simply the square root of the variance. In mathematical notation, the population standard deviation $\sigma$ is calculated as:

$$
\sigma = \sqrt{\sigma^2}
$$

Let's calculate the standard deviation.

In [16]:
# Calculate the population standard deviation
std_dev = np.std(data)
std_dev

9.999344684595643

As expected, our calculated population standard deviation is very close to 10, the standard deviation of the normal distribution we sampled from.

## Estimating the Mean, Variance and Standard Deviation

In reality, we rarely have data from an entire population. So, we usually need to estimate these parameters from a sample. Let's take a sample of five measurements from our population and then estimate these parameters. First, we'll draw the sample.

In [17]:
# Draw a sample of 5 measurements
sample = np.random.choice(data, 5)
sample

array([23.99498109, 27.19194348, 13.86405634, 14.60517994, 26.90015011])

Now, let's estimate the population parameters from this sample.

## Estimating the Mean

The sample mean $\bar{x}$, which is our estimate of the population mean, is calculated in the same way as the population mean:

$$
\bar{x} = \frac{1}{n}\sum_{i=1}^{n}x_{i}
$$

where $n$ is the number of measurements in the sample and $x_{i}$ is each individual measurement. Let's calculate the sample mean.

In [18]:
# Calculate the sample mean
x_bar = np.mean(sample)
x_bar

21.311262188945257

## Estimating the Variance

The sample variance $s^2$, which is our estimate of the population variance, is calculated similarly to the population variance, but with two important differences. First, we calculate the differences from the sample mean instead of the population mean. Second, we divide by $n - 1$ instead of $n$. This is called Bessel's correction, and it corrects the bias in the estimation of the population variance. It's necessary because the differences from the sample mean are on average smaller than the differences from the population mean. Without this correction, we would underestimate the population variance.

In mathematical notation, the sample variance $s^2$ is calculated as:

$$
\sigma^2 = \frac{1}{N} \sum_{i=1}^{N} (x_i - \mu)^2
$$

$$
s^2 = \frac{1}{n-1} \sum_{i=1}^{n} (x_i - \bar{x})^2
$$

where $\bar{x}$ is the sample mean. Let's calculate the sample variance.

In [19]:
# Calculate the sample variance
s_squared = np.var(sample, ddof=1)
s_squared

43.36321045338236

## Estimating the Standard Deviation

The sample standard deviation $s$, which is our estimate of the population standard deviation, is simply the square root of the sample variance. In mathematical notation, the sample standard deviation $s$ is calculated as:

$$
s = \sqrt{s^2}
$$

Let's calculate the sample standard deviation.

In [20]:
# Calculate the sample standard deviation
s = np.std(sample, ddof=1)
s

6.585074825192373

These are our estimates of the population mean, variance, and standard deviation based on a sample of five measurements. With more data, our estimates would be more accurate and we would have more confidence in them. However, even with just five measurements, we've made reasonable estimates, saving a significant amount of time and resources compared to collecting data from the entire population.

In summary, we rarely have data from an entire population, so we usually estimate the population parameters from a sample. The mean provides a measure of central tendency, while the variance and standard deviation provide measures of variability. Understanding these statistics is fundamental to many areas of data analysis and machine learning.

## Additional References

- [Bessel's Correction](https://en.wikipedia.org/wiki/Bessel%27s_correction) on Wikipedia.
- [Numpy Documentation](https://numpy.org/doc/stable/): For more information on numpy functions.