# TASK 3 - RESEARCH OF STANDARD DEVIATION FUNCTIONS

### Brief:
*The standard deviation of an array of numbers x is
calculated using numpy as **np.sqrt(np.sum((x - np.mean(x))^2)/len(x))** .
However, Microsoft Excel has **two different versions** of the standard deviation
calculation, **STDEV.P and STDEV.S** . The STDEV.P function performs the above
calculation but in the STDEV.S calculation **the division is by len(x)-1** rather
than **len(x)**. Research these Excel functions, writing a note in a Markdown cell
about the difference between them. Then use numpy to perform a simulation
demonstrating that the STDEV.S calculation is a better estimate for the standard
deviation of a population when performed on a sample. Note that part of this task
is to figure out the terminology in the previous sentence.*

## Differences between STDEV.P & STDEV.S

### STDEV.P:
STDEV.P is an excel function used when calculating the standard deviation of an entire population. A population data set contains all members of a specified group, this is the entire list of possible data values. Uses the count of **n** in formulae.

For example, the population may be "ALL people living in the US."

### STDEV.S:
STDEV.S is an excel function used when calculating a sample of a data set. A sample data set contains a part, or a subset, of a population. The size of a sample is always less than the size of the population from which it is taken. This utilizes the count of **n-1** in formulae

Example: The sample may be "SOME people living in the US."

### Differences:
The only difference between the formulae is that for the sample standard deviation you divide by n-1, n is subtracted by 1 to get an unbiased sample deviation. Subtracting by 1 means that the sample standard deviation will be a **larger** number.

See the example below to understand why n-1 is a better estimate for a sample variance.


In [1]:
import numpy as np
import random

# Create dataset of 100 random integers between 1 and 20.
x = []
n = 100
numLow = 1
numHigh = 20

for i in range (0, n):
    x.append(random.randint(numLow, numHigh))
x.sort()

# Calculating the population standard deviation.
stdevp = np.sqrt(np.sum((x - np.mean(x))**2)/len(x))
print("Entire population with STDEV.P = %1.15f" % stdevp)

# Create a sample set of x, using 50% of the data that is in x.
y = x[1:10] + x[30:40] + x[60:70] + x[90:100]

# Calculating the un-biased sample standard deviation.
stdevs_unbiased = np.sqrt(np.sum((y - np.mean(y))**2)/len(y)-1)
print("Sample population with unbiased STDEV.S = %1.15f" % stdevs_unbiased)

# Calculating the biased sample standard deviation.
stdevs_biased = np.sqrt(np.sum((y - np.mean(y))**2)/len(y))
print("Sample population with biased STDEV.S = %1.15f" % stdevs_biased)

Entire population with STDEV.P = 5.359888058532566
Sample population with unbiased STDEV.S = 6.151762467745814
Sample population with biased STDEV.S = 6.232510044882885


### Results:

As we can see from the results, the unbiased formula is generally a decimal or two closer to the population standard deviation than the biased formula.

### References:
* Population VS Sample Data; MathBitsNotebook.com; http://mathbitsnotebook.com/Algebra1/StatisticsData/STPopSample.html
* Measures of Spread; MathBitsNotebook.com; http://mathbitsnotebook.com/Algebra1/StatisticsData/STSpread.html
* Why we divide by n-1 for unbiased sample variance; Sal Khan; https://www.khanacademy.org/math/ap-statistics/summarizing-quantitative-data-ap/more-standard-deviation/v/review-and-intuition-why-we-divide-by-n-1-for-the-unbiased-sample-variance