# Why is there an n-1 in the Sample Standard Deviation formula?

If we can't calculate summaries for an entire population or even if we are using a large dataset to theorise about a wider or longer term population - that would mean that all our summaries (averages, spread, skew etc) are just going to be estimates of how the wider population behaves.

If we work out the mean of a sample it's usually pretty close to the mean of the population (which is handy - it means we can just use the same formula for the sample mean and population mean), but spread isn't so simple. There are lots of ways we could estimate what the standard deviation of the population might be, the most commonly used is the formula with n-1 instead of n. This essentially makes the result that you calculate for the sd ever so slightly smaller and corrects it so that it better represents what the real standard deviation for a whole population would be.

If you've come across the phrase 'degrees of freedom' (perhaps if you've had to learn to do chi-squared or t-tests manually) then you might already be familiar with this concept: Part of the calculation for standard deviation has us working out the differences from each value to the mean (or estimated mean in the case of a sample). Because we know these differences always sum to be zero, technically we wouldn't really have to work out the last one if we didn't have to as it's determined by the others... so instead of there being n differences to find there's only n-1 and there is one value that isn't really 'free'.

We simulate the differences between the n and n-1 formulae here so you can see...

In [32]:
#Let's first make up a population to work with
#we'll start with a Normal distribution - you choose the mean and standard deviation. 
#The rest of the simulation won't know what values you've chosen and we'll see if it gets close!
import numpy as np

MeanToGenerate = 5
SDToGenerate = 10

np.random.seed(1)
populationData = np.random.normal(loc = MeanToGenerate, scale = SDToGenerate, size = 500)

print('The actual population mean is', np.mean(populationData))
print('The actual population standard deviation is', np.std(populationData))


The actual population mean is 5.5343689425630656
The actual population standard deviation is 9.8840825928265


In [33]:
#Lets save these for comparison later:
populationMean = np.mean(populationData)
populationSD = np.std(populationData)

In [34]:
#Let's take a random sample, of say 50, and see what the mean and standard deviation are:
sample1 = np.random.choice(populationData, size = 50)

mean1 = np.mean(sample1)
sd1 = np.std(sample1)
#np.std() is the population sd by default, we need to set ddof = 1 to use the n-1 formula:
sd1sample = np.std(sample1, ddof = 1)

print("The first sample's mean is", mean1, 'compared to the population:', populationMean)
print("The first sample's standard deviation using the n formula is", sd1, 'or using the n-1 formula is:',sd1sample,'Compared to the population:', populationSD)

The first sample's mean is 5.099248791970257 compared to the population: 5.5343689425630656
The first sample's standard deviation using the n formula is 9.243868610241334 or using the n-1 formula is: 9.337717398141592 Compared to the population: 9.8840825928265


In [35]:
#Clearly there some differences, lets take another sample and average the sample means and sd's to see if it's closer
sample2 = np.random.choice(populationData, size = 50)

mean2 = np.mean(sample2)
sd2 = np.std(sample2)
sd2sample = np.std(sample2, ddof = 1)

print("The two sample's means average out as", np.mean([mean1,mean2]), 'compared to the population:', populationMean)
print("The two sample's standard deviations (n formula) average out as", np.mean([sd1,sd2]),'and for the n-1 formula:',np.mean([sd1sample,sd2sample]), 'compared to the population:', populationSD)

The two sample's means average out as 5.866723267067535 compared to the population: 5.5343689425630656
The two sample's standard deviations (n formula) average out as 9.41859759777682 and for the n-1 formula: 9.514220329507594 compared to the population: 9.8840825928265


# Exercise
Create a loop that will repeat the above cell of code, but for many more samples.

Which standard deviation formula that we are using on the samples gets the answer closer to the real standard deviation for the population?

In [57]:
#Solution
numberOfSamples = 100000
sampleSize = 50

sampleMeans = []
sampleSDn = []
sampleSDs = []

for count in range(numberOfSamples):
    sample = np.random.choice(populationData, size = sampleSize)
    sampleMeans.append(np.mean(sample))
    sampleSDn.append(np.std(sample))
    sampleSDs.append(np.std(sample, ddof = 1))
    
print("The sample means average out as", np.mean(sampleMeans), 'compared to the population:', populationMean)
print("The samples standard deviations (n formula) average out as", np.mean(sampleSDn),'and for the n-1 formula:',np.mean(sampleSDs), 'compared to the population:', populationSD)

The sample means average out as 5.537498542152148 compared to the population: 5.5343689425630656
The samples standard deviations (n formula) average out as 9.73823697255743 and for the n-1 formula: 9.837104857281311 compared to the population: 9.8840825928265


In [58]:
#Percentage errors

print('The percentage difference between the mean from the samples and the population mean is: ', round(100*(populationMean - np.mean(sampleMeans))/populationMean,2), '%')
print('The percentage difference between the SD from the samples (using the n formula) and the population SD is: ', round(100*(populationSD - np.mean(sampleSDn))/populationSD,2), '%')
print('The percentage difference between the SD from the samples (using the n-1 formula) and the population SD is: ', round(100*(populationSD - np.mean(sampleSDs))/populationSD,2), '%')


The percentage difference between the mean from the samples and the population mean is:  -0.06 %
The percentage difference between the SD from the samples (using the n formula) and the population SD is:  1.48 %
The percentage difference between the SD from the samples (using the n-1 formula) and the population SD is:  0.48 %


In [61]:
pip install plotly

Collecting plotly
  Downloading plotly-5.14.1-py2.py3-none-any.whl (15.3 MB)
Collecting tenacity>=6.2.0
  Downloading tenacity-8.2.2-py3-none-any.whl (24 kB)
Installing collected packages: tenacity, plotly
Successfully installed plotly-5.14.1 tenacity-8.2.2
Note: you may need to restart the kernel to use updated packages.


In [62]:
import plotly.graph_objects as go

In [68]:
samplesMean = np.mean(sampleMeans)
samplesSDp = np.mean(sampleSDn)
samplesSDs = np.mean(sampleSDs)
typicalSampleP = np.random.normal(loc = samplesMean, scale = samplesSDp, size = 50)
typicalSampleS = np.random.normal(loc = samplesMean, scale = samplesSDs, size = 50)

fig = go.Figure()

fig.add_trace(go.Histogram(x = populationData, nbinsx = 30))
fig.add_trace(go.Histogram(x = typicalSampleP, nbinsx = 25))
fig.add_trace(go.Histogram(x = typicalSampleS, nbinsx = 25))

fig.update_layout(barmode = 'overlay')

fig.update_traces(opacity = 0.6)

fig.show()

In [73]:
sample3 = np.random.choice(populationData, size = 50)
sample4 = np.random.choice(populationData, size = 50)
sample5 = np.random.choice(populationData, size = 50)

In [91]:
import plotly.figure_factory as ff

hist_data = [sample1, sample4, sample3]

In [92]:
group_labels = ['Sample 3', 'Sample 2', 'Sample 1']
colors = ['#835AF1', '#7FA6EE', '#B8F7D4']

fig = ff.create_distplot(hist_data, group_labels, colors = colors, show_rug = False)
fig.show()

In [93]:
hist_dataP = [populationData]
group_labelsP = ['Population']

fig2 = ff.create_distplot(hist_dataP, group_labelsP, show_rug = False)
fig2.show()

In [88]:
import numpy as np

data = [1,2,3,4,5]
sampleSD = np.std(data, ddof = 1)
populationSD = np.std(data)

In [89]:
!pip install chart-studio

Collecting chart-studio
  Downloading chart_studio-1.1.0-py3-none-any.whl (64 kB)
Collecting retrying>=1.3.3
  Downloading retrying-1.3.4-py3-none-any.whl (11 kB)
Installing collected packages: retrying, chart-studio
Successfully installed chart-studio-1.1.0 retrying-1.3.4


In [90]:
import chart_studio as cs
username = 'VickyCrockett'
apiKey = 'vH9H8ndoHCPf12VlZWqq'

cs.tools.set_credentials_file(username = username, api_key = apiKey)

In [95]:
import chart_studio.plotly as py
import chart_studio.tools as tls

py.plot(fig, filename = 'plotly_samples', auto_open = True)

'https://plotly.com/~VickyCrockett/3/'

In [96]:
py.plot(fig2, filename = 'plotly_population', auto_open = True)

'https://plotly.com/~VickyCrockett/5/'