<a href="https://www.kaggle.com/alperenkaran/demonstration-of-central-limit-theorem?scriptVersionId=88886209" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Demonstration of Central Limit Theorem

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import truncnorm
from tqdm.auto import tqdm

def get_data(low, high, mean, std, size):
    data = np.round(truncnorm((low-mean)/std, (high-mean)/std, mean, std).rvs(size))
    return data

np.random.seed(seed=0) # for reproducibility
lazy = get_data(0, 100, 30, 15, 400000)
hardworking = get_data(0, 100, 75, 10, 600000)
all_data = np.concatenate([lazy, hardworking])

## Part 1: Description of the data

- Note: this dataset is randomly generated, does not reflect reality.

- Assume that we can obtain the math exam scores from all junior high school students studying in a country. 

- This is a called a **population.**

- Let's look at the distribution of math grades for this population.

In [None]:
plt.figure(figsize=(15,3))
plt.hist(all_data, bins=50)
plt.xticks(np.arange(0,101,5))
plt.yticks([])
plt.grid()
plt.title('Distribution of grades')
plt.show()

- This distribution actually makes a lot sense because

    1. The population contains some lazy students with a mean of 30
    2. The population contains many hard-working students with a mean of 75
    
- We can also compute the mean and standard deviation of our population:

In [None]:
print('The mean of the data is', np.round(np.mean(all_data),2))
print('The standard deviation of the data is', np.round(np.std(all_data),2))

Let's call these two numbers $\mu$ and $\sigma$. ($\mu = 57.24, \ \sigma=24.55$)

## Part 2: Select a sample

Now let's select a sample of size 100 students and find the sample mean.

In [None]:
sample = np.random.choice(all_data, size=100)
print('The mean of this sample is', np.round(np.mean(sample),2))

## Part 3: Select 5 samples

Now let's select 5 samples (of again size 100) to see their means:

In [None]:
for i in range(1,6):
    sample = np.random.choice(all_data, size=100)
    print('The mean of this sample', i, 'is', np.round(np.mean(sample),2))

## Part 4: Understanding Central Limit Theorem

- Even if the underlying distribution is non-normal, if the sample size is large enough, the means of samples will approximately be normally distributed.

- Furthermore, its mean will be the same as population mean $\mu$,

- And its standard deviation will be the population standard deviation divided by the square root of sample size which is $\sigma \big/ \sqrt{\text{sample size}}$

## Part 5: Select 10 000 samples

We select 10 000 samples of size 100, and write their means to a list.

In [None]:
samples = np.random.choice(all_data, 10000*100).reshape(10000,100) #sample 10 000 x 100 times, and divide into 10 000 samples of size 100
sample_means = samples.mean(axis=1)

## Part 6: Visualize sample means

In [None]:
plt.figure(figsize=(15,3))
plt.hist(sample_means, bins=50)
plt.yticks([])
plt.grid()
plt.title('Distribution of sample means')
plt.show()

Hey! It looks very normal to me!

## Part 7: Mean and Standard Deviation of sample means

In [None]:
print('The expected mean of sample means is', np.round(np.mean(all_data),2))
print('The expected standard deviation of sample means is', np.round(np.std(all_data)/np.sqrt(100), 3))

In [None]:
print('The mean of sample means is', np.round(np.mean(sample_means),2))
print('The standard deviation of sample means is', np.round(np.std(sample_means),2))

Very close!