# CLT Illustrated

Let's first load our required packages. 

## import packages 

In [69]:
from numpy.random import normal
import numpy as np
import random 
import pandas as pd
import altair as alt

In [70]:
alt.renderers.enable('default') # doesn't work: html, notebook, mimetype, having nothing won't render on github

RendererRegistry.enable('default')

In [71]:
alt.data_transformers.disable_max_rows()

DataTransformerRegistry.enable('default')

Let's also define a helpful function that we'll need for later. 

In [72]:
def get_sample_means(pop, n, size):
    """
    Sample from pop, n times and produce a list of 
    sample means.
    
    Arguments
    ---------
    pop (Series): A pandas series with values to draw samples from. 
    n   (int)   : The number of times to draw samples from the population.
    size (int)  : The number of observations to draw from a single sample. 
    
    Returns 
    -------
    list
        A list of sample means. 
    """
    sample_means = []
    for i in range(n):
        # pull a sample from the population 
        sample = pop.sample(n=size)
        
        # compute the sample mean 
        sample_mean = np.mean(sample)
        
        # append the sample mean to the list of means 
        sample_means.append(sample_mean)
        
    return sample_means

Going back to our measuring human height idea, let's create our own populuation of humans and define what the mean and standard deviation of the height will be. Let's define 3 things: 

**1)** That human height is normally distributed in the population 

**2)** The population has 1 million people. 

**3)** The average human height will be 1.6 meters 

**4)** The standard deviation of the height will be 0.6 meters 

With these assumptions in place, let's build our population of humans using code. 


In [73]:
random.seed(1234)

# define the population parameters 
pop_mean = 1.6
pop_sd = 0.6
#pop_num = 1_000_000
pop_num = 100

# generate the finite population 
pop = normal(loc=pop_mean, scale=pop_sd, size=pop_num)

pop = pd.DataFrame(data=pop, columns=['heights'])
pop.head()

Unnamed: 0,heights
0,1.760583
1,2.751395
2,1.721516
3,0.755301
4,1.816925


Okay now, we have a dataframe of our populuation of 1 million heights. Let's plot what this distribution looks like. 

In [74]:
pop_vline = (alt.Chart(pop)
             .mark_rule(color='yellow')
             .encode(alt.X('mean(heights):Q', 
                           title='Heights')))

plot = (alt.Chart(pop)
        .transform_density(
            'heights',
            as_=['Heights', 'density'])
        .mark_area()
        .encode(
            alt.X('Heights:Q'),
            alt.Y('density:Q'))
        .properties(title='Distribution of Height in the Population'))

plot + pop_vline

Okay looks good. Here's what our population looks like. The yellow line falls at the mean height in the population which you'll notice occurs exactly at height = 1.6. That's exactly what we defined in the code. 