# The Central Limit Theorem

In [34]:
import numpy as np
import pandas as pd
import scipy
from scipy.stats import ttest_ind
import matplotlib.pyplot as plt
%matplotlib inline

## Create populations

Two populations are generated from two different binomial distributions. 

In [35]:
pop1 = np.random.binomial(10, 0.2, 10000)
pop2 = np.random.binomial(10, 0.5, 10000)

Let's take one sample ($N = 100$) from each population.

In [36]:
# Sample with replacement 
sample1 = np.random.choice(pop1, 100, replace=True)
sample2 = np.random.choice(pop2, 100, replace=True)

# Print out the mean and the standard deviation of each sample
print('{} is the mean of the sample from population 1.'.format(sample1.mean()))
print('{} is the standard deviation of the sample from population 1.'.format(sample1.std()))
print('{} is the mean of the sample from population 2.'.format(sample2.mean()))
print('{} is the standard deviation of the sample from population 2.'.format(sample2.std()))

1.9 is the mean of the sample from population 1.
1.1357816691600549 is the standard deviation of the sample from population 1.
5.22 is the mean of the sample from population 2.
1.5784802817900512 is the standard deviation of the sample from population 2.


## Tasks

 1. Increase the size of your samples from 100 to 1000, then calculate the means and standard deviations for your new samples and create histograms for each.  Repeat this again, decreasing the size of your samples to 20.  What values change, and what remain the same?
 
    I expect to see smaller standard deviations when the size of samples increases and larger standard deviations when it decreases. In both cases, sample means would be roughly the same.

In [37]:
# Increase N to 1000
sample_large1 = np.random.choice(pop1, 1000, replace=True)
sample_large2 = np.random.choice(pop2, 1000, replace=True)
# Print out the mean and the standard deviation of each sample
print('{} is the mean of the large sample from population 1.'.format(sample_large1.mean()))
print('{} is the standard deviation of the large sample from population 1.'.format(sample_large1.std()))
print('{} is the mean of the large sample from population 2.'.format(sample_large2.mean()))
print('{} is the standard deviation of the large sample from population 2.'.format(sample_large2.std()))

# Add line break 
print('\n')

# Decrease N to 20
sample_small1 = np.random.choice(pop1, 20, replace=True)
sample_small2 = np.random.choice(pop2, 20, replace=True)
# Print out the mean and the standard deviation of each sample
print('{} is the mean of the small sample from population 1.'.format(sample_small1.mean()))
print('{} is the standard deviation of the small sample from population 1.'.format(sample_small1.std()))
print('{} is the mean of the small sample from population 2.'.format(sample_small2.mean()))
print('{} is the standard deviation of the small sample from population 2.'.format(sample_small2.std()))

1.913 is the mean of the large sample from population 1.
1.3059215137212496 is the standard deviation of the large sample from population 1.
4.964 is the mean of the large sample from population 2.
1.5743900406189058 is the standard deviation of the large sample from population 2.


2.2 is the mean of the small sample from population 1.
1.16619037896906 is the standard deviation of the small sample from population 1.
5.1 is the mean of the small sample from population 2.
1.7 is the standard deviation of the small sample from population 2.


Neither the means nor the standard deviations seem to be very different compared to the original samples ($N = 100$).

2. Change the probability value (`p` in the [NumPy documentation](https://docs.scipy.org/doc/numpy/reference/generated/numpy.random.binomial.html)) for `pop1` to 0.3, then take new samples and compute the t-statistic and p-value. Then change the probability value p for group 1 to 0.4, and do it again.  What changes, and why?

   When we change $p$ from 0.2 to 0.3, the difference between the two population becomes smaller. As a result, both the $t$-value and the $p$-value should be smaller. When $p$ further changes from 0.3 to 0.4, the two populations become even closer so the $t$-value and the $p$-value should be even smaller.

In [38]:
# Change p to 0.3 in population 1
pop3 = np.random.binomial(10, 0.3, 10000)
# Sample from the new population
sample3 = np.random.choice(pop3, 100, replace=True)
# Calculate t- and p-values
test1 = ttest_ind(sample2, sample3, equal_var=False)
# Show the results
print('The t-value is {} and the p-value is {}'.format(test1[0], test1[1]))

# Change p to 0.3 in population 1
pop4 = np.random.binomial(10, 0.4, 10000)
# Sample from the new population
sample4 = np.random.choice(pop4, 100, replace=True)
# Calculate t- and p-values
test2 = ttest_ind(sample2, sample4, equal_var=False)
# Show the results
print('The t-value is {} and the p-value is {}'.format(test2[0], test2[1]))

The t-value is 11.21849634526823 and the p-value is 6.409144504987356e-23
The t-value is 6.237419860875322 and the p-value is 2.6448116447464545e-09


The results confirmed both of my predictions.

3. Change the distribution of your populations from binomial to a distribution of your choice. Do the sample mean values still accurately represent the population values?

   If we change both populations to Beta distributions, sample means should still represent their population mean. Moreover, according to the central limit theorem, if we take a large number of samples from a population, the sample means will have a normal distribution whose mean is close to the true population mean. This will be true for any distributions, including Beta.

First, create two populations from two different Beta distributions.

In [40]:
pop_beta1 = np.random.beta(8, 2, 10000)
pop_beta2 = np.random.beta(5, 5, 10000)

In [42]:
# Generate one sample from each population
sample_beta1 = np.random.choice(pop_beta1, 1000, replace=True)
sample_beta2 = np.random.choice(pop_beta2, 1000, replace=True)

# Calculate sample and population means 
print(pop_beta1.mean())
print(sample_beta1.mean())
print(pop_beta2.mean())
print(sample_beta2.mean())

0.8002939665930171
0.8045765180396223
0.4977934056836968
0.5028989898354519


In [None]:
As can be seen, sample means still accurately represent the population means.