# Math 032 - Summer 2024
## Authors: Chat GPT and Adrien Peltzer

### Introduction
In this notebook, we will illustrate the Central Limit Theorem (CLT) by sampling from a specific distribution and plotting the resulting normalized sample means. The Central Limit Theorem states that the distribution of sample means will approximate a normal distribution as the sample size becomes large, regardless of the shape of the original distribution.

### Parameters
- `n_samples`: Number of samples to generate
- `sample_size`: Size of each sample
- `distribution`: Type of distribution to sample from ('uniform' or 'normal')

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats

# Parameters
n_samples = 1000  # Number of samples
sample_size = 30  # Size of each sample
distribution = 'uniform'  # Can be 'uniform' or 'normal'

### Sampling from the specified distribution
We will sample from the specified distribution (`uniform` or `normal`). You can change the distribution type by modifying the `distribution` variable.

In [None]:
# Sampling from the specified distribution
if distribution == 'uniform':
    samples = np.random.uniform(0, 1, (n_samples, sample_size))
elif distribution == 'normal':
    samples = np.random.normal(0, 1, (n_samples, sample_size))
else:
    raise ValueError("Unsupported distribution type. Choose 'uniform' or 'normal'.")

### Calculating sample means
Next, we calculate the means of the samples.

In [None]:
# Calculating sample means
sample_means = np.mean(samples, axis=1)

### Normalizing the sample means
We normalize the sample means to have a mean of 0 and a standard deviation of 1.

In [None]:
# Normalizing the sample means
normalized_sample_means = (sample_means - np.mean(sample_means)) / np.std(sample_means)

### Plotting the results
Finally, we plot a histogram of the normalized sample means and overlay the standard normal distribution (bell curve).

In [None]:
# Plotting
plt.figure(figsize=(12, 6))

# Histogram of normalized sample means
plt.hist(normalized_sample_means, bins=30, density=True, alpha=0.6, color='g')

# Plotting the bell curve (Standard Normal Distribution)
xmin, xmax = plt.xlim()
x = np.linspace(xmin, xmax, 100)
p = stats.norm.pdf(x, 0, 1)
plt.plot(x, p, 'k', linewidth=2)

title = 'Central Limit Theorem Illustration'\
        '\n(Sampling from a {} distribution)'.format(distribution)
plt.title(title)
plt.xlabel('Normalized Sample Mean')
plt.ylabel('Frequency')
plt.grid(True)
plt.show()

### Childhood Obesity Data

We'd like to test if some *real-world data* is normally distributed. Let's look at this data from the NHANES website. 

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import shapiro, probplot

In [None]:
# Load the data
data = pd.read_csv('NHANESChildObesity.csv')

In [None]:
# Extract the relevant columns
columns_to_test = ['BMXWAIST', 'BMXWT']
data_to_test = data[columns_to_test]

# Perform the Shapiro-Wilk test for normality
for column in columns_to_test:
    stat, p = shapiro(data_to_test[column])
    print(f'{column} - Statistics={stat}, p={p}')
    if p > 0.05:
        print(f'{column} looks Gaussian (fail to reject H0)')
    else:
        print(f'{column} does not look Gaussian (reject H0)')


In [None]:

# Visualizations
for column in columns_to_test:
    plt.figure(figsize=(12, 6))

    # Histogram
    plt.subplot(1, 2, 1)
    sns.histplot(data_to_test[column], kde=True)
    plt.title(f'Histogram of {column}')

    # Q-Q plot
    plt.subplot(1, 2, 2)
    probplot(data_to_test[column], dist="norm", plot=plt)
    plt.title(f'Q-Q Plot of {column}')

    plt.tight_layout()
    plt.show()
