# Central Limit Theorem (CLT) - Revisited

## 📚 Introduction to the CLT

ℹ️ You can skip this section of the `Recap - Central Limit Theorem - Revisited` if you had time to read it during the `Challenge - The Central Limit Theorem - A first approach`

**Two convergence theorems revolutionized the disciplines of probability and statistics:**
- **`LLN`: the Law of Large Numbers**
- **`CLT`: the Central Limit Theorem**

🧑🏻‍🏫 What is the CLT ? According to [Wikipedia](https://en.wikipedia.org/wiki/Central_limit_theorem)

> In probability theory, the central limit theorem (CLT) establishes that, in many situations, when independent random variables are summed up, their properly normalized sum tends towards a `Gaussian/normal distribution` (informally a `bell curve`) even if the original variables themselves are not normally distributed. The theorem is a key concept in probability theory because it implies that probabilistic and statistical methods that work for normal distributions can be applicable to many problems involving other types of distributions.

## 🎯  Interpreting and experimenting the CLT

🎯  Let's illustrate how to use the [Central Limit Theorem](https://en.wikipedia.org/wiki/Central_limit_theorem) in a dataset:

* Given a population, let's consider a feature (example: size, weight, salary, etc...) for each individual.


🚀  The important takeaway of these two theorems is that **whatever the shape of the distribution** of a given feature over the population **is**, **the distribution of the (sampled) mean<u>S</u> tends to be Gaussian**:
* `the mean of the means` = $ \mu$ (Law of Large Numbers)
* `the standard deviation of the means` = $ \frac{\sigma}{\sqrt{n}} $  (Central Limit Theorem)

![](https://upload.wikimedia.org/wikipedia/commons/thumb/7/7b/IllustrationCentralTheorem.png/400px-IllustrationCentralTheorem.png)

💡  We can wrap it up the following way:

$$ \large \bar{X} \approx_{n \rightarrow \infty} \mathcal{N}(\mu,\frac{\sigma}{\sqrt{n}}) $$

👩🏻‍🔬  Let's verify this experimentally!

##  🚀 Let's get started !

👉 In this challenge, we will use the `tips` dataset from the `seaborn` library to illustrate the Central Limit Theorem.

In [None]:
# Data Manipulation
import numpy as np
import pandas as pd
# Scientific libraries
import scipy.stats as stats
# Data Visualisation
import seaborn as sns
import matplotlib.pyplot as plt

❓ Load the `"total_bill"` dataset from `seaborn` into a `df` variable and display the first rows ❓

<details>
    <summary>Hint</summary>
    You can use <a href="https://seaborn.pydata.org/generated/seaborn.load_dataset.html"><code>seaborn.load_dataset</code></a>
</details>

In [None]:
# YOUR CODE HERE

## 🧐 Exploratory Data Analysis (EDA)

❓ How many rows are available in the dataset ❓

In [None]:
# YOUR CODE HERE

📊 Plot the distribution of the `total_bill` column in the restaurant.

In [None]:
# YOUR CODE HERE

❓ What is the [**skewness**](https://whatis.techtarget.com/definition/skewness) value of this distribution ❓

In [None]:
# YOUR CODE HERE

❓ Create two variables `mu` and `sigma` to store the mean and standard deviation of tips ❓

In [None]:
# YOUR CODE HERE

## 🎲 Sampling the mean

👉 Pick randomly - and with replacement - 10 rows of the dataset, and compute the mean $ \bar{X} $ of this sample.

Run this cell a few times.
* Do you get the same result each time?
* Did you expect it ?

In [None]:
# df["total_bill"].sample(1000, replace=True).mean()
df["total_bill"].sample(5).mean()

👉 Create a `means` list storing a list of means of $N$ samples (each of them with size $n$).

Start with $n = 5$ and $N = 10$

📊 In the same cell, **plot** the distribution of `means`. 

🕵️‍♀️ Let's play with the *sample size n* and the *number of samples N*:
- Keep *n* constant, increase N. What do you observe ?
    - Plot a grid of 6 distributions playing with  $ n \in \{ 1, 5, 50, 100, 500, 1000 \}$
- Then, increase *n* and test another range for *N*. What do you observe ?
    - Plot a grid of 6 distributions playing with  $ N \in \{ 10, 20, 30, 50, 100, 500 \}$

<b>Playing with the `sample size`</b>:

In [None]:
# YOUR CODE HERE

💡 As *n* increases:
* the distribution of the means converges towards the theoretical mean $ \mu $ (LGN)
* the variance around $ \mu $ tends towards 0 (indeed: $ \large \frac{\sigma}{\sqrt{n}} \rightarrow_{n \rightarrow \infty} 0 $)

<b>Playing with the `number_of_samples`</b>:

In [None]:
# Number of customers in one sample
n = 200

# Number of samples
list_of_N = [10,20,30,50, 100, 500]
 
# Plot 6 graphs : 2 rows by 3 columns for the 6 values of N
fix,axes = plt.subplots(nrows=2,ncols=3,figsize=(15,10))
for N, ax in zip(list_of_N,axes.flat):
    means = [df["total_bill"].sample(n, replace=True).mean() for i in range(N)]
    ax.set_title(f"N={N}")
    ax.set_xlim(0, 40)
    sns.histplot(means,bins=10,ax=ax)


💡 As *N* increases:
* the distribution of the means is less noisy

## 👩🏻‍💻 Verifying the CLT with simulations:

![](https://upload.wikimedia.org/wikipedia/commons/thumb/7/7b/IllustrationCentralTheorem.png/400px-IllustrationCentralTheorem.png)

$$ \large \bar{X} \approx_{n \rightarrow \infty} \mathcal{N}(\mu,\frac{\sigma}{\sqrt{n}}) $$

🔥 Let's verify the Central Limit Theorem computationally 

For each value of `n`:
- Compare `mu` with the mean of means
- Compare `sigma` with the standard deviation of the means, (don't forget the $\sqrt n$ adjustment)
- Compute the `skewness` of the sampling distribution

In [None]:
# YOUR CODE HERE

In [None]:
print(f"mu: {round(mu,2)}")
print(f"sigma: {round(sigma,2)}")
print(f"skew: {round(skew,2)}")
print(f"kurtosis: {round(kurtosis,2)}")

In [None]:
# YOUR CODE HERE

## ⭐️ Real-life application of the CLT

👉 Let's consider `n` =  100 rows **sampled from the dataset**. What is the probability that the cumulated total bill is **lower than 1800€**? 

🚀 `n > 30` is enough to apply the Central Limit Theorem. The distribution of the sampled means follows a **`Gaussian Distribution`** (already referred as a **`Normal Distribution`**)

$$

❓ Plot the  **`pdf`** (a.k.a. `probability density function`) of the sampled means of the total bills. You can use ***[`scipy.stats.norm`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.norm.html)***

$$ \large \bar{X} \approx_{n \rightarrow \infty} \mathcal{N}(\mu,\frac{\sigma}{\sqrt{n}}) $$



In [None]:
# YOUR CODE HERE

🧑🏻‍🏫 We'll revisit the concept of Gaussian Variable during the lecture `Statistical Inference`. For this kind of random

❓ What is the probability we are looking for? Use the `cdf` method to find it.

In [None]:
# YOUR CODE HERE

❓ Compute the z-score for the value `18€`

In [None]:
# YOUR CODE HERE

❓ Plot the normal distribution (0, 1) and a red dot for the target (use the `pdf`)

In [None]:
# YOUR CODE HERE

🏁 And... that's the end of this module for today !

💾 Again, you know the routine: `git add/commit/push` !

---


🎉 Massive congratulations for making it to the end of the two Mathematics modules! 


🤩 If you fell in love with math, look at the following video: [**`The Map of Mathematics`**](https://www.youtube.com/watch?v=OmJ-4B-mS-Y) (11 min - 9M views on Youtube)



<img src="https://live.staticflickr.com/272/32264483720_c51bdde679_n.jpg">



---

📆 So, now **how do you prepare yourself to enter the world of `Decision Science`** ? 

- 🐍 **`Python For Data Science`** :
    - the more we progress in the bootcamp, the more important you will have to be proficient in Python so that you can focus on the new Data Science concepts and not the programming questions !
    - _Example_: think about a professionnal tennis player: before a game,  he/she elaborates a strategy with the coach to beat the opponent, not how to make a good serve or how to do a top spin !
    
- 🔢 **`SQL`**:
    - Mastering databases'queries is at the heart of any _Data Science and Analytics job_, even before Python for Data Analysts
    - It is fundamental that you master how to _join tables_ with SQL, we will re-use the concepts of _merging tables_ extensively with _Pandas_ and you will have to do it not only during your projects but also afterwards in your next job
    
- 🐼 **`Pandas/Numpy`**
    - The more expert you are at manipulating data with these two libraries, the more you can focus on adding value to your analysis 
    - The same way Excel masters will outshine their colleagues in Finance, Pandas/Numpy wizards will be much faster on focusing on handling new data.
    
- 🧮 **`Maths`**:
    - `Algebra`: You must be comfortable dealing with Matrixes, DataFrames, Numpy Arrays! We will use them everywhere, even for Computer Vision and Image Preprocessing. We do not ask you to be an expert taking a full Linear Algebra course at the university like [MIT - Gilbert Strang - Linear Algebra](https://ocw.mit.edu/courses/mathematics/18-06-linear-algebra-spring-2010/) (Open Source course on MIT OpenCourseWare) but at least, be proficient with Matrix Shapes, Transpose, Inverse, Matrix Multiplications
    - `Probability and Statistics`: We will re-use the Gaussian Distributions and the Central Limit Theorems during the next module, so make sure you understood these concepts and list all the remaining questions you would like to ask to your teachers and TA !

👉 These topics should be your priorities! You can review a bit later challenges related to Data Visualisation and Data Sourcing: for example if you still have time after reviewing the priorities or before starting your Data Science Projects.


👋 See you soon ! 👋

---