# Lecture 3: Central Limit Theorem
---

## Probability and Statistics - Recap

* Random experiments are experiments whose outcome cannot be predicted with certainty.
    * Tossing a coin
* Sample space is the set of all possible outcomes of a random experiment.
    * S = {H, T}
* An outcome is just an element of the sample space.
    * H is the outcome corresponding to getting a heads on a coin toss
* An event is a subset of the sample space.
    * A = {H} Event A occurs when you get a heads on a single coin toss

## Random Variables
---

* Random variable is a function that maps the sample space to R (real numbers).
    * x : S --> R (w is used to indicate an event)
    
$$
x =  \left\{ \begin{array}{lr} 1, & \text{if } w = \{H\}\\ 0, & \text{if } w = \{T\}\\ \end{array}\right\}
$$

* The cumulative distribution function (c.d.f) is a function F: R --> [0,1] defined by $F(x) = P(X <= x)$ where X is the random variable and x is an arbitary real number.
    * $X$ ~ $F$ indicates F is the distribution of r.v. X

$$
P(a < X <= b) = P(X <= b) - P(X <= a) = F(b) - F(a)
$$

$$
P(X < a) = F(a^-)
$$

#### 1. Discrete r.v.

* Discrete random variable - A random variable is said to be disrete if the set of all possible values of X is a sequence x_1, x_2 (i.e. countable finite or infinite sequence) 
    
* Probability mass function (p.m.f) p: R --> [0,1] such that $p(x) = P(X = x)$.

$$
\sum_{i} p(x_i) = 1
$$

* Relation between c.d.f and p.m.f

$$
F(a) = \sum_{x_i <= a} p(x_i)
$$

#### 2. Continuous r.v.

* Continuous random variable - A random variable is said to be continuous if there exists a non-negative function f(x) defined for all x $\in$ R such that for any set B of rea numbers, we have $P(X \in B) = \int_B f(x) dx$.
    
* Probability density function (p.d.f) is the above defined f(x).

$$
\int_R f(x) dx = 1
$$

* Relation between c.d.f and p.d.f

$$
F(a) = \int_{-\infty}^{a} f(x) dx
$$

#### 3. Expectation

* Expectation is the average or mean value of a random variable.
* For discrete random variable,

$$
E[X] = \sum_{x} x.p(x)
$$

* For continuous random variable,

$$
E[X] = \int_{R} x.f(x) dx
$$

* Expectation of sum of random variables is equal to the sum of expectation of random variables, $\textbf {irrespective of mutual independence}$, i.e.

$$
E[\sum_{i=1}^{n}X_i] = \sum_{i=1}^{n}E[X_i]
$$

* $E[c] = c$ where c is a constant
* $E[cX] = cE[X]$ where c is a constant

#### 4. Variance

* Variance of X is defined by $Var(X) = E[(x-\mu)^2]$
* Variance of sum of random variables is equal to the sum of expectation of random variables $\textbf {only if all the random variables are mutually independent}$ i.e.

$$
Var(\sum_{i=1}^{n}X_i) = \sum_{i=1}^{n}Var(X_i)
$$

* $Var(c) = 0$
* $Var(X + c) = Var(X)$ where c is a constant
* $Var(cX) = c^{2}Var(X)$ where c is a constant

## Some common distributions
---
Consider X as the random variable.

````{tabbed} Bernoulli
* In Bernoulli distribution, r.v. can take only two possible values 0 or 1 with X=1 occuring with probability p.

|    Distribution   |   Type     |   p.m.f      | E[X] | Var(X) |
| :------------     |  -------------        |  -------------        |----:| ----:  |
| Bernoulli         |Discrete |    $p(x) =  \left\{ \begin{array}{lr} 1-p, & \text{if x = 0}\\ p, & \text{if x = 1}\\ \end{array}\right\} $  |  $p$ | $p(1-p)$ |

```{image} ../assets/2022_01_10_central_limit_theorem_notebook/bernoulli-pmf.png
:name: Bernoulli p.m.f.
```

````

````{tabbed} Binomial

* In n Binomial events, we can get the probability of getting r events $(r\le n)$ with favourable odds p and the rest with unfavourable odds q $(q = 1-p)$.

|    Distribution   |   Type     |   p.m.f    | E[X] | Var(X) |
| :------------     |  -------------        |  -------------        |----:| ----:  |
| Binomial          |Discrete |$ p(r) = {n \choose r }p^{r}q^{n-r}$|$np$|$np(1-p)$|

```{image} ../assets/2022_01_10_central_limit_theorem_notebook/binomial-pmf.png
:name: Binomial p.m.f.
```
````

````{tabbed} Gaussian

* The normal/gaussian distribution can be expressed in terms of its expectation value and variance.

|    Distribution   |   Type     |   p.d.f      | E[X] | Var(X) |
| :------------     |  -------------        |  -------------        |----:| ----:  |
| Normal/Gaussian   |Continuous |  $ f(x) = \frac{1}{σ\sqrt{2Π}}\exp{(-\frac{1}{2}{(\frac{x-\mu}{σ})}^2)}$|$μ$|$σ$|

```{image} ../assets/2022_01_10_central_limit_theorem_notebook/gaussian-pdf.png
:name: Gaussian p.d.f.
```
````

## Intuition for CLT
---

#### 1. Probability distributions

Let's say we want to explore the statistical properties of the heights of human adults. Let us divide the range of possible adult human heights, say from 4 feet to 7 feet, into bins of size 1 feet. We measure the height of a random person and put it in the corresponding bin; like a measurement of 5.6 feet goes into the bin [5 feet,6 feet). When we take multiple such measurements and stack them on top on each other in the respective bins, we get a histogram. This histogram tells us about the distribution of height among the people we measured. If we decrease our bin size and increase the number of measurements, we can make statements and estimates about the distribution more precisely. 

However, it may happen one of the bin may remain empty as no measurment that we sampled fell in the range of that bin. However, that does not mean that no sample in the entire population will fall in that range. So, we use a curve to approximate the historgram and provide us with the same information. The advantage of a curve is that we can account for the empty bins and we do not need to take care of the bin size while sampling.

We dont have enough money and time to sample the entire population. The curve approximated based on the mean and standard deviation of the data we collect is a good enough estimate for the entire population.

```{admonition} Takeaway
The probability distribution (as the name suggests) shows us how the probability of measurements is distributed.
```

We plot a histogram of the height data obtained from the [Kaggle weight-height dataset](https://www.kaggle.com/mustafaali96/weight-height) and try to estimate a curve that fits the data.

```python
import numpy as np
from scipy.stats import norm
import matplotlib.pyplot as plt
import pandas as pd

df = pd.read_csv("../../weight-height.csv")
arr = list(df['Height'])

# Fit a normal distribution to the data:
mu, std = norm.fit(arr)

# Plot the histogram.
plt.hist(arr, bins=25, density=True)

# Plot the PDF.
xmin, xmax = plt.xlim()
x = np.linspace(xmin, xmax, 100)
p = norm.pdf(x, mu, std)
plt.plot(x, p, 'k', linewidth=2)
title = '''Heights histogram
''' + "Fit results: mu = %.2f,  std = %.2f" % (mu, std)
plt.title(title)
plt.xlabel("Height x (inches)")
plt.ylabel("p(x)")
plt.show()
```

```{image} ../assets/2022_01_10_central_limit_theorem_notebook/prob-dist.png
:name: Probability distribution
```

#### 2. Sampling a distribution

If we ask a computer to pick a measurment at random based on the probability described by the historgram/curve, it is likely to fall in the taller region of the histogram/curve. Every once in a while the computer will also pick some measurment from the shorter region of the histogram/curve. 

```{admonition} Takeaway
Sampling a distribution (again as the name suggests) is getting samples from a distribution based on the probabilities described by the distribution.
```

Why do we have to sample from a distribution? We can use the computer to generate a lot of samples and we can plug them into various statistical tests and make inferences about the real-world data. Since we know the original distribution of the data we collect, we can compare our expectations of what will happen to what the reality is. So, sampling allows us to determine what statistical tests are capable of doing with doing much work in data collection.

#### 3. Normal or Gaussian distribution

The normal or Gaussian distribution is a bell-shaped, symmetric curve. In context of our adult human heights example, the x-axis represents the heights and the y-axis represents the probability of observing someone of that height.

Properties
- The normal distributions are always centred at the average value.
- A tall, narrow curve means less range of the quantity on the x-axis
- The width of the curve is captured by the standard deviation $\sigma$
- Almost 68% of measurments fall with $\pm 1 \sigma$
- Almost 95% of the measurements fall with $\pm 2 \sigma$
- Almost 99.7% of the measurements fall with $\pm 3 \sigma$
- Given an average value and the standard deviation, we can draw a unique normal distribution.
- We observe a curve a lot in nature, such as for distributions of heights, weights, commute times, etc.

## Central Limit Theorem
---

#### Even if you are not normal, the average is normal.

Let us take a uniform distribution in the range $x \in [0, 1]$. The probability of picking any number in this range is uniform. We take 30 points at random from this range and call it sample 1. We calculate the mean of this sample and plot a histogram of the mean. We collect multiple such samples and plot their means on the histogram. When we take many such means, we see that the histogram represents the normal/Gaussian curve. Even when we have sampled the data from a uniform distribution, the means of the sampling distributions are normally distributed. Even when we begin with an exponential distribution, the means of the sampling distribution still end up being normally distributed. 

Practical implications of the means being nornally distributed
- In reality, we don not know what distribution our data belongs to. However, the CLT does not care about the original distribution of our data. So we do not have to worry about the distribution that the samples come from. 

CLT is the basis for a lot of statistics. The normally distributed means are used to make confidence intervals and in various other statistical tests that help differentiate between two or more samples.

```{caution}
It is believed that the CLT is applicable only when we take a sample size $\ge$ 30.
```

## Formal Statement of CLT
---