## 1 Central Limit Theorem

### 1.1 Sampling Distribution

#### Definition

Distribution of sample means.

#### How to form a Sampling Distribution?

STEP #1: Draw $n$ number of random samples from a population.  
STEP #2: Calculate mean for each sample a.k.a sample mean $\bar{x}$.   
STEP #3: Repeat STEP #2 for all remaining samples and collect all sample means.  
STEP #4: Plot the collected means, collection of these sample means forms a sampling distribution.

Sampling distribution has its own mean.

### 1.2 Central Limit Theorem

#### Theorem

Mean of the sample means from Sampling Distribution will be approximately equal to population mean.

$
\begin{align}
\large
\bar{X} \sim N\biggl(\mu\;, \frac{\sigma}{\sqrt{n}} \biggr)
\end{align}
$

#### Properties of CLT

1. Population distribution may or may be normal but Sampling Distribution will always be a Normal Distribution.
2. Mean from the Sampling Distribution will be closer to the population mean.
3. Closer the population is towards the normal distribution, smaller sample size is fine.
4. Mean of sampling distribution tends to become closer to the population mean with decently large sample size.
5. As we increase sample size standard deviation in sampling distribution decreases.

#### SE in CLT

1. Standard Error is actually the Standard Deviation of Sampling Distribution.
2. Since Standard Error is inversely proportional to Sample size, Larger the sample size smaller the Standard Error.

> **Note**:
>
> Typically sample size is suggested to be at least 30 or greater.

### 1.3 Conditions of CLT

#### Condition #1: Randomization

Data should be randomly sampled, ensuring every population member has an equal chance of being included.

#### Condition #2: Independence

1. Each sample value should be independent, with one event's occurrence not affecting another.
2. Commonly met in probability sampling methods, which independently select observations.

#### Condition #3: Large Sample Condition

1. A sample size of 30 or more is generally considered "sufficiently large."
2. This threshold can vary slightly based on the population distribution's shape.

### 1.4 Examples

In [1]:
import numpy as np
from scipy import stats

#### Quiz #1

Systolic blood pressure of a group of people is known to have an average of 122 mmHg and a standard deviation of 10 mmHg.  
Calculate the probability that the average blood pressure of 16 people will be greater than 125 mmHg.  

##### Solution

In [2]:
mu = 122
std = 10
n = 16
# Find P(x > 125)
# P(x > 125) = 1 - P(x < 125)
x = 125

In [3]:
s_std = std / np.sqrt(n)  # Standard Error
z_score = (x - mu) / s_std  # Z-Score
s_std.round(4).item(), z_score.round(4).item()

(2.5, 1.2)

In [4]:
p_x_gt_125 = 1 - stats.norm.cdf(z_score)  # Probability of X > 125
p_x_gt_125.round(4).item()

0.1151

#### Quiz #2

In an e-commerce website, the average purchase amount per customer is 80 with a standard deviation of 15.  
If we randomly select a sample of 50 customers,  
what is the probability that the average purchase amount in the sample will be less than $75?

##### Solution

In [5]:
mu = 80
std = 15
n = 50
# p(x < 75)
x = 75

In [6]:
s_std = std / np.sqrt(n)  # Standard Error
z_score = (x - mu) / s_std  # Z-Score
s_std.round(4).item(), z_score.round(4).item()

(2.1213, -2.357)

In [7]:
p_x_lt_75 = stats.norm.cdf(z_score)
p_x_lt_75.round(4).item()

0.0092

## 2 Confidence Interval

### 2.1 What is Confidence Interval?

#### Definition

A confidence interval is the mean of your estimate plus and minus the variation in that estimate.

#### Explanation

1. Mean of the sample extracted from Population is called as Sample mean.
2. Point Estimate (single value) have low confidence, instead intervals (range of values) have higher confidence.
3. Sample mean is a Point Estimate of population mean hence it is not correct approach to estimate Population mean.

Two ways to calculate CI:

1. CLT
2. Bootstrapping

#### Formula

$\text{Confidence Interval} = \text{Point Estimate} \pm \text{Margin of Error}$

#### How to compute CI?

There are 2 ways to find confidence interval:

1. Using CLT
2. Using Bootstrapping

### 2.2 CI Using CLT

#### Definition

When we have mean we use CLT to estimate population mean with some confidence interval.

#### Formula

$
\begin{align}
\text{Confidence Interval} = \biggl(\overline{X} \pm Z \cdot \frac{\sigma}{\sqrt{n}} \biggr)
\end{align}
$

### 2.3 Using Bootstrapping

#### Definition

* When we have median we use Bootstrapping to estimate population mean with some confidence interval.
* Bootstrapping is sampling **with replacement**.

#### Properties

1. Median from sample is used and hence it tries to estimate Population median.
2. Bootstrapping can be used to estimate any percentiles.
3. This method is useful when we have very small dataset.

### 2.4 Examples

#### Quiz #1

The mean height of a sample of 100 adults was found to be 65 inches, with a standard deviation of 2.5 inches.  
Compute 95% confidence interval.

##### Solution

In [8]:
s_mean = 65
s_std = 2.5
n = 100
confidence = 0.95  # 95%
# Find x1 and x2

In [9]:
std_err = s_std / np.sqrt(n)
std_err.round(4).item()

0.25

In [10]:
z1 = stats.norm.ppf(0.025)
z2 = stats.norm.ppf(0.975)

z1.round(4).item(), z2.round(4).item()

(-1.96, 1.96)

In [11]:
# Find P(x1) = 2.5%
x1 = s_mean + (z1 * std_err)

# Find P(x2) = 97.5%
x2 = s_mean + (z2 * std_err)

x1.round(4).item(), x2.round(4).item()

(64.51, 65.49)

or

In [12]:
x1, x2 = stats.norm.interval(confidence, loc=s_mean, scale=std_err)
x1.round(4).item(), x2.round(4).item()

(64.51, 65.49)

#### Quiz #2

The sample mean recovery time of 100 patients after taking a drug was seen to be 10.5 days with a standard deviation of 2 days.  
Find the 95% confidence interval of the true mean.

##### Solution

In [13]:
s_mean = 10.5
s_std = 2
n = 100
confidence = 0.95  # 95%

In [14]:
std_err = s_std / np.sqrt(n)
std_err.round(4).item()

0.2

In [15]:
z1 = stats.norm.ppf(0.025)
z2 = stats.norm.ppf(0.975)

z1.round(4).item(), z2.round(4).item()

(-1.96, 1.96)

In [16]:
# Find P(x1) = 2.5%
x1 = s_mean + (z1 * std_err)

# Find P(x2) = 97.5%
x2 = s_mean + (z2 * std_err)

x1.round(4).item(), x2.round(4).item()

(10.108, 10.892)

or

In [17]:
x1, x2 = stats.norm.interval(confidence, loc=s_mean, scale=std_err)
x1.round(4).item(), x2.round(4).item()

(10.108, 10.892)

#### Quiz #3

From a sample of 80 endangered birds, the average wingspan was found to be 45 cm, with a population standard deviation of 10 cm.  
What is the correct confidence interval of the mean wingspan of the entire population with 90% confidence.

##### Solution

In [18]:
s_mean = 45
std = 10
n = 80
confidence = 0.9  # 90%

In [19]:
std_err = std / np.sqrt(n)
std_err.round(4).item()

1.118

In [20]:
z1 = stats.norm.ppf(0.05)
z2 = stats.norm.ppf(0.95)

z1.round(4).item(), z2.round(4).item()

(-1.6449, 1.6449)

In [21]:
# Find P(x1) = 5%
x1 = s_mean + (z1 * std_err)

# Find P(x2) = 95%
x2 = s_mean + (z2 * std_err)

x1.round(4).item(), x2.round(4).item()

(43.161, 46.839)

or

In [23]:
x1, x2 = stats.norm.interval(confidence, loc=s_mean, scale=std_err)
x1.round(4).item(), x2.round(4).item()

(43.161, 46.839)

#### Quiz #4

In a software project, the team estimates bug resolution time at an average of 6 hours with a standard deviation of 2 hours.  
To estimate the mean resolution time with 99% confidence, the project manager samples 25 resolved bugs.  
What is the correct confidence interval?

##### Solution

In [24]:
s_mean = 6
std = 2
n = 25
confidence = 0.99

In [25]:
std_err = std / np.sqrt(n)
std_err.round(4).item()

0.4

In [26]:
x1, x2 = stats.norm.interval(confidence, loc=s_mean, scale=std_err)
x1.round(4).item(), x2.round(4).item()

(4.9697, 7.0303)