## 1 Datasets

### 1.1 Population

#### Definition

All the items under the scope of study.

### 1.2 Sample

#### Definition

While population is entire data set under the study, Sample is an **unbiased subset of population**.

### 1.3 Sampling

#### Definition

* Process of picking data-points from population is called as Sampling. Sampling is done based on some strategy.
* Samples can have duplicates. Such samples are called as Sampling **with replacement**.

#### Why Sampling?

1. Due to practicality, It is not possible to perform statistics on entire population of data set, hence we do sampling.
2. Any statistics applied on Sample data is trying to infer statistical metrics from Sample for the Population.
3. A **well designed sample** can provide an accurate estimate of the population characteristics with significantly less time and error.

## 2 Sample Statistics

### 2.1 Sample Mean

#### Definition

Mean of sample.

#### Formula

$
\begin{align}
\large
\bar{x} = \frac{1}{n} \sum_{i=1}^{n} x_i
\end{align}
$

> **Note**:
> 
> Larger the sample size closer the mean is to population mean.

### 2.2 Sample Variance

#### Definition

Variance of sample.

#### Formula

$
\begin{align}
\large
s^2 = \frac{1}{n-1} \sum_{i=1}^{n} (x_i - \bar{x})^2
\end{align}
$

#### Why $n - 1$?

1. $n - 1$ is done to increase the variance in sample. It is called as "Bessel's Correction".
2. It is important for ensuring that the sample variance provides an unbiased estimate of population variance. In other words its done to reduce the bias of variance in sample.
3. $n - 1$ is decided based on Degree of Freedom.

#### Degree of Freedom

Now, we have $n - 1$ data-points can vary freely; the last value is constrained by the others, as the mean must be same.

> **Note**:
>
> $n - 1$ is done to increase the variance in sample.  
> In other words its done to reduce the bias of variance in sample.  
> Its called as **Bessel's correction**.  

### 2.3 Sample Standard Deviation

#### Definition

Standard deviation of Sample.

#### Formula

$\text{Standard Deviation} = \sqrt{\text{Variance}}$

or

$
\begin{align}
\large
s = \sqrt{\frac{1}{n-1} \sum_{i=1}^{n} (X_i - \bar{X})^2}
\end{align}
$

### 2.4 Why Sample Statistics?

#### Reasons

1. Since its is not feasible to study/derive statistical metrics from Population, goal of Sample Statistics is to estimate Population Statistics.
2. While estimating Population Statistics from Sampling Statistics there can be errors.

### 2.5 Point Estimates

#### Definition

1. Estimating mean, variance, standard deviation of entire population using sample is called as **Point Estimate**.
2. Point Estimate of mean of Population is mean of sample.
3. Point Estimate of variance of Population is variance of sample.
4. Point Estimate of standard deviation of Population is standard deviation of sample.

> **Note**:
>
> Like describing statistical metrics (mean, median, etc.) from population is part of **Descriptive Statistics**,  
> inferring point estimates from Sample is part of **Inferential Statistics**.

### 2.6 Standard Error

#### Definition

1. The standard error (SE) indicates how much the sample mean is expected to deviate from the population mean when different samples are drawn.
2. Standard Error helps us in assessing the reliability of estimates derived from samples.
3. A smaller standard error suggests that the sample statistic is likely to be close to the population parameter.
4. While a larger standard error indicates more variability and less precision in the estimate.

> **Note**:
>
> Lower the Standard Error higher the reliability.

#### Formula

##### Case #1: Population standard deviation is known

$
\begin{align}
\large
SE_{\bar{x}} = \frac{\sigma}{\sqrt{n}}
\end{align}
$

##### Case #1: Population standard deviation is **NOT** known

$
\begin{align}
\large
SE_{\bar{x}} = \frac{s}{\sqrt{n}}
\end{align}
$

Where $s$ is sample standard deviation

Standard Error is inversely proportional to sample size $n$.

> **Note**:
> 
> Higher the $n$ value lower the Standard Error.

#### Law of Large Numbers

The law says:

As Sample size increases the Sample mean tends to be closer to the population mean.

#### Standard Error vs Standard Deviation

What is the difference between Standard Error and Standard Deviation?

##### Standard Deviation

1. Standard Deviation represents the **variability of individual data-points** within a Population or Sample.
2. Spread of data-points across the mean in a single distribution (population or sample).

##### Standard Error

1. Standard Error represents the **variability across multiple sample means** (derived from different samples).
2. Among multiple Samples, how much the Sample means are different from Population mean.

### 2.7 Examples

#### Quiz #1

There are 45 students in a class. 5 students were randomly selected from this class  
and their heights (in cm) were recorded as follows: [131, 150, 140, 142, 152]   

Calculate Sample mean and sample variance.

In [1]:
import numpy as np


data = [131, 150, 140, 142, 152]

x_bar = np.mean(data)
print("Sample Mean:", x_bar)

s_var = np.var(data, ddof=1)
print("Sample Varaiance:", s_var)

Sample Mean: 143.0
Sample Varaiance: 71.0


#### Quiz #2

A sample of 30 latest returns on XYZ stock reveals a mean return of 4 with a sample standard deviation of 0.13.  
Estimate the SE of the sample mean.

In [2]:
import math


se_mean = 0.13 / math.sqrt(30)
print("Standard error of mean:", round(se_mean, 4))

Standard error of mean: 0.0237


## 3 Sampling techniques

There are two types of Sampling techniques.

1. Probability Sampling
2. Non-Probability Sampling

### 3.1 Probability Sampling

#### 1 Simple Random Sampling

Steps in Simple Random Sampling:

1. Define the population.
2. Determine the sample size.
3. Randomly select your sample.
4. Collect data from sample.

#### 2 Stratified Sampling

#### 3 Systemic Sampling

#### 4 Cluster Sampling

### 3.2 Non-Probability Sampling

#### 1 Snowball Sampling

#### 2 Judgemental Sampling

## 4 Uniform Distributions

### 4.1 What is Uniform Distribution?

#### Definition

* Probability of every outcome is same across the data.
* Uniform Distribution is both Discrete Distribution and Continuous Distribution.

### 4.2 Discrete Uniform Distribution

#### Formula

$
\begin{align}
\large
P(X=x) = \frac{1}{b - a + 1}
\end{align}
$

Where $a <= x <= b$

#### Expected Value

$
\begin{align}
\large
E[x] = \frac{a + b}{2}
\end{align}
$

### 4.3 Continuous Uniform Distribution

#### Formula

##### Discrete

Probability of a specific value:

$
\begin{align}
\large
P(X=x) = \frac{1}{b - a}
\end{align}
$

Where $a <= x <= b$

##### Continuous

Probability of a range:

$
\begin{align}
\large
P(X = [x_1, x_2]) = \frac{x_2 - x_1}{b - a}
\end{align}
$

#### Expected Value

$
\begin{align}
\large
E[x] = \frac{a + b}{2}
\end{align}
$

### 4.4 Mean

$
\large
\begin{align}
\mu = \frac{1}{2} \bigl(a + b \bigr)
\end{align}
$

Where a and b are limits

### 4.5 Variance

### 4.3 Examples

#### Quiz #1

Suppose the weight of dolphins is uniformly distributed between 100 pounds and 150 pounds.  
If we randomly select a dolphin at random, then determine the probability that the chosen  
dolphin will weigh between 120 and 130 pounds.

In [2]:
from scipy import stats

In [3]:
a = 100
b = 150

x_1 = 120
x_2 = 130

(x_2 - x_1) / (b - a)

0.2