## 1 Statistics

### 1.1 Overview

There are two types of Statistics:

1. Descriptive Statistics
2. Inferential Statistics

![types of statistics](images/types_of_stats.png)

### 1.2 Descriptive Statistics

#### Definition

1. Describing the existing or historical data.
2. Some common examples is EDA.

As part of descriptive statistics we try to identify/measure below behavior in data:

1. Measure of central tendency
2. Measure of Variability

### 1.3 Inferential Statistics

1. A sample is drawn from a population and make inferences or draw conclusions about the entire population.

## 2 Measure of central tendency

Measuring the centrality of data.  

Why do we have to measure the centrality of data?  

When we measure the centrality of data, we are essentially trying to summarize a dataset with a single representative value. Real-world datasets can have millions of points. Instead of listing all of them, we want a compact representation.

#### Types of Central tendencies

1. Mean
2. Median
3. Mode

### 2.1 Mean

#### Definition

1. Sum of all values in the array divided by number of elements in the array.
2. Mean gets affected by outliers.

#### Formula

$
\large
\begin{align}
\mu = \frac{1}{n} \sum_{i=1}^{n} x_i
\end{align}
$

### 2.2 Median

#### Definition

1. Midpoint of **sorted** data.
2. Median is **robust** (not affected) by outliers.

#### Formula

##### Case #1: Odd number of elements

$
\huge
\begin{align}
x_{\frac{n + 1}{2}}
\end{align}
$

##### Case #2: Even number of elements

$
\huge
\begin{align}
\frac{1}{2}(x_{\frac{n}{2}} + x_{\frac{n}{2} + 1})
\end{align}
$

or

$
\huge
\begin{align}
\frac{x_{(\frac{n}{2})} + x_{(\frac{n}{2} + 1)}}{2}
\end{align}
$

### 2.3 Mode

#### Definition

Mode gives the most frequent elements.  

Dataset can have more than one mode. There are following types of modes:

1. Uni-Model data
2. Bi-Model data
3. Multi-Model data

> **Note**:
> 
> Mean and Median are suitable for continuous data. Mode is suitable for discrete data.

### 2.4 Weighed mean

#### Definition

A weighted average is a way to find the average of a set of numbers when some numbers are more important than others, giving them more "weight" in the final calculation.  
It is calculated by multiplying each data point by its assigned weight, summing these products, and then dividing by the sum of all the weights

#### Formula

$
\huge
\begin{align}
\frac{\sum_{i = 1}^{n} w_i * x_i}{\sum_{i = 1}^{n} w_i}
\end{align}
$

> **Note**:
>
> Weighed mean is used while calculating Loss function for imbalanced dataset.

### 2.5 Examples

#### Quiz #1

There are 4 people whose average age is 24.  
We know the age of three people: 20, 22, and 28.  
What is the median age of these 4 people?

1. 22
2. 24
3. 25
4. 26

In [1]:
# (20 + 22 + 28 + x) / 4 = 24
x = (4 * 24) - (20 + 22 + 28)
x  # Missing age

26

In [2]:
ages = [20, 22, 26, 28]
median = (22 + 26) / 2
median

24.0

#### Quiz #2

A survey of number of pets in a town saw that:
1. 30% people had 0 pets
2. 40% people had 1 pet
3. 10% people had 2 pets
4. 20% people had 3 pets

What is the average number of pets? 

In [3]:
((30 * 0) + (40 * 1) + (10 * 2) + (20 * 3)) / 100

1.2

#### Quiz #3

The mean weight of 2 children in a family is 40 Kgs.  
If the weight of the mother is included, the mean becomes 45.  
What is the weight of the mother?

1. 45
2. 50
3. 55
4. 60
5. 65

In [4]:
# (k1 + k2) / 2 = 40
# (80 + m) / 3 = 45
m = (45 * 3) - 80
m

55

## 3 Measure of Variability

1. Range
2. Variance
3. Standard Deviation
4. Inter-quartile range (IQR)

### 3.1 Range

#### Definition

Range is sensitive to outliers. Instead of range, a variation on range called as Inter Quartile Range is used.

#### Formula

$\text{Range} = \text{Max element} - \text{Min element}$

### 3.2 Quantiles

#### Definition

A quantile is a value that divides a probability distribution (or dataset) into intervals with equal probabilities.  
In other words, if you sort your data in increasing order, quantiles are the "cut points" that split the data into equally sized subsets.

1. Quantiles break data into equal parts. Some popular Quantiles are Percentiles, Quartiles and Deciles.
2. **Quartiles** are a special type of Quantiles that break data into **4 equal parts**.
3. **Percentiles** are a type of Quantiles that break data into **100 equal parts**.

### 3.2 Inter-Quartile Range (IQR)

#### Definition

1. IQR is difference between 75th and 25th percentile.
2. IQR measures the statistical dispersion i.e., it describes the spread in the data.
3. IQR is median-based measure of spread, i.e., it tells us how far the data-points are from the Median.

* Q1 25th percentile
* Q2 50th percentile (aka median)
* Q3 75th percentile

#### Formula

$\text{IQR} = \text{Q3} - \text{Q1}$

> **Note**:
> 
> 1. Larger the IQR higher the spread in data aka variance.
> 2. IQR uses **Median as central tendency** (center of data).

### 3.3 Variance

#### Definition

1. Average of square of distance from mean.
2. Variance measures the statistical dispersion i.e., it describes the spread in the data.
3. Variance is mean-based measure of spread i.e., it tell us how far the data-points are from the Mean.

#### Formula

$
\large
\begin{align}
\sigma^2 = \frac{1}{n} \sum_{i=1}^{n} (x_i - \mu)^2
\end{align}
$

> **Note**:
>
> 1. Lesser the Variance in dataset its better.
> 2. Variance uses **Mean as central tendency** (center of data).

### 3.4 Standard Deviation

#### Definition

Standard Deviation is square root of Variance.

#### Formula

$\text{Standard Deviation} = \sqrt{\text{Variance}}$

or

$
\large
\begin{align}
\sigma = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (x_i - \mu)^2}
\end{align}
$

> **Note**:
>
> Lesser the Standard Deviation in dataset its better.