# Measures of Dispersion
## What is variability ?
It refers to **how spread out** or **how dispersed** the data points are around the **center**.

Two datasets can have the same mean, *but completely different* variability.

**Example:**
- [4, 5, 6] : 5
- [1, 5, 9] : 5

The second dataset is much more spread out.

## Why does it matter ?
In data science, understanding **variability** is fundamental to making reliable *inferences* and *robust models*.

It is not just about numbers! Variability tells you **how trustworthy, consistent, and predictable** your data (and **conclusions**) really are.

Undestanding it helps:
- detect **consistency**: Are the data points close together or scattered?
- assess **reliability**: A process with low variability is more predictable.
- compare different groups: Two products *might have the same average performance*, but one could vary much more, so it is unpredictable and unreliable

## Main Measures of Variability

### Range
The range is the simplest measure of variability.
It tells us the difference between the largest and smallest values in a dataset.

$$ R = \max(x) - \min(x) $$

#### Advantages
- easy to compute
- gives a quick sense of spread
#### Disadvantages
- very sensitive to outliers
- tell nothing about how other data are spread
  
----------------------------------------
### Interquartile Range
The interquartile range (IQR) measures the spread of the middle 50% of the data.

It‚Äôs calculated as:
$$ IQR = Q_3 - Q_1 $$

Where:
- Q1=25th percentile (first quartile)
- Q3=75th percentile (third quartile)

#### Advantages
- not affected by outliers, since it ignores extreme values.
- useful for detecting skewness and outliers.
#### Disadvantages
- ignores 50% of the data (only looks at the middle part).
---------------------------------------
### Variance
The variance measures the average squared deviation of each value from the mean.

**sample:** $$ s^2 = \frac{\sum (x_i - \bar{x})^2}{n - 1} $$
**population:** $$ \quad \sigma^2 = \frac{\sum (x_i - \mu)^2}{N} $$

---------------------------------------
### Standard Deviation
The standard deviation is simply the square root of the variance:
$$ s = \sqrt{s^2}, \quad \sigma = \sqrt{\sigma^2} $$
It brings the measure back to the original units of the data.
#### Advantages
- very commonly used.
- expressed in the same units as the data.
- forms the basis for many statistical methods (e.g., z-scores, confidence intervals).
#### Disadvantages
- sensitive to outliers (because it‚Äôs based on squared deviations).

---------------------------------------
### Coefficient of Variation
The coefficient of variation expresses the **standard deviation as a percentage of the mean**.
$$ CV = \frac{s}{\bar{x}} \times 100\% $$

**It‚Äôs a relative measure of variability, meaning it lets us compare the spread of two datasets that have different units or scales.**

Example:

**Dataset A:** mean = 100, s = 10 ‚Üí ùê∂ùëâ=10/100√ó100=10%
CV=10/100√ó100=10%

**Dataset B:** mean = 200, s = 30 ‚Üí ùê∂ùëâ=30/200√ó100=15%
CV=30/200√ó100=15%

---------------------------------------


In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats

In [2]:
df = sns.load_dataset('tips')
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [3]:
range_total_bill = df['total_bill'].max() - df['total_bill'].min()
range_total_bill

47.74

In [4]:
IQR_bill = stats.iqr(df['total_bill'], interpolation='midpoint')
IQR_bill

10.849999999999998

In [5]:
sample_variance = df['total_bill'].var(ddof=1)
pop_variance = df['total_bill'].var(ddof=0)
print('Sample', sample_variance)
print('Population', pop_variance)

Sample 79.25293861397826
Population 78.92813148851113


In [6]:
sample_std = df['total_bill'].std(ddof=1)
pop_std = df['total_bill'].std(ddof=0)
print('Sample', sample_std)
print('Population', pop_std)

Sample 8.902411954856856
Population 8.88415057777113


In [7]:
mean = df['total_bill'].mean()
std = df['total_bill'].std()
CV = (std / mean) * 100
print(str(round(CV, 2)) + '%')

44.99%
