# Estimates of Variability

Location is just one dimension in summarizing a feature. A second dimension, **variability**, also referred to as **dispersion**, measures whether the data values are tightly clustered or spread out. **At the heart of statistics lies variability**: measuring it, reducing it, distinguishing random from real variability, identifying the various sources of real variability, and making decisions in the presence of it.


## KEY TERMS FOR VARIABILITY METRICS
### Deviations
- The difference between the observed values and the estimate of location.
- deviations tell us how dispersed the data is around the central value.
    - formula: (Xi-mean) or (Xi-median)
    - Synonyms
        - errors, residuals

### Variance
- The sum of squared deviations from the mean divided by n – 1 where n is the number of data values.
    - Synonym
        - mean-squared-error

### Standard deviation
- The square root of the variance.

### Mean absolute deviation
- The mean of the absolute values of the deviations from the mean.
    - Synonyms
        - l1-norm, Manhattan norm

### Median absolute deviation from the median
- The median of the absolute values of the deviations from the median.

### Range
- The difference between the largest and the smallest value in a data set.

### Order statistics
- Metrics based on the data values sorted from smallest to biggest.
    - Synonym
        - ranks

### Percentile
- The value such that P percent of the values take on this value or less and (100–P) percent take on this value or more.
    - Synonym
        - quantile

### Interquartile range
- The difference between the 75th percentile and the 25th percentile.
    - Synonym
        - IQR

In [1]:
%matplotlib inline

import pandas as pd
import numpy as np
from scipy.stats import trim_mean
from statsmodels import robust
import wquantiles

import seaborn as sns
import matplotlib.pylab as plt

In [2]:
state = pd.read_csv(r'G:\data_science\data\state.csv')
print(state.head(8))

         State  Population  Murder.Rate Abbreviation
0      Alabama     4779736          5.7           AL
1       Alaska      710231          5.6           AK
2      Arizona     6392017          4.7           AZ
3     Arkansas     2915918          5.6           AR
4   California    37253956          4.4           CA
5     Colorado     5029196          2.8           CO
6  Connecticut     3574097          2.4           CT
7     Delaware      897934          5.8           DE


In [6]:
#standard_deviation
state['Population'].std()

6848235.347401142

In [7]:
#IQR
state['Population'].quantile(0.75)-state['Population'].quantile(0.25)

4847308.0

In [9]:
#mean_absolute_deviation
robust.scale.mad(state['Population'])


3849876.1459979336

# Note
- Variance and standard deviation are the most widespread and routinely reported statistics of variability.

- Both are sensitive to outliers.

- More robust metrics include mean absolute deviation, median absolute deviation from the median, and percentiles (quantiles).