## Calculating mean, trimmed mean, median, weighted mean and weighted median


• The basic metric for location is the mean, but it can be sensitive to extreme values (outlier).  
• Other metrics (median, trimmed mean) are less sensitive to outliers and unusual distributions and hence are more robust.  

In [21]:
import pandas as pd
import numpy as np
from scipy.stats import trim_mean
import wquantiles

state = pd.read_csv('data/state.csv')
state.head()

Unnamed: 0,State,Population,Murder.Rate,Abbreviation
0,Alabama,4779736,5.7,AL
1,Alaska,710231,5.6,AK
2,Arizona,6392017,4.7,AZ
3,Arkansas,2915918,5.6,AR
4,California,37253956,4.4,CA


#### Mean

In [22]:
state['Population'].mean()

np.float64(6162876.3)

#### Trimmed mean

In [23]:
trim_mean(state['Population'], 0.1) # drop 10% from each end

np.float64(4783697.125)

#### Median

In [24]:
state['Population'].median()

np.float64(4436369.5)

#### Weighted mean

In [25]:
np.average(state['Murder.Rate'], weights=state['Population'])

np.float64(4.445833981123393)

#### Weighted median

In [26]:
wquantiles.median(state['Murder.Rate'], weights=state['Population'])

np.float64(4.4)

## Variability (dispersion)

Variability measures how tightly the data are clustered; how close the values are to each other.  

- **Deviation (error, residual)**: the difference between the measured values and the estimate values.  
- **Variance(szórásnégyzet)**: measures of how spread out data points are from the mean, calculates the average of the squared differences between each observer data point and the mean, so:  
Calculate the difference between the data point and the mean for each data point, then raise them to the power of two individually, then sum those powers up, then divide the sum by (the number of the data point -1):   
$s^2 = \frac{\sum_{i=1}^{N}(x_{i}-\bar{x})^2}{n-1}$  
- **Standard deviation (szórás)**: The square root of the variance.  s= $\sqrt{variance}$  
- **Mean absolute deviation**: The mean of the absolute values of the deviations from the mean = $\frac{\sum_{i=1}^{N}|x_{i}-\bar{x}|}{n-1}$  
- **Median absolute deviation from the median**: The median of the absolute values of the deviations from the median.  
- **Range**: The difference between the largest and the smallest value in a data set.  
- **P percentile**: This is the value at which P% of the data is less than the value, and (100 - P)% is greater than the value (or equal to it).  
e.g. 10%: 10% of the data is less than the value and 90% is greater than (or equal to) the value.  
Median = 50th percentile  
- **Interquartile range (IQR)**: The difference between the 75th percentile and the 25th percentile  

The variance and standard devia‐tion are especially sensitive to outliers.  
A robust estimate of variability is the median absolute deviation from the median or MAD:  
**Median absolute deviation** = $Median(|x_{1} − m|, |x_{2} − m| , ..., |x_{N} − m|)$  
Like the median, the MAD is not influenced by extreme values.

#### Pth Percentile

To find 60th percentile of the data set, say we have 100 data is this set,  order the data, from the lowest to the highest value, then pick the 60th data in this order: that will be the 60th percentile of the data set.  
Note: the 60th percentile is equal to .6 *quantile*.  

#### Standard deviation

In [27]:
import statsmodels

state['Population'].std()


np.float64(6848235.347401142)

#### IQR

In [28]:
state['Population'].quantile(0.75) - state['Population'].quantile(0.25)


np.float64(4847308.0)

#### MAD

In [29]:
from statsmodels.robust.scale import mad
# mad(state['Population'])

statsmodels.robust.scale.mad(state['Population'])

np.float64(3849876.1459979336)