## Calculating mean, trimmed mean, median, weighted mean and weighted median


• The basic metric for location is the mean, but it can be sensitive to extreme values (outlier).  
• Other metrics (median, trimmed mean) are less sensitive to outliers and unusual distributions and hence are more robust.  

In [2]:
import pandas as pd
import numpy as np
from scipy.stats import trim_mean
import wquantiles

state = pd.read_csv('data/state.csv')
state.head()

Unnamed: 0,State,Population,Murder.Rate,Abbreviation
0,Alabama,4779736,5.7,AL
1,Alaska,710231,5.6,AK
2,Arizona,6392017,4.7,AZ
3,Arkansas,2915918,5.6,AR
4,California,37253956,4.4,CA


#### Mean

In [3]:
state['Population'].mean()

np.float64(6162876.3)

#### Trimmed mean

In [4]:
trim_mean(state['Population'], 0.1) # drop 10% from each end

np.float64(4783697.125)

#### Median

In [5]:
state['Population'].median()

np.float64(4436369.5)

#### Weighted mean

In [6]:
np.average(state['Murder.Rate'], weights=state['Population'])

np.float64(4.445833981123393)

#### Weighted median

In [7]:
wquantiles.median(state['Murder.Rate'], weights=state['Population'])

np.float64(4.4)

## Variability (dispersion)

Variability measures how tightly the data are clustered; how close the values are to each other.  

**Deviation (error, residual)**: the difference between the measured values and the estimate values.  
**Variance**: measures of how spread out data points are from the mean, calculates the average of the squared differences between each observer data point and the mean, so:  
Calculate the difference between the data point and the mean for each data point, then raise them to the power of two individually, then sum those powers up, then divide the sum by (the number of the data point -1):   
$\frac{\sum_{i=1}^{N}(x_{i}-\bar{x})^2}{n-1}$  
**Standard deviation**: The square root of the variance.  
**Mean absolute deviation**:The mean of the absolute values of the deviations from the mean.  
**Median absolute deviation from the median**: The median of the absolute values of the deviations from the median.  
**Range**: The difference between the largest and the smallest value in a data set.  
**X percentile**: This is the value at which X% of the data is less than the value, and (100 - X)% is greater than the value (or equal to it).  
e.g. 10%: 10% of the data is less than the value and 90% is greater than (or equal to) the value.  
Median = 50th percentile  