# Estimates of Location

Variables with measured or count data might have thousands of distinct values. A basic step in exploring your data is getting a “typical value” for each feature (variable): an estimate of where most of the data is located (i.e., its central tendency).

## KEY TERMS FOR ESTIMATES OF LOCATION
### Mean
- The sum of all values divided by the number of values.
    - Synonym
        -  average

### Weighted mean
- The sum of all values times a weight divided by the sum of the weights.
    - Synonym
        - weighted average

### Median
- The value such that one-half of the data lies above and below.
    - Synonym
        - 50th percentile

### Percentile
- The value such that P percent of the data lies below.
    - Synonym
        - quantile

### Weighted median
- The value such that one-half of the sum of the weights lies above and below the sorted data.

### Trimmed mean
- The average of all values after dropping a fixed number of extreme values.
    - Synonym
        - truncated mean

### Robust
- Not sensitive to extreme values.
    - Synonym
        - resistant

### Outlier
- A data value that is very different from most of the data.
    - Synonym
        - extreme value

## Example: Location Estimates of Population and Murder Rates

In [1]:
%matplotlib inline

from pathlib import Path

import pandas as pd
import numpy as np
from scipy.stats import trim_mean
from statsmodels import robust
import wquantiles

import seaborn as sns
import matplotlib.pylab as plt

In [5]:
state = pd.read_csv(r'G:\data_science\data\state.csv')
print(state.head(8))

         State  Population  Murder.Rate Abbreviation
0      Alabama     4779736          5.7           AL
1       Alaska      710231          5.6           AK
2      Arizona     6392017          4.7           AZ
3     Arkansas     2915918          5.6           AR
4   California    37253956          4.4           CA
5     Colorado     5029196          2.8           CO
6  Connecticut     3574097          2.4           CT
7     Delaware      897934          5.8           DE


In [6]:
state.shape

(50, 4)

Compute the mean, trimmed mean, and median for Population. For mean and median we can use the pandas methods of the data frame. The trimmed mean requires the trim_mean function in scipy.stats.

In [9]:
#mean
state.Population.mean()

6162876.3

In [13]:
#trimmed_mean
trim_mean(state['Population'],0.1)

4783697.125

In [14]:
#median
state['Population'].median()

4436369.5

### The mean is bigger than the trimmed mean, which is bigger than the median.

In [15]:
#weighted mean 
np.average(state['Murder.Rate'], weights=state['Population'])

4.445833981123393

Weighted mean is available with numpy. For weighted median, we can use the specialised package wquantiles

In [17]:
wquantiles.median(state['Murder.Rate'], weights=state['Population'])

4.4

### the weighted mean and the weighted median are about the same.

### KEY IDEAS
- The basic metric for location is the mean, but it can be sensitive to extreme values (outlier).

- Other metrics (median, trimmed mean) are less sensitive to outliers and unusual distributions and hence are more robust.