# Exporatory Data Analysis

Classical statistics is focused on **inference**, a set of procedures for getting conclusions about large populations based on small samples. **Data analysis** includes statistical inference as just on component.


## Estimates of Location
A basic step in exploring data is getting a 'typical value' for each feature (variable): an estimate of where most of the data is located (i.e. its central tendency). 

Metrics and Estimates Note: statisticians estimate and data scientist measure.

In [9]:
import pandas as pd
import numpy as np
from scipy.stats import trim_mean


In [7]:
df = pd.read_csv('data/state.csv')
df.head()

Unnamed: 0,State,Population,Murder.Rate,Abbreviation
0,Alabama,4779736,5.7,AL
1,Alaska,710231,5.6,AK
2,Arizona,6392017,4.7,AZ
3,Arkansas,2915918,5.6,AR
4,California,37253956,4.4,CA


### Compute mean, trimmed mean, median
Ideas: mean is sensitive to extreme values (outliers) and median and trimmed are less sensitive to outliers and unusual distributions and hence are more robust.

In [6]:
print('population mean {}:'.format(df['Population'].mean()))
print('population median {}:'.format(df['Population'].median()))
print('population trimmed mean {}:'.format(trim_mean(df['Population'], .1)))

population mean 6162876.3:
population median 4436369.5:
population trimmed mean 4783697.125:


Note: mean > trimmed mean > mean

### Compute average murder rate for the country
Note: we need to use a weighted mean to account for different populations in the states.

In [12]:
print('weighted mean {}'.format(np.average(df['Murder.Rate'], weights=df['Population'])))

weighted mean 4.445833981123393


## Estimates of Variability

Variability refers to dispersion, whether the data values are tightly clustered or spread out. At the heart of statisitcs lies variability: measuring it, reducing it, distinguishing random from real, identifying the sources of real variability, and making decisions in the presence of it.

## Estimates Based on Percentiles

Statistics based on sorted(ranked) data are referred to as **order statistics**. The most basic is range: largest value in data minus smallest value in data. Range is extremely sensitive to outliers and not very useful as a general measure of dispersion in the data. There are techniques that account for outliers. 

### Percentile:
The Pth percentile is a value such that a least P percent of the values take on this value or less and at least (100-P) percent of the values take on this value of more.

A common measurement of variability is the difference between the 25th percentile and the 75th percentile, called the **interquartile range** (IQR).