# Chapter 1: Exploratory Data Analysis

    Overview: This chapter focuses on the first step of any data science project: epxloring the data.

## Data Structures
Two types of data structures:
* Unstructured data: Images, videos, Emails, Audio files
* Structured data: Relational databases, Spreadsheets, data tables (CSV, TSV), XML & JSON files

***Data Types***

There are two basic types of structure data: numeric and categorical.

* Numeric: Data expressed on a numeric scale
    * Continuous: Data that can take on any value in an interval 
        * Synonyms: interval, float, numeric
        * Examples: Time, meters, miles
    * Discrete: Data that can take only integer values
        * Synonyms: integer, count
        * Examples: Age, Number of students in a class, Number of rooms in a house

* Categorical: Data taht can take on only a specific set of values representing a set of possible categories (Synonyms: enums, enumerated, factors, nominal)
    * Binary: A special case of categorical data with just two categories of values
        * Synonyms: dichotomous, logical, indicator, boolean
        * Examples: 0 or 1, True or False
    * Ordinal: Categorical data that has an explicit ordering.
        * Synonyms: ordered factor
        * Examples: (Red, Green, or Blue), (Large, Medium, Small), (Cat, Dog, Mouse), Movie ratings, levels of pain

***Data Typing in Software***

Different software will handle data types differently. As a result, it's important to know your software and how it handles data types so that practitioners can appropriately conduct analyses.

***Key Terms for Rectangular Data***

Data Frame
* Rectangular data (like a spreadsheet) is the basic data structure for statistical and machine learning models.

Feature
* A column within a table is commonly referred to as a feature

Outcome
* Many data science projects involve predicting an *outcome* - often a yes/no outcome. The features are sometimes used to predict the *outcome* in an experiment or a study.
* Synonyms: dependent variable, response, target, otuput

Records
* A row within a table is commonly referred to as a *record*.
* Synonyms: case, example, instance, observation, pattern, sample

***Nonrectangular Data Structures***
* Time-series data
* Spacial data structures (mapping and location analytics)
* Graph data structures (physical, social, and abstract relationships)

## Estimates of Location
Estimates of location are variables with measured or coutn data that might have thousands of distinct values. A basic step in exploring your data is getting a "typical value" for each feature (variable): an estimate of where most of the data is located (i.e., its central tendancy).

**Key Terms for Estimates of Location**

Mean
>The sum of all values divided by the number of values.
>>Synonym: average

Weighted mean
>The sum of all values times a weight divided by the sum of the weights.
>>Synonym: weighted average

Median
>The value such that one-half of the data lies above and below.
>>Synonym: 50th percentile

Percentile
>The value such that *P* percent of the data lies below.
>>Synonym: quantile

Weighted median
>The value such that one-half of the sum of the weights lies above and below the sorted data.

Trimmed mean
>The average of all values after dropping a fixed number of extreme values.
>>Synonym: truncated mean

Robust
>Not sensitive to extreme values.
>>Synonym: resistant

Outlier
>A data value that is very different from most of the data.
>>Synonym: extreme value

**Mean**

$
\bar{x} = \frac{1}{n}\sum_{i=1}^{n} x_i
$

The most basic estimate of location is the mean, or *average* value. The mean is the sum of all values divided by the number of values. Consider the following set of numbers: {3, 5, 1, 2}. 

The mean is 
= (3 + 5 + 1 + 2) / 4 
= 11 / 4
= 2.75

**Trimmed mean**

$ \bar{x} = \dfrac{\displaystyle\sum_{i=p+1}^{\,n-p} x_{(i)}}{n - 2p} $

A variation of the mean is a *trimmed mean*, which you calculate by dropping a fixed number of sorted values at each end and then taking an average of the remaining values. Representing the sorted values by $x_{(1)}, x_{(2)}... x_{(n)}$ where $x_{(1)}$ is the smallest value and $x_{(n)}$ is the largest.

A trimmed mean eliminates the influence of extreme values. For example, in international diving the top and bottom scores from five judges are dropped, and the final score is the average of the scores from the three remaining judges. This makes it difficult for a single judge to manipulate the score, perhaps to favor their country's contestant. Trimmed means are widely used, and in many cases are preferable to using the ordinary mean. 

**Weighted Mean**

$ \bar{x}_w = \dfrac{\displaystyle\sum_{i=1}^{n} w_i x_i}{\displaystyle\sum_{i=1}^{n} w_i} $

Another type of mean is a *weighted mean*, which you calculate by multiplying each data value $x_{(i)}$ by a user-specified weight $w_{(i)}$ and dividing their sum by the sum of the weights. 

There are two (2) main motivations for using a weighted mean:
1. Some values are intrinsically more variable than others, and highly variable observations are given a lower weight. For example, if we are taking the average from multiple sensors and one of the sensors is less accurate, then we might downweight the data from that sensor.
2. The data collected does not equally represent the different groups that we are interested in measuring. For example, because of the way an online experiment was conducted, we may not have a set of data that accurately reflects all groups in the user base. To correct that, we can give a higher weight to the values from the groups that were underrepresented. 

**Median and Robust Estimates**

The *median* is the middle number on a sorted list of the data. If there is an even number of data values, the middle value is one that is not actually in the data set, but rather the average of the two values that divide the sorted data into upper and lower halves. Compared to the mean, which uses all observations, the median depends only on the values in the center of the sorted data. While this might seem to be a disadvantage, since the mean is much more sensitive to the data, there are many instances in which the median is a better metric for estimating location. Let's say we want to look at typical household incomes in neighborhoods around Lake Washington in Seattle. In comparing the Medina neighborhood to the Windermere neighborhood, using the mean would produce very different results because Bill Gates lives in Medina. If we use the median, it won't matter how rich Bill Gates is - the position of the middle observation will remain the same.

For the same reasons one uses a weighted mean, it's also possible to compute a *weighted median*. As with the median, we first sort the data, although each data value has an associated weight. Instead of the middle number, the weighted median is a value such that the sum of the weights is equal for the lower and upper halves of the sorted list. Like the median, the weighted median is robust to outliers.

**Outliers**

The median is referred to as a *robust* estimate of location since it's not influenced by *outliers* (extreme cases) that could skew the results. An outlier is any value that is very distant from the other values in a data set. 

An outlier itself doesn't make a data value invalid or erroneous (like the Bill Gates example). Still, outliers are often the result of data errors such as mixing data of different units (kilometers vs. meters) or bad readings from a sensor. When outliers are the result of bad data, the mean will result in a poor location estimate, while the median will still be valid. In any case, outliers should be identified and are usually worthy of further investigation.

> Anomaly Detection
>>In contrast to typical data analysis, where outliers are somtimes informative and sometimes a nuisance, in *anomaly detection* the points of interest are the outliers, and the greater mass of data serves primarily to define the "normal" against which anomalies are measured.

In [19]:
import pandas as pd
from scipy import stats

In [8]:
state_df = pd.read_csv("/Users/brian.v.nguyen/projects/practical_statistics_for_data_scientists/data/state.csv")
state_df.head(5)

Unnamed: 0,State,Population,Murder.Rate,Abbreviation
0,Alabama,4779736,5.7,AL
1,Alaska,710231,5.6,AK
2,Arizona,6392017,4.7,AZ
3,Arkansas,2915918,5.6,AR
4,California,37253956,4.4,CA


Compute the mean, trimmed mean, and median for the population

In [None]:
avg = state_df['Population'].mean()
med = state_df['Population'].median()
trimmed_avg = stats.trim_mean(state_df['Population'], 0.1)

print(f"Average: {avg}")
print(f"Median: {med}")
print(f"Trimmed Average: {trimmed_avg}")

Average: 6162876.3
Median: 4436369.5
Trimmed Average: 4783697.125


Compute the weighted average murder rate

In [23]:
import numpy as np
import wquantiles # pip3 install wquantiles

In [27]:
np_weighted_avg_murder_rate = np.average(
    state_df['Murder.Rate'],
    weights=state_df['Population']
    )

wquantiles_weighted_avg_murder_rate = wquantiles.median(
    state_df['Murder.Rate'],
    weights=state_df['Population']
    )

print(f"Weighted Avg. Murder Rate using NumPy: {np_weighted_avg_murder_rate}")
print(f"Weighted Avg. Murder Rate using wquantiles: {wquantiles_weighted_avg_murder_rate}")

Weighted Avg. Murder Rate using NumPy: 4.445833981123393
Weighted Avg. Murder Rate using wquantiles: 4.4


## Estimates of Variability

Location is just one dimension of summarizing a feature. A second dimension, variability, also referred to as *dispersion*, measures whether the data values are tightly clustered or spread out. At the heart 