# __Chapter 01__

# Exploratory Data Analysis

## Elements of Structured Data

There are two basic types of structured data: __numeric__ and __categorical__.

_Numeric_ data
comes in two forms:
1. continuous, such as wind speed or time duration,
2. and discrete,
such as the count of the occurrence of an event.

_Categorical_ data takes only a fixed set
of values, such as a type of TV screen (plasma, LCD, LED, etc.) or a state name (Ala‐
bama, Alaska, etc.).
- Binary data is an important special case of categorical data that
takes on only one of two values, such as 0/1, yes/no, or true/false.
- Another useful type
of categorical data is ordinal data in which the categories are ordered; an example of
this is a numerical rating (1, 2, 3, 4, or 5).

Why do we bother with a taxonomy of data types? It turns out that for the purposes
of data analysis and predictive modeling, the data type is important to help determine
the type of visual display, data analysis, or statistical model.

Software engineers and database programmers may wonder why we even need the
notion of categorical and ordinal data for analytics. After all, categories are merely a
collection of text (or numeric) values, and the underlying database automatically han‐
dles the internal representation. However, explicit identification of data as categorical,
as distinct from text, does offer some advantages

* Knowing that data is categorical can act as a signal telling software how statistical
procedures, such as producing a chart or fitting a model, should behave. In particular, ordinal data can be represented as an ordered.factor in R, preserving a
user-specified ordering in charts, tables, and models. In Python, scikit-learn
supports ordinal data with the sklearn.preprocessing.OrdinalEncoder .
* Storage and indexing can be optimized (as in a relational database).
* The possible values a given categorical variable can take are enforced in the software (like an enum).

## Rectangular Data

The typical frame of reference for an analysis in data science is a rectangular data
object, like a spreadsheet or database table.
Rectangular data is the general term for a two-dimensional matrix with rows indicating records (cases) and columns indicating features (variables).

__Data frame__
* Rectangular data (like a spreadsheet) is the basic data structure for statistical and
machine learning models.

__Feature__
* A column within a table is commonly referred to as a feature.
Synonyms: attribute, input, predictor, variable

__Outcome__
* Many data science projects involve predicting an outcome—often a yes/no out‐
come (in Table 1-1, it is “auction was competitive or not”). The features are some‐
times used to predict the outcome in an experiment or a study.
Synonyms: dependent variable, response, target, output

__Records__
* A row within a table is commonly referred to as a record.
Synonyms: case, example, instance, observation, pattern, sample

## Data Frames and Indexes

Traditional database tables have one or more columns designated as an index, essentially a row number. This can vastly improve the efficiency of certain database queries

## Estimates of Location

Variables with measured or count data might have thousands of distinct values. A
basic step in exploring your data is getting a “typical value” for each feature (variable):
an estimate of where most of the data is located (i.e., its central tendency).

### Key Terms for Estimates of Location

__Mean__:
* The sum of all values divided by the number of values.
    Synonym: avergae

__Weighted mean__:
* The sum of all values times a weight divided by the sum of the weights.
    Synonym: wighted average.
    
__Median__:
* The value such that one-half of the data lies above and below.
    Synonym: 50th Percentile

__Percentile__:
* The value such that $P$ percent of the data lies below.
    Synonym: quantile.
    
__Weighted median__:
* The value such that one-half of the sum of the weights lies above and below the
    sorted data.

__Trimmed mean__:
* The average of all values after dropping a fixed number of extreme values.
    Synonym: truncated mean,
    
__Robuts__:
* Non sensitive to extreme values. Synonym: resistant.

__Outlier__:
* A data value that is very different from most of the data.
    Synonym: extreme value.

Statisticians often use the term estimate for a value calculated from
the data at hand, to draw a distinction between what we see from
the data and the theoretical true or exact state of affairs. Data scientists and business analysts are more likely to refer to such a value as
a metric. The difference reflects the approach of statistics versus
that of data science: accounting for uncertainty lies at the heart of
the discipline of statistics, whereas concrete business or organizational objectives are the focus of data science. Hence, statisticians
estimate, and data scientists measure.

### Mean
The most basic estimate of location is the __mean__, or average value. The mean is the
sum of all values divided by the number of values.
The formula to compute the mean for a set of $n$ values $x_1 , x_2 , ..., x_n$ is:

$$ \bar{x} = \frac{\sum_{i = 1}^{n} x_i}{n}$$

N (or n) refers to the total number of records or observations. In
statistics it is capitalized if it is referring to a population, and lowercase if it refers to a sample from a population. In data science, that
distinction is not vital, so you may see it both ways.

A variation of the mean is a __trimmed mean__, which you calculate by dropping a fixed
number of sorted values at each end and then taking an average of the remaining val‐
ues. Representing the sorted values by $x_1 , x_2 , ..., x_n$ where $x_1$ is the smallest value
and $x_n$ the largest, the formula to compute the trimmed mean with p smallest and
largest values omitted is:

$$ \bar{x} = \frac{\sum_{i = p + 1}^{n - p} x_{(i)}}{n - 2p}$$

A trimmed mean eliminates the influence of extreme values. For example, in international diving the top score and bottom score from five judges are dropped, and the
final score is the average of the scores from the three remaining judges. This makes it
difficult for a single judge to manipulate the score, perhaps to favor their country’s
contestant. Trimmed means are widely used, and in many cases are preferable to
using the ordinary mean.

Another type of mean is a __weighted mean__, which you calculate by multiplying each
data value $x_i$ by a user-specified weight $w$ i and dividing their sum by the sum of the
weights. The formula for a weighted mean is:

$$ \bar{x_w} = \frac{\sum_{i = 1}^{n} w_i x_i}{\sum_{i = 1}^{n} w_i}$$

There are two main motivations for using a weighted mean:
* Some values are intrinsically more variable than others, and highly variable
observations are given a lower weight. For example, if we are taking the average
from multiple sensors and one of the sensors is less accurate, then we might
downweight the data from that sensor.
* The data collected does not equally represent the different groups that we are
interested in measuring. For example, because of the way an online experiment
was conducted, we may not have a set of data that accurately reflects all groups in
the user base. To correct that, we can give a higher weight to the values from the
groups that were underrepresented.

### Median and Robust Estimates

The median is the middle number on a sorted list of the data. If there is an even number of data values, the middle value is one that is not actually in the data set, but
rather the average of the two values that divide the sorted data into upper and lower
halves. Compared to the mean, which uses all observations, the median depends only
on the values in the center of the sorted data.

While this might seem to be a disadvan‐
tage, since the mean is much more sensitive to the data, there are many instances in
which the median is a better metric for location. Let’s say we want to look at typical
household incomes in neighborhoods around Lake Washington in Seattle. In comparing the Medina neighborhood to the Windermere neighborhood, using the mean
would produce very different results because Bill Gates lives in Medina. If we use the
median, it won’t matter how rich Bill Gates is—the position of the middle observation
will remain the same.

For the same reasons that one uses a weighted mean, it is also possible to compute a
weighted median. As with the median, we first sort the data, although each data value
has an associated weight. Instead of the middle number, the weighted median is a
value such that the sum of the weights is equal for the lower and upper halves of the
sorted list. Like the median, the weighted median is robust to outliers.

### Outliers
The median is referred to as a robust estimate of location since it is not influenced by
outliers (extreme cases) that could skew the results. An outlier is any value that is very
distant from the other values in a data set.

Being an outlier in itself does
not make a data value invalid or erroneous (as in the previous example with Bill
Gates). Still, outliers are often the result of data errors such as mixing data of different
units (kilometers versus meters) or bad readings from a sensor. When outliers are the
result of bad data, the mean will result in a poor estimate of location, while the
median will still be valid. In any case, outliers should be identified and are usually
worthy of further investigation.

#### Anomaly Detection
In contrast to typical data analysis, where outliers are sometimes
informative and sometimes a nuisance, in anomaly detection the
points of interest are the outliers, and the greater mass of data
serves primarily to define the “normal” against which anomalies
are measured.

The median is not the only robust estimate of location. In fact, a trimmed mean is
widely used to avoid the influence of outliers. For example, trimming the bottom and
top 10% (a common choice) of the data will provide protection against outliers in all
but the smallest data sets. The trimmed mean can be thought of as a compromise
between the median and the mean: it is robust to extreme values in the data, but uses
more data to calculate the estimate for location.

#### Other Robust Metrics for Location
Statisticians have developed a plethora of other estimators for location, primarily with the goal of developing an estimator more
robust than the mean and also more efficient (i.e., better able to
discern small location differences between data sets). While these
methods are potentially useful for small data sets, they are not
likely to provide added benefit for large or even moderately sized
data sets.

### Example: Location Estimates of Population and Murder Rates

Example shows the first few rows in the data set containing population and murder
rates (in units of murders per 100,000 people per year) for each US state (2010
Census).

In [1]:
import numpy as np
from scipy import stats
import pandas as pd

In [2]:
state = pd.read_csv('../data/state.csv')

In [3]:
state.head()

Unnamed: 0,State,Population,Murder.Rate,Abbreviation
0,Alabama,4779736,5.7,AL
1,Alaska,710231,5.6,AK
2,Arizona,6392017,4.7,AZ
3,Arkansas,2915918,5.6,AR
4,California,37253956,4.4,CA


In [4]:
state['Population'].mean()

6162876.3

In [5]:
stats.trim_mean(state['Population'], 0.1)

4783697.125

In [6]:
state['Population'].median()

4436369.5

The mean is bigger than the trimmed mean, which is bigger than the median.

This is because the trimmed mean excludes the largest and smallest five states
( `trim = 0.1` drops 10% from each end). If we want to compute the average murder rate
for the country, we need to use a weighted mean or median to account for different
populations in the states.

Weighted mean is available with NumPy

In [7]:
np.average(state['Murder.Rate'], weights = state['Population'])

4.445833981123393

For weighted median, we can use the specialized package wquantiles

In [9]:
import wquantiles
wquantiles.median(state['Murder.Rate'], weights = state['Population'])

4.4

In statistics, a __central tendency__ (or measure of central tendency) is a central or typical value for a probability distribution. It may also be called a center or location of the distribution. Colloquially, measures of central tendency are often called averages. The term central tendency dates from the late 1920s.

The most common measures of __central tendency__ are the arithmetic mean, the median, and the mode. A middle tendency can be calculated for either a finite set of values or for a theoretical distribution, such as the normal distribution. Occasionally authors use central tendency to denote "the tendency of quantitative data to cluster around some central value.

The central tendency of a distribution is typically contrasted with its __dispersion__ or __variability__; dispersion and central tendency are the often characterized properties of distributions. Analysis may judge whether data has a strong or a weak central tendency based on its dispersion. 

The following may be applied to one-dimensional data. Depending on the circumstances, it may be appropriate to transform the data before calculating a central tendency. Examples are squaring the values or taking logarithms. Whether a transformation is appropriate and what it should be, depend heavily on the data being analyzed.

__Arithmetic mean__ or simply, __mean__
*    the sum of all measurements divided by the number of observations in the data set.

__Median__
*    the middle value that separates the higher half from the lower half of the data set. __The median and the mode are the only measures of central tendency that can be used for ordinal data, in which values are ranked relative to each other but are not measured absolutely__.

__Mode__
*    the most frequent value in the data set. __This is the only central tendency measure that can be used with nominal data, which have purely qualitative category assignments__.

__Geometric mean__
*    the nth root of the product of the data values, where there are n of these. This measure is valid only for data that are measured absolutely on a strictly positive scale.

$$ G = \Big( \prod_{i = 1}^{n} x_i \Big) ^{\frac{1}{n}} = \sqrt[n]{x_1 x_2 ... x_n}$$

__Harmonic mean__
*    the reciprocal of the arithmetic mean of the reciprocals of the data values. This measure too is valid only for data that are measured absolutely on a strictly positive scale.

$$ H = \frac{n}{\sum_{i = i}^{n} \frac{1}{x_i}}$$
__Weighted arithmetic mean__
*    an arithmetic mean that incorporates weighting to certain data elements.

__Truncated mean or trimmed mean__
*    the arithmetic mean of data values after a certain number or proportion of the highest and lowest data values have been discarded.

__Interquartile mean__
*    a truncated mean based on data within the interquartile range.

__Midrange__
*    the arithmetic mean of the maximum and minimum values of a data set.

__Midhinge__
*    the arithmetic mean of the first and third quartiles.

__Trimean__
*    the weighted arithmetic mean of the median and two quartiles.

__Winsorized mean__
*    an arithmetic mean in which extreme values are replaced by values closer to the median.



Any of the above may be applied to each dimension of multi-dimensional data, but the results may not be invariant to rotations of the multi-dimensional space. In addition, there are the

__Geometric median__
*    which minimizes the sum of distances to the data points. This is the same as the median when applied to one-dimensional data, but it is not the same as taking the median of each dimension independently. It is not invariant to different rescaling of the different dimensions.

__Quadratic mean__ (often known as the root mean square)
*    useful in engineering, but not often used in statistics. This is because it is not a good indicator of the center of the distribution when the distribution includes negative values.

__Simplicial depth__
*    the probability that a randomly chosen simplex with vertices from the given distribution will contain the given center

__Tukey median__
*    a point with the property that every halfspace containing it also contains many sample points

For unimodal distributions (distributions possesing a unique mode) the following bounds are known and are sharp:

$$ \frac{|\theta - \mu|}{\sigma} \le \sqrt{3} $$

$$ \frac{|\nu - \mu|}{\sigma} \le \sqrt{0.6} $$

$$ \frac{|\theta - \nu|}{\sigma} \le \sqrt{3}$$

where $\mu$ is the mean, $\nu$ is the median, $\theta$ is the mode, and $\sigma$ is the standard deviation. 

For every distribution

$$ \frac{|\nu - \mu|}{\sigma} \le 1 $$

## Estimates of Variability
Location is just one dimension in summarizing a feature. A second dimension, variability, also referred to as dispersion, measures whether the data values are tightly clustered or spread out. At the heart of statistics lies variability: measuring it, reducing it,
distinguishing random from real variability, identifying the various sources of real
variability, and making decisions in the presence of it.