# Chapter 1: Exploratory Data Analysis

    Overview: This chapter focuses on the first step of any data science project: epxloring the data.

## Data Structures
Two types of data structures:
* Unstructured data: Images, videos, Emails, Audio files
* Structured data: Relational databases, Spreadsheets, data tables (CSV, TSV), XML & JSON files

***Data Types***

There are two basic types of structure data: numeric and categorical.

* Numeric: Data expressed on a numeric scale
    * Continuous: Data that can take on any value in an interval 
        * Synonyms: interval, float, numeric
        * Examples: Time, meters, miles
    * Discrete: Data that can take only integer values
        * Synonyms: integer, count
        * Examples: Age, Number of students in a class, Number of rooms in a house

* Categorical: Data taht can take on only a specific set of values representing a set of possible categories (Synonyms: enums, enumerated, factors, nominal)
    * Binary: A special case of categorical data with just two categories of values
        * Synonyms: dichotomous, logical, indicator, boolean
        * Examples: 0 or 1, True or False
    * Ordinal: Categorical data that has an explicit ordering.
        * Synonyms: ordered factor
        * Examples: (Red, Green, or Blue), (Large, Medium, Small), (Cat, Dog, Mouse), Movie ratings, levels of pain

***Data Typing in Software***

Different software will handle data types differently. As a result, it's important to know your software and how it handles data types so that practitioners can appropriately conduct analyses.

***Key Terms for Rectangular Data***

Data Frame
* Rectangular data (like a spreadsheet) is the basic data structure for statistical and machine learning models.

Feature
* A column within a table is commonly referred to as a feature

Outcome
* Many data science projects involve predicting an *outcome* - often a yes/no outcome. The features are sometimes used to predict the *outcome* in an experiment or a study.
* Synonyms: dependent variable, response, target, otuput

Records
* A row within a table is commonly referred to as a *record*.
* Synonyms: case, example, instance, observation, pattern, sample

***Nonrectangular Data Structures***
* Time-series data
* Spacial data structures (mapping and location analytics)
* Graph data structures (physical, social, and abstract relationships)

## Estimates of Location
Estimates of location are variables with measured or coutn data that might have thousands of distinct values. A basic step in exploring your data is getting a "typical value" for each feature (variable): an estimate of where most of the data is located (i.e., its central tendancy).

**Key Terms for Estimates of Location**

Mean
>The sum of all values divided by the number of values.
>>Synonym: average

Weighted mean
>The sum of all values times a weight divided by the sum of the weights.
>>Synonym: weighted average

Median
>The value such that one-half of the data lies above and below.
>>Synonym: 50th percentile

Percentile
>The value such that *P* percent of the data lies below.
>>Synonym: quantile

Weighted median
>The value such that one-half of the sum of the weights lies above and below the sorted data.

Trimmed mean
>The average of all values after dropping a fixed number of extreme values.
>>Synonym: truncated mean

Robust
>Not sensitive to extreme values.
>>Synonym: resistant

Outlier
>A data value that is very different from most of the data.
>>Synonym: extreme value

**Mean**

$
\bar{x} = \frac{1}{n}\sum_{i=1}^{n} x_i
$

The most basic estimate of location is the mean, or *average* value. The mean is the sum of all values divided by the number of values. Consider the following set of numbers: {3, 5, 1, 2}. 

The mean is 
= (3 + 5 + 1 + 2) / 4 
= 11 / 4
= 2.75

**Trimmed mean**

$ \bar{x} = \dfrac{\displaystyle\sum_{i=p+1}^{\,n-p} x_{(i)}}{n - 2p} $

A variation of the mean is a *trimmed mean*, which you calculate by dropping a fixed number of sorted values at each end and then taking an average of the remaining values. Representing the sorted values by $x_{(1)}, x_{(2)}... x_{(n)}$ where $x_{(1)}$ is the smallest value and $x_{(n)}$ is the largest.

A trimmed mean eliminates the influence of extreme values. For example, in international diving the top and bottom scores from five judges are dropped, and the final score is the average of the scores from the three remaining judges. This makes it difficult for a single judge to manipulate the score, perhaps to favor their country's contestant. Trimmed means are widely used, and in many cases are preferable to using the ordinary mean. 

In [1]:
import pandas as pd