# Chapter 1: Exploratory data analysis

- Classcal statistics
    - Focused almos exclusively on inference;
    - Complex set of procedures for drawing conclusions about large populations based on small samples.
- Data analysis
    - Tukey's 1977 book:
        - Exploratory data analysis.
    - Statistical inferece is just on component;
    - Links with engineering and computer science;
    - Availability of computing power evolved data analysis beyond its original scope;
    - David Donoho article 2015.

## Elements of structure data

### Two types of data:
- Numeric data:
    - Data expressed on a numeric scale;
    - Continuous:
        - Can take any value (float).
    - Discrete:
        - Only integer values (int).
- Categorical:
    - Can take only specific set of values;
    - Represent a set of possible categories:
        - Binary:
            - Just two categories (bool).
        - Ordinal:
            - Data has an ordering factor.

### Key ideas:
- Data is tipically classified in software by type;
- Data types include numeric and categorical;
- Data typing in software acts as a signal to the software on how to process the data.

### Rectangular data

Key terms:
- Dataframe:
    - Rectangular data (spreadsheet) is the basic data structure for statistical and machine learning models.
- Feature:
    - A column within a table is commonly referred to as a feature:
        - Atribute, input, predictor, variable.
- Outcome:
    - Data science projects involve predicting an outcome;
    - Features ar used to predict the outcome in an experiment;
    - Synonyms: dependent variable, response, target, output.
- Records:
    - A row with the table;
    - Synonyms: case, example, instance, obsercation, patter, sample.

In [2]:
import pandas as pd
import numpy as np

In [3]:
pd.DataFrame(
    {
        1: {
            'Category' : 'Music/Movie/Game',
            'currency' : 'US',
            'sellerRating' : 3249,
            'Duration' : 5,
            'endDay' : 'Mon',
            'ClosePrice' : 0.01,
            'OpenPrice' : 0.01,
            'Competitive?' : 0,
        },
        2: {
            'Category' : 'Automotive',
            'currency' : 'US',
            'sellerRating' : 3115,
            'Duration' : 7,
            'endDay' : 'Tue',
            'ClosePrice' : 0.01,
            'OpenPrice' : 0.01,
            'Competitive?' : 0,
        }
    }
).T

Unnamed: 0,Category,currency,sellerRating,Duration,endDay,ClosePrice,OpenPrice,Competitive?
1,Music/Movie/Game,US,3249,5,Mon,0.01,0.01,0
2,Automotive,US,3115,7,Tue,0.01,0.01,0


## Data frames and indexes

- Databases have one or mode columns designated as indexes;
- Essentially a row number.

### Nonrectangular data strucutres

- Time series:
    - Sucessive measurements of the same variable;
    - Indexed by a date/time value;
    - Key cinoibebte for internet of things.
- Spatial data structures:
    - Used for mapping and location analytics;
    - More complex than rectangular data.
- Graph or network:
    - Used to represent physical, social and abstract relationships.


### Estimates of location

- Variables with measured or count data might have thousands of distinct values;
- A basic step is getting a tipic value for each feature.

Key terms for estimates of location:
- Mean:
    - The sum of all values divided by the number of values;
    - Synonym: avarage.
- Weighted mean:
    - The sum of all values divided by the sum of the weights;
    - Synonym: weighted average.
- Median:
    - The value such that one-half of the data lies above and below;
    - Synonym: 50th percentile.
- Percentile:
    - The value such that $P$ percent of the data lies below;
    - Synonym: quantile.
- Weighted median:
    - The value such one-half of the sum of the weights lies above and below the sorted data.
- Trimmed mean:
    - The average of all values after dropping a fixed number of extreme values;
    - Synonym: truncated mean.
- Robust:
    - Not sensitive to extreme values;
    - Synonym: resistant.
- Outlier
    - A data value that is very different from most of the data;
    - Synonym: extreme value.

#### Mean

$$
    \bar{x}_t = \frac{
            \sum_{i = 1}^{n}{
                x_i
            }
        }
        {
            n
        }
$$

#### Trimmed mean

- A variation of the mean;
- Calculate by dropping a fixed number of sorted values at each end and then taking a average of the remaining values;
- Eliminates the influence of extreme values.


$$
    \bar{x} = \frac{
            \sum_{i = p + 1}^{n - p}
            {
                x_i
            }
        }
        {
            n - 2p
        }
$$

#### Weighted mean

- Calculated by multiplying each value $x_i$ by a user-specified weight $w_i$ and divided by the sum of the weights;
- Motivations:
    - Some values are instrinsically more variable than others and must recieve a lower weight;
    - The data collect does not equally representes the different groups that are interested in measuring.

$$
    \bar{x}_w = \frac{
            \sum_{i = 1}^{n}
            {
                w_i x_i
            }
        }
        {
            \sum_{i = 1}^{n}
            {
                w_i
            }
        }
$$

#### Median and robusst estimates

- The median is the middle number on a sorted list of data;
- Is the list is an even number of data values, the middle is one the average of the two values that divide the data.
- When the mean is sensitive to the data, the median is a better metric;
- Is also possible to compute a weighted median;

#### Outliers

- A value very distant from the others in the dataset;
- Usually identified outside the percentiles.

#### Example: Location estimates of population and murder rates

In [5]:
df = pd.read_csv('../data/state.csv')
df.head(8)

Unnamed: 0,State,Population,Murder.Rate,Abbreviation
0,Alabama,4779736,5.7,AL
1,Alaska,710231,5.6,AK
2,Arizona,6392017,4.7,AZ
3,Arkansas,2915918,5.6,AR
4,California,37253956,4.4,CA
5,Colorado,5029196,2.8,CO
6,Connecticut,3574097,2.4,CT
7,Delaware,897934,5.8,DE


Compute mean

In [6]:
df['Population'].mean()

6162876.3

Compute median

In [13]:
df['Population'].median()

4436369.5

Compute trimmed mean

In [16]:
from scipy import stats

stats.trim_mean(df['Population'], 0.1)

4783697.125

The mean is bigger than the trimmed mean which is bigger than the median

The average murder rate for the country needs to use a weighted mean or median to account the different population in states.

In [19]:
np.average(df['Murder.Rate'], weights=df['Population'])

4.445833981123393

In [26]:
import weightedstats as ws

ws.weighted_median(df['Murder.Rate'], weights=df['Population'])

4.4

#### Key ideas

- Mean is the basic metric for location, but is sensitive to extreme values (outliers);
- Other metrics are less sensitive to outliers and ununsual distributions and hence, more robust.