# Exploratory Data Analysis

### Objectives


1. Interview datasets and tell their story
1. Know the difference between exploratory data analysis (EDA) and statistical modeling
1. Know the difference between a categorical and a continuous variable
1. Know the difference between an ordinal and nominal categorical variable
1. Know the difference between univariate and bivariate data
1. Know graphical and non-graphical EDA techniques to apply to univariate and bivariate data

### Resources

1. Read [chapter 4 of this book](http://www.stat.cmu.edu/~hseltman/309/Book/) by Howard Seltman
1. Read about the [categorical data type](http://pandas.pydata.org/pandas-docs/stable/categorical.html) in the pandas documentation.

More Resources
1. [Udacity class on EDA in R](https://classroom.udacity.com/courses/ud651)
1. [Stanford Visualization Class](http://web.stanford.edu/class/cs448b/cgi-bin/wiki-fa16/index.php?title=Main_Page)
1. [Great blog post on diamonds EDA](https://solomonmessing.wordpress.com/2014/01/19/visualization-series-the-scatterplot-or-how-to-use-data-so-you-dont-get-ripped-off/)
1. [Kaggle Winner Interviews](http://blog.kaggle.com/category/winners-interviews/)

![](images/ds_life.png)

[From Microsoft](https://docs.microsoft.com/en-us/azure/machine-learning/team-data-science-process/overview)

# Introduction
The data science life cycle diagram above is a representation of what an end-to-end data analysis workflow would look like. 


# Developing a Data Analysis Routine
Do you have a plan when the data gets in your hands or do you just randomly explore data until you reach a conclusion? Developing a routine can help you ensure that you follow a common set of procedures during each analysis. This is no different than an airline pilot going through routine safety checks or a professional golfer approaching each golf shot the same way. The notebook **EDA Checklist** lists all of the ideas mentioned in this notebook and can be used as a template for developing your own routine.

### Visualization is the primary tool of EDA
The primary investigative results that your EDA should produce are visualizations Seaborn and pandas automatically take care of much of this for us.

### Descriptive statistics are a close second
Along with visualizations come descriptive statistics. A good data visualization should contain most of what can be calculated and outputted into a table. Nevertheless, summary statistics give precise information. 

### No formal hypothesis testing 
EDA does not usually concern itself with formal statistical hypothesis testing. Statistical analysis is still done by calculating descriptive statistics and correlations. 

# EDA with Diamonds
One of the most popular datasets for beginning exploration is the [diamonds dataset made famous by the ggplot2](http://ggplot2.tidyverse.org/reference/diamonds.html) R visualization library.

# The Data Dictionary
The data dictionary is a file that contains information about your dataset. If there is no data dictionary, you need to create it as you complete your EDA. At a minimum, a data dictionary needs to have the column name, description and data type.

Let's look at the data dictionary for the diamonds dataset.

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

pd.options.display.max_colwidth = 120
diamonds_dictionary = pd.read_csv('../data/diamonds_dictionary.csv', index_col='Column Name')
diamonds_dictionary

# 1. Tidy data and data types

## Inspect the first few rows
Let's look at the head of the DataFrame and inspect the first few rows.

In [None]:
diamonds = pd.read_csv('../data/diamonds.csv')
diamonds.head()

## Is the data tidy?
Once you first take a look at your data, you need to determine if it is tidy or not. For that you need to review the tidy notebook or as summary - answer the following questions.

* Is every column a variable?
* Is every row an observation?
* Is every table a single observational unit?

To answer these questions you need to identify your variables first. Lots of data that will end up in your hands will come from a formal relational database. Much of this data is already tidy. Data from excel spreadsheets, government data, summarized data from web scraping, etc.. will usually not be tidy and you will need to work on tidying the data first before doing any analysis.

Our diamond dataset is tidy. All column names represent variables and each row is a single observation. From the little we know on diamonds, it appears that the whole table is one observational unit. We could possibly think about putting x, y, z, table and depth in a separate table as they all relate to measurements but having all the columns together makes for easier analysis.

# Data Types

Once we determine that the data set is tidy, we can find the data types of each column.

In [None]:
diamonds.dtypes

### Update the data dictionary
The index auto-aligns here. 

In [None]:
diamonds_dictionary['Data Type'] = diamonds.dtypes
diamonds_dictionary

## Types of variables: Categorical or Continuous
The two broad classes of variables in a dataset are categorical and continuous. Categorical data is limited to finite, discrete values and can be either strings or numbers. Continuous variables can take on an infinite set of values and are always numeric. 

### Types of categorical data: ordinal or nominal
Categorical data can be further subdivided into two different types - ordinal and nominal. Ordinal data has a natural ordering but the difference between the orders is not measurable. Cancer is usually categorized into 4 stages with 4 being the worst. It is not clear how much worse stage 4 is than stage 3. 

Nominal data is any other type of categorical data that has no natural ordering like type of coffee or hair color or TV show. Let's add a further classification to our data dictionary.

In [None]:
c, o, n = 'continuous', 'ordinal', 'nominal'

In [None]:
d = {'carat':c, 'clarity':o, 'color':o, 'cut':o, 'depth':c, 
     'price':c, 'table':c, 'x':c, 'y':c, 'z':c}

diamonds_dictionary['Data Type Info'] = pd.Series(d)
diamonds_dictionary

## Rearranging the column order
You should not accept the default column ordering of your dataset. It might be sufficient but once the data is in your hands, you have control to change it. Even though the diamonds dataset only has 10 columns, we can still rearrange it such that it is more meaningful. 

In [None]:
# old order
diamonds.columns

In [None]:
new_order = ['cut', 'color', 'clarity','carat', 'price', 'x', 'y','z','depth', 'table']
diamonds = diamonds[new_order]
diamonds.head()

## A bit more metadata
Let's get the number of observations and the number of missing values for each column.

In [None]:
diamonds.shape

In [None]:
diamonds.isna().sum()

Append the number of missing values to the data dictionaray

In [None]:
diamonds_dictionary['Missing Values'] = diamonds.isna().sum()
diamonds_dictionary

## Your Turn
Perform the same steps with your dataset in your notebook.

# 2. Univariate Analysis

### Univariate vs Bivariate (and multivariate) Analyses
Univariate analysis is done on one variable at a time. Bivariate or multivariate is analysis done on 2 or more variables.

### Graphical vs Non-graphical
Each exploratory analysis will either result in either a graph or some numbers representing the data.

## Summary Table
To help guide you on your exploratory data analysis, a suggested plot/table is given in the 10 table cells below.

| Univariate             | Graphical                               | Non-Graphical                     | 
|-------------|-----------------------------------------|-----------------------------------|
| Categorical | Bar char of frequencies (count/percent) | Contingency table (count/percent) |
| Continuous  | Histogram/KDE, box/violin, qqplot, fat tails  | central tendency -mean/median/mode, spread - variance, std, skew, kurt, IQR  |

| Bivariate/multivariate            | Graphical                               | Non-Graphical                     | 
|-------------|-----------------------------------------|-----------------------------------|
| Categorical vs Categorical | heat map, mosaic plot | Two-way Contingency table (count/percent) |
| Continuous vs Continuous  | all pairwise scatterplots, kde, heatmaps |  all pairwise correlation/regression   |
| Categorical vs Continuous  | All seaborn "categorical" plots | Summary statistics for each level |

## Begin with Univariate Analysis
After you have tidied the data and began the data dictionary, a reasonable place to start is with univariate analysis. 

### Categorical or Continuous
The **`dtypes`** method outputs the pandas data types but does not directly tell us whether the variable is continuous or categorical.

Numeric variables are not necessarily continuous. Columns with limited discrete numeric values are candidates for being categorical data.

In [None]:
diamonds = pd.read_csv('../data/diamonds.csv')
diamonds.head()

### Binning continuous variables
It is also possible to bin continuous variables into categories. We are all naturally fond of this when receiving grades: 90 - 100 is mapped to an **A** with 80 - 89 mapped to **B** and so on.

### Get count of unique values for each
The **`nunique`** DataFrame method returns the count of unique values for each column. This can help determine if a continuous variable might be served best as categorical.

In [None]:
diamonds.nunique()

### Univariate analysis: Interview each column
Univariate analysis is simply an analysis done on one variable. For smaller datasets, I like to manually examine each variable. This way, I can learn the distribution of each variable, discover potential outliers, missing values and simplify matters by concentrating on only variable at a time.

Non-graphical univariate analysis for categorical data is pretty bland as there is not much to do except report the count or relative frequency. 

### Quickly done with `describe`

In [None]:
diamonds.describe()

### Univariate analysis on the categorical variables
The frequency of occurrence of each value by raw count and percentage is usually the first (and many times only exploratory step taken) when doing univariate categorical analysis. The **`value_counts`** Series method will be useful here.

In [None]:
diamonds['cut'].value_counts()

In [None]:
diamonds['color'].value_counts()

In [None]:
diamonds['clarity'].value_counts()

In [None]:
# use normalize=True to get percentage
diamonds['cut'].value_counts(normalize=True).round(2)

### Outliers for categorical variables
Categorical columns that have values with very few counts may be considered an outlier. 

### Change low values to "other"
We can set a threshold of a minimum number of counts and change these values to "other".

### Changing `object` to `category`
Let's change actual categorical values to the Pandas `category` data type. Changing the column to type **`category`** does several things. 
* It saves memory by encoding each category as a numerical value. 
* Sorting is possible by the category order (if given). 
* The **`.cat`** accessor makes many more methods available.

### Use `pd.Categorical`

Ordinal variables can be given their ordering through the **`categories`** parameter with **`ordered`** set equal to **`True`**.

In [None]:
order = ['Fair', 'Good', 'Very Good', 'Premium', 'Ideal']
diamonds['cut'] = pd.Categorical(diamonds['cut'], ordered=True, categories=order)

In [None]:
# notice that the data type is now a categiry and the categories are ordered
diamonds['cut'].head()

### Convert color and clarity to category

In [None]:
order = ['J', 'I', 'H', 'G', 'F', 'E', 'D']
diamonds['color'] = pd.Categorical(diamonds['color'], ordered=True, categories=order)

order = ['I1', 'SI2', 'SI1', 'VS2', 'VS1', 'VVS2', 'VVS1', 'IF']
diamonds['clarity'] = pd.Categorical(diamonds['clarity'], ordered=True, categories=order)

###  For nominal categories
There is no need to specify `ordered` or `categories` for nominal variables.

### Sort by frequency and then by category
The **`value_counts`** method works as before by showing the frequencies in descending order. Chaining the **`sort_index`** method displays the power of pandas categorical variables by sorting by the given categorical order.

In [None]:
diamonds['color'].value_counts()

In [None]:
diamonds['color'].value_counts().sort_index()

In [None]:
# percentages
diamonds['color'].value_counts(normalize=True).round(3).sort_index()

In [None]:
diamonds['color'].value_counts().sort_index().plot(kind='bar')

### Seaborn sorts axis automatically
Conveniently, seaborn sorts the categorical variable axis.

In [None]:
sns.countplot(x='color', data=diamonds)

In [None]:
sns.countplot(x='cut', data=diamonds)

In [None]:
sns.countplot(x='clarity', data=diamonds)

### Pie charts are evil
Lots of data visualization experts say to [avoid pie charts](https://www.quora.com/How-and-why-are-pie-charts-considered-evil-by-data-visualization-experts).

In [None]:
diamonds.color.value_counts().plot(kind='pie')

# Your turn
Do univariate analysis on the categorical columns. Convert them to the `category` type.

# 3. More Univariate Analysis

### Feature Engineering Columns of Strings
Categorical columns don't lend themselves to much exploratory data analysis. Features (new variables) may be created from strings. For instance, the first or last letter can be pulled out into its own column for further analysis. The second word of a sentence, the count of the number of vowels and so forth.

Just because a column is a string does not mean a single bar plot of frequencies ends the analysis.

### Univariate analysis on carat
Carat is numerical and therefore a much larger array of statistics may be generated to describe the variable. A boxplot is great to see some measure of spread and have some cut-off for outliers, defaulting to 1.5 times the IQR. There appear to be quite a few outliers, seen by the dots beyond the whisker of the plot below.

The density of the distribution is not visible with a box plot. For all I know, 99.9% of the data could be less than 2. Use a KDE or histogram for this.

In [None]:
sns.boxplot(y='carat', data=diamonds)

In [None]:
sns.distplot(diamonds['carat'])

### More precision about distribution
96% percent of the diamonds are 2 carats are less and 99.7% are 2.5 carats or less. 99% of the diamonds are between .23 and 2.31 carats.

In [None]:
diamonds['carat'].quantile([.005, .995])

## Your Turn
Complete some univariate analysis on some continuous variables.

# 4. Outliers (in one dimension)
There is no formal statistical definition of an outlier but generally speaking, we think of outliers as being an abnormal observation distant from other points. There has been lots of research [dedicated to outlier detection](https://en.wikipedia.org/wiki/Outlier#Detection) but for our purposes we will concentrate on allowing our natural human ability to notice slight imperfections from a standard.

Box plots are great tools for visually detecting outliers. Seaborn (and most other plotting tools) defaults to labeling outliers as any observation more than 1.5 times the IQR beyond either the first or third quartiles.

In [None]:
filt = diamonds.dtypes != 'category'
diamonds.columns[filt]

In [None]:
dia_num = diamonds.select_dtypes(exclude='category')
dia_num.head()

In [None]:
dia_num_melt = dia_num.melt()
dia_num_melt.head()

In [None]:
sns.catplot(x='value', data=dia_num_melt, kind='box', col='variable', col_wrap=3, sharex=False)

### Handling outliers
During EDA, we are not necessarily interested in taking an action on the outlier. Instead we can label it, investigate it further and then make a decision on it.

### Labeling the outliers
A column will now be created to label the outliers.

In [None]:
outliers = ((diamonds['x'] < 3) | (diamonds['y'] > 30) | (diamonds['y'] > 20) | 
            (diamonds['carat'] > 4) | (diamonds['depth'] < 45) | (diamonds['depth'] > 75) |
            (diamonds['table'] < 40) | (diamonds['table'] > 90)).astype(int)

In [None]:
diamonds['outliers'] = outliers
filt = diamonds['outliers'] == 1
diamonds[filt]

### Comments on outliers
* There are 7 rows with x,y,z all equal to 0. These variables must be positive, so they can't possibly be correct. 
* The two y values over 30mm can't possibly be right as one of them would be wider than the largest diamond ever found and the price is much too low.

### Calculated Depth
The data dictionary tells us that the **`depth`** is equal to **`z / mean(x,y)`**. Let's calculate the depth using this formula and compare to the depth from the data.

In [None]:
diamonds['calculated_depth'] = diamonds['z'] / ((diamonds['x'] + diamonds['y']) / 2) * 100

In [None]:
diamonds.head()

In [None]:
diamonds['depth_diff'] = (diamonds['depth'] - diamonds['calculated_depth']).abs()

In [None]:
diamonds.sort_values('depth_diff', ascending=False).head(25)

In [None]:
(diamonds['depth_diff'] < 5).mean(), (diamonds['depth_diff'] > 5).sum()

### depth vs calculated depth
If this was a pristine dataset, then the calculated depth would equal the depth for each observation. About .1% (or 40) of the observations have an absolute depth difference less than 1. What does this mean for the other .2% of the data? There must be a measurement/input error in x, y or z. The table above sorts by largest absolute depth difference. A **`z`** of 0 is responsible for much of the large depth differences.

More investigation into these wrong calculated depth observations might need to happen.

### Duplicated rows
Looking back up at the outliers table, it appears that several pairs of observations are identical or very similar (see 49556 and 49557). All the duplicated rows are saved to the **`dupes`** DataFrame. Perhaps the duplicates should be dropped. More information is needed. There are 289 duplicated rows.

In [None]:
dupes = diamonds[diamonds.duplicated(keep=False)]
dupes.head(20)

In [None]:
dupes.shape

## Your Turn
Try and discover outliers and find duplicated rows.

# 5. Bivariate and Multivariate EDA
All the above EDA focused on a single column at one time (univariate). Of course it is possible to extend a data analyses to multiple columns but the amount of combinations of plots and tables grows as if there are n columns then three are **n choose 2** bivariate combinations. With the 11 original variables, this would make 55 bivariate combinations and 165 involving three variables at a time.

Look way back at [the table summarizing](#Summarizing) the types of graphical and non-graphical tools for the different combinations of variables.

* categorical vs categorical
* categorical vs continuous
* continuous vs continuous

## Categorical vs Categorical
Let's create two-way contingency tables and heat maps to help show the distribution.

In [None]:
col_clar_ct = diamonds.pivot_table(index='clarity', columns='color', aggfunc='size')
col_clar_ct

Easier to see areas where data is denser.

In [None]:
# bulk of the data is in the middle
sns.heatmap(col_clar_ct)

In [None]:
cut_color_ct = diamonds.pivot_table(index='cut', columns='color', aggfunc='size')
cut_color_ct

In [None]:
sns.heatmap(cut_color_ct)

In [None]:
cut_color_pct = pd.crosstab(diamonds['cut'], diamonds['color'], normalize='all')
cut_color_pct

In [None]:
sns.heatmap(cut_color_pct, annot=True, fmt='.2f')

## Your Turn
Do some analysis on categorical vs categorical data

# 6. Categorical vs Continuous
All the plots in the categorical section in the [seaborn tutorial](http://seaborn.pydata.org/tutorial/categorical.html) will be of major help here. 

### A loose problem statement
The rest of the notebook will work on discovering how price per carat changes with respect to the variables. This variable does not exist yet so we will need to create it first.

In [None]:
diamonds['price_per_carat'] = diamonds['price'] / diamonds['carat']

### Comparing all categories vs all continuous variables
The Figure below, plots the mean at every level of category for all the continuous variables. All three categorical variables are ordered and displayed in the given order.

Very interestingly, all the continuous variables decline as the categorical variables increase. carat, x, y, z, and table seem to be closely related to the size of the diamond and so it appears that it is harder and harder to find high-quality large diamonds.

The only continuous variable that increased is price per carat. I would have expected it to increase more for the higher quality diamonds, but unexpectedly it only increases a small amount and the highest quality diamonds do not always average the highest prices. Why is this so?

In [None]:
g = sns.PairGrid(diamonds,
                 x_vars=["color", "cut", "clarity"],
                 y_vars=["carat", "price", "price_per_carat", "table", "x", "y", "z"], height=3, aspect=1.5)
g.map(sns.pointplot, ci=0)

### Price per carat vs clarity and color
It does not seem like there is much of a relationship between price per carat and clarity and color unless you are looking at the first and last plots below.

The middle 6 plots all look virtually identical. The **I1** clarity graph is significantly less that the rest. Colors **E** and **D** in the **IF** clarity graph are also clearly above the result.

### Multiplicative effect
Since the clarity and color do not seem to have an effect until you have either awful or amazing diamonds, the effects of having both very good or both very poor might be like multiplying their values together for an even much larger gain.

In [None]:
sns.catplot(x='color', y='price_per_carat', data=diamonds, kind='bar', col='clarity', col_wrap=4)

### cut might not have an effect
Plotting cut and color vs price per carat shows nearly identical graphs for all the cut types. Even **ideal** cut diamonds are no better than the others. The worst **color** **fair** cut diamonds tend to be a bit worse than average. But overall, cut does not look like it has much of an effect.

In [None]:
sns.catplot(x='color', y='price_per_carat', data=diamonds, kind='bar', col='cut')

### Heat map to identify high and low prices

In [None]:
color_clarity_price_mean = diamonds.pivot_table(index='color', columns='clarity', values='price_per_carat')

In [None]:
sns.heatmap(color_clarity_price_mean)

## Your Turn
Make some plots on categorical vs continuous variables

# 7. Continuous vs Continuous
A pairwise scatter plot is a fantastic first assessment of the relationships between continuous variables. Examining every combination of continuous variables might make for a terribly large plot so using a heat/cluster map like the one below to first find the highest correlated variables can help narrow down the choices.

Coloring the points by a third variable aids quite in the understanding. It is clear that carat is highly correlated with price and price per carat. The variables x, y, and z were not used in this plot because they are highly correlated with one another and highly correlated with carat. The variable carat essentially takes the place for the other variables.

Since the diamond dataset is fairly large, plotting 50,000 points will take some time. Many points will overlap. To help alleviate this computational load, use the **`sample`** method to select a random sample of the data. The marker size of each point has also been set.

In [None]:
sns.pairplot(diamonds.sample(frac=.3), 
             diag_kind='kde',
             vars=['carat', 'price', 'price_per_carat', 'depth'],
             hue='color', 
             plot_kws={"s": 3}, 
             height=3)

### Selecting the **hue** by a different categorical variable
The above shows a clear relationship between carat and price. It also shows that the **color** of each stone is important. If we examine a vertical strip of data for carat vs price we notice that color **D** is always the highest.

But very interestingly, the frequency of color D stones decrease as carats increase. In fact, it appears there are almost no color D stones larger than 2 carats. Perhaps larger diamonds are of poorer quality.

In [None]:
sns.pairplot(diamonds.sample(frac=.3), 
             diag_kind='kde',
             vars=['carat', 'price', 'price_per_carat', 'depth'],
             hue='clarity', 
             plot_kws={"s": 3}, 
             height=3)

### Clustering
The variables, price, price per carat, carat, x, y and z are all very tightly clustered together and all highly correlated with one another. The hierarchical cluster map makes this easy to spot.

In [None]:
sns.clustermap(diamonds.corr())

## Your Turn
Do some continuous vs continuous analysis

# 8. Binning a Continuous variable

## Uneven distribution of high quality diamonds
The scatter plots above indicate that higher quality diamonds tend to me smaller in size. Take a look at the box plots below. It is clear that the highest quality diamonds, D color and IF clarity are much lower average carat size. The largest diamonds are the worst quality.

In [None]:
sns.boxplot(x='color', y='carat', data=diamonds)

In [None]:
sns.boxplot(x='clarity', y='carat', data=diamonds)

In [None]:
sns.boxplot(x='cut', y='carat', data=diamonds)

### Making a categorical variable out of a continuous variable
Occasionally, you will want to code different ranges of a continuous variable as a categorical variable as was talked about with numerical grades converting to letter grades.

Our point plots from way above indicated that price decreased as the quality of diamonds increased. How was that possible? Our above box plots indicate that the highest quality diamonds tend to be much smaller on average but larger diamonds are also more expensive. This explains the paradox.

To help visualize diamonds of about the same size we can turn our continuous variable **`carat`** into a categorical one with the **`pd.qcut`** function. This will cut the data into equally sized bins.

In [None]:
# this creates 10 equal sized bins
diamonds['carat_category'] = pd.qcut(diamonds['carat'], 5)

In [None]:
diamonds['carat_category'].value_counts()

In [None]:
# original
sns.pointplot(x='clarity', y='price', data=diamonds)

In [None]:
# small diamonds
diamonds_small = diamonds[diamonds['carat_category'].cat.codes == 0]
sns.pointplot(x='clarity', y='price', data=diamonds_small, ci=0)

# The diamond story in one plot
This one plot tells the diamond story. As carat increases, price goes up but it also increases as color and clarity improve.

In [None]:
sns.catplot(x='clarity', y='price', data=diamonds, hue='carat_category', col='color', col_wrap=4, kind='point')

## Your Turn
Make a categorical variable out a continuous variable and use it to make a plot.