# 2. Data Dictionary

### Objectives

* Understand how a data analysis routine can be helpful
* Understand what exploratory data analysis (EDA) is
* Know the difference between EDA and statistical modeling
* Know the difference between univariate and multivariate data
* Know graphical and non-graphical EDA techniques to apply to univariate and multivariate data

### Resources

* Read [chapter 4 of this book](http://www.stat.cmu.edu/~hseltman/309/Book/) by Howard Seltman
* Data driven articles from [FiveThirtyEight](https://fivethirtyeight.com/)
* [Udacity class on EDA in R](https://classroom.udacity.com/courses/ud651)
* [Stanford Visualization Class](http://web.stanford.edu/class/cs448b/cgi-bin/wiki-fa16/index.php?title=Main_Page)
* [Good blog post on diamonds EDA](https://solomonmessing.wordpress.com/2014/01/19/visualization-series-the-scatterplot-or-how-to-use-data-so-you-dont-get-ripped-off/)
* [Kaggle Winner Interviews](http://blog.kaggle.com/category/winners-interviews/)

## What is EDA?
Exploratory data analysis is an approach one has when first examining a dataset to gather a fundamental understanding of it without formal statistical hypothesis testing. EDA helps you have an elementary understanding of your dataset through both visualization and descriptive statistics. Usually, no formal conclusions are drawn. 

### EDA helps you discover a project/hypothesis
As you are completing EDA, you might want to investigate a particular path further with more detail which can lead to building an entire project or making a hypothesis that will be rigorously tested. It's not important to know what  you want to discover before beginning EDA.

## Developing a Data Analysis Routine
Do you have a plan when the data gets in your hands or do you just randomly explore data until you find something interesting? Developing a routine can help you ensure that you follow a common set of procedures during each analysis. This is no different than an airline pilot going through routine safety checks or a professional golfer approaching each golf shot the same way. The notebook **EDA Checklist** contains all of the ideas mentioned in this chapter and can be used as a template for developing your own routine.

### Visualization and descriptive statistics are the primary tools of EDA
I am naturally drawn to reading articles that have good data visualizations embedded within them. There's nothing like obtaining information from an interesting data visualization. Most EDA make heavy use of visualizations to 

### No formal hypothesis testing 
EDA does not usually concern itself with formal statistical hypothesis testing. Statistical analysis is still done by calculating descriptive statistics and correlations. 

## Final Product of EDA

## EDA with Diamonds
A popular dataset for beginning exploration is the [diamonds dataset used extensively in examples by the ggplot2][1] R visualization library.

[1]: http://ggplot2.tidyverse.org/reference/diamonds.html

## The Data Dictionary
The data dictionary is a file that contains information about every column of your dataset. If there is no data dictionary, you need to create one when beginning EDA. At a minimum, a data dictionary needs to have the column name, description, and data type of each column.

Let's look at the data dictionary for the diamonds dataset. Notice that the `Column Name` column is set as the index. This will be important soon when appending new columns to the data dictionary.

In [None]:
%matplotlib inline
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

pd.options.display.max_colwidth = 120
diamonds_dictionary = pd.read_csv('../data/diamonds_dictionary.csv', index_col='Column Name')
diamonds_dictionary

## Our first look at the data
Read in the dataset and inspect the first few rows.

In [None]:
diamonds = pd.read_csv('../data/diamonds.csv')
diamonds.head()

### How many rows and columns are there?
Let's get the dimensions of our dataset.

In [None]:
diamonds.shape

## Is the data tidy?
Once you first take a look at your data, you need to determine if it is tidy or not.

Our diamond dataset is tidy. All column names represent variables and each row is a single observation. From the little I know on diamonds, it appears that the whole table is one observational unit. We could possibly think about putting x, y, z, table and depth in a separate table as they all relate to measurements but having all the columns together makes for easier analysis.

## Data Types
Once we determine that the data set is tidy, we can find the data types of each column.

In [None]:
diamonds.dtypes

## Convert Columns to appropriate data types
Ensure that the data types match the type that you expect. Be mindful of the common scenario of a column of numeric data being read in as a column of strings.

The diamonds dataset seems to have been read with the appropriate data types.

### Add the data types to the data dictionary
Notice that the `price` column is the 6th index value in the data dictionary but the 7th in the `diamonds.dtypes` Series. When we run the new column assignment below, the indexes align first and then the `Data Type` column gets created.

In [None]:
diamonds_dictionary['Data Type'] = diamonds.dtypes
diamonds_dictionary

### Get count of unique values for each column
The **`nunique`** DataFrame method returns the count of unique values for each column. This can help determine if a numeric variable might be served best as categorical. Again, automatic alignment of the index will ensure that the values are in the correct cell.

In [None]:
du = diamonds.nunique()
du

In [None]:
diamonds_dictionary['Num Unique'] = du
diamonds_dictionary

## Label type of data
Provide a generic labeling of each column of data as either continuous, ordinal, or nominal. Use the data dictionary to help determine this label. In this example, all of the string columns are ordinal and there are no nominal columns.

We create a Python dictionary mapping the column name to the label and then create a Series from it.

In [None]:
c, o, n = 'continuous', 'ordinal', 'nominal'

d = {'carat':    c, 
     'clarity':  o, 
     'color':    o, 
     'cut':      o, 
     'depth':    c, 
     'price':    c, 
     'table':    c, 
     'x':        c, 
     'y':        c, 
     'z':        c}

type_label = pd.Series(d)
type_label

### Don't forget columns
Manually typing out columns is a recipe for typos and outright leaving one out. Use the `columns` attribute to output all of them.

In [None]:
diamonds.columns

### Add to data  dictionary
We can add this new column directly to our data because of automatic index alignment.

In [None]:
diamonds_dictionary['Data Type Info'] = type_label
diamonds_dictionary

## Rearranging the column order
Once the data is in your hands, you have control to change it. It isn't necessary to keep the original column order. Even though the diamonds dataset only has 10 columns, we can still rearrange it such that it is more meaningful. 

### Strategies for ordering columns
* Place more important columns to the left and less important ones to the right.
* Place columns of strings that help identify the row first
* Group similar columns together (like start and end time of a bike ride)

In [None]:
# old order
diamonds.columns

In [None]:
new_order = ['cut', 'color', 'clarity','carat', 'price', 'x', 'y','z','depth', 'table']
diamonds = diamonds[new_order]
diamonds.head()

## Ensure that the number of columns are the same
Make sure you haven't accidentally dropped a column.

In [None]:
diamonds.shape

### Get the number of missing values in each column
We sum up the number of missing values in each column and append it to the data dictionary.

In [None]:
num_missing = diamonds.isna().sum()
num_missing

Append the number of missing values to the data dictionaray

In [None]:
diamonds_dictionary['Missing Values'] = num_missing
diamonds_dictionary

### Analyzing missing data
Understanding missing data is essential to completing an informative EDA. At this step, we can simply be aware of how many there are and do analysis on them later.

# Exercise
Complete these steps on your dataset