<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

## Walk Through of Standard EDA Procedure

_Authors: Kiefer Katovich (SF)_

---

This lesson uses Boston housing data to walk through a basic exploratory data analysis procedure, starting at the beginning with loading the data. 

Although in many — if not most cases — the EDA procedure will be considerably more involved, this should give you an idea of the basic workflow a data scientist would go through when taking a look at a new data set.

**Note:** This lesson is strictly exploratory. We will not be formulating any hypotheses about the data or testing them. In many cases, you may have already formulated a hypothesis before even looking at your data, which could considerably affect your focus and choices in what to investigate.


### Lesson Guide

- [Description of the Boston Housing Data](#data_description)
- [Loading the Data](#load_data)
- [Describing the Basic Format of the Data and the Columns](#header)
- [Dropping Unwanted Columns](#drop)
- [Cleaning Corrupted Data](#clean)
- [Counting Null Values and Dropping Rows](#drop_nulls)
- [Renaming Columns](#rename)
- [Describing Summary Statistics for Columns](#describe)
- [Investigating Potential Outliers With Box Plots](#boxplots)
- [Plotting All Variables Together](#plot_all)
- [Standardizing Variables](#standardization)
- [Plotting the Standardized Variables Together](#plot_all_rescaled)
- [Looking at the Covariance or Correlation Between Variables](#cov_cor)


<a id='data_description'></a>

### Description of the Boston Housing Data

---

The columns of the data set are coded. The corresponding descriptions are:

    CRIM: Per capita crime rate by town.
    ZN: Proportion of residential land zoned for lots larger than 25,000 sq. ft. 
    INDUS: Proportion of non-retail business acres per town.
    CHAS: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise). 
    NOX: Nitric oxides concentration (parts per 10 million).
    RM: Average number of rooms per dwelling.
    AGE: Proportion of owner-occupied units built prior to 1940. 
    DIS: Weighted distances to five Boston employment centers.
    RAD: Index of accessibility to radial highways.
    TAX: Full-value property tax rate per 10,000 dollars.
    PTRATIO: Pupil-teacher ratio by town.
    B: 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town. 
    LSTAT: Percentage of lower-status population. 
    MEDV: Median value of owner-occupied homes in 1000s of dollars.
    
Each row in the data set represents a different suburb of Boston.

These descriptions of shortened or coded variables are often called "codebooks," or data dictionaries. They are typically included along with data sets you might find online in a separate file.


**Load the packages.**

In [1]:
import numpy as np
import scipy.stats as stats
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd

sns.set_style('darkgrid')

%config InlineBackend.figure_format = 'retina'
%matplotlib inline

<a id='load_data'></a>

### 1. Loading the Data

---

Import the `.csv` into a `pandas` DataFrame.

In [2]:
boston_file = './datasets/housing.csv'

In [3]:
# A:

<a id='header'></a>

### 2. Describing the Basic Format of the Data and the Columns

---

Use the `.head()` function to examine what the loaded data looks like (as an option, you can also pass in an integer for the number of rows you want to see). This is a good initial step to get a feel for what is contained in the `.csv` and what problems may be present.

The `.dtypes` attribute tells you the data type for each of your columns.

In [4]:
# Print out the first eight rows:


In [5]:
# Look at the dtypes of the columns:


<a id='drop'></a>

### 2. Dropping Unwanted Columns

---

There is a column labeled `Unnamed: 0`, which appears to simply number the rows. We already have the rows' number IDs in the DataFrame's index, so we don't need this column.

We can use the built-in `.drop()` function to get rid of this column. When removing a column, we need to specify `axis=1` to the function.

For the record, the `.index` attribute holds the row indices. This is the sister attribute to the `.columns` attribute that we work with more often.



In [6]:
# Print out the `index` object and the first 20 items in the DataFrame's index 
# to see that we already have these row numbers:


In [7]:
# Remove the unneccesary column:


<a id='clean'></a>

### 3. Cleaning the Corrupted Columns

---

You may have noticed that, when we examined the `dtypes` attribute, two of the columns had an "object" type, indicating that they were strings. However, we know from the data description above (and we can infer from the data's header) that `DIS` and `RAD` should in fact be numeric.

It is pretty common to have numeric columns represented as strings in your data if some of the observations are corrupted. That's why it's important to always check your columns' data types.

**3.A What's causing the `DIS` column to be encoded as string? Figure out a way to make sure the column is numeric while preserving its information.**

**Tip**: The built-in `.map()` function on a column will apply a function to each of its elements.

In [8]:
# A:

**3.B What is causing the `RAD` column to be encoded as string? Figure out a way to make sure the column is numeric while preserving its information.**

**Tip**: You can put `np.nan` values in place of corrupted observations, which are numeric null values.

In [9]:
# A:

<a id='drop_nulls'></a>

### 4. Counting Null Values and Dropping Rows

---

Having replaced the question marks with `np.nan` values, we know that there are some missing observations for the `RAD` column. 

When we start to build models with data, null values in observations are (almost) never allowed. It's important to always check how many observations are missing — and in which columns.

A handy way to find out how many null values there are per column is with `pandas`.

```python
boston.isnull().sum()
```

The built-in `.isull()` function will convert the columns to Boolean `True` and `False` values (returning a new DataFrame); null values are indicated by `True`. 

Tacked on the end, the `.sum()` function will then sum these Boolean columns, and the total number of null values per column will be returned.

In [10]:
# A:

**Drop the null values.** 

In this case, let's keep it simple and just drop the rows from the data set that contain null values. If a column has a lot of null values, it often makes more sense to drop the column entirely instead of the individual rows. In this case, however, we will just drop the rows.

The `.dropna()` function will drop rows that have _**ANY**_ null values. Use this carefully, as you could drop more rows than you expect.

In [11]:
# A:

<a id='rename'></a>

### 5. Renaming Columns

---

Oftentimes, it's annoying to have to memorize what codes mean for columns, or reference the codebook whenever we want to know the meaning of a variable. So, it makes sense to rename columns more descriptively.

There is more than one method for accomplishing this, but one easy way is to use the `.rename()` function.

For reference, here are the column names and their descriptions again:

    CRIM: Per capita crime rate by town.
    ZN: Proportion of residential land zoned for lots larger than 25,000 sq. ft. 
    INDUS: Proportion of non-retail business acres per town.
    CHAS: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise). 
    NOX: Nitric oxides concentration (parts per 10 million).
    RM: Average number of rooms per dwelling.
    AGE: Proportion of owner-occupied units built prior to 1940. 
    DIS: Weighted distances to five Boston employment centers.
    RAD: Index of accessibility to radial highways.
    TAX: Full-value property tax rate per 10,000 dollars.
    PTRATIO: Pupil-teacher ratio by town.
    B: 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town. 
    LSTAT: Percentage of lower-status population. 
    MEDV: Median value of owner-occupied homes in 1000s of dollars.

In [12]:
# A:

There two popular methods for renaming DataFrame columns:
1. Using a _dictionary substitution_, which is useful if you only want to rename a few of the columns. This method uses the `.rename()` function.
2. Using a _list replacement_, which is quicker than writing out a dictionary but requires a full list of names.

In [None]:
# Dictionary method

In [None]:
# List replacement method

<a id='describe'></a>

### 6. Describing Summary Statistics for Columns

---

The `.describe()` function gives summary statistics for each of your variables. What are some, if any, oddities you notice about the variables, based on this output?

In [13]:
# A:

<a id='boxplots'></a>

### 7. Investigating Potential Outliers With Box Plots

---

Here, we will use the `seaborn` package to create box plots of the variables we've identified as potentially containing outliers.

First, some notes on `seaborn`'s box plot keyword argument options:

    `orient`: Can be 'v' or 'h' for vertical and horizontal, respectively.
    `fliersize`: The size of the outlier points (in pixels).
    `linewidth`: The width of the line surrounding the box plot.
    `notch`: Shows the confidence interval for the median (calculated by `seaborn/plt.boxplot`).
    `saturation`: Saturates the colors to a specific extent.

There are more keyword arguments available, but these are the most relevant for now.   

_If you want to check out more, place your cursor in the `boxplot` argument bracket and press `shift+tab` (Press four times repeatedly to bring up detailed documentation)._
    

In [14]:
# Rate of crime


In [15]:
# Percent owner occupied


In [16]:
# Percent business zone


In [17]:
# Black population statistic


<a id='plot_all'></a>

### 8. Plotting All Variables Together

---

Plot all of the variables in a horizontal box plot with `seaborn`. What, if anything, is wrong with this plot?

In [18]:
# A:


<a id='standardization'></a>

### 9. Standardizing Variables

---

Rescaling variables is common and sometimes essential. For example, when we get to regularizing models, the rescaling procedure becomes a requirement before fitting the model.

Here, we'll rescale the variables using a procedure called "standardization," which forces the distribution of each variable to have a mean of 0 and a standard deviation of 1.

Standardization is not complicated:

    standardized_variable = (variable - mean_of_variable) / std_dev_of_variable
    
**Note**: Nothing else has changed about the distribution of the variable. It doesn't become normally distributed.

**9.A Pull out the rate of crime and plot the distribution.**

Also, print out the mean and standard deviation of the original variable.

In [19]:
# A:

**9.B Standardize the `rate_of_crime` variable. Notice that the new mean is centered at 0.**

In [20]:
# A:

**9.C Plot the original and standardized rate of crime. Notice that nothing changes about the distribution, except for the location and scale.**

In [21]:
# A:


<a id='plot_all_rescaled'></a>

### 10. Plotting the Standardized Variables Together

---

`pandas` DataFrames make it easy to standardize columns simultaneously. You can standardize data like so:

```python
boston_stand = (boston - boston.mean()) / boston.std()
```

Create a standardized version of the data and recreate the box plot. Now you can better examine the differences in the shape of distributions across our variables.

In [22]:
# A:


<a id='cov_cor'></a>

### 11. Looking at the Covariance or Correlation Between Variables

---

An easy way to get a feel for linear relationships between variables is with a correlation matrix.

Below, we have the formula for the covariance between two variables: `$X$` and `$Y$`.

#### 11.A Covariance

Given the sample size `$N$`, variables `$X$` and `$Y$`, and means `$\bar{X}$` and `$\bar{Y}$`:

### $$ \text{covariance}(X, Y) = \sum_{i=1}^N \frac{(X - \bar{X})(Y - \bar{Y})}{N}$$

The covariance is a measure of "relatedness" between variables. It's the sum of deviations from the mean of `$X$`, multiplied by deviations from the mean of `$Y$`, adjusted by the sample size (`$N$`).

Code the covariance between `pct_underclass` and `home_median_value` by hand below. Verify that you've gotten the correct result using `np.cov()`. Set the keyword argument `bias=True` in `np.cov()` to have it use the same covariance calculation.

**Note**: `np.cov` returns a covariance _matrix_, or, each value's covariance with itself and the other variable in a matrix format.

In [23]:
# A:

#### 11.B Correlation

Covariance is not easy to interpret. The values are difficult to read because they are relative to the variance of the variables.

A much more common metric — and one that is directly calculable from the covariance — is the correlation.

Again, let `$X$` and `$Y$` be our two variables and use the covariance `$cov(X, Y)$` that we calculated above:

### $$ \text{pearson correlation}\;r = cor(X, Y) =\frac{cov(X, Y)}{std(X)std(Y)}$$

Calculate the correlation between `pct_under` and `med_value` by hand below. Check that it is the same as `np.corrcoef()` with `bias=True`.


In [24]:
# A:

#### 11.C The correlation matrix

We can see the correlation between all numeric variables in our data set by using `pandas` DataFrames' built-in `.corr()` function. Use it below on the Boston data set.

It's useful for getting a feel for what is related and what is not, which can help you decide what to investigate further. (Although with a lot of variables, the matrix can be a bit overwhelming).

In [25]:
# A:

**`seaborn` also has an effective way of showing this visually, if colors stick out to you more than decimal values.**

In [None]:
# A: