In [None]:
import numpy as np
import scipy.stats as stats
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd

sns.set_style('darkgrid')

%config InlineBackend.figure_format = 'retina'
%matplotlib inline

<a id='load_data'></a>

### 1. Load the data

---

Import the csv into a pandas DataFrame.

In [None]:
boston_file = './datasets/housing.csv'
bf = pd.read_csv(boston_file)
bf.head()

In [None]:
# A:
bf.columns

<a id='header'></a>

### 2. Describe the basic format of the data and the columns

---

Use the `.head()` function (and optionally pass in an integer for the number of rows you want to see) to examine what the loaded data looks like. This is a good initial step to get a feel for what is in the csv and what problems may be present.

The `.dtypes` attribute tells you the data type for each of your columns.

In [None]:
# Print out the first 8 rows:
bf.head(8)

In [None]:
# Look at the dtypes of the columns:
bf.dtypes

In [None]:
bf.info()

In [None]:
bf.shape

In [None]:
bf.describe()

In [None]:
bf.describe().T

<a id='drop'></a>

### 2. Drop unwanted columns

---

There is a column labeled `Unnamed: 0` which appears to simply number the rows. We already have the number id of the rows in the DataFrame's index and so we don't need this column.

The `.drop()` built-in function can be used to get rid of a column. When removing a column, we need to specify `axis=1` to the function.

For the record, the `.index` attribute holds the row indices. This is the the sister attribute to the `.columns` attribute that we work with more often.



In [None]:
# print out the index object and the first 20 items in the DataFrame's index 
# to see that we have these row numbers already:
type(bf.index)
bf.index.values[:21]

In [None]:
# Remove the unneccesary column:
bf = bf.drop('Unnamed: 0', axis = 1).head()


In [None]:
bf.columns

<a id='clean'></a>

### 3. Clean corrupted columns

---

You may have noticed when we examined the `dtypes` attribute that two of the columns were of type "object", indicating that they were string. However, we know from the data description above (and we can infer from the header of the data) that `DIS` and `RAD` should in fact be numeric.

It is pretty common to have numeric columns represented as strings in your data if some of the observations are corrupted. It is important to always check the data types of your columns.

**3.A What is causing the `DIS` column to be encoded as a string? Figure out a way to make sure the column is numeric while preserving information.**

*Tip: Either use a for loop OR use  
The `.map()` built-in function on a column will apply a function to each element of the column.*

In [None]:
# Dictionary Method

In [None]:
# List Replacement Method

<a id='describe'></a>

### 6. Describe the summary statistics for the columns

---

The `.describe()` function gives summary statistics for each of your variables. What are some, if any, oddities you notice about the variables based on this output?

In [None]:
# A:

<a id='boxplots'></a>

### 7. Plot variables with potential outliers using boxplots.

---

Here we will plot boxplots of the variables we have identified as potentially having outliers.

_If you want to check out more, place your cursor in the `boxplot` argument bracket and press `shift+tab` (Press four times repeatedly to bring up detailed documentation)._
    

In [None]:
# rate of crime


In [None]:
# percent owner occupied


In [None]:
# business zone percent


In [None]:
# black population statistic


<a id='plot_all'></a>

### 8. Plot all the variables on boxplots together.

---

Plot all the variables


<a id='standardization'></a>

### 9. Standardizing variables

---

Rescaling variables is very common, and sometimes essential. For example, when we get to regularization of models the rescaling procedure becomes a requirement before fitting the model.

Here we'll rescale the variables using a procedure called "standardization", which forces the distribution of each variable to have a mean of 0 and a standard deviation of 1.

Standardization is not complicated:

    standardized_variable = (variable - mean_of_variable) / std_dev_of_variable
    
Note: Nothing else is changed about the distribution of the variable. It doesn't become normally distributed.

**9.A Pull out rate of crime and plot the distribution.**

Also print out the mean and standard deviation of the original variable.

In [None]:
# A:
bf.describe()

In [None]:
bf['CRIM'].std()

**9.B Standardize the rate_of_crime variable. Notice the new mean is centered at 0.**