# Project - EDA with Pandas Using the Boston Housing Data

## Introduction

In this section you've learned a lot about importing, cleaning up, analyzing (using descriptive statistics) and visualizing data. In this more free form project you'll get a chance to practice all of these skills with the Boston Housing dataset, which contains housing values in the suburbs of Boston. The Boston housing data is commonly used by aspiring Data Scientists.

## Objectives

You will be able to:

* Perform a full exploratory data analysis process to gain insight about a dataset 

## Goals

Use your data munging and visualization skills to conduct an exploratory analysis of the dataset below. At a minimum, this should include:

* Loading the data (which is stored in the file `'train.csv'`) 
* Use built-in Python functions to explore measures of centrality and dispersion for at least 3 variables
* Create *meaningful* subsets of the data using selection operations like `.loc`, `.iloc`, or related operations.   Explain why you used the chosen subsets and do this for three possible 2-way splits. State how you think the two measures of centrality and/or dispersion might be different for each subset of the data. Examples of potential splits:
    - Create two new DataFrames based on your existing data, where one contains all the properties next to the Charles river, and the other one contains properties that aren't 
    - Create two new DataFrames based on a certain split for crime rate 
* Next, use histograms and scatter plots to see whether you observe differences for the subsets of the data. Make sure to use subplots so it is easy to compare the relationships.

## Variable Descriptions

This DataFrame contains the following columns:

- `crim`: per capita crime rate by town  
- `zn`: proportion of residential land zoned for lots over 25,000 sq.ft  
- `indus`: proportion of non-retail business acres per town   
- `chas`: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)  
- `nox`: nitrogen oxide concentration (parts per 10 million)   
- `rm`: average number of rooms per dwelling   
- `age`: proportion of owner-occupied units built prior to 1940  
- `dis`: weighted mean of distances to five Boston employment centers   
- `rad`: index of accessibility to radial highways   
- `tax`: full-value property-tax rate per \$10,000   
- `ptratio`: pupil-teacher ratio by town    
- `black`: 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town   
- `lstat`: lower status of the population (percent)   
- `medv`: median value of owner-occupied homes in $10000s 
  
    
Source
Harrison, D. and Rubinfeld, D.L. (1978) Hedonic prices and the demand for clean air. J. Environ. Economics and Management 5, 81–102.

Belsley D.A., Kuh, E. and Welsch, R.E. (1980) Regression Diagnostics. Identifying Influential Data and Sources of Collinearity. New York: Wiley.


## Summary

Congratulations, you've completed your first "free form" exploratory data analysis of a popular dataset!

In [21]:
import pandas as pd
import numpy as np

%matplotlib notebook
import matplotlib.pyplot as plt
plt.style.use('seaborn-whitegrid')

In [22]:
data = pd.read_csv('train.csv')
data = data[['crim','age','tax']]
data.head()

Unnamed: 0,crim,age,tax
0,0.00632,65.2,296
1,0.02731,78.9,242
2,0.03237,45.8,222
3,0.06905,54.2,222
4,0.08829,66.6,311


In [23]:
print(data.describe())

             crim         age         tax
count  333.000000  333.000000  333.000000
mean     3.360341   68.226426  409.279279
std      7.352272   28.133344  170.841988
min      0.006320    6.000000  188.000000
25%      0.078960   45.400000  279.000000
50%      0.261690   76.700000  330.000000
75%      3.678220   93.800000  666.000000
max     73.534100  100.000000  711.000000


In [24]:
import matplotlib.pyplot as plt
plt.show(block=True)
# pd.options.display.mpl_style = 'default'
data.boxplot()

<IPython.core.display.Javascript object>

<matplotlib.axes._subplots.AxesSubplot at 0x1174c6320>

In [13]:
data.hist()

<IPython.core.display.Javascript object>

array([[<matplotlib.axes._subplots.AxesSubplot object at 0x1141d36a0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1016df358>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x10170c080>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x10173d1d0>]],
      dtype=object)

In [14]:
# from pandas.plotting import scatter_matrix
# scatter_matrix(data, alpha=0.2, figsize=(6, 6), diagonal='kde')

<IPython.core.display.Javascript object>

array([[<matplotlib.axes._subplots.AxesSubplot object at 0x114341080>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1156e2b70>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x11572bcc0>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x115761f98>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x11579d588>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1157ceb38>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x11580a128>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x11583c710>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x11583c748>]],
      dtype=object)

In [26]:
import seaborn as sb
from pylab import rcParams

import scipy
from scipy.stats import pearsonr
rcParams['figure.figsize'] = 4,2


# data.corr(method ='pearson') 

sb.pairplot(data)

<IPython.core.display.Javascript object>

<seaborn.axisgrid.PairGrid at 0x1a1aa4d320>

In [27]:
# Pair plot suggests possible positive correlation between 
# crime rate and building age

In [38]:
pearsonr_coefficient, p_value = pearsonr(data.age, data.crim)
print(f"PearsonR Correlation Coefficient {pearsonr_coefficient}")

PearsonR Correlation Coefficient 0.3790338208768802


In [None]:
# The R-value indicates a weak positive correlation