# Introduction.

This notebook contains my analysis of the Boston Housing dataset. 

The Boston Housing dataset contains information of house prices in various parts of Boston Massachusetts. The dataset was originally published by Harrison, D. and Rubinfeld, D.L. in their paper "Hedonic prices and the demand for clean air', J. Environ. Economics & Management, vol.5, 81-102, 1978"(Delve 1996). The dataset was derived from information collected by the U.S. Census Service concerning house prices in the Boston area (Delve 1996).

The dataset is used extensively in Data Science and Machine Learning and was orginally part of the UCI Machine Learning Repository (Medium 2019). The dataset is typically used as a training set for algorithms used in predictive analytics. The goal is to predict the value of house prices using the given variables (Medium 2019). 

The sequence of this notebook is as follows;

1. A description of the dataset using Descriptive Statics.
2. The application of inferential statics to analyse the relationship between the variables in this instance I am particularly interested in the relationship between the house prices (MEDV) and the proximity to the Charles River (CHAS).
3. Prediction.



I'm starting off by importing the necessary libraries.

In [1]:
# Importing the libraries I want to perform the analysis on the Dataset.
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

Next, I will import the dataset from the csv file. Here is a link to the dataset.

In [2]:
# Reading the dataset into the notebook as 'data'.
data = pd.read_csv('boston.csv')

Checking that the dataset read in correctly by displaying the first 5 rows of the dataset using the 'head' command.

In [3]:
# Using the 'head' command to display the first 5 rows of the dataset.
data.head()

Unnamed: 0,crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,b,lstat,medv
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7,396.9,5.33,36.2


## Describe (20%)
Use descriptive statistics and plots to describe the Boston House Prices dataset.

The dataset contains 506 rows with 14 variables.

#### The Variables.
CRIM - per capita crime rate by town / area.

ZN - proportion of residential land zoned for lots over 25,000 sq.ft.

INDUS - Proportion of non-retail business acrea per town.

CHAS - Charles River dummy variable (1 if tract bounds river, 0 otherwise).

NOX - nitric oxides concentration (parts per 10 million).

RM - average number of rooms per dwelling.

AGE - proportion of owner-occupied units built prior to 1940.

DIS - weighted distance to five Boston employeement centres.

RAD - index of accessibility to radial highways.

TAX - full-value property tax rate per $10,000.

PTRATIO - pupil teacher ration per town.

B - 1000 (Bk - 0.63)^2 where Bk is the proportion of blacks per town.

LSTAT = % lower status of the population.

MEDV - median calue of owner-occupied homes in $1,000's.



### Summary Statistics.



In [4]:
# Using the 'describe' command to display the summary statistics for the dataset.
data.describe()

Unnamed: 0,crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,b,lstat,medv
count,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0
mean,3.613524,11.363636,11.136779,0.06917,0.554695,6.284634,68.574901,3.795043,9.549407,408.237154,18.455534,356.674032,12.653063,22.532806
std,8.601545,23.322453,6.860353,0.253994,0.115878,0.702617,28.148861,2.10571,8.707259,168.537116,2.164946,91.294864,7.141062,9.197104
min,0.00632,0.0,0.46,0.0,0.385,3.561,2.9,1.1296,1.0,187.0,12.6,0.32,1.73,5.0
25%,0.082045,0.0,5.19,0.0,0.449,5.8855,45.025,2.100175,4.0,279.0,17.4,375.3775,6.95,17.025
50%,0.25651,0.0,9.69,0.0,0.538,6.2085,77.5,3.20745,5.0,330.0,19.05,391.44,11.36,21.2
75%,3.677082,12.5,18.1,0.0,0.624,6.6235,94.075,5.188425,24.0,666.0,20.2,396.225,16.955,25.0
max,88.9762,100.0,27.74,1.0,0.871,8.78,100.0,12.1265,24.0,711.0,22.0,396.9,37.97,50.0


Because we are primarily interested in house prices (known as the target value) I will run descriptive statistics on this variable. I will also plat the distribution of these values.

In [7]:
price = data['medv']

In [6]:
# Minimum price of the data
minimum_price = np.min(price)
print ("The minimum price: ${:}".format(minimum_price))
# print "Minimum price: ${:,.2f}".format(minimum_price)
# Alternative using pandas
# minimum_price = prices.min()

The minimum price: $5.0


### Notes / Thoughts.
Is it possible that the datase was gerrymandered in anyway so as to produce a particular outcome?
Ian has asked that we specificially look at the relationship between house prices and closeness to the Charles river, is there a corelation? Or is this corelation explained in other ways?

EG. Could closeness to the Charles river be related to lower air polution thus making the area more desireable to live in thus pushing up the house price, if ths was the case could it be that being close to the Charles river meant that you were further away from large population centres and natually had a lower crime rate as a result of that. SO it's not closeness to the river that matters rather a distance form population centres.

Who decided on the measerments to use? Are these reasonable to use in such a dataset?

Ian has asked us to analyses the dataset from the perspective of the relationship between median house prices and being close to the Charles river. Is it fair to assume that house prices are the important metric for considereation.

Maybe there is a relationship between other variables that indirectly impact the price of houses or perhaps there is another factor that influences house prices and this is not measured at all in the dataset.

And another thing is what do the descriptive statisitics actually tell us about a dataset? How does a description of the dataset inforn your thinking, decision making or influence further analyses? How do you decide what statistics will give you a meaningful insight or not as the case maybe! Does the nature of your decision or whatever influence your choice of statistics used to describe the dataset?

## Infer (20%)
Use inferential statistics to analyse whether there is a significant difference between houses along the Charles river and those that aren't.

Explain and discuss my findings.

## Predict (60%)
Use keras to create a neural network that can predict the median house based on the other variables in the dataset.



Notes.
Possible aditional analyses. 
Is this unique to Boston and the Charles river? Does this occur in other cities? 
What about London, Paris or Dublin? Or Galway and the Corrib?
Is it only rivers that cause this or do other bodies of water?

## References.

Delve 1996, 'The Boston Housing Dataset', CS Toronto webpage, https://www.cs.toronto.edu/~delve/data/boston/bostonDetail.html Accessed on November 6th 2019.

Harrison, D. and Rubinfeld, D.L.,"Hedonic prices and the demand for clean air', J. Environ. Economics & Management, vol.5, 81-102, 1978"

Medium 2019, 'EDA on Boston Housing Dataset', https://medium.com/@utkarshgpt47/eda-on-boston-housing-dataset-8745644ab368 Accessed on November 6th 2019.

Source for Dataset
https://github.com/selva86/datasets/blob/master/BostonHousing.csv Accessed on 7th November 2019.