<img src="https://learnonline.gmit.ie/pluginfile.php/1/theme_adaptable/logo/1538655948/Transparent%20new.png" align="left">


<h1>Higher Diploma in Science in Computing (Data Analytics) - Machine Learning and Statistics</h1>
<h2>Boston House Prices Dataset Project 2019</h2>
<h3>Cóbhan Phillipson - G00174503</h3>
<hr>

<h3><u>Introduction</u></h3>

<p>This project is conducted as part of the Machine Learning and Statisitcs module on the Higher Diploma in Science in Computing (Data Analytics) in GMIT. The project is based on the Boston House Prices dataeet and we are instructed by the following guidelines:</p>
<ol><li><strong>Describe:</strong> Create a jupyter notebook that uses descriptive statistics and plots to describe the Boston House Prices dataset.</li> 
<li><strong>Infer:</strong> Add a section where you use inferential statistics to analyse whether there is a significant difference in median house prices between houses that are along the Charles river and those that aren’t.</li>
<li><strong>Predict:</strong> Use keras to create a neural network that can predict the median house price based on the other variables in the dataset. </li></ol>
<hr>

<h4>Import Libraries</h4>
<p>In this project, I will use a number of Python packages. The first port of call is to import these packages into this notebook.</p>

In [10]:
# Import libraries necessary for this project
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn as sk

%matplotlib inline

<h3><u>Section 1 - Describe</u></h3>
<h4>About the Boston House Prices Dataset</h4>
<p>This dataset contains information collected by the U.S Census Service concerning housing in the area of Boston Massachusetts. It was first published in 1978 in the Journal of Environmental Economics and Management, volume 5 as part of an article investigating the willingness of people in the Boston metroploitan area to pay for clean air.</p>

<p>The article was written by David Harrison, Jr. and Daniel L. Rubinfeld and found that marginal air pollution damages (as revealed in the housing market) are found to increase with the level of air pollution and with household income.[1] The dataset is small in size with only 506 rows and 13 columns.</p>

<p>To get a full description of the dataset, we can use the sklearn' DESCR attribute which provides us with a free-text description of the data</p>

In [12]:
#Import Boston House Price Data Set - sklearn comes with built in datasets which include Boston House Prices Dataset
from sklearn.datasets import load_boston 
boston_dataset = load_boston()
print(boston_dataset.DESCR) 

Boston House Prices dataset

Notes
------
Data Set Characteristics:  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive
    
    :Median Value (attribute 14) is usually the target

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pupil-teacher ratio by town
      

<p>One thing to note in the dataset notes is that the median value (MEDV) is the target variable here and the other values are feature variables which can be used to predict house prices. <p>

In [4]:
# create a pandas dataframe with the House Price Dataset
df = pd.DataFrame(boston_dataset.data, columns=boston_dataset.feature_names) 
# return the first 5 rows of our dataset to ensure everything has imported correctly
df.head()


Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33


<p>The first thing we notice here is that our target value MEDV is missing from our dataframe and must be created.</p>

In [17]:
df['MEDV'] = boston_dataset.target #add median
df.head() #return first five rows again to check median is included

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33,36.2


<hr>
<h4>Key Statistics</h4>
<p>To begin our initial exploration of the data, it is worth identifying some key statistics. We can use pandas to describe our dataset</p>


In [14]:
df.describe()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
count,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0
mean,3.593761,11.363636,11.136779,0.06917,0.554695,6.284634,68.574901,3.795043,9.549407,408.237154,18.455534,356.674032,12.653063
std,8.596783,23.322453,6.860353,0.253994,0.115878,0.702617,28.148861,2.10571,8.707259,168.537116,2.164946,91.294864,7.141062
min,0.00632,0.0,0.46,0.0,0.385,3.561,2.9,1.1296,1.0,187.0,12.6,0.32,1.73
25%,0.082045,0.0,5.19,0.0,0.449,5.8855,45.025,2.100175,4.0,279.0,17.4,375.3775,6.95
50%,0.25651,0.0,9.69,0.0,0.538,6.2085,77.5,3.20745,5.0,330.0,19.05,391.44,11.36
75%,3.647423,12.5,18.1,0.0,0.624,6.6235,94.075,5.188425,24.0,666.0,20.2,396.225,16.955
max,88.9762,100.0,27.74,1.0,0.871,8.78,100.0,12.1265,24.0,711.0,22.0,396.9,37.97
