# Predict Boston Housing Prices

This python program predicts the price of houses in Boston using a machine learning algorithm called a Linear Regression.

# Linear Regression
Linear regression is a linear approach to modeling the relationship between a scalar response (or dependent variable) and one or more explanatory variables (or independent variables).

## Pros:
1. Simple to implement.
2. Used to predict numeric values.

## Cons:
1. Prone to overfitting.
2. Cannot be used when the relation between independent and dependent variable are non linear.

In [1]:
# importing the necessary libraries
import pandas as pd
import numpy as np
from sklearn import linear_model
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error

In [2]:
#Load the Boston Housing Data Set from sklearn.datasets and print it
from sklearn.datasets import load_boston
boston = load_boston()

In [3]:
print("Features present in Dataset: ", *boston.feature_names)

Features present in Dataset:  CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT


In [4]:
#Transform the data set into a data frame 
#NOTE: boston.data = the data we want, 
#      boston.feature_names = the column names of the data
#      boston.target = Our target variable or the price of the houses
df_x = pd.DataFrame(boston.data, columns = boston.feature_names)
df_y = pd.DataFrame(boston.target)

In [5]:
#Get some statistics from our data set, count, mean standard deviation etc.
df_x.describe()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
count,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0
mean,3.613524,11.363636,11.136779,0.06917,0.554695,6.284634,68.574901,3.795043,9.549407,408.237154,18.455534,356.674032,12.653063
std,8.601545,23.322453,6.860353,0.253994,0.115878,0.702617,28.148861,2.10571,8.707259,168.537116,2.164946,91.294864,7.141062
min,0.00632,0.0,0.46,0.0,0.385,3.561,2.9,1.1296,1.0,187.0,12.6,0.32,1.73
25%,0.082045,0.0,5.19,0.0,0.449,5.8855,45.025,2.100175,4.0,279.0,17.4,375.3775,6.95
50%,0.25651,0.0,9.69,0.0,0.538,6.2085,77.5,3.20745,5.0,330.0,19.05,391.44,11.36
75%,3.677083,12.5,18.1,0.0,0.624,6.6235,94.075,5.188425,24.0,666.0,20.2,396.225,16.955
max,88.9762,100.0,27.74,1.0,0.871,8.78,100.0,12.1265,24.0,711.0,22.0,396.9,37.97


In [6]:
#Initialize the linear regression model
reg = linear_model.LinearRegression()

In [7]:
#Split the data into 67% training and 33% testing data
#NOTE: We have to split the dependent variables (x) and the target or independent variable (y)
x_train, x_test, y_train, y_test = train_test_split(df_x, df_y, test_size=0.33, random_state=42, shuffle=True)

In [8]:
print("X-Train Shape->", x_train.shape)
print("X-Test Shape->", x_test.shape)

X-Train Shape-> (339, 13)
X-Test Shape-> (167, 13)


In [9]:
#Train our model with the training data
reg.fit(x_train, y_train)

LinearRegression()

In [10]:
#Print the coefecients/weights for each feature/column of our model
print(reg.coef_)

[[-1.28749718e-01  3.78232228e-02  5.82109233e-02  3.23866812e+00
  -1.61698120e+01  3.90205116e+00 -1.28507825e-02 -1.42222430e+00
   2.34853915e-01 -8.21331947e-03 -9.28722459e-01  1.17695921e-02
  -5.47566338e-01]]


In [11]:
#print our price predictions on our test data
y_pred = reg.predict(x_test)

In [12]:
#Print the the prediction for the third row of our test data actual price = 13.6
print(y_pred[2])

[15.63751079]


In [13]:
#print the actual price of houses from the testing data set
y_test[0]

173    23.6
274    32.4
491    13.6
72     22.8
452    16.1
       ... 
110    21.7
321    23.1
265    22.8
29     21.0
262    48.8
Name: 0, Length: 167, dtype: float64

In [14]:
# Two different ways to check model performance/accuracy using,
# mean squared error which tells you how close a regression line is to a set of points.

# 1. Mean squared error by numpy
#print("Mean Squared Error :", np.mean((y_pred-y_test)**2))

# 2. Mean squared error by sklearn 
print("Mean Square Error :", mean_squared_error(y_test, y_pred))  

print("Mean Absolute Error: ", mean_absolute_error(y_test, y_pred))

Mean Square Error : 20.724023437339753
Mean Absolute Error:  3.1482557548168235
