## Section Three - Boston Housing Data  
**Author: Zak Hussain**  
**Date: 10/18/2019 - 11/01/2019**  
**Course: ML 6140** 

**Purpose**:   
Using the Boston data set, build a linear regressor that predicts NOX and another one that predicts median home values. 

In [1]:
import pandas as pd

**Preprocessing** 

In [2]:
# identified the column headers via looking at the data file. 
col_headers = ['crim', 'zn', 'indus',
           'chas', 'nox', 'rm',
           'age', 'dis', 'rad',
           'tax', 'ptratio', 'b', 
           'lstat', 'medv']

# I made the necessary changes to the boston.txt file and saved the necessary format to a new txt file, 'processed_boston.txt'
# create a dataframe from the new file. 
df = pd.read_csv('../Data/processed_boston.txt', delim_whitespace=True, names=col_headers) 

In [3]:
# check to ensure the values in the dataframe are in-line with the 
df.head()

Unnamed: 0,crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,b,lstat,medv
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222.0,18.7,396.9,5.33,36.2


In [4]:
# check the characteristices of the columns. 
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 14 columns):
crim       506 non-null float64
zn         506 non-null float64
indus      506 non-null float64
chas       506 non-null int64
nox        506 non-null float64
rm         506 non-null float64
age        506 non-null float64
dis        506 non-null float64
rad        506 non-null int64
tax        506 non-null float64
ptratio    506 non-null float64
b          506 non-null float64
lstat      506 non-null float64
medv       506 non-null float64
dtypes: float64(12), int64(2)
memory usage: 55.4 KB


In [5]:
# check the characteristics of the data 
df.describe()

Unnamed: 0,crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,b,lstat,medv
count,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0
mean,3.613524,11.363636,11.136779,0.06917,0.554695,6.284634,68.574901,3.795043,9.549407,408.237154,18.455534,356.674032,12.653063,22.532806
std,8.601545,23.322453,6.860353,0.253994,0.115878,0.702617,28.148861,2.10571,8.707259,168.537116,2.164946,91.294864,7.141062,9.197104
min,0.00632,0.0,0.46,0.0,0.385,3.561,2.9,1.1296,1.0,187.0,12.6,0.32,1.73,5.0
25%,0.082045,0.0,5.19,0.0,0.449,5.8855,45.025,2.100175,4.0,279.0,17.4,375.3775,6.95,17.025
50%,0.25651,0.0,9.69,0.0,0.538,6.2085,77.5,3.20745,5.0,330.0,19.05,391.44,11.36,21.2
75%,3.677082,12.5,18.1,0.0,0.624,6.6235,94.075,5.188425,24.0,666.0,20.2,396.225,16.955,25.0
max,88.9762,100.0,27.74,1.0,0.871,8.78,100.0,12.1265,24.0,711.0,22.0,396.9,37.97,50.0


In [6]:
# standardize the data, as some columns have substantially different scales. 
from sklearn.preprocessing import StandardScaler

In [7]:
# create a StandardScaler object 
scaler = StandardScaler()

# fit and transform the orginal df, saving it to a standardized dataframe. 
standardized_df = pd.DataFrame(scaler.fit_transform(df), columns=col_headers)

In [8]:
# display the characteristics of the standardized data. 
standardized_df.describe()

Unnamed: 0,crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,b,lstat,medv
count,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0
mean,-8.337643000000001e-17,3.306534e-16,2.804081e-16,-3.100287e-16,-8.071058e-16,-5.189086e-17,-2.650493e-16,8.293761000000001e-17,1.514379e-15,-9.93496e-16,4.493551e-16,-1.451408e-16,-1.595123e-16,-4.24781e-16
std,1.00099,1.00099,1.00099,1.00099,1.00099,1.00099,1.00099,1.00099,1.00099,1.00099,1.00099,1.00099,1.00099,1.00099
min,-0.4197819,-0.4877224,-1.557842,-0.2725986,-1.465882,-3.880249,-2.335437,-1.267069,-0.9828429,-1.31399,-2.707379,-3.907193,-1.531127,-1.908226
25%,-0.4109696,-0.4877224,-0.8676906,-0.2725986,-0.9130288,-0.5686303,-0.837448,-0.8056878,-0.6379618,-0.767576,-0.4880391,0.2050715,-0.79942,-0.5994557
50%,-0.3906665,-0.4877224,-0.2110985,-0.2725986,-0.1442174,-0.1084655,0.3173816,-0.2793234,-0.5230014,-0.4646726,0.274859,0.3811865,-0.1812536,-0.1450593
75%,0.00739656,0.04877224,1.015999,-0.2725986,0.598679,0.4827678,0.9067981,0.6623709,1.661245,1.530926,0.8065758,0.433651,0.6030188,0.2685231
max,9.933931,3.804234,2.422565,3.668398,2.732346,3.555044,1.117494,3.960518,1.661245,1.798194,1.638828,0.4410519,3.548771,2.98946


**NOX Prediction model**  
For predicting the NOX levels, I used X to contain the feature-space, and y contains the ground truths. The data is then split into X_train, X_test, y_train, y_test. I then us LassoCV as my linear model. This model performs iterative fitting along a regularization path. Since it is a cross-validation estimator, It uses cross-validation to select the best model.  

In [9]:
from sklearn.model_selection import train_test_split

# identify the feature space and ground truth. 
X = standardized_df.loc[:, standardized_df.columns != 'nox']
y = standardized_df['nox'] 

# split the data 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.3)

In [10]:
# build cross-validation estimator.  
from sklearn.linear_model import LassoCV

# characterize the cv to use 5-fold cross validation. 
lasso_clf = LassoCV(cv=5)
lasso_clf.fit(X_train, y_train) 

# because this is a regression model, save the predictions to y_pred, to get the confusion matrix. 
y_pred = lasso_clf.predict(X_test)
print('Here is the accuracy of the model on Predicting NOX levels.')
lasso_clf.score(X_test, y_test)

Here is the accuracy of the model on Predicting NOX levels.


0.8174685292020016

**Median Home Value Prediction Model**  
For predicting the median home value, I also implementd a LassoCV model. 

In [15]:
# adjust the feature space and the labels. 
X = standardized_df.loc[:, standardized_df.columns != 'medv'] 
y = standardized_df['medv'] 

# split the data. 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.4) 

In [19]:
# build another cross-validation estimator. 
lasso_clf2 = LassoCV(cv=3) 
lasso_clf2.fit(X_train, y_train) 

# because this is a regression model, save the predictions to y_pred, to get the confusion matrix. 
y_pred = lasso_clf2.predict(X_test)

# show the accuracy.
print("Accuracy for predicting the median home value: ")
lasso_clf2.score(X_test, y_test) 

Accuracy for predicting the median home value: 


0.719903598725705