In this project, I will be fitting regression models to data from the Boston Housing Study. All the explanatory variables (except neighborhood) and all 506 census tract observations from the Boston Housing Study will be used for this analysis. I will be using the median value of homes in thousands of 1970 dollars. The two regression modeling methods used here will be:

1. Linear regression
2. Ridge regression

These methods are evaluated within a cross-validation design, using root-mean-squared error (RMSE) as an index of prediction error.

After analyzing the results from the two methods, I would suggest using the Ridge regression as opposed to the Linear regression for assessing the Market Value of residential real estate. This is because the RMSE and R-Squared values obtained in the cross-validation design show that Ridge Regression is performing better and has a better fit to the model data than the Linear Regression.

### Data Preparation

In [1]:
# Setting the seed to obtain reproducible results

RANDOM_SEED = 1

# Fitting the intercept term in the models

SET_FIT_INTERCEPT = True

# Importing base packages

import numpy as np
import pandas as pd

# Importing Modeling routines from Scikit Learn packages

import sklearn.linear_model 
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.metrics import mean_squared_error, r2_score  
from math import sqrt

In [2]:
# Reading data for the Boston Housing Study

boston_input = pd.read_csv('boston.csv')

# Displaying attribute list in Boston DataFrame
# This is necessary to check the attribute types and 
# If the fields can contain Null data or not

print('\nGeneral description of the boston_input DataFrame:')
print(boston_input.info())

# Dropping neighborhood attribute from the data being considered

boston = boston_input.drop('neighborhood', 1)

# Displaying descriptive stats of Boston DataFrame

print('\nDescriptive statistics of the boston DataFrame:')
print(boston.describe())


General description of the boston_input DataFrame:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 14 columns):
neighborhood    506 non-null object
crim            506 non-null float64
zn              506 non-null float64
indus           506 non-null float64
chas            506 non-null int64
nox             506 non-null float64
rooms           506 non-null float64
age             506 non-null float64
dis             506 non-null float64
rad             506 non-null int64
tax             506 non-null int64
ptratio         506 non-null float64
lstat           506 non-null float64
mv              506 non-null float64
dtypes: float64(10), int64(3), object(1)
memory usage: 55.4+ KB
None

Descriptive statistics of the boston DataFrame:
             crim          zn       indus        chas         nox       rooms  \
count  506.000000  506.000000  506.000000  506.000000  506.000000  506.000000   
mean     3.613524   11.363636   11.136779    0.06917

In [3]:
# Setting up the preliminary data model 
# the first column is the median housing value response
# the remaining columns are the explanatory variables

prelim_model_data = np.array([boston.mv,\
    boston.crim,\
    boston.zn,\
    boston.indus,\
    boston.chas,\
    boston.nox,\
    boston.rooms,\
    boston.age,\
    boston.dis,\
    boston.rad,\
    boston.tax,\
    boston.ptratio,\
    boston.lstat]).T

# Looking at the data before standardization

print('\nData dimensions:', prelim_model_data.shape)


Data dimensions: (506, 13)


In [4]:
# Importing StandardScaler to standardize the model data

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

# the model data will be standardized form of preliminary model data

model_data = scaler.fit_transform(prelim_model_data)

# Looking at the data after standardization

print('\nDimensions for model_data:', model_data.shape)


Dimensions for model_data: (506, 13)


### Regression of Data using Linear Regression and Ridge Regression in the K-Fold design and Cross-Validation of the Models using RMSE

In [19]:
# Setting names list containing names of the regression models
# Setting the regressors list specifying the regression models
# In Ridge regression, we are keeping alpha = 1

names = ['Linear_Regression', 'Ridge_Regression alpha=0.1',
         'Ridge_Regression alpha=1',
         'Ridge_Regression alpha=2', 'Ridge_Regression alpha=5' , 
         'Ridge_Regression alpha=500']

regressors = [LinearRegression(fit_intercept = SET_FIT_INTERCEPT, 
              normalize = False), 
              Ridge(alpha = 0.1, solver = 'cholesky', 
                    fit_intercept = SET_FIT_INTERCEPT, 
                    normalize = False, 
                    random_state = RANDOM_SEED), 
              Ridge(alpha = 1, solver = 'cholesky', 
                    fit_intercept = SET_FIT_INTERCEPT, 
                    normalize = False, 
                    random_state = RANDOM_SEED),
              Ridge(alpha = 2, solver = 'cholesky', 
                    fit_intercept = SET_FIT_INTERCEPT, 
                    normalize = False, 
                    random_state = RANDOM_SEED), 
              Ridge(alpha = 5, solver = 'cholesky', 
                    fit_intercept = SET_FIT_INTERCEPT, 
                    normalize = False, 
                    random_state = RANDOM_SEED), 
              Ridge(alpha = 500, solver = 'cholesky', 
                    fit_intercept = SET_FIT_INTERCEPT, 
                    normalize = False, 
                    random_state = RANDOM_SEED)]

In [20]:
# Specifying the folds for validation - 10-fold cross-validation

from sklearn.model_selection import KFold

# Specifying the folds for validation - 10-fold cross-validation

N_FOLDS = 10

# Setting up numpy array for storing results

cv_results = np.zeros((N_FOLDS, len(names)))

# Setting the KFold object using number of folds and 
# random seed set previously

kf = KFold(n_splits = N_FOLDS, shuffle=False, random_state = RANDOM_SEED)

# Checking the splitting process by looking at fold observation counts
# Initializing the fold count and looping through the splits of model data
# The variable index_for_fold will be used as the results array index

index_for_fold = 0 

# This loop will run 10 times, once for each fold

for train_index, test_index in kf.split(model_data):
    print('\nFold index:', index_for_fold,
          '------------------------------------------')
        
#   1:model_data.shape[1] slices for explanatory variables
#   and 0 is the index for the response variable 
#   Storing the split index in a variable to avoid recalculation

    model_data_exp = model_data.shape[1]

# Generating Train and Test datasets for the current fold
   
    X_train = model_data[train_index, 1:model_data_exp]
    X_test = model_data[test_index, 1:model_data_exp]
    y_train = model_data[train_index, 0]
    y_test = model_data[test_index, 0]   

# Initializing the method count and looping through the methods 
# (Linear Regression and Ridge Regression)
# This loop will run twice, once for each method, per fold
# The variable index_for_method will be used as results array index
    
    index_for_method = 0
    for name, reg_model in zip(names, regressors):
        print('\nRegression model evaluation for:', name)
        reg_model.fit(X_train, y_train)
        
        # evaluate on the test set for this fold
        y_test_predict = reg_model.predict(X_test)
        print('Coefficient of determination (R-squared):',
              r2_score(y_test, y_test_predict))
        fold_method_result = sqrt(mean_squared_error(y_test, y_test_predict))
        
        print('Root mean-squared error:', fold_method_result)
        cv_results[index_for_fold, index_for_method] = fold_method_result
        index_for_method += 1
  
    index_for_fold += 1

cv_results_df = pd.DataFrame(cv_results)
cv_results_df.columns = names

print('\n----------------------------------------------')
print('Average results from ', N_FOLDS, '-fold cross-validation\n',
      'in standardized units (mean 0, standard deviation 1)\n',
      '\nMethod               Root mean-squared error', sep = '')     
print(cv_results_df.mean())  


Fold index: 0 ------------------------------------------

Regression model evaluation for: Linear_Regression
Coefficient of determination (R-squared): 0.765075234937901
Root mean-squared error: 0.3080338355432655

Regression model evaluation for: Ridge_Regression alpha=0.1
Coefficient of determination (R-squared): 0.7652758652644297
Root mean-squared error: 0.3079022739996639

Regression model evaluation for: Ridge_Regression alpha=1
Coefficient of determination (R-squared): 0.7670378077121718
Root mean-squared error: 0.30674447231421376

Regression model evaluation for: Ridge_Regression alpha=2
Coefficient of determination (R-squared): 0.7689061806957396
Root mean-squared error: 0.3055119400905162

Regression model evaluation for: Ridge_Regression alpha=5
Coefficient of determination (R-squared): 0.7739891277941462
Root mean-squared error: 0.30213336567902116

Regression model evaluation for: Ridge_Regression alpha=500
Coefficient of determination (R-squared): 0.5543979021412023
Root

The mean RMSE value of Ridge Regression is slightly better (lower) than Linear Regression and the R-Squared values of Ridge Regression are slightly better (higher) than the Linear Regression. This shows that Ridge Regression has better performance on the data and can do a better job at predicting the value of the response variable (the median value of homes) using all the explanatory variables. In Ridge Regression, as the Alpha value increases, the R-Squared increases and RMSE value decreases which is an indication that it reduces overfitting on the data but as is seen in the case of a very high alpha value (500), the RMSE has increased which shows the underfitting of data.