[View in Colaboratory](https://colab.research.google.com/github/bluebottle66/Practical-Machine-Learning-Northwestern-/blob/master/Predict422Week4_KunYang.ipynb)

In [0]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import sklearn.linear_model
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.metrics import mean_squared_error, r2_score
from math import sqrt # for root mean-squared error calculation

from sklearn.tree import DecisionTreeRegressor  # machine learning tree
from sklearn.ensemble import RandomForestRegressor # ensemble method

RANDOM_SEED = 1
SET_FIT_INTERCEPT = True

In [0]:
url="https://raw.githubusercontent.com/bluebottle66/Practical-Machine-Learning-Northwestern-/master/boston.csv"
boston_input = pd.read_csv(url)

**Read and Process data**

- Switch to github link to store csv files
- Look at data statistics, drop n/a, drop neighborhood
- Create a new response variable for Log median value of homes in thousands of 1970 dollars
- Create preliminary dataset
- Perform Standard Scaler of original data









In [3]:
boston = boston_input.drop('neighborhood', 1)

boston.dropna()

#create log transformation for response variable
boston['logMv']=np.log(boston['mv'])

prelim_model_data = np.array([boston.mv,\
boston.logMv,\
boston.crim,\
boston.zn,\
boston.indus,\
boston.chas,\
boston.nox,\
boston.rooms,\
boston.age,\
boston.dis,\
boston.rad,\
boston.tax,\
boston.ptratio,\
boston.lstat]).T

# dimensions of the polynomial model X input and y response
# preliminary data before standardization
print('\nData dimensions:', prelim_model_data.shape)


Data dimensions: (506, 14)


In [4]:
#Data transformation
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
print(scaler.fit(prelim_model_data))

# show standardization constants being employed
print(scaler.mean_)
print(scaler.scale_)

# the model data will be standardized form of preliminary model data
model_data = scaler.fit_transform(prelim_model_data)

# dimensions of the polynomial model X input and y response
# all in standardized units of measure
print('\nDimensions for model_data:', model_data.shape)

StandardScaler(copy=True, with_mean=True, with_std=True)
[2.25288538e+01 3.03455800e+00 3.61352356e+00 1.13636364e+01
 1.11367787e+01 6.91699605e-02 5.54695059e-01 6.28463439e+00
 6.85749012e+01 3.79504269e+00 9.54940711e+00 4.08237154e+02
 1.84555336e+01 1.26530632e+01]
[9.17309810e+00 4.07871084e-01 8.59304135e+00 2.32993957e+01
 6.85357058e+00 2.53742935e-01 1.15763115e-01 7.01922514e-01
 2.81210326e+01 2.10362836e+00 8.69865112e+00 1.68370495e+02
 2.16280519e+00 7.13400164e+00]

Dimensions for model_data: (506, 14)


**Model Build up and Data Analysis:**

Look at correlation matrix - analyzing the factors which impact the value

Build up 4 regression models (linear, ridge, lasso and elastic net) on  log transform of home values (from previous week)


Build up Decision Tree Regressor models

Build up Random Forecasts models


Evaluate with cross-validation

In [5]:
corr=boston.corr()
print(corr[(corr>0.5)|(corr<-0.5)])

             crim        zn     indus  chas       nox     rooms       age  \
crim     1.000000       NaN       NaN   NaN       NaN       NaN       NaN   
zn            NaN  1.000000 -0.533828   NaN -0.516604       NaN -0.569537   
indus         NaN -0.533828  1.000000   NaN  0.763651       NaN  0.644779   
chas          NaN       NaN       NaN   1.0       NaN       NaN       NaN   
nox           NaN -0.516604  0.763651   NaN  1.000000       NaN  0.731470   
rooms         NaN       NaN       NaN   NaN       NaN  1.000000       NaN   
age           NaN -0.569537  0.644779   NaN  0.731470       NaN  1.000000   
dis           NaN  0.664408 -0.708027   NaN -0.769230       NaN -0.747881   
rad      0.625505       NaN  0.595129   NaN  0.611441       NaN       NaN   
tax      0.582764       NaN  0.720760   NaN  0.668023       NaN  0.506456   
ptratio       NaN       NaN       NaN   NaN       NaN       NaN       NaN   
lstat         NaN       NaN  0.603800   NaN  0.590879 -0.613808  0.602339   

In [0]:
#Create list of regression models and all factors in our regression model

names = ['Linear_Regression', 'Ridge_Regression','Lasso_Regression','Elastic_Net_Regression','Decision_Tree','Random_Forest']
variables=['crim','zn','indus','chas','nox','romms','age','dis','rad','tax','ptratio','lstat']

In [0]:
regressors = [LinearRegression(fit_intercept = SET_FIT_INTERCEPT),
              Ridge(alpha = 1, solver = 'cholesky',
                fit_intercept = SET_FIT_INTERCEPT,
                normalize = False,
                random_state = RANDOM_SEED),
              Lasso(alpha = 0.1, max_iter=10000, tol=0.01,
                fit_intercept = SET_FIT_INTERCEPT,
                random_state = RANDOM_SEED),
              ElasticNet(alpha = 0.1, l1_ratio = 0.5,
                max_iter=10000, tol=0.01,
                fit_intercept = SET_FIT_INTERCEPT,
                normalize = False,
                random_state = RANDOM_SEED),
              DecisionTreeRegressor(random_state = RANDOM_SEED, 
                max_features="log2"),
              RandomForestRegressor(random_state = RANDOM_SEED, 
                max_features="log2",
                n_estimators=100,
                bootstrap=True)
             ]

In [0]:
N_FOLDS = 10
N_Variables=12

# set up numpy array for storing results

cv_results = np.zeros((N_FOLDS, len(names)))
rsquare_results = np.zeros((N_FOLDS, len(names)))
importance_results=np.zeros((N_FOLDS,N_Variables,2))

In [0]:
from sklearn.model_selection import KFold
kf = KFold(n_splits = N_FOLDS, shuffle=False, random_state = RANDOM_SEED)
# check the splitting process by looking at fold observation counts
index_for_fold = 0 # fold count initialized

In [14]:
for train_index, test_index in kf.split(model_data):
  
  print('\nFold index:', index_for_fold,
  '------------------------------------------')

  X_train = model_data[train_index, 2:model_data.shape[1]]
  X_test = model_data[test_index, 2:model_data.shape[1]]
  y_train = model_data[train_index, 1]
  y_test = model_data[test_index, 1]
  # note, if using raw response variable, we shalll use model_data[train_index, 0] here

  print('\nShape of input data for this fold:',
        '\nData Set: (Observations, Variables)')
  print('X_train:', X_train.shape)
  print('X_test:',X_test.shape)
  print('y_train:', y_train.shape)
  print('y_test:',y_test.shape)

  index_for_method = 0 # initialize

  for name, reg_model in zip(names, regressors):
    
      print('\nRegression model evaluation for:', name)
      print(' Scikit Learn method:', reg_model)
      reg_model.fit(X_train, y_train) # fit on the train set for this fold

      # evaluate on the test set for this fold
      y_test_predict = reg_model.predict(X_test)
      
      r2_result = r2_score(y_test, y_test_predict)

      fold_method_result = sqrt(mean_squared_error(y_test, y_test_predict))

      print(reg_model.get_params(deep=True))
      print('Root mean-squared error:', fold_method_result)
      
      rsquare_results[index_for_fold, index_for_method] =r2_result

      cv_results[index_for_fold, index_for_method] = fold_method_result
      
      if name == 'Decision_Tree':
        importance_results[index_for_fold,:,0] = reg_model.feature_importances_[:]
        
      if name == 'Random_Forest':
        importance_results[index_for_fold,:,1] = reg_model.feature_importances_[:]
      
      index_for_method += 1

  index_for_fold += 1


Fold index: 0 ------------------------------------------

Shape of input data for this fold: 
Data Set: (Observations, Variables)
X_train: (455, 12)
X_test: (51, 12)
y_train: (455,)
y_test: (51,)

Regression model evaluation for: Linear_Regression
 Scikit Learn method: LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)
{'copy_X': True, 'fit_intercept': True, 'n_jobs': 1, 'normalize': False}
Root mean-squared error: 0.2957195904826947

Regression model evaluation for: Ridge_Regression
 Scikit Learn method: Ridge(alpha=1, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=False, random_state=1, solver='cholesky', tol=0.001)
{'alpha': 1, 'copy_X': True, 'fit_intercept': True, 'max_iter': None, 'normalize': False, 'random_state': 1, 'solver': 'cholesky', 'tol': 0.001}
Root mean-squared error: 0.2951112840301624

Regression model evaluation for: Lasso_Regression
 Scikit Learn method: Lasso(alpha=0.1, copy_X=True, fit_intercept=True, max_iter=10000,
   no

In [15]:
cv_results_df = pd.DataFrame(cv_results)
cv_results_df.columns = names
print(cv_results_df.mean())

r2_results_df = pd.DataFrame(rsquare_results)
r2_results_df.columns = names
print(r2_results_df.mean())

importance_results_df = pd.DataFrame(np.mean(importance_results, axis=0))
importance_results_df.index=variables
print(importance_results_df)

Linear_Regression         0.496452
Ridge_Regression          0.495693
Lasso_Regression          0.544405
Elastic_Net_Regression    0.518855
Decision_Tree             0.626142
Random_Forest             0.444510
dtype: float64
Linear_Regression         0.374328
Ridge_Regression          0.378451
Lasso_Regression          0.339908
Elastic_Net_Regression    0.404225
Decision_Tree             0.032930
Random_Forest             0.565667
dtype: float64
                0         1
crim     0.063189  0.121718
zn       0.018727  0.006724
indus    0.037553  0.043005
chas     0.003342  0.005803
nox      0.061188  0.128171
romms    0.099195  0.177683
age      0.025785  0.037930
dis      0.065897  0.068985
rad      0.004943  0.022942
tax      0.034062  0.048215
ptratio  0.026371  0.057643
lstat    0.559748  0.281183


**Result Analysis**

- Random Forest model has the smallest mean square error and highest R-Square, so it is the best model to choose from the 6
- Decision_Tree model seems have poor perfomance here
- look at the feature importance of the 12 variables, the higher the value, the feature is more important: notice that when I built the model, I set the max_feature as log2, so log(12) = 4, max feature here is close to 4 factors, we can think that the top 3~4 feature importance are the most important factors to drive the home price in our Random Forest model


> the top 4 important features are:
   lstat, romms, nox and crim
   So: % of lower income people; avg rooms in the house; air polution and crime rate
   Not surpriseingly, we can find those 4 factors are indeed have higher absoluate correlation with house value from the correlation table
    

