# Regression

Information on house sales in King County, WA (between May 2014 and May 2015). (Each row in the data set pertains to one house. There is a total of 21,613 houses in the data set). Predict the sale price of a house (i.e., the `price` column) based on the characteristics of the house. This is important, because this information can be helpful for buyers, sellers, realtors, and lenders.

## Goal

Use the **kc_house_data.csv** data set and build a model to predict **price**. <br>

# Read and Prepare the Data

In [1]:
# Common imports
import numpy as np
import pandas as pd

np.random.seed(42)

# Get the data

In [2]:
#We will predict the "price" value in the data set:
kcHouse = pd.read_csv("kc_house_data.csv")
kcHouse.head()

Unnamed: 0,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,221900.0,3,1.0,1180,5650,1.0,0,0,3,7,1180,0,1955,0,98178,47.5112,-122.257,1340,5650
1,538000.0,3,2.25,2570,7242,2.0,0,0,3,7,2170,400,1951,1991,98125,47.721,-122.319,1690,7639
2,180000.0,2,1.0,770,10000,1.0,0,0,3,6,770,0,1933,0,98028,47.7379,-122.233,2720,8062
3,604000.0,4,3.0,1960,5000,1.0,0,0,5,7,1050,910,1965,0,98136,47.5208,-122.393,1360,5000
4,510000.0,3,2.0,1680,8080,1.0,0,0,3,8,1680,0,1987,0,98074,47.6168,-122.045,1800,7503


# Split data (train/test)

In [3]:
from sklearn.model_selection import train_test_split

train, test = train_test_split(kcHouse, test_size=0.3)

# Data Prep

Perform your data prep here. You can use pipelines like we do in the tutorials. Otherwise, feel free to use your own data prep steps. Eventually, you should do the following at a minimum:<br>
- Separate inputs from target<br>
- Impute/remove missing values<br>
- Standardize the continuous variables<br>
- One-hot encode categorical variables<br>

In [4]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder

### Check for missing values

In [5]:
train.isna().sum()

price            0
bedrooms         0
bathrooms        0
sqft_living      0
sqft_lot         0
floors           0
waterfront       0
view             0
condition        0
grade            0
sqft_above       0
sqft_basement    0
yr_built         0
yr_renovated     0
zipcode          0
lat              0
long             0
sqft_living15    0
sqft_lot15       0
dtype: int64

In [6]:
test.isna().sum()

price            0
bedrooms         0
bathrooms        0
sqft_living      0
sqft_lot         0
floors           0
waterfront       0
view             0
condition        0
grade            0
sqft_above       0
sqft_basement    0
yr_built         0
yr_renovated     0
zipcode          0
lat              0
long             0
sqft_living15    0
sqft_lot15       0
dtype: int64

There are no missing values so no need for imputation.

### Separate the target variable

In [7]:
train_y = train[['price']]
test_y = test[['price']]

train_inputs = train.drop(['price'], axis=1)
test_inputs = test.drop(['price'], axis=1)

### Identify numerical/categorical columns

In [8]:
train_inputs.dtypes

bedrooms           int64
bathrooms        float64
sqft_living        int64
sqft_lot           int64
floors           float64
waterfront         int64
view               int64
condition          int64
grade              int64
sqft_above         int64
sqft_basement      int64
yr_built           int64
yr_renovated       int64
zipcode            int64
lat              float64
long             float64
sqft_living15      int64
sqft_lot15         int64
dtype: object

In [9]:
# Identify the numerical columns
numeric_columns = train_inputs.drop(['zipcode'], axis=1).columns
numeric_columns

Index(['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors',
       'waterfront', 'view', 'condition', 'grade', 'sqft_above',
       'sqft_basement', 'yr_built', 'yr_renovated', 'lat', 'long',
       'sqft_living15', 'sqft_lot15'],
      dtype='object')

In [10]:
# Identify the categorical columns
categorical_columns = train_inputs[['zipcode']].columns
categorical_columns

Index(['zipcode'], dtype='object')

### Pipelines

In [11]:
numeric_transformer = Pipeline(steps=[
                ('scaler', StandardScaler())])

In [12]:
categorical_transformer = Pipeline(steps=[   
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

In [13]:
preprocessor = ColumnTransformer([
        ('num', numeric_transformer, numeric_columns),
        ('cat', categorical_transformer, categorical_columns)])

# Transform Train and Test

In [14]:
#Fit and transform the train data
train_x = preprocessor.fit_transform(train_inputs)
train_x

<15129x87 sparse matrix of type '<class 'numpy.float64'>'
	with 272322 stored elements in Compressed Sparse Row format>

In [15]:
train_x.shape

(15129, 87)

In [16]:
# Transform the test data
test_x = preprocessor.transform(test_inputs)
test_x

<6484x87 sparse matrix of type '<class 'numpy.float64'>'
	with 116712 stored elements in Compressed Sparse Row format>

In [17]:
test_x.shape

(6484, 87)

# Determine Baseline 

In [18]:
#Find the average price
mean_value = np.mean(train_y['price'])
mean_value

537729.2636658074

In [19]:
# Predict all values as the mean
baseline_pred = np.repeat(mean_value, len(test_y))
baseline_pred

array([537729.26366581, 537729.26366581, 537729.26366581, ...,
       537729.26366581, 537729.26366581, 537729.26366581])

In [20]:
from sklearn.metrics import mean_squared_error

baseline_mse = mean_squared_error(test_y, baseline_pred)
baseline_rmse = np.sqrt(baseline_mse)
print('Baseline RMSE: {}' .format(baseline_rmse))

Baseline RMSE: 380289.487466917


Baseine prediction has $380k error.

# Train a SGD model (with no regularization)

In [33]:
from sklearn.linear_model import SGDRegressor 

# tol = stopping criterion
# eta0 = learning rate
# penalty = regularization term
# max_iter = number of passes over training data (i.e., epochs)

sgd_reg = SGDRegressor(max_iter=100, penalty=None, eta0=0.1, tol=0.0001) 

sgd_reg.fit(train_x, train_y)

sgd_reg.n_iter_

  y = column_or_1d(y, warn=True)


26

### Generate the error metrics

In [34]:
#Train RMSE
reg_train_pred = sgd_reg.predict(train_x)

train_mse = mean_squared_error(train_y, reg_train_pred)

train_rmse = np.sqrt(mean_squared_error (train_y, reg_train_pred))

print('Train RMSE: {}' .format(train_rmse))

Train RMSE: 183742.10797671514


In [35]:
#Test RMSE
reg_test_pred = sgd_reg.predict(test_x)

test_mse = mean_squared_error (test_y, reg_test_pred)

test_rmse = np.sqrt(mean_squared_error (test_y, reg_test_pred))

print('Test RMSE: {}' .format(test_rmse))

Test RMSE: 192628.45935835893


# Try L1 Regularization in SGD

In [54]:
sgd_reg_L1 = SGDRegressor(max_iter=100, penalty='l1', alpha = 0.1, eta0=0.01, tol=0.0001)

sgd_reg_L1.fit(train_x, train_y)

  y = column_or_1d(y, warn=True)


SGDRegressor(alpha=0.1, average=False, early_stopping=False, epsilon=0.1,
             eta0=0.01, fit_intercept=True, l1_ratio=0.15,
             learning_rate='invscaling', loss='squared_loss', max_iter=100,
             n_iter_no_change=5, penalty='l1', power_t=0.25, random_state=None,
             shuffle=True, tol=0.0001, validation_fraction=0.1, verbose=0,
             warm_start=False)

### Generate the error metrics

In [55]:
#Train RMSE
reg_train_pred = sgd_reg_L1.predict(train_x)

train_mse = mean_squared_error(train_y, reg_train_pred)

train_rmse = np.sqrt(mean_squared_error (train_y, reg_train_pred))

print('Train RMSE: {}' .format(train_rmse))

Train RMSE: 161212.18364478918


In [56]:
#Test RMSE
reg_test_pred = sgd_reg_L1.predict(test_x)

test_mse = mean_squared_error (test_y, reg_test_pred)

test_rmse = np.sqrt(mean_squared_error (test_y, reg_test_pred))

print('Test RMSE: {}' .format(test_rmse))

Test RMSE: 170681.9500492998


# Try L2 Regularization in SGD

In [177]:
sgd_reg_L2 = SGDRegressor(max_iter=100, penalty='l2', alpha = 0.1, eta0=0.01, tol=0.0001)

sgd_reg_L2.fit(train_x, train_y)

  y = column_or_1d(y, warn=True)


SGDRegressor(alpha=0.1, average=False, early_stopping=False, epsilon=0.1,
             eta0=0.01, fit_intercept=True, l1_ratio=0.15,
             learning_rate='invscaling', loss='squared_loss', max_iter=100,
             n_iter_no_change=5, penalty='l2', power_t=0.25, random_state=None,
             shuffle=True, tol=0.0001, validation_fraction=0.1, verbose=0,
             warm_start=False)

### Generate the error metrics

In [178]:
#Train RMSE
reg_train_pred = sgd_reg_L2.predict(train_x)

train_mse = mean_squared_error(train_y, reg_train_pred)

train_rmse = np.sqrt(mean_squared_error (train_y, reg_train_pred))

print('Train RMSE: {}' .format(train_rmse))

Train RMSE: 194568.71855342688


In [179]:
#Test RMSE
reg_test_pred = sgd_reg_L2.predict(test_x)

test_mse = mean_squared_error (test_y, reg_test_pred)

test_rmse = np.sqrt(mean_squared_error (test_y, reg_test_pred))

print('Test RMSE: {}' .format(test_rmse))

Test RMSE: 205261.76972485872


With L2 regularization the RMSE is worse than either of the first two models, as well as having a similar overfitting problem.

# Try ElasticNet in SGD

In [180]:
from sklearn.linear_model import ElasticNet

sgd_reg_elastic = SGDRegressor(max_iter=100, penalty='elasticnet', l1_ratio=0.5, alpha = 0.1, 
                          eta0=0.01, tol=0.0001)
sgd_reg_elastic.fit(train_x, train_y)

  y = column_or_1d(y, warn=True)


SGDRegressor(alpha=0.1, average=False, early_stopping=False, epsilon=0.1,
             eta0=0.01, fit_intercept=True, l1_ratio=0.5,
             learning_rate='invscaling', loss='squared_loss', max_iter=100,
             n_iter_no_change=5, penalty='elasticnet', power_t=0.25,
             random_state=None, shuffle=True, tol=0.0001,
             validation_fraction=0.1, verbose=0, warm_start=False)

### Generate the error metrics

In [181]:
#Train RMSE
reg_train_pred = sgd_reg_elastic.predict(train_x)

train_mse = mean_squared_error(train_y, reg_train_pred)

train_rmse = np.sqrt(mean_squared_error (train_y, reg_train_pred))

print('Train RMSE: {}' .format(train_rmse))

Train RMSE: 188974.33322669254


In [182]:
#Test RMSE
reg_test_pred = sgd_reg_elastic.predict(test_x)

test_mse = mean_squared_error (test_y, reg_test_pred)

test_rmse = np.sqrt(mean_squared_error (test_y, reg_test_pred))

print('Test RMSE: {}' .format(test_rmse))

Test RMSE: 199777.6446780081


The RMSE of the Elastic Net model falls in between those of the L1 and L2 models, which can be expected. It performs more poorly than the non-regularized and L1 models, including having more overfitting.

# Try Polynomial Features

Try fitting a model with degree = 3 or 4 (or higher if your computer can handle). This might overfit the model. 

In [69]:
from sklearn.preprocessing import PolynomialFeatures

# Create third degree terms and interaction terms
poly_features = PolynomialFeatures(degree=3).fit(train_x)

train_x_poly = poly_features.transform(train_x)

test_x_poly = poly_features.transform(test_x)

In [70]:
#Create poly model
pol_sgd_reg = SGDRegressor(max_iter=100, penalty=None, eta0=0.001, tol=0.0001) 

pol_sgd_reg.fit(train_x_poly, train_y)

  y = column_or_1d(y, warn=True)


SGDRegressor(alpha=0.0001, average=False, early_stopping=False, epsilon=0.1,
             eta0=0.001, fit_intercept=True, l1_ratio=0.15,
             learning_rate='invscaling', loss='squared_loss', max_iter=100,
             n_iter_no_change=5, penalty=None, power_t=0.25, random_state=None,
             shuffle=True, tol=0.0001, validation_fraction=0.1, verbose=0,
             warm_start=False)

### Generate the error metrics

In [71]:
#Train RMSE
reg_train_pred = pol_sgd_reg.predict(train_x_poly)

train_mse = mean_squared_error(train_y, reg_train_pred)

train_rmse = np.sqrt(mean_squared_error (train_y, reg_train_pred))

print('Train RMSE: {}' .format(train_rmse))

Train RMSE: 180771071326576.0


In [72]:
#Test RMSE
reg_test_pred = pol_sgd_reg.predict(test_x_poly)

test_mse = mean_squared_error (test_y, reg_test_pred)

test_rmse = np.sqrt(mean_squared_error (test_y, reg_test_pred))

print('Test RMSE: {}' .format(test_rmse))

Test RMSE: 164402669807286.72


Extremely high RMSE for the polynomial shows that 3rd degree features are not appropriate for creating a price model. 

# Try Regularization again

Use one of the regularization techniques on the polynomial model.

In [200]:
#Create poly model with L1 regularization
pol_sgd_reg_L1 = SGDRegressor(max_iter=100, penalty='l1',alpha=0.1, eta0=0.001, tol=0.0001) 

pol_sgd_reg_L1.fit(train_x_poly, train_y)

  y = column_or_1d(y, warn=True)


SGDRegressor(alpha=0.1, average=False, early_stopping=False, epsilon=0.1,
             eta0=0.001, fit_intercept=True, l1_ratio=0.15,
             learning_rate='invscaling', loss='squared_loss', max_iter=100,
             n_iter_no_change=5, penalty='l1', power_t=0.25, random_state=None,
             shuffle=True, tol=0.0001, validation_fraction=0.1, verbose=0,
             warm_start=False)

### Generate the error metrics

In [201]:
#Train RMSE
reg_train_pred = pol_sgd_reg_L1.predict(train_x_poly)

train_mse = mean_squared_error(train_y, reg_train_pred)

train_rmse = np.sqrt(mean_squared_error (train_y, reg_train_pred))

print('Train RMSE: {}' .format(train_rmse))

Train RMSE: 35721600636255.67


In [202]:
#Test RMSE
reg_test_pred = pol_sgd_reg_L1.predict(test_x_poly)

test_mse = mean_squared_error (test_y, reg_test_pred)

test_rmse = np.sqrt(mean_squared_error (test_y, reg_test_pred))

print('Test RMSE: {}' .format(test_rmse))

Test RMSE: 34725587123283.133


While L1 regularization reduced the difference between the RMSE of the train and test sets noticeably, it is still obviously not a model design worth considering.