# Assignment 2 - Regression - McCartney


In this assignment, we will focus on housing. The data set for this exercise includes information on house sales in King County, WA (between May 2014 and May 2015). (Each row in the data set pertains to one house. There is a total of 21,613 houses in the data set). You will use this data set to predict the sale price of a house (i.e., the `price` column) based on the characteristics of the house. This is important, because this information can be helpful for buyers, sellers, realtors, and lenders.

## Description of Variables

The description and type of each variable is provided in "KC house data - Data Dictionary.docx". Make sure to read this document to learn about the variables.

## Goal

Use the **kc_house_data.csv** data set and build a model to predict **price**. <br>

## Submission:

Please save and submit this Jupyter notebook file. The correctness of the code matters for your grade. **Readability and organization of your code is also important.** You may lose points for submitting unreadable/undecipherable code. Therefore, use markdown cells to create sections, and use comments where necessary.


# Read and Prepare the Data

In [3]:
# Common imports
import numpy as np
import pandas as pd

np.random.seed(31484443)

# Get the data

In [4]:
# Import the data set:
#We will predict the "price" value in the data set:

kchouse = pd.read_csv("kc_house_data.csv")
kchouse.head()

Unnamed: 0,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,432000.0,5.0,2.75,2060.0,329903.0,1.5,0,3,5,7.0,2060,0,1989.0,0,98022.0,47.1776,-121.944,2240,220232.0
1,170000.0,2.0,1.0,810.0,8424.0,1.0,0,0,4,6.0,810,0,1959.0,0,98023.0,47.3286,-122.346,820,8424.0
2,235000.0,3.0,1.0,960.0,5030.0,1.0,0,0,3,7.0,960,0,1955.0,0,98118.0,47.5611,-122.28,1460,5400.0
3,350000.0,2.0,1.0,830.0,5100.0,1.0,0,0,4,7.0,830,0,1942.0,0,98126.0,47.5259,-122.379,1220,5100.0
4,397380.0,2.0,1.0,1030.0,5072.0,1.0,0,0,3,6.0,1030,0,1924.0,1958,98115.0,47.6962,-122.294,1220,6781.0


In [5]:
kchouse.shape

(21613, 19)

In [6]:
kchouse.isna().sum()

price            0
bedrooms         1
bathrooms        0
sqft_living      1
sqft_lot         1
floors           1
waterfront       0
view             0
condition        0
grade            1
sqft_above       0
sqft_basement    0
yr_built         1
yr_renovated     0
zipcode          2
lat              0
long             0
sqft_living15    0
sqft_lot15       1
dtype: int64

In [7]:
#drop rows with missing values since minimial amount

kchouse = kchouse.dropna(axis=0, inplace=False)

kchouse

Unnamed: 0,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,432000.0,5.0,2.75,2060.0,329903.0,1.5,0,3,5,7.0,2060,0,1989.0,0,98022.0,47.1776,-121.944,2240,220232.0
1,170000.0,2.0,1.00,810.0,8424.0,1.0,0,0,4,6.0,810,0,1959.0,0,98023.0,47.3286,-122.346,820,8424.0
2,235000.0,3.0,1.00,960.0,5030.0,1.0,0,0,3,7.0,960,0,1955.0,0,98118.0,47.5611,-122.280,1460,5400.0
3,350000.0,2.0,1.00,830.0,5100.0,1.0,0,0,4,7.0,830,0,1942.0,0,98126.0,47.5259,-122.379,1220,5100.0
4,397380.0,2.0,1.00,1030.0,5072.0,1.0,0,0,3,6.0,1030,0,1924.0,1958,98115.0,47.6962,-122.294,1220,6781.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
21608,925500.0,3.0,2.75,1970.0,5200.0,1.5,0,3,3,8.0,1970,0,1915.0,2002,98136.0,47.5374,-122.388,2140,5200.0
21609,419900.0,5.0,3.50,2880.0,5000.0,2.0,0,0,3,8.0,2260,620,2012.0,0,98038.0,47.3455,-122.023,2590,4800.0
21610,340000.0,3.0,1.75,1730.0,11986.0,1.0,0,3,5,6.0,1730,0,1918.0,0,98198.0,47.3595,-122.323,2490,9264.0
21611,740000.0,4.0,2.50,3360.0,15091.0,2.0,0,0,3,9.0,3360,0,1997.0,0,98052.0,47.6649,-122.135,1930,9936.0


# Split data (train/test)

In [8]:
from sklearn.model_selection import train_test_split

train, test = train_test_split(kchouse, test_size=0.3)

In [9]:
train.shape

(15123, 19)

In [10]:
train.isna().sum()

price            0
bedrooms         0
bathrooms        0
sqft_living      0
sqft_lot         0
floors           0
waterfront       0
view             0
condition        0
grade            0
sqft_above       0
sqft_basement    0
yr_built         0
yr_renovated     0
zipcode          0
lat              0
long             0
sqft_living15    0
sqft_lot15       0
dtype: int64

# Data Prep

Perform your data prep here. You can use pipelines like we do in the tutorials. Otherwise, feel free to use your own data prep steps. Eventually, you should do the following at a minimum:<br>
- Separate inputs from target<br>
- Impute/remove missing values<br>
- Standardize the continuous variables<br>
- One-hot encode categorical variables<br>

In [11]:
# Imports:

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder

In [12]:
# Separate the target variable and input variables

train_y = train[['price']]
test_y = test[['price']]

train_inputs = train.drop(['price'], axis=1)
test_inputs = test.drop(['price'], axis=1)

In [13]:
# Identify the numerical and categorical columns

train_inputs.dtypes

bedrooms         float64
bathrooms        float64
sqft_living      float64
sqft_lot         float64
floors           float64
waterfront         int64
view               int64
condition          int64
grade            float64
sqft_above         int64
sqft_basement      int64
yr_built         float64
yr_renovated       int64
zipcode          float64
lat              float64
long             float64
sqft_living15      int64
sqft_lot15       float64
dtype: object

In [14]:
# Identify the numerical columns
numeric_columns = train_inputs.select_dtypes(include=[np.number]).columns.to_list()

# Identify the categorical columns
categorical_columns = train_inputs.select_dtypes('object').columns.to_list()

In [15]:
categorical_columns = ['zipcode']

In [16]:
# Be careful: numerical columns already includes the categorical columns,
# So, we need to remove the categrorical columns from numerical columns.

for col in categorical_columns:
    numeric_columns.remove(col)

In [17]:
# Identify the binary columns so we can pass them through without transforming
binary_columns = ['waterfront']

In [18]:
# Be careful: numerical columns already includes the binary columns,
# So, we need to remove the binary columns from numerical columns.

for col in binary_columns:
    numeric_columns.remove(col)

In [19]:
numeric_columns

['bedrooms',
 'bathrooms',
 'sqft_living',
 'sqft_lot',
 'floors',
 'view',
 'condition',
 'grade',
 'sqft_above',
 'sqft_basement',
 'yr_built',
 'yr_renovated',
 'lat',
 'long',
 'sqft_living15',
 'sqft_lot15']

In [20]:
categorical_columns

['zipcode']

In [21]:
binary_columns

['waterfront']

In [22]:
# Numeric transformer:
numeric_transformer = Pipeline(steps=[
                ('imputer', SimpleImputer(strategy='median')),
                ('scaler', StandardScaler())])

In [23]:
# Categorical transformer:
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value=99999)),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

In [24]:
# Binary transformer:
binary_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent'))])

In [25]:
# Column transformer:
preprocessor = ColumnTransformer([
        ('num', numeric_transformer, numeric_columns),
        ('cat', categorical_transformer, categorical_columns),
        ('binary', binary_transformer, binary_columns)],
        remainder='passthrough')

#passtrough is an optional step. You don't have to use it.

In [26]:
#Fit and transform the train data
train_x = preprocessor.fit_transform(train_inputs)

train_x

<15123x87 sparse matrix of type '<class 'numpy.float64'>'
	with 257198 stored elements in Compressed Sparse Row format>

In [27]:
train_x.shape

(15123, 87)

In [28]:
# Transform the test data
test_x = preprocessor.transform(test_inputs)

test_x

<6482x87 sparse matrix of type '<class 'numpy.float64'>'
	with 110250 stored elements in Compressed Sparse Row format>

In [29]:
test_x.shape

(6482, 87)

# Train a Linear Regression Model

In [30]:
#Closed form solution

from sklearn.linear_model import LinearRegression

lin_reg = LinearRegression()

lin_reg.fit(train_x, train_y)

LinearRegression()

In [31]:
from sklearn.metrics import mean_squared_error

In [32]:
#Train RMSE
reg_train_pred = lin_reg.predict(train_x)

train_mse = mean_squared_error(train_y, reg_train_pred)

train_rmse = np.sqrt(mean_squared_error (train_y, reg_train_pred))

print('Train RMSE: {}' .format(train_rmse))

Train RMSE: 161289.27598277864


In [33]:
#Test RMSE
reg_test_pred = lin_reg.predict(test_x)

test_mse = mean_squared_error (test_y, reg_test_pred)

test_rmse = np.sqrt(mean_squared_error (test_y, reg_test_pred))

print('Test RMSE: {}' .format(test_rmse))

Test RMSE: 160639.64399069356


# Calculate the Baseline

In [34]:
#First find the average value of the target

mean_value = np.mean(train_y['price'])

mean_value

541770.9706407458

In [35]:
# Predict all values as the mean

baseline_pred = np.repeat(mean_value, len(test_y))

baseline_pred

array([541770.97064075, 541770.97064075, 541770.97064075, ...,
       541770.97064075, 541770.97064075, 541770.97064075])

In [36]:
baseline_mse = mean_squared_error(test_y, baseline_pred)

baseline_rmse = np.sqrt(baseline_mse)

print('Baseline RMSE: {}' .format(baseline_rmse))

Baseline RMSE: 363073.22435019154


In [37]:
train_y['price']

17470     652500.0
12842     315000.0
16282    1030000.0
13911     700000.0
7499      690000.0
           ...    
6907      380000.0
10503     731781.0
18284     575000.0
5203      370000.0
644       235000.0
Name: price, Length: 15123, dtype: float64

# Train a SGD model (with no regularization)

In [38]:
from sklearn.linear_model import SGDRegressor 

# tol = stopping criterion
# eta0 = learning rate
# penalty = regularization term
# max_iter = number of passes over training data (i.e., epochs)

sgd_reg = SGDRegressor(max_iter=100, penalty=None, eta0=0.1, tol=0.0001) 

sgd_reg.fit(train_x, train_y)

  return f(*args, **kwargs)


SGDRegressor(eta0=0.1, max_iter=100, penalty=None, tol=0.0001)

In [39]:
sgd_reg.n_iter_

15

### Generate the error metrics

In [40]:
#Train RMSE
reg_train_pred = sgd_reg.predict(train_x)

train_mse = mean_squared_error(train_y, reg_train_pred)

train_rmse = np.sqrt(mean_squared_error (train_y, reg_train_pred))

print('Train RMSE: {}' .format(train_rmse))

Train RMSE: 163558.78884699786


In [41]:
#Test RMSE
reg_test_pred = sgd_reg.predict(test_x)

test_mse = mean_squared_error (test_y, reg_test_pred)

test_rmse = np.sqrt(mean_squared_error (test_y, reg_test_pred))

print('Test RMSE: {}' .format(test_rmse))

Test RMSE: 162822.77887352704


# Try L1 Regularization in SGD

In [42]:
from sklearn.linear_model import SGDRegressor 

# tol = stopping criterion
# eta0 = learning rate
# penalty = regularization term
# max_iter = number of passes over training data (i.e., epochs)

sgd_reg = SGDRegressor(max_iter=100, penalty= 'l1', eta0=0.1, tol=0.0001) 

sgd_reg.fit(train_x, train_y)

  return f(*args, **kwargs)


SGDRegressor(eta0=0.1, max_iter=100, penalty='l1', tol=0.0001)

In [43]:
sgd_reg.n_iter_

20

### Generate the error metrics

In [44]:
#Train RMSE
reg_train_pred = sgd_reg.predict(train_x)

train_mse = mean_squared_error(train_y, reg_train_pred)

train_rmse = np.sqrt(mean_squared_error (train_y, reg_train_pred))

print('Train RMSE: {}' .format(train_rmse))

Train RMSE: 166788.8893461099


In [45]:
#Test RMSE
reg_test_pred = sgd_reg.predict(test_x)

test_mse = mean_squared_error (test_y, reg_test_pred)

test_rmse = np.sqrt(mean_squared_error (test_y, reg_test_pred))

print('Test RMSE: {}' .format(test_rmse))

Test RMSE: 165729.78114241158


# Try L2 Regularization in SGD

In [46]:
from sklearn.linear_model import SGDRegressor 

# tol = stopping criterion
# eta0 = learning rate
# penalty = regularization term, default is 'l2'
# max_iter = number of passes over training data (i.e., epochs)

sgd_reg = SGDRegressor(max_iter=100, eta0=0.1, tol=0.0001) 

sgd_reg.fit(train_x, train_y)

  return f(*args, **kwargs)


SGDRegressor(eta0=0.1, max_iter=100, tol=0.0001)

In [47]:
sgd_reg.n_iter_

14

### Generate the error metrics

In [48]:
#Train RMSE
reg_train_pred = sgd_reg.predict(train_x)

train_mse = mean_squared_error(train_y, reg_train_pred)

train_rmse = np.sqrt(mean_squared_error (train_y, reg_train_pred))

print('Train RMSE: {}' .format(train_rmse))

Train RMSE: 169256.26466514188


In [49]:
#Test RMSE
reg_test_pred = sgd_reg.predict(test_x)

test_mse = mean_squared_error (test_y, reg_test_pred)

test_rmse = np.sqrt(mean_squared_error (test_y, reg_test_pred))

print('Test RMSE: {}' .format(test_rmse))

Test RMSE: 168676.08501081952


# Try ElasticNet in SGD

In [50]:
from sklearn.linear_model import SGDRegressor 

# tol = stopping criterion
# eta0 = learning rate
# penalty = regularization term
# max_iter = number of passes over training data (i.e., epochs)

sgd_reg = SGDRegressor(max_iter=100, penalty= 'elasticnet', eta0=0.1, tol=0.0001) 

sgd_reg.fit(train_x, train_y)

  return f(*args, **kwargs)


SGDRegressor(eta0=0.1, max_iter=100, penalty='elasticnet', tol=0.0001)

In [51]:
sgd_reg.n_iter_

19

### Generate the error metrics

In [52]:
#Train RMSE
reg_train_pred = sgd_reg.predict(train_x)

train_mse = mean_squared_error(train_y, reg_train_pred)

train_rmse = np.sqrt(mean_squared_error (train_y, reg_train_pred))

print('Train RMSE: {}' .format(train_rmse))

Train RMSE: 183230.91471767923


In [53]:
#Test RMSE
reg_test_pred = sgd_reg.predict(test_x)

test_mse = mean_squared_error (test_y, reg_test_pred)

test_rmse = np.sqrt(mean_squared_error (test_y, reg_test_pred))

print('Test RMSE: {}' .format(test_rmse))

Test RMSE: 183728.7185970101


# Create Polynomial Features

Create polynomial features with degree = 2. 

In [54]:
from sklearn.preprocessing import PolynomialFeatures

# Create second degree terms and interaction terms
poly_features = PolynomialFeatures(degree=2).fit(train_x)

train_x_poly = poly_features.transform(train_x)

test_x_poly = poly_features.transform(test_x)

#Mind you, this will create the polynomial terms of the categorical variables too

#if degree=3, then it creates all combinations: a, a^2, a^3, b, b^2, b^3, a.b, a^2.b, a.b^2, a^2.b^2 

In [55]:
#We still fit a linear regression model

pol_lin_reg = LinearRegression()

pol_lin_reg.fit(train_x_poly, train_y)

LinearRegression()

In [56]:
#Train RMSE
reg_train_pred = pol_lin_reg.predict(train_x_poly)

train_mse = mean_squared_error(train_y, reg_train_pred)

train_rmse = np.sqrt(mean_squared_error (train_y, reg_train_pred))

print('Train RMSE: {}' .format(train_rmse))

Train RMSE: 94461.6292176902


In [57]:
#Test RMSE
reg_test_pred = pol_lin_reg.predict(test_x_poly)

test_mse = mean_squared_error (test_y, reg_test_pred)

test_rmse = np.sqrt(mean_squared_error (test_y, reg_test_pred))

print('Test RMSE: {}' .format(test_rmse))

Test RMSE: 119907.10575167852


# Try L2 Regularization in SGD (with polynomial features)

In [2]:
#Stochastic Gradient:
sgd_reg_L2 = SGDRegressor(max_iter=50, alpha = 0.1, eta0=0.1, tol=0.0001)

sgd_reg_L2.fit(train_x, train_y)

NameError: name 'SGDRegressor' is not defined

### Generate the error metrics

In [92]:
#Train RMSE
reg_train_pred = sgd_reg_L2.predict(train_x)

train_mse = mean_squared_error(train_y, reg_train_pred)

train_rmse = np.sqrt(mean_squared_error (train_y, reg_train_pred))

print('Train RMSE: {}' .format(train_rmse))

Train RMSE: 212061.9520285425


In [93]:
#Test RMSE
reg_test_pred = sgd_reg_L2.predict(test_x)

test_mse = mean_squared_error (test_y, reg_test_pred)

test_rmse = np.sqrt(mean_squared_error (test_y, reg_test_pred))

print('Test RMSE: {}' .format(test_rmse))

Test RMSE: 204846.9980708624


# Discussion

Briefly answer the following questions: (2 points) 
1) Which model performs the best (and why)?<br>
2) Does the best model perform better than the baseline (and why)?<br>
3) Does the best model exhibit any overfitting; what did you do about it?