# Modeling

This notebook contains all steps taken in the modeling phase of the data science pipeline for the Zillow clustering project. This notebook does rely on helper files so if you want to run the code blocks in this notebook ensure that you have all the helper files in the same directory.

---

## The Required Imports

As stated before this notebook relies on some helper files which are imported below. This notebook also relies on numpy, pandas, matplotlib, seaborn, and sklearn.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.linear_model import LinearRegression, LassoLars, TweedieRegressor
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.feature_selection import RFE
from sklearn.pipeline import make_pipeline

from acquire import AcquireZillow
from prepare import prepare_and_split
from preprocessing import *
from model import establish_baseline
from _model import Model
from evaluate import *

---

## Acquire and Prepare the Data

Let's acquire, prepare, and split the data before we begin.

In [2]:
# Let's acquire and prepare the data.
train, validate, test = prepare_and_split(AcquireZillow().get_data())
train.shape, validate.shape, test.shape

  df = self._load_data(use_cache, cache_data)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test['cluster'] = kmeans.predict(test[columns])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[name] = value


((34867, 12), (14943, 11), (12453, 12))

In [3]:
train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 34867 entries, 68462 to 5796
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype   
---  ------                --------------  -----   
 0   square_feet           34867 non-null  float64 
 1   lot_size              34867 non-null  float64 
 2   property_age          34867 non-null  float64 
 3   non_average_zip_code  34867 non-null  bool    
 4   logerror              34867 non-null  float64 
 5   bathroomcnt           34867 non-null  float64 
 6   bedroomcnt            34867 non-null  float64 
 7   tax_assessed_value    34867 non-null  float64 
 8   cluster               34867 non-null  category
 9   cluster_1             34867 non-null  uint8   
 10  cluster_2             34867 non-null  uint8   
 11  cluster_3             34867 non-null  uint8   
dtypes: bool(1), category(1), float64(7), uint8(3)
memory usage: 2.3 MB


In [4]:
validate.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 14943 entries, 55436 to 38305
Data columns (total 11 columns):
 #   Column                Non-Null Count  Dtype   
---  ------                --------------  -----   
 0   square_feet           14943 non-null  float64 
 1   lot_size              14943 non-null  float64 
 2   property_age          14943 non-null  float64 
 3   non_average_zip_code  14943 non-null  bool    
 4   logerror              14943 non-null  float64 
 5   bathroomcnt           14943 non-null  float64 
 6   bedroomcnt            14943 non-null  float64 
 7   tax_assessed_value    14943 non-null  float64 
 8   cluster               14943 non-null  category
 9   cluster_2             14943 non-null  uint8   
 10  cluster_3             14943 non-null  uint8   
dtypes: bool(1), category(1), float64(7), uint8(2)
memory usage: 992.4 KB


In [5]:
# Somehow there was no data from validate in cluster 1. So we must create by hand.
validate['cluster_1'] = 0

In [6]:
test.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 12453 entries, 17178 to 20860
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype   
---  ------                --------------  -----   
 0   square_feet           12453 non-null  float64 
 1   lot_size              12453 non-null  float64 
 2   property_age          12453 non-null  float64 
 3   non_average_zip_code  12453 non-null  bool    
 4   logerror              12453 non-null  float64 
 5   bathroomcnt           12453 non-null  float64 
 6   bedroomcnt            12453 non-null  float64 
 7   tax_assessed_value    12453 non-null  float64 
 8   cluster               12453 non-null  category
 9   cluster_1             12453 non-null  uint8   
 10  cluster_2             12453 non-null  uint8   
 11  cluster_3             12453 non-null  uint8   
dtypes: bool(1), category(1), float64(7), uint8(3)
memory usage: 839.3 KB


## Modeling

We identified in exploration that the features square_feet, property_age, zip_code, and tax_assessed_value may have relationships with logerror. We'll use these as well as the clusters in our models. We will create a linear regression, polynomial regression, and tweedie regressor models with and without the clusters.

In [7]:
# First we must remove outliers and scale the data.
train_no_outliers = remove_outliers(train, 1.5, ['square_feet', 'tax_assessed_value'])
train_scaled, validate_scaled, test_scaled = scale_data(
    train,
    validate,
    test,
    train.drop(columns = 'logerror').columns
)

train_scaled_no_outliers = remove_outliers(train_no_outliers, 1.5, ['square_feet', 'tax_assessed_value'])

In [8]:
train_scaled.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
square_feet,34867.0,0.10681,0.060679,0.0,0.067983,0.090862,0.128448,1.0
lot_size,34867.0,0.004292,0.018695,0.0,0.000774,0.000983,0.001682,1.0
property_age,34867.0,0.364767,0.169167,0.0,0.233083,0.37594,0.473684,1.0
non_average_zip_code,34867.0,0.129865,0.33616,0.0,0.0,0.0,0.0,1.0
logerror,34867.0,0.016868,0.164812,-4.65542,-0.024592,0.006469,0.038968,5.262999
bathroomcnt,34867.0,0.141272,0.105962,0.0,0.111111,0.111111,0.222222,1.0
bedroomcnt,34867.0,0.281024,0.090989,0.0,0.181818,0.272727,0.363636,1.0
tax_assessed_value,34867.0,0.019166,0.02578,0.0,0.00766,0.013748,0.022475,1.0
cluster,34867.0,0.017543,0.107621,0.0,0.0,0.0,0.0,1.0
cluster_1,34867.0,0.000143,0.011974,0.0,0.0,0.0,0.0,1.0


In [9]:
models = []
results = {}

### Establish Baseline

Let's establish a baseline model to compare our models to.

In [10]:
baseline = establish_baseline(train[['logerror']])
mean_squared_error(train[['logerror']], baseline, squared = False)

0.16481013354498159

In [11]:
results['baseline'] = {'RMSE' : mean_squared_error(train[['logerror']], baseline, squared = False)}

### Linear Regression Without Clusters

In [12]:
features = [
    'square_feet',
    'property_age',
    'non_average_zip_code',
    'tax_assessed_value'
]

lr = Model(LinearRegression(), train_scaled_no_outliers, features, 'logerror')

In [13]:
mean_squared_error(train[['logerror']], lr.make_predictions(train_scaled), squared = False)

0.16476388979922255

In [14]:
results['linear_regression_no_clusters'] = {
    'RMSE' : mean_squared_error(train[['logerror']], lr.make_predictions(train_scaled), squared = False),
    'RMSE_val' : mean_squared_error(validate[['logerror']], lr.make_predictions(validate_scaled), squared = False)
}

models.append(lr)

### Tweedie Regressor Without Clusters

In [15]:
tr = Model(TweedieRegressor(), train_scaled_no_outliers, features, 'logerror')

In [16]:
mean_squared_error(train[['logerror']], tr.make_predictions(train_scaled), squared = False)

0.16478990500501559

In [17]:
results['tweedie_regressor_no_clusters'] = {
    'RMSE' : mean_squared_error(train[['logerror']], tr.make_predictions(train_scaled), squared = False),
    'RMSE_val' : mean_squared_error(validate[['logerror']], tr.make_predictions(validate_scaled), squared = False)
}

models.append(tr)

### Polynomial Regression Without Clusters

In [18]:
poly_reg = make_pipeline(
    PolynomialFeatures(include_bias = False),
    LinearRegression()
)
pr = Model(poly_reg, train_scaled_no_outliers, features, 'logerror')

In [19]:
mean_squared_error(train[['logerror']], pr.make_predictions(train_scaled), squared = False)

0.1649828469247726

In [20]:
results['polynomial_regression_no_clusters'] = {
    'RMSE' : mean_squared_error(train[['logerror']], pr.make_predictions(train_scaled), squared = False),
    'RMSE_val' : mean_squared_error(validate[['logerror']], pr.make_predictions(validate_scaled), squared = False)
}

models.append(pr)

### Polynomial Regression Interactions Only Without Clusters

In [21]:
poly_reg = make_pipeline(
    PolynomialFeatures(include_bias = False, interaction_only = True),
    LinearRegression()
)
pr = Model(poly_reg, train_scaled_no_outliers, features, 'logerror')

In [22]:
mean_squared_error(train[['logerror']], pr.make_predictions(train_scaled), squared = False)

0.16546501480244352

In [23]:
results['polynomial_regression_interactions_only_no_clusters'] = {
    'RMSE' : mean_squared_error(train[['logerror']], pr.make_predictions(train_scaled), squared = False),
    'RMSE_val' : mean_squared_error(validate[['logerror']], pr.make_predictions(validate_scaled), squared = False)
}

models.append(pr)

### Linear Regression With Clusters

In [24]:
features = [
    'square_feet',
    'non_average_zip_code',
    'tax_assessed_value',
    'cluster_1',
    'cluster_2',
    'cluster_3'
]

lr = Model(LinearRegression(), train_scaled_no_outliers, features, 'logerror')

In [25]:
mean_squared_error(train[['logerror']], lr.make_predictions(train_scaled), squared = False)

0.16465408125693634

In [26]:
results['linear_regression_with_clusters'] = {
    'RMSE' : mean_squared_error(train[['logerror']], lr.make_predictions(train_scaled), squared = False),
    'RMSE_val' : mean_squared_error(validate[['logerror']], lr.make_predictions(validate_scaled), squared = False)
}

models.append(lr)

### Tweedie Regressor With Clusters

In [27]:
tr = Model(TweedieRegressor(), train_scaled_no_outliers, features, 'logerror')

In [28]:
mean_squared_error(train[['logerror']], tr.make_predictions(train_scaled), squared = False)

0.16481587014464205

In [29]:
results['tweedie_regressor_with_clusters'] = {
    'RMSE' : mean_squared_error(train[['logerror']], tr.make_predictions(train_scaled), squared = False),
    'RMSE_val' : mean_squared_error(validate[['logerror']], tr.make_predictions(validate_scaled), squared = False)
}

models.append(tr)

### Polynomial Regression With Clusters

In [30]:
poly_reg = make_pipeline(
    PolynomialFeatures(include_bias = False),
    LinearRegression()
)
pr = Model(poly_reg, train_scaled_no_outliers, features, 'logerror')

In [31]:
mean_squared_error(train[['logerror']], pr.make_predictions(train_scaled), squared = False)

0.1711358227596668

In [32]:
results['polynomial_regression_with_clusters'] = {
    'RMSE' : mean_squared_error(train[['logerror']], pr.make_predictions(train_scaled), squared = False),
    'RMSE_val' : mean_squared_error(validate[['logerror']], pr.make_predictions(validate_scaled), squared = False)
}

models.append(pr)

### Polynomials Regression Interactions Only With Clusters

In [33]:
poly_reg = make_pipeline(
    PolynomialFeatures(include_bias = False, interaction_only = True),
    LinearRegression()
)
pr = Model(poly_reg, train_scaled_no_outliers, features, 'logerror')

In [34]:
mean_squared_error(train[['logerror']], pr.make_predictions(train_scaled), squared = False)

0.1708039807187697

In [35]:
results['polynomial_regression_interactions_only_with_clusters'] = {
    'RMSE' : mean_squared_error(train[['logerror']], pr.make_predictions(train_scaled), squared = False),
    'RMSE_val' : mean_squared_error(validate[['logerror']], pr.make_predictions(validate_scaled), squared = False)
}

models.append(pr)

---

## Results

In [36]:
pd.DataFrame(results).T

Unnamed: 0,RMSE,RMSE_val
baseline,0.16481,
linear_regression_no_clusters,0.164764,0.167628
tweedie_regressor_no_clusters,0.16479,0.167462
polynomial_regression_no_clusters,0.164983,0.167543
polynomial_regression_interactions_only_no_clusters,0.165465,0.167988
linear_regression_with_clusters,0.164654,0.1675
tweedie_regressor_with_clusters,0.164816,0.167487
polynomial_regression_with_clusters,0.171136,0.169414
polynomial_regression_interactions_only_with_clusters,0.170804,0.169074


---

## Conclusion

Adding clusters to the model made a small improvement for the linear regression model, but did not help any of the other models. Additionally, none of these models performs much better than the baseline.