<div style = 'text-align: center;'>
    <img src = '../images/ga_logo_large.png'>
</div>

# **Project 2: Ames Price Prediction Model**

---
### **Model Preprocessing and Fitting**

In [7]:
# import needed libraries for this notebook

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# sklearn modules
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler, PolynomialFeatures

In [678]:
# read in clean file

file_path = '../datasets/clean_data/ames_clean.csv'
ames = pd.read_csv(file_path)

#check size
ames.shape

(2051, 81)

In [680]:
# make sure it's indeed the clean file
ames.head()

Unnamed: 0,id,pid,ms_subclass,ms_zoning,lot_frontage,lot_area,street,alley,lot_shape,land_contour,...,screen_porch,pool_area,pool_qc,fence,misc_feature,misc_val,mo_sold,yr_sold,sale_type,saleprice
0,109,533352170,60,RL,69.06,13517,Pave,No_Alley,IR1,Lvl,...,0.0,0.0,Np,NoFe,NoFea,0.0,3,2010,WD,130500.0
1,544,531379050,60,RL,43.0,11492,Pave,No_Alley,IR1,Lvl,...,0.0,0.0,Np,NoFe,NoFea,0.0,4,2009,WD,220000.0
2,153,535304180,20,RL,68.0,7922,Pave,No_Alley,Reg,Lvl,...,0.0,0.0,Np,NoFe,NoFea,0.0,1,2010,WD,109000.0
3,318,916386060,60,RL,73.0,9802,Pave,No_Alley,Reg,Lvl,...,0.0,0.0,Np,NoFe,NoFea,0.0,4,2010,WD,174000.0
4,255,906425045,50,RL,82.0,14235,Pave,No_Alley,IR1,Lvl,...,0.0,0.0,Np,NoFe,NoFea,0.0,3,2010,WD,138500.0


---
### **Model 1**
### **Create Features Matrix and Target Vector**

**Predictive Matrix**
____________

In [684]:
# build X matrix, this is model 1

features = ['1st_flr_sf', 'lot_area', '2nd_flr_sf']
X = ames[features]

# check dimensions
X.shape

(2051, 3)

In [686]:
# check first rows
X.head()

Unnamed: 0,1st_flr_sf,lot_area,2nd_flr_sf
0,725,13517,754
1,913,11492,1209
2,1057,7922,0
3,744,9802,700
4,831,14235,614


**Target Vector**
________________

In [689]:
# build y vector, model 1

y = ames['saleprice']

# check dimensions
y.shape

(2051,)

In [691]:
# check first rows
y.head()

0    130500.0
1    220000.0
2    109000.0
3    174000.0
4    138500.0
Name: saleprice, dtype: float64

The number of rows for X and y match.

---
### **Build Model**

**Split Train and Test Data**
________________

In [777]:
# execute first split using default ratio of 75/25 (default)
X_train, X_test, y_train, y_test = train_test_split(X, y)

In [779]:
# confirm splits and shapes
X_train.shape, X_test.shape

((1538, 3), (513, 3))

**Instantiate and Fit Linear Regresssion Model**
____________

In [782]:
# instantiate model 1
model1 = LinearRegression()

In [784]:
# fit model
model1.fit(X_train, y_train)

**Evaluate Model 1**
__________________

In [787]:
# train set model 1 R2 score
model1.score(X_train, y_train)

0.5627272882398949

In [789]:
# test set model 1 R2 score
model1.score(X_test, y_test)

0.5887024620547454

**Cross Validation**
_______________________

In [792]:
# get R2 score from cross validating, keep 5-fold default
cross_val_score(model1, X_train, y_train)

array([0.6201977 , 0.63807278, 0.49665357, 0.57348989, 0.4190321 ])

Lots of R2 score variation

In [795]:
# cross validation mean
cross_val_score(model1, X_train, y_train).mean()

0.5494892097531604

The average R2 score on the train data is 0.58, while the R2 score on the test data is 0.49.  This means there is room for improvement for this model. Recall that one of the features (`2nd_flr_sf`) contains non-zero data for just 860 observations, that's 42% of the entire population.  Further, the correlation coefficent of that feature, relative to the target (`saleprice`) is only 0.25.  Given this scenario, let's try dropping that feature altogether and evaluate the model again.

### **Model Tuning**
_________________

**New Predicitve Matrix**
________________

In [800]:
# build new X matrix, this is model 1, iteration 2
# drop '2nd_flr_sf'

features = ['1st_flr_sf', 'lot_area']
X = ames[features]

# check dimensions
X.shape

(2051, 2)

**Target Vector**
____________

In [803]:
# remains the same, just verify

y.shape, y.head()

((2051,),
 0    130500.0
 1    220000.0
 2    109000.0
 3    174000.0
 4    138500.0
 Name: saleprice, dtype: float64)

**Split Train and Test Data**
__________

In [806]:
# keep default ratio of 75/25

X_train, X_test, y_train, y_test = train_test_split(X, y)

In [808]:
# confirm splits and shapes
X_train.shape, X_test.shape

((1538, 2), (513, 2))

**Instantiate and Fit Linear Regression Model**
______________

In [811]:
# instantiate
model1 = LinearRegression()

In [813]:
# fit model
model1.fit(X_train, y_train)

**Re-evaluate Model 1**

In [816]:
# train set model 1 R2 score
model1.score(X_train, y_train)

0.3802943709884766

In [818]:
# test set model 1 R2 score
model1.score(X_test, y_test)

0.4062392041527192

**Cross Validation**
________________

In [821]:
# keep 5-fold default
cross_val_score(model1, X_train, y_train)

array([0.33587949, 0.42656002, 0.37391136, 0.41002464, 0.33114506])

In [823]:
# average
cross_val_score(model1, X_train, y_train).mean()

0.3755041144417797

Model performance worsened after dropping `2nd_flr_sf` feature. This feature will need to be added back in, plus consider another feature take takes into account square footage.  This would keep the model consistent with basic problem statement premise, namely that home buyers value space.<br>
Add `total_bsmt_sf` and go for a third model iteration.

**New Predictive Matrix**
___________

In [826]:
# build new X matrix, iteration 3
# add '2nd_flr_sf' back in and add 'total_bsmt_sf'

features = ['1st_flr_sf', 'lot_area', '2nd_flr_sf', 'total_bsmt_sf']
X = ames[features]

# check dimensions
X.shape

(2051, 4)

**Target Vector**
____________

In [829]:
# should still be the same, confirm

y.shape, y.head()

((2051,),
 0    130500.0
 1    220000.0
 2    109000.0
 3    174000.0
 4    138500.0
 Name: saleprice, dtype: float64)

**Split Train and Test Data**
______________

In [832]:
# keep default ratio of 75/25

X_train, X_test, y_train, y_test = train_test_split(X, y)

In [834]:
# confirm splits / dimensions
X_train.shape, X_test.shape

((1538, 4), (513, 4))

**Instantiate and Fit Linear Regression Model**
__________________

In [837]:
# instantiate
model1 = LinearRegression()

In [839]:
# fit model
model1.fit(X_train, y_train)

**Re-evaluate Model 1**
_____________

In [842]:
# train set R2 score
model1.score(X_train, y_train)

0.6316493607090364

In [844]:
# test set R2 score
model1.score(X_test, y_test)

0.5693305871738592

**Cross Validation**
____________

In [847]:
# keep 5-fold default
cross_val_score(model1, X_train, y_train)

array([0.30826166, 0.69529585, 0.65253914, 0.69973265, 0.6813917 ])

In [849]:
# average
cross_val_score(model1, X_train, y_train).mean()

0.6074441987071285

There is a slight improvement in model performance. Time to look into other features.

________

---
### **Model 2**
Use `ames_clean2.csv` file for model 2 setup.<br>

In [855]:
# read in dataset
file_path = '../datasets/clean_data/ames_clean2.csv'
ames2 = pd.read_csv(file_path)

# confirm shape
ames2.shape

(2051, 83)

In [857]:
# first two rows
ames2.head(2)

Unnamed: 0,id,pid,ms_subclass,ms_zoning,lot_frontage,lot_area,street,alley,lot_shape,land_contour,...,pool_qc,fence,misc_feature,misc_val,mo_sold,yr_sold,sale_type,saleprice,indoor_area,outdoor_area
0,109,533352170,60,RL,69.06,13517,Pave,No_Alley,IR1,Lvl,...,Np,NoFe,NoFea,0.0,3,2010,WD,130500.0,2204.0,13561.0
1,544,531379050,60,RL,43.0,11492,Pave,No_Alley,IR1,Lvl,...,Np,NoFe,NoFea,0.0,4,2009,WD,220000.0,3035.0,11566.0


### **Create Features Matrix and Target Vector**

**Predictive Matrix**
________

In [861]:
# build X matrix, model 2, iteration 1
features = ['indoor_area', 'garage_area', 'outdoor_area']
X = ames2[features]

# check dimensions
X.shape

(2051, 3)

In [863]:
# see first rows
X.head()

Unnamed: 0,indoor_area,garage_area,outdoor_area
0,2204.0,475.0,13561.0
1,3035.0,559.0,11566.0
2,2114.0,246.0,7974.0
3,1828.0,400.0,9902.0
4,2121.0,484.0,14294.0


**Target Vector**
________

In [866]:
# same y, confirm 
y.shape, y.head()

((2051,),
 0    130500.0
 1    220000.0
 2    109000.0
 3    174000.0
 4    138500.0
 Name: saleprice, dtype: float64)

**Split Train and Test Data**
______

In [869]:
# keep default ratio of 75/25

X_train, X_test, y_train, y_test = train_test_split(X, y)

In [871]:
# confirm splits and dimensions

X_train.shape, X_test.shape

((1538, 3), (513, 3))

**Instantiate and Fit Model**

In [874]:
# instantiate
model2 = LinearRegression()

In [876]:
# fit model
model2.fit(X_train, y_train)

**Evaluate Model**
__________

In [879]:
# train set R2 score
model2.score(X_train, y_train)

0.6408191189022633

In [881]:
# test set R2 score
model2.score(X_test, y_test)

0.7176553763156469

**Cross Validation**
___________

In [884]:
# keep 5-fold default
cross_val_score(model2, X_train, y_train)

array([0.6854459 , 0.46873634, 0.70867125, 0.55353726, 0.69409785])

In [886]:
# average
cross_val_score(model2, X_train, y_train).mean()

0.6220977195194302

-----
**Automate this Process**

Build a simple function to pass in data, execute workflow of building and fitting model, then provide scores.

In [889]:
def build_model(df,
                features,
                pol = False,
                scale = False,
                dummy = False,
                dum_cols = None, 
                target = 'saleprice',
                ts = 0.75, 
                cv = 5):
    

    # X matrix ------------------------------------------------------------------
    X = df[features]

    if dummy == True:
        # create dummy vars and add to X
        X = pd.get_dummies(X, columns = [dum_cols], drop_first = True, dtype = int)
    
    
    # target y vector -----------------------------------------------------------
    y = df[target]
    
    
    # poly features and scaling -------------------------------------------------
    if pol == True:
        # instantiate poly features
        poly = PolynomialFeatures(include_bias = False)
        X = poly.fit_transform(X)
    
    if scale == True:
        # instantiate standar scaler
        sc = StandardScaler()
        X = sc.fit_transform(X)
        

    # rest of workflow ----------------------------------------------------------
    # train/test split
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = ts)
    
    # instantiate model
    lr = LinearRegression()
    
    # fit model
    lr.fit(X_train, y_train)
    
    # evaluate model
    # train set
    r2_train = round(lr.score(X_train, y_train), 2)
    # test set
    r2_test = round(lr.score(X_test, y_test), 2)
    
    # cross val
    x_val = round(cross_val_score(lr, X_train, y_train, cv = cv).mean(), 2)
    
    return [r2_train, r2_test, x_val]

In [910]:
# test function
features = ['indoor_area', 'overall_qual', 'overall_cond', 'neighborhood']
build_model(ames2, features = features, pol = True, scale = True, dummy = True, dum_cols = 'neighborhood')

[0.93, -2.720455835968742e+26, -7.924954285723748e+26]

The function seems to work as intended.  The features were scaled, polynomial features were generated, and dummy variables were created as well.  Based on the scores that were ouput, this specific model does well with train data, but fails miserably with test data, it's overfit.

----
**Iterative Modeling**<br>
From this section on down, the model will undergo different tweaks as features are transformed, scaled, combined or dummified.  A list of features will be passed into the function and the scores will be returned for quick model evaluation.

---
**Iteration Number 3**

In [931]:
# iteration 3, add 'year_built`
features = ['garage_area', 'indoor_area', 'outdoor_area', 'year_built']

# build/evaluate model
build_model(ames2, features = features)

[0.79, 0.68, 0.77]

On iteration number 3, the model shows a slight improvement with R2 scores above 0.6.  The model's performance is responsible for about 60% of the variability on the target variable.

----
**Iteration Number 4**

In [950]:
# iteration 4, drop 'outdoor_area'
features = ['garage_area', 'indoor_area', 'year_built']

# build/evaluate model
build_model(ames2, features = features)

[0.78, 0.67, 0.77]

Dropping the `outdoor_area` feature is not making too much of a difference.  The R2 scores are still, on average, above 0.6.  The model has not significantly improved yet.

----
**Iteration Number 5**<br>
From the EDA findings, this iteration will consist of creating an interactive feature by combining `overall_qual` and `overall_cond`.  It will also take into account a new column that adds the `garage_area` to `indoor_area`, which will be called `total_house_area`.  Finally, the `year_built` feature will be part of the model.

In [973]:
# iteration 5

# Interaction terms
ames2['overall qual * cond'] = ames2['overall_qual'] * ames2['overall_cond']

# Combine garage and indoor for total house area
ames2['total_house_area'] = ames2['garage_area'] + ames2['indoor_area']

# build/eval model
features = ['total_house_area', 'year_built', 'overall qual * cond']
build_model(ames2, features = features)

[0.76, 0.77, 0.7]

A noticeably better model performance is observed.  The R2 scores are now averaging slightly and consistenly above 0.7.  The way the features were combined improved the model.  Since the scores for the train set and test set are tracking closely, this means that the model is performing well on unseen data.

---
**Iteration Number 6**<br>
As the model has been steadily improved, and based on additional EDA, this iteration will seek further improvements by creating polynomial features and scaling them.  In addition, other features will be combined, namely all full bathrooms will be added up and appended as a new colum.  The same approach will be taken with the half bathrooms.

In [989]:
# iteration 6
# use poly features and scale vars

# combine full baths into one col
ames2['full_baths'] = ames2['bsmt_full_bath'] + ames2['full_bath']

# combine 1/2 baths into one col
ames2['half_baths'] = ames2['bsmt_half_bath'] + ames2['half_bath']

# drop 'overall qual * cond' col, poly feat taking care of it
ames2.drop(columns = 'overall qual * cond', inplace = True)

Test model, below.

In [1020]:
# build/eval model
features = ['year_built', 'total_house_area', 'overall_qual', 'overall_cond', 'full_baths', 'half_baths']
build_model(ames2, features = features, pol = True, scale = True)

[0.91, 0.86, 0.71]

Observed significantly improved performance on both the training set and the test set.  The R2 scores are fluctuting between 0.85 and 0.9.  At this stage, the model is accounting for most of the variability in the target variable.

In [1028]:
# tweak and evaluate model
# remove overall_cond

features = ['year_built', 'total_house_area', 'overall_qual', 'full_baths', 'half_baths']
build_model(ames2, features = features, pol = True, scale = True)

[0.9, 0.62, 0.88]

There seems to be some inconsistency with the test set under this scenario.

In [1043]:
# tweak and evaluate again
# remove half baths

features = ['year_built', 'total_house_area', 'overall_qual', 'full_baths']
build_model(ames2, features = features, pol = True, scale = True)

[0.89, 0.74, 0.87]

Based on R2 scores, the performance is slightly better.  However, not as good as when all the features were incorporated.  Run the first version of this iteration on the Kaggle side.

---
**Iteration Number 7**<br>
Besides current features, this iteration will also take into account the month when the home was sold, the total number of rooms, fireplaces and whether centrail AC is available or not.  This last variable will be dummified. Polynonial features and scaling will also be used.

In [1072]:
# iteration 7
# use poly features, scaling, dummify 'central_air'
# add month sold and number of total rooms

features = ['total_house_area', 'overall_qual', 'overall_cond', 'mo_sold',
            'full_baths', 'half_baths', 'totrms_abvgrd', 'fireplaces', 'central_air']

# build / eval model
build_model(ames2, features = features, pol = True, scale = True, dummy = True, dum_cols = 'central_air')

[0.92, 0.73, 0.87]

The R2 score on the the training set is quite strong, the test set fluctuates somewhat.  Again, signs that overfitting needs to be addressed in order to get the right bias/variance balance.