<div style = 'text-align: center;'>
    <img src = '../images/ga_logo_large.png'>
</div>

# **Project 2: Ames Price Prediction Model**

---
### **Model Preprocessing and Fitting**

In [8]:
# import needed libraries for this notebook

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# sklearn modules
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split, cross_val_score

In [12]:
# read in clean file

file_path = '../datasets/clean_data/ames_clean.csv'
ames = pd.read_csv(file_path)

#check size
ames.shape

(2051, 81)

In [22]:
# make sure it's indeed the clean file
ames.head()

Unnamed: 0,id,pid,ms_subclass,ms_zoning,lot_frontage,lot_area,street,alley,lot_shape,land_contour,...,screen_porch,pool_area,pool_qc,fence,misc_feature,misc_val,mo_sold,yr_sold,sale_type,saleprice
0,109,533352170,60,RL,69.06,13517,Pave,No_Alley,IR1,Lvl,...,0.0,0.0,Np,NoFe,NoFea,0.0,3,2010,WD,130500.0
1,544,531379050,60,RL,43.0,11492,Pave,No_Alley,IR1,Lvl,...,0.0,0.0,Np,NoFe,NoFea,0.0,4,2009,WD,220000.0
2,153,535304180,20,RL,68.0,7922,Pave,No_Alley,Reg,Lvl,...,0.0,0.0,Np,NoFe,NoFea,0.0,1,2010,WD,109000.0
3,318,916386060,60,RL,73.0,9802,Pave,No_Alley,Reg,Lvl,...,0.0,0.0,Np,NoFe,NoFea,0.0,4,2010,WD,174000.0
4,255,906425045,50,RL,82.0,14235,Pave,No_Alley,IR1,Lvl,...,0.0,0.0,Np,NoFe,NoFea,0.0,3,2010,WD,138500.0


---
### **Create Features Matrix and Target Vector**

**Predictive Matrix**
____________

In [74]:
# build X matrix, this is model 1

features = ['1st_flr_sf', 'lot_area', '2nd_flr_sf']
X = ames[features]

# check dimensions
X.shape

(2051, 3)

In [76]:
# check first rows
X.head()

Unnamed: 0,1st_flr_sf,lot_area,2nd_flr_sf
0,725,13517,754
1,913,11492,1209
2,1057,7922,0
3,744,9802,700
4,831,14235,614


**Target Vector**
________________

In [78]:
# build y vector, model 1

y = ames['saleprice']

# check dimensions
y.shape

(2051,)

In [80]:
# check first rows
y.head()

0    130500.0
1    220000.0
2    109000.0
3    174000.0
4    138500.0
Name: saleprice, dtype: float64

The number of rows for X and y match.

---
### **Build Model**

**Split Train and Test Data**
________________

In [116]:
# execute first split using default ratio of 75/25 (default)
X_train, X_test, y_train, y_test = train_test_split(X, y)

In [118]:
# confirm splits and shapes
X_train.shape, X_test.shape

((1538, 3), (513, 3))

**Instantiate and Fit Linear Regresssion Model**
____________

In [121]:
# instantiate model 1
model1 = LinearRegression()

In [123]:
# fit model
model1.fit(X_train, y_train)

**Evaluate Model 1**
__________________

In [126]:
# train set model 1 R2 score
model1.score(X_train, y_train)

0.592006523369035

In [128]:
# test set model 1 R2 score
model1.score(X_test, y_test)

0.4885437814026715

**Cross Validation**
_______________________

In [130]:
# get R2 score from cross validating, keep 5-fold default
cross_val_score(model1, X_train, y_train)

array([0.57995353, 0.61794777, 0.64988112, 0.56823026, 0.49846517])

Lots of R2 score variation

In [133]:
# cross validation mean
cross_val_score(model1, X_train, y_train).mean()

0.5828955695053409

The average R2 score on the train data is 0.58, while the R2 score on the test data is 0.49.  This means there is room for improvement for this model. Recall that one of the features (`2nd_flr_sf`) contains non-zero data for just 860 observations, that's 42% of the entire population.  Further, the correlation coefficent of that feature, relative to the target (`saleprice`) is only 0.25.  Given this scenario, let's try dropping that feature altogether and evaluate the model again.

### **Model Tuning**
_________________

**New Predicitve Matrix**
________________

In [146]:
# build new X matrix, this is model 1, iteration 2
# drop '2nd_flr_sf'

features = ['1st_flr_sf', 'lot_area']
X = ames[features]

# check dimensions
X.shape

(2051, 2)

**Target Vector**
____________

In [149]:
# remains the same, just verify

y.shape, y.head()

((2051,),
 0    130500.0
 1    220000.0
 2    109000.0
 3    174000.0
 4    138500.0
 Name: saleprice, dtype: float64)

**Split Train and Test Data**
__________

In [152]:
# keep default ratio of 75/25

X_train, X_test, y_train, y_test = train_test_split(X, y)

In [156]:
# confirm splits and shapes
X_train.shape, X_test.shape

((1538, 2), (513, 2))

**Instantiate and Fit Linear Regression Model**
______________

In [160]:
# instantiate
model1 = LinearRegression()

In [162]:
# fit model
model1.fit(X_train, y_train)

**Re-evaluate Model 1**

In [166]:
# train set model 1 R2 score
model1.score(X_train, y_train)

0.368345138991085

In [168]:
# test set model 1 R2 score
model1.score(X_test, y_test)

0.4248421406834433

**Cross Validation**
________________

In [171]:
# keep 5-fold default
cross_val_score(model1, X_train, y_train)

array([0.31500169, 0.44462551, 0.18300903, 0.37271373, 0.44040992])

In [173]:
# average
cross_val_score(model1, X_train, y_train).mean()

0.3511519768564207

Model performance worsened after dropping `2nd_flr_sf` feature. This feature will need to be added back in, plus consider another feature take takes into account square footage.  This would keep the model consistent with basic problem statement premise, namely that home buyers value space.<br>
Add `total_bsmt_sf` and go for a third model iteration.

**New Predictive Matrix**
___________

In [214]:
# build new X matrix, iteration 3
# add '2nd_flr_sf' back in and add 'total_bsmt_sf'

features = ['1st_flr_sf', 'lot_area', '2nd_flr_sf', 'total_bsmt_sf']
X = ames[features]

# check dimensions
X.shape

(2051, 4)

**Target Vector**
____________

In [217]:
# should still be the same, confirm

y.shape, y.head()

((2051,),
 0    130500.0
 1    220000.0
 2    109000.0
 3    174000.0
 4    138500.0
 Name: saleprice, dtype: float64)

**Split Train and Test Data**
______________

In [220]:
# keep default ration of 75/25

X_train, X_test, y_train, y_test = train_test_split(X, y)

In [222]:
# confirm splits / dimensions
X_train.shape, X_test.shape

((1538, 4), (513, 4))

**Instantiate and Fit Linear Regression Model**
__________________

In [225]:
# instantiate
model1 = LinearRegression()

In [227]:
# fit model
model1.fit(X_train, y_train)

**Re-evaluate Model 1**
_____________

In [230]:
# train set R2 score
model1.score(X_train, y_train)

0.5947757783389949

In [232]:
# test set R2 score
model1.score(X_test, y_test)

0.682176677909685

**Cross Validation**
____________

In [235]:
# keep 5-fold default
cross_val_score(model1, X_train, y_train)

array([0.63856212, 0.64673407, 0.68397325, 0.51375278, 0.31995678])

In [237]:
# average
cross_val_score(model1, X_train, y_train).mean()

0.5605958005633526

There is a slight improvement in model performance. Check other metrics.