# Ames Housing Project


## Project Challenge Statement

#### Goal: Predict the price of homes at sale for the Aimes Iowa Housing dataset. 

Two files used to build the model. 

- train_data_cleanna.csv -- this data contains all of the training data with no missing values and outliers
- test_data_cleanna.csv -- this data contains all of the testing data with no missing values and outliers


## Table of Contents 

This Notebook is broken down into different sections for analysis purpose. The following links are connected to differenct section within the Notebook for simple navigation. 

### Contents:
- [Best Features Extraction](#Best-Features-Extraction)

## Feature Engineering Preprocessing 

#### Steps 
1. Build OLS model  
    - one hot encode on all categorical variables  


2. Understand p-value 
    - find variables that has p-value smaller than 0.05
    
    
3. Build OLD model with extracted features


In [1]:
#Eliminate warnings 
import sys
import warnings

if not sys.warnoptions:
    warnings.simplefilter("ignore")

In [2]:
# Library imports
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LassoCV, RidgeCV, ElasticNetCV, LinearRegression
from sklearn.preprocessing import StandardScaler, PolynomialFeatures, PowerTransformer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.feature_selection import SelectKBest, VarianceThreshold, f_regression, RFECV
from sklearn.metrics import mean_squared_error
import statsmodels.api as sm
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.metrics import r2_score


np.random.seed(42)
%matplotlib inline

In [3]:
from functions import *

In [4]:
#import Data 
train = pd.read_csv('../datasets/train.csv')

clean_train_data = pd.read_csv('../datasets/train_data_clean.csv')
clean_test_data = pd.read_csv('../datasets/test_data_clean.csv')

base_train_data = pd.read_csv('../datasets/train_data_cleanna.csv')
base_test_data = pd.read_csv('../datasets/test_data_cleanna.csv')


## Best Features Extraction

build a linear regression model as the baseline model for reference as we extract the key features with low Pvaues

### 1. Build OLS Model with Statsmodel to extract columns with low p-values

In [5]:
base_train_data.head()

Unnamed: 0.1,Unnamed: 0,Id,PID,MS SubClass,Lot Frontage,Lot Area,Overall Qual,Overall Cond,Year Built,Year Remod/Add,...,Fireplace Qu,Garage Type,Garage Finish,Garage Qual,Garage Cond,Paved Drive,Pool QC,Fence,Misc Feature,Sale Type
0,0,109,533352170,60,68.878999,13517,6,8,1976,2005,...,No_Fireplace Qu,Attchd,RFn,TA,TA,Y,No_Pool QC,No_Fence,No_Misc Feature,WD
1,1,544,531379050,60,43.0,11492,7,5,1996,1997,...,TA,Attchd,RFn,TA,TA,Y,No_Pool QC,No_Fence,No_Misc Feature,WD
2,2,153,535304180,20,68.0,7922,5,7,1953,2007,...,No_Fireplace Qu,Detchd,Unf,TA,TA,Y,No_Pool QC,No_Fence,No_Misc Feature,WD
3,3,318,916386060,60,73.0,9802,5,5,2006,2007,...,No_Fireplace Qu,BuiltIn,Fin,TA,TA,Y,No_Pool QC,No_Fence,No_Misc Feature,WD
4,4,255,906425045,50,82.0,14235,6,8,1900,1993,...,No_Fireplace Qu,Detchd,Unf,TA,TA,N,No_Pool QC,No_Fence,No_Misc Feature,WD


In [6]:
#Numerical Data only 
X_base = base_train_data[ext_num_features(base_train_data)].drop(columns = ['Unnamed: 0', 'Id','PID', 'SalePrice'])

##### Get Dummy Variables on Categorical variables 

In [7]:
X_dummy = ext_cat_features(base_train_data)
pd.get_dummies(base_train_data, columns = X_dummy).head()

Unnamed: 0.1,Unnamed: 0,Id,PID,MS SubClass,Lot Frontage,Lot Area,Overall Qual,Overall Cond,Year Built,Year Remod/Add,...,Misc Feature_TenC,Sale Type_COD,Sale Type_CWD,Sale Type_Con,Sale Type_ConLD,Sale Type_ConLI,Sale Type_ConLw,Sale Type_New,Sale Type_Oth,Sale Type_WD
0,0,109,533352170,60,68.878999,13517,6,8,1976,2005,...,0,0,0,0,0,0,0,0,0,1
1,1,544,531379050,60,43.0,11492,7,5,1996,1997,...,0,0,0,0,0,0,0,0,0,1
2,2,153,535304180,20,68.0,7922,5,7,1953,2007,...,0,0,0,0,0,0,0,0,0,1
3,3,318,916386060,60,73.0,9802,5,5,2006,2007,...,0,0,0,0,0,0,0,0,0,1
4,4,255,906425045,50,82.0,14235,6,8,1900,1993,...,0,0,0,0,0,0,0,0,0,1


### 2. Understand P_values

##### Fit model with dummy variables and numerical variables

In [8]:
X = pd.get_dummies(base_train_data, columns = X_dummy).drop(columns = ['Unnamed: 0', 'Id','PID', 'SalePrice'])
y = base_train_data['SalePrice']
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42)

In [9]:
OLS = sm.OLS(y, X)
results = OLS.fit()
results.summary()

0,1,2,3
Dep. Variable:,SalePrice,R-squared:,0.944
Model:,OLS,Adj. R-squared:,0.936
Method:,Least Squares,F-statistic:,119.2
Date:,"Fri, 22 Mar 2019",Prob (F-statistic):,0.0
Time:,09:30:39,Log-Likelihood:,-23072.0
No. Observations:,2049,AIC:,46650.0
Df Residuals:,1795,BIC:,48080.0
Df Model:,253,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
MS SubClass,-41.8434,57.146,-0.732,0.464,-153.923,70.236
Lot Frontage,98.6007,34.922,2.823,0.005,30.109,167.092
Lot Area,0.8211,0.111,7.380,0.000,0.603,1.039
Overall Qual,6102.2659,737.879,8.270,0.000,4655.075,7549.457
Overall Cond,5565.1169,643.845,8.644,0.000,4302.352,6827.882
Year Built,332.5518,55.680,5.973,0.000,223.348,441.756
Year Remod/Add,73.5050,41.567,1.768,0.077,-8.019,155.029
Mas Vnr Area,24.9182,4.461,5.586,0.000,16.169,33.667
BsmtFin SF 1,17.6472,1.979,8.919,0.000,13.767,21.528

0,1,2,3
Omnibus:,407.387,Durbin-Watson:,2.031
Prob(Omnibus):,0.0,Jarque-Bera (JB):,6329.789
Skew:,0.479,Prob(JB):,0.0
Kurtosis:,11.557,Cond. No.,7.82e+16


In [10]:
p_cols = results.pvalues[results.pvalues < 0.05].index
p_cols

Index(['Lot Frontage', 'Lot Area', 'Overall Qual', 'Overall Cond',
       'Year Built', 'Mas Vnr Area', 'BsmtFin SF 1', 'BsmtFin SF 2',
       'Total Bsmt SF', '1st Flr SF', '2nd Flr SF', 'Gr Liv Area',
       'Bedroom AbvGr', 'Kitchen AbvGr', 'Fireplaces', 'Garage Area',
       'Wood Deck SF', 'Screen Porch', 'Neighborhood_ClearCr',
       'Neighborhood_CollgCr', 'Neighborhood_Edwards', 'Neighborhood_Gilbert',
       'Neighborhood_GrnHill', 'Neighborhood_Mitchel', 'Neighborhood_NAmes',
       'Neighborhood_NWAmes', 'Neighborhood_NoRidge', 'Neighborhood_NridgHt',
       'Neighborhood_OldTown', 'Neighborhood_SWISU', 'Neighborhood_Sawyer',
       'Neighborhood_SawyerW', 'Neighborhood_StoneBr', 'Neighborhood_Timber',
       'Exterior 1st_BrkFace', 'Exterior 1st_CBlock', 'Exterior 2nd_CBlock',
       'Garage Qual_Ex', 'Garage Qual_Po', 'Garage Cond_Ex',
       'Misc Feature_TenC'],
      dtype='object')

In [11]:
#Extract common values in X_train and X_test values
p_cols = ['Lot Frontage', 'Lot Area', 'Overall Qual', 'Overall Cond',
       'Year Built', 'Mas Vnr Area', 'BsmtFin SF 1', 'Total Bsmt SF',
       '1st Flr SF', '2nd Flr SF', 'Bedroom AbvGr', 'Kitchen AbvGr',
       'Garage Area', 'Wood Deck SF', 'Screen Porch', 'Neighborhood_Edwards',
       'Neighborhood_Gilbert', 'Neighborhood_NAmes',
       'Neighborhood_NWAmes', 'Neighborhood_NoRidge', 'Neighborhood_NridgHt',
       'Neighborhood_OldTown', 'Neighborhood_SWISU', 'Neighborhood_Sawyer',
       'Neighborhood_StoneBr', 'Exterior 1st_BrkFace',
       'Garage Cond_Ex']

### 3. Plot variables with low p_values again

In [12]:
#Fit linear model again to check selected columns 
OLS = sm.OLS(y, X[p_cols])
results = OLS.fit()
results.summary()

0,1,2,3
Dep. Variable:,SalePrice,R-squared:,0.981
Model:,OLS,Adj. R-squared:,0.981
Method:,Least Squares,F-statistic:,3823.0
Date:,"Fri, 22 Mar 2019",Prob (F-statistic):,0.0
Time:,09:30:46,Log-Likelihood:,-23848.0
No. Observations:,2049,AIC:,47750.0
Df Residuals:,2022,BIC:,47900.0
Df Model:,27,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Lot Frontage,193.3958,35.470,5.452,0.000,123.833,262.958
Lot Area,0.6595,0.106,6.221,0.000,0.452,0.867
Overall Qual,1.658e+04,728.375,22.767,0.000,1.52e+04,1.8e+04
Overall Cond,3523.1513,587.178,6.000,0.000,2371.614,4674.689
Year Built,-33.7100,3.252,-10.365,0.000,-40.088,-27.332
Mas Vnr Area,29.8610,4.377,6.822,0.000,21.277,38.445
BsmtFin SF 1,22.5562,1.687,13.371,0.000,19.248,25.864
Total Bsmt SF,21.8009,2.699,8.078,0.000,16.508,27.094
1st Flr SF,60.7095,3.279,18.513,0.000,54.278,67.141

0,1,2,3
Omnibus:,344.298,Durbin-Watson:,2.02
Prob(Omnibus):,0.0,Jarque-Bera (JB):,3344.592
Skew:,0.484,Prob(JB):,0.0
Kurtosis:,9.184,Cond. No.,393000.0
