<center><h2>Sale Prices Prediction for Iowa Residential Houses</h2></center>
<center>(ongoing)</center>

<center>Version 2: Advanced modeling with Decision Tree, Random Forest, and Gradient Boosting</center>

#### Changelog:
- 09/20 -- Pipeline
- 09/21 -- Decision Tree, Random Forest
- [ ] Gradient Boosting
- [ ] Cross-Validation
- [ ] Data Leakage

#### 1. Import modules and datasets

In [7]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.metrics import mean_absolute_error

In [8]:
train_file_path = "~/PROJECTS/05_iowa_house_sale_price_prediction/train.csv"
train_full = pd.read_csv(train_file_path, index_col = "Id")

In [9]:
train_full.head()

Unnamed: 0_level_0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,Inside,...,0,,,,0,2,2008,WD,Normal,208500
2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,FR2,...,0,,,,0,5,2007,WD,Normal,181500
3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,Inside,...,0,,,,0,9,2008,WD,Normal,223500
4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,Corner,...,0,,,,0,2,2006,WD,Abnorml,140000
5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,FR2,...,0,,,,0,12,2008,WD,Normal,250000


In [10]:
test_file_path = "~/PROJECTS/05_iowa_house_sale_price_prediction/test.csv"
test_full = pd.read_csv(test_file_path, index_col = "Id")

In [11]:
test_full.head()

Unnamed: 0_level_0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,...,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1461,20,RH,80.0,11622,Pave,,Reg,Lvl,AllPub,Inside,...,120,0,,MnPrv,,0,6,2010,WD,Normal
1462,20,RL,81.0,14267,Pave,,IR1,Lvl,AllPub,Corner,...,0,0,,,Gar2,12500,6,2010,WD,Normal
1463,60,RL,74.0,13830,Pave,,IR1,Lvl,AllPub,Inside,...,0,0,,MnPrv,,0,3,2010,WD,Normal
1464,60,RL,78.0,9978,Pave,,IR1,Lvl,AllPub,Inside,...,0,0,,,,0,6,2010,WD,Normal
1465,120,RL,43.0,5005,Pave,,IR1,HLS,AllPub,Inside,...,144,0,,,,0,1,2010,WD,Normal


#### 2. Initial modeling

In [12]:
# remove rows with missing target

X = train_full.copy()
X.dropna(axis=0, subset=["SalePrice"], inplace=True)

In [13]:
# specify target and features

y = X['SalePrice']
X.drop(['SalePrice'], axis=1, inplace=True)

In [14]:
# split into training data and validation data

X_train, X_valid, y_train, y_valid = train_test_split(X, y, 
                                                      train_size=0.8, 
                                                      test_size=0.2, 
                                                      random_state=0)

In [15]:
# select numerical columns

num_cols = [col for col in X_train.columns if X_train[col].dtype in ('int64', 'float64')]

In [17]:
# select categorical columns with relatively low cardinality

cat_cols = [col for col in X_train.columns if X_train[col].nunique() < 10 and X_train[col].dtype == 'object']

In [31]:
# keep selected columns only

my_cols = num_cols + cat_cols

X_train = X_train[my_cols].copy()
X_valid = X_valid[my_cols].copy()

X_test = test_full[my_cols].copy()

#### 3. Pipeline -- preprocessing missing data

In [32]:
# preprocessing for numerical data

num_transformer = SimpleImputer(strategy = 'constant')

In [33]:
# preprocessing for categorical data

cat_transformer = Pipeline(steps=[('imputer', SimpleImputer(strategy='most_frequent')), 
                                  ('onehot', OneHotEncoder(handle_unknown='ignore'))])

In [34]:
# bundle preprocessing for both numerical and categorical data

my_preprocessor = ColumnTransformer(transformers=[('num', num_transformer, num_cols), 
                                                  ('cat', cat_transformer, cat_cols)])

#### 4. Decision Tree model

In [49]:
# define model

dtree_model = DecisionTreeRegressor(max_leaf_nodes=100, random_state=1)

In [50]:
# bundle preprocessing and modeling code in a pipeline

dtree_pipeline = Pipeline(steps=[('preprocessor', my_preprocessor), 
                                 ('model', dtree_model)])

In [51]:
# fit model

dtree_pipeline.fit(X_train, y_train)

Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('num',
                                                  SimpleImputer(strategy='constant'),
                                                  ['MSSubClass', 'LotFrontage',
                                                   'LotArea', 'OverallQual',
                                                   'OverallCond', 'YearBuilt',
                                                   'YearRemodAdd', 'MasVnrArea',
                                                   'BsmtFinSF1', 'BsmtFinSF2',
                                                   'BsmtUnfSF', 'TotalBsmtSF',
                                                   '1stFlrSF', '2ndFlrSF',
                                                   'LowQualFinSF', 'GrLivArea',
                                                   'BsmtFullBath',
                                                   'BsmtHalfBath', 'FullBa...
                                                 

In [52]:
# prediction with validation data

dtree_predictions = dtree_pipeline.predict(X_valid)

In [53]:
# evaluate model

dtree_mae = mean_absolute_error(y_valid, dtree_predictions)
print("Decision Tree model MAE:", "{:,.2f}".format(dtree_mae))

Decision Tree model MAE: 26,005.27


#### 5. Random Forest model

In [36]:
# define model

rf_model = RandomForestRegressor(n_estimators=100, random_state=0)

In [37]:
# bundle preprocessing and modeling code in a pipeline

rf_pipeline = Pipeline(steps=[('preprocessor', my_preprocessor), 
                              ('model', rf_model)])

In [38]:
# fit model

rf_pipeline.fit(X_train, y_train)

Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('num',
                                                  SimpleImputer(strategy='constant'),
                                                  ['MSSubClass', 'LotFrontage',
                                                   'LotArea', 'OverallQual',
                                                   'OverallCond', 'YearBuilt',
                                                   'YearRemodAdd', 'MasVnrArea',
                                                   'BsmtFinSF1', 'BsmtFinSF2',
                                                   'BsmtUnfSF', 'TotalBsmtSF',
                                                   '1stFlrSF', '2ndFlrSF',
                                                   'LowQualFinSF', 'GrLivArea',
                                                   'BsmtFullBath',
                                                   'BsmtHalfBath', 'FullBa...
                                                 

In [39]:
# prediction with validation data

rf_predictions = rf_pipeline.predict(X_valid)

In [42]:
# evaluate model

rf_mae = mean_absolute_error(y_valid, rf_predictions)
print("Random Forest model MAE:", "{:,.2f}".format(rf_mae))

Random Forest model MAE: 17,861.78


#### 6. Gradiant Boosting model