<hr style="height:.9px;border:none;color:#333;background-color:#333;" />
<hr style="height:.9px;border:none;color:#333;background-color:#333;" />

<br><h2>Script 10 |  Nonparametric Ensemble Models</h2>
<br>
Written by Chase Kusterer<br>
<a href="https://github.com/chase-kusterer">GitHub</a> | <a href="https://www.linkedin.com/in/kusterer/">LinkedIn</a>
<br><br><br>

<hr style="height:.9px;border:none;color:#333;background-color:#333;" />
<hr style="height:.9px;border:none;color:#333;background-color:#333;" />

<h2>Part I: Preparation and Exploration</h2>
<br><h4>a) Imports and Initial Setup</h4>
Run the following code to import packages and load the dataset into Python.

In [None]:
# importing critical libraries
import pandas            as pd                 # data science essentials
import matplotlib.pyplot as plt                # data visualization
import seaborn           as sns                # enhanced data viz


# importing machine learning models
from sklearn.tree     import DecisionTreeRegressor     # regression trees
from sklearn.ensemble import RandomForestRegressor     # random forest
from sklearn.ensemble import GradientBoostingRegressor # gbm


# importing machine learning tools
from sklearn.model_selection import train_test_split # train-test split
from sklearn.tree import plot_tree                   # tree plots


# loading data
housing = pd.read_excel('./datasets/housing_feature_rich.xlsx')


# setting pandas print options
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
pd.set_option('display.max_colwidth', 100)


# displaying the head of the dataset
housing.head(n = 5)

<br>

In [None]:
# declaring sets of x-variables
x_variables = ['Garage_Cars', 'Overall_Qual', 'Total_Bsmt_SF',
               'NridgHt', 'Kitchen_AbvGr', 'has_Second_Flr',
               'Mas_Vnr_Area', 'has_Garage', 'Porch_Area',
               'NWAmes', 'OldTown', 'Overall_Cond',
               'Edwards', 'Somerst', 'Fireplaces',
               'Second_Flr_SF', 'First_Flr_SF', 'has_Mas_Vnr',
               'CulDSac', 'Total_Bath', 'Crawfor', 'Garage_Area',
               'has_Porch']


full_x = ['Overall_Qual', 'Overall_Cond', 'Mas_Vnr_Area', 'Total_Bsmt_SF',
          'First_Flr_SF', 'Second_Flr_SF', 'Gr_Liv_Area', 'Full_Bath',
          'Half_Bath', 'Kitchen_AbvGr', 'TotRms_AbvGr', 'Fireplaces',
          'Garage_Cars', 'Garage_Area', 'Porch_Area', 'log_Lot_Area',
          'has_Second_Flr', 'has_Garage', 'has_Mas_Vnr', 'has_Porch',
          'Total_Bath', 'CulDSac', 'BrkSide', 'CollgCr', 'Crawfor',
          'Edwards', 'Gilbert', 'Mitchel', 'NWAmes', 'NridgHt', 'OldTown',
          'Sawyer', 'SawyerW', 'Somerst', 'Other_NH']


reduced_x = ['Overall_Qual', 'Gr_Liv_Area', 'Full_Bath',
             'Kitchen_AbvGr', 'TotRms_AbvGr', 'Fireplaces',
             'Garage_Cars', 'Garage_Area', 'Porch_Area', 
             'log_Lot_Area', 'has_Second_Flr', 'has_Garage',
             'has_Mas_Vnr', 'has_Porch', 'Total_Bath', 'CulDSac']

<br>

In [None]:
# preparing x-features
x_data = x_data[reduced_x]

# preparing y-feature
y_data = housing.loc[ : , 'Sale_Price']

<br>

In [None]:
help(train_test_split)

<br>

In [None]:
# train-test split
x_train, x_test, y_train, y_test = train_test_split(
            x_data,
            y_data,
            test_size    = 0.25,
            random_state = 219 )

<hr style="height:.9px;border:none;color:#333;background-color:#333;" /><br>

<h2>Part II: Random Forest</h2><br>
A random forest can be thought of as a group of decision trees that are all slightly different from each other. This model type starts by randomly selecting a subset of explanatory variables and building a decision tree. Then, it takes another random subset of explanatory variables and builds another tree. After building several trees, each observation has several different results for its predicted value. This can be thought of as giving each tree a voice as to what the final prediction should be for each observation.

For example, one observation may have been voted positive 80% of the time (the event in question occurred), and voted negative 20% of the time (the event in question did not occur). After all votes have been cast, whichever class has the most votes wins, and prediction on the observation is complete.<br><br>
<h4>a) Build a random forest model.</h4>
Instantiate a random forest model using its default hyperparameters for the options . You know how to do this. Here is a help file to "help" you out. :)

In [None]:
help(RandomForestRegressor)

<br>

In [None]:
# specifying a model name
model_name = 'Unpruned Random Forest'


# INSTANTIATING a random forest model with default values
model = RandomForestRegressor(n_estimators     = 100,
                              criterion        = 'mse',
                              max_depth        = None,
                              min_samples_leaf = 1,
                              bootstrap        = True,
                              warm_start       = False,
                              random_state     = 219)

<br>

In [None]:
# FITTING the training data
model_fit = model.fit(x_train, y_train)


# PREDICTING based on the testing set
model_pred = model.predict(x_test)


# SCORING the results
model_train_score = model.score(x_train, y_train).round(4) # using R-square
model_test_score  = model.score(x_test, y_test).round(4)   # using R-square
model_gap         = abs(model_train_score - model_test_score).round(4)


# displaying results
print('Training Score :', model_train_score)
print('Testing Score  :', model_test_score)
print('Train-Test Gap :', model_gap)

<hr style="height:.9px;border:none;color:#333;background-color:#333;" /><br>
<h3>Tuned Random Forest</h3>

In [None]:
# specifying a model name
model_name = 'Pruned Random Forest'


# INSTANTIATING a random forest model with default values
model = RandomForestRegressor(n_estimators     = 650,
                              criterion        = 'mse',
                              max_depth        = 8,
                              min_samples_leaf = 25,
                              bootstrap        = True,
                              warm_start       = False,
                              random_state     = 219)


# FITTING the training data
model_fit = model.fit(x_train, y_train)


# PREDICTING based on the testing set
model_pred = model.predict(x_test)


# SCORING the results
model_train_score = model.score(x_train, y_train).round(4) # using R-square
model_test_score  = model.score(x_test, y_test).round(4)   # using R-square
model_gap         = abs(model_train_score - model_test_score).round(4)


# displaying results
print('Training Score :', model_train_score)
print('Testing Score  :', model_test_score)
print('Train-Test Gap :', model_gap)

<br>

In [None]:
# plotting feature importance
plot_feature_importances(model          ,
                         train = x_train,
                         export = False )

<hr style="height:.9px;border:none;color:#333;background-color:#333;" /><br>

<h2>Part III: Gradient Boosted Machines</h2><br>
Gradient boosted machines (GBMs) are like decision trees, but instead of starting fresh with each iteration, they learn from the performance results of previous iterations. Unlike random forest, GBMs use a row-wise penalty instead of a column-wise penalty, reweighting each row instead of each column. The learning rate shrinks the contribution of each tree, and there is a trade-off between the learning rate and the number of estimators.<br><br>

In [None]:
help(GradientBoostingRegressor)

<br>

In [None]:
# specifying a model name
model_name = 'Unpruned GBM'


# INSTANTIATING the model object
model = GradientBoostingRegressor(loss          = 'ls',
                                  learning_rate = 0.5,
                                  n_estimators  = 200,
                                  criterion     = 'friedman_mse',
                                  max_depth     = 5,
                                  warm_start    = False,
                                  random_state  = 219)


# FITTING the training data
model_fit = model.fit(x_train, y_train)


# PREDICTING based on the testing set
model_pred = model.predict(x_test)


# SCORING the results
model_train_score = model.score(x_train, y_train).round(4) # using R-square
model_test_score  = model.score(x_test, y_test).round(4)   # using R-square
model_gap         = abs(model_train_score - model_test_score).round(4)


# displaying results
print('Training Score :', model_train_score)
print('Testing Score  :', model_test_score)
print('Train-Test Gap :', model_gap)

<br>

In [None]:
# plotting feature importance
plot_feature_importances(model,
                         train = x_train,
                         export = False)

<hr style="height:.9px;border:none;color:#333;background-color:#333;" />
<hr style="height:.9px;border:none;color:#333;background-color:#333;" /><br>

~~~
      ___  ___  __                 
|__/ |__  |__  |__)                
|  \ |___ |___ |                   
                                   
 __   __   __               __    /
/ _` |__) /  \ |  | | |\ | / _`  / 
\__> |  \ \__/ |/\| | | \| \__> .  



~~~

<hr style="height:.9px;border:none;color:#333;background-color:#333;" />
<hr style="height:.9px;border:none;color:#333;background-color:#333;" />

<br>