<hr style="height:.9px;border:none;color:#333;background-color:#333;" />
<hr style="height:.9px;border:none;color:#333;background-color:#333;" />

<br><h2>Script 08 | Nonparametric Modeling with Regression Trees</h2>
<br>
Written by Chase Kusterer<br>
<a href="https://github.com/chase-kusterer">GitHub</a> | <a href="https://www.linkedin.com/in/kusterer/">LinkedIn</a>
<br><br><br>

<hr style="height:.9px;border:none;color:#333;background-color:#333;" />
<hr style="height:.9px;border:none;color:#333;background-color:#333;" />

<h2>Part I: Preparation and Exploration</h2>
<br><h4>a) Imports and Initial Setup</h4>
Run the following code to import packages and load the dataset into Python.

In [None]:
# importing libraries
import matplotlib.pyplot as plt                        # data visualization
import pandas as pd                                    # data science essentials
from sklearn.model_selection import train_test_split   # train-test split
from sklearn.tree import DecisionTreeRegressor         # regression trees
from sklearn.tree import plot_tree                     # tree plots
import warnings                                        # warnings from code



# new libraries
from sklearn.preprocessing import StandardScaler  # standard scaler
import warnings                                   # warnings from code

# setting pandas print options
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)


# suppressing warnings
warnings.filterwarnings(action = 'ignore')


# specifying the path and file name
file = './datasets/housing_feature_rich.xlsx'


# reading the file into Python
housing = pd.read_excel(io     = file,
                        header = 0   )


housing.drop(labels  = ['property_id'],
             axis    = 1,
             inplace = True)


#####################################
# importing model coefficients file #
#####################################
coef_file = "./model_results/Model_Coefficients.xlsx"

coef_df   = pd.read_excel(io     = coef_file,
                          header = 0        )



# checking housing dataset
housing.head(n = 5)

<br>

In [None]:
#################
## full models ##
#################

# all x-data
x_all = list(housing.drop(labels  = ['Sale_Price', 'log_Sale_Price'],
                          axis    = 1))

# original x-data
x_original = list(housing.loc[ : , 'Lot_Area' : 'Porch_Area' ])



################
## original y ##
################

# best base model 
x_base = ['Mas_Vnr_Area',  'Total_Bsmt_SF', 'First_Flr_SF',
          'Second_Flr_SF', 'Garage_Area']


# best model after feature engineering
x_rich = ['Lot_Area', 'Garage_Cars', 'Overall_Qual', 'Total_Bsmt_SF',
          'NridgHt', 'Kitchen_AbvGr', 'has_Second_Flr',
          'Mas_Vnr_Area', 'has_Garage', 'Porch_Area',
          'NWAmes', 'OldTown', 'Overall_Cond', 'NAmes',
          'Edwards', 'Somerst', 'Fireplaces', 'Second_Flr_SF',
          'First_Flr_SF', 'has_Mas_Vnr', 'CulDSac', 'Total_Bath',
          'Crawfor', 'Garage_Area', 'has_Porch']



###################
## logarithmic y ##
###################

# best model after feature engineering (log y)
x_rich_log_y = ['Lot_Area', 'First_Flr_SF', 'Second_Flr_SF', 'Garage_Cars' ,
                'Overall_Qual', 'Overall_Cond', 'Total_Bsmt_SF', 'OldTown',
                'Kitchen_AbvGr', 'Total_Bath', 'has_Second_Flr', 'NridgHt',
                'Fireplaces', 'Porch_Area', 'Somerst', 'CollgCr', 'Crawfor',
                'CulDSac', 'NWAmes', 'Edwards', 'Gilbert']



########################
## response variables ##
########################
original_y = 'Sale_Price'
log_y      = 'log_Sale_Price'

<br>

In [None]:
# preparing x-data
x_data = housing[ x_original ]

# preparing y-data
y_data = housing[ original_y ]

<br>
<h4>Should we standardize the data? Will it make a difference in a nonparametric model? Explore the results to find out!</h4>

In [None]:
# # INSTANTIATING a StandardScaler() object
# scaler = StandardScaler()


# # FITTING and TRANSFORMING
# x_scaled = scaler.fit_transform(x_data)


# # converting scaled data into a DataFrame
# x_scaled_df = pd.DataFrame(x_scaled)


# # labeling columns
# x_scaled_df.columns = x_data.columns


# x_data = x_scaled_df.copy()

# # checking the results
# x_data.describe(include = 'number').round(decimals = 2)

In [None]:
# INSTANTIATING a StandardScaler() object
scaler = StandardScaler()


# FITTING and TRANSFORMING
x_scaled = scaler.fit_transform(x_data)


# converting scaled data into a DataFrame
x_scaled_df = pd.DataFrame(x_scaled)


# labeling columns
x_scaled_df.columns = x_data.columns


x_data = x_scaled_df.copy()

# checking the results
x_data.describe(include = 'number').round(decimals = 2)

<br>

In [None]:
# train-test split with stratification
x_train, x_test, y_train, y_test = train_test_split(
            x_data,
            y_data,
            test_size    = 0.25,
            random_state = 702)

<hr style="height:.9px;border:none;color:#333;background-color:#333;" /><br>

<h2>Part II: Classification Trees (CART Models)</h2><br>
CART models are very useful in regression problems as they output interesting tools such as <strong>tree plots</strong> and <strong>feature importance</strong>. As they are a nonparametric model type, they have no coefficients. <strong>They also assume no model form, meaning that we do not need to transform any features or engineer new ones.</strong> CART models are meant to work out of the box.<br><br>

<strong>CART Model Highlights</strong><br>

* tend to overfit unless pruned
* tend to be worse at prediction than other model types (after pruning)
* can generate very useful outputs for developing hypotheses and data-driven findings


Run the following code to load a user-defined function for CART model output.

In [None]:
########################################
# plot_feature_importances
########################################
def plot_feature_importances(model, train, export = False):
    """
    Plots the importance of features from a CART model.
    
    PARAMETERS
    ----------
    model  : CART model
    train  : explanatory variable training data
    export : whether or not to export as a .png image, default False
    """
    
    # declaring the number
    n_features = x_train.shape[1]
    
    # setting plot window
    fig, ax = plt.subplots(figsize=(12,9))
    
    plt.barh(range(n_features), model.feature_importances_, align='center')
    plt.yticks(np.arange(n_features), train.columns)
    plt.xlabel("Feature importance")
    plt.ylabel("Feature")
    
    if export == True:
        plt.savefig('Tree_Leaf_50_Feature_Importance.png')

<hr style="height:.9px;border:none;color:#333;background-color:#333;" /><br>

Run the following code to develop a regression tree model. Notice that it overfits to the training data as it was allowed to continue growing until it could not grow any further. 

In [None]:
model_name = 'Unpruned Regression Tree'

# INSTANTIATING a model object
model = DecisionTreeRegressor(random_state = 702)


# FITTING to the training data
model_fit = model.fit(x_train, y_train)


# PREDICTING on new data
model_pred = model.predict(x_test)


# SCORING the results
model_train_score = model.score(x_train, y_train) # using R-square
model_test_score  = model.score(x_test, y_test)   # using R-square
model_gap         = abs(model_train_score - model_test_score)


# displaying results
print('Training Score :', model_train_score)
print('Testing Score  :', model_test_score)
print('Train-Test Gap :', model_gap)

<hr style="height:.9px;border:none;color:#333;background-color:#333;" /><br>
<h4>a) Develop a new regression tree.</h4>
Develop a regression tree with a maximum depth of 4 and a minimum number of samples per leaf of 25. The following help file may be useful in approaching this challenge.

In [None]:
help(DecisionTreeRegressor)

<br>

In [None]:
model_name = 'Pruned Regression Tree'

# INSTANTIATING a model object - CHANGE THIS AS NEEDED
model = DecisionTreeRegressor(_____,
                              _____,
                              _____)


# FITTING to the training data
model_fit = model.fit(x_train, y_train)


# PREDICTING on new data
model_pred = model.predict(x_test)


# SCORING the results
model_train_score = model.score(x_train, y_train).round(4) # using R-square
model_test_score  = model.score(x_test, y_test).round(4)   # using R-square
model_gap         = abs(model_train_score - model_test_score).round(4)


# displaying results
print('Training Score :', model_train_score)
print('Testing Score  :', model_test_score)
print('Train-Test Gap :', model_gap)

In [None]:
model_name = 'Pruned Regression Tree'

# INSTANTIATING a model object - CHANGE THIS AS NEEDED
model = DecisionTreeRegressor(max_depth        = 4,
                              min_samples_leaf = 25,
                              random_state     = 702)


# FITTING to the training data
model_fit = model.fit(x_train, y_train)


# PREDICTING on new data
model_pred = model.predict(x_test)


# SCORING the results
model_train_score = model.score(x_train, y_train) # using R-square
model_test_score  = model.score(x_test, y_test)   # using R-square
model_gap         = abs(model_train_score - model_test_score)


# displaying results
print('Training Score :', model_train_score)
print('Testing Score  :', model_test_score)
print('Train-Test Gap :', model_gap)

<br>

In [None]:
# setting figure size
plt.figure(figsize=(50, 10)) # adjusting to better fit the visual


# developing a plotted tree
plot_tree(decision_tree = model, # changing to pruned_tree_fit
          feature_names = x_train.columns,
          filled        = True, 
          rounded       = True, 
          fontsize      = 14)


# rendering the plot
plt.show()

<hr style="height:.9px;border:none;color:#333;background-color:#333;" /><br>

CART models have an amazing tool to help evaluate feature performance. This tool, known as <strong>feature importance</strong>, informs as to how "important" each feature is in terms of splitting the data into nodes. Run the user-defined function below to see the results of this tool.

In [None]:
# plotting feature importance
plot_feature_importances(model,
                         train = x_train,
                         export = False)

<br>

<hr style="height:.9px;border:none;color:#333;background-color:#333;" />
<hr style="height:.9px;border:none;color:#333;background-color:#333;" /><br>

~~~
      ___  ___  __                 
|__/ |__  |__  |__)                
|  \ |___ |___ |                   
                                   
 __   __   __               __    /
/ _` |__) /  \ |  | | |\ | / _`  / 
\__> |  \ \__/ |/\| | | \| \__> .  



~~~

<hr style="height:.9px;border:none;color:#333;background-color:#333;" />
<hr style="height:.9px;border:none;color:#333;background-color:#333;" />

<br>