<hr style="height:.9px;border:none;color:#333;background-color:#333;" />
<hr style="height:.9px;border:none;color:#333;background-color:#333;" />

<br><h2>Script 04 | Classification Trees</h2>
<br>
Written by Chase Kusterer<br>
<a href="https://github.com/chase-kusterer">GitHub</a> | <a href="https://www.linkedin.com/in/kusterer/">LinkedIn</a>
<br><br><br>

<hr style="height:.9px;border:none;color:#333;background-color:#333;" />
<hr style="height:.9px;border:none;color:#333;background-color:#333;" />

<h2>Part I: Preparing to Model</h2>
<h4>a) Imports and Loading the Dataset</h4>
Complete the code to import packages and load the 'titanic_feature_rich.xlsx' dataset into Python as <strong>titanic</strong>.

In [None]:
## Package Imports ##
# fundamentals
import pandas            as pd                       # data science essentials
import matplotlib.pyplot as plt                      # data visualization
import seaborn           as sns                      # enhanced data viz
import warnings

# scaling and scoring
from sklearn.preprocessing import StandardScaler     # standard scaler
from sklearn.metrics import make_scorer               # customizable scorer


# machine learning
from sklearn.model_selection import train_test_split # train-test split
from sklearn.metrics import confusion_matrix         # confusion matrix
from sklearn.metrics import roc_auc_score            # auc score
from sklearn.tree import DecisionTreeClassifier      # classification trees
from sklearn.tree import plot_tree                   # tree plots


## Data Import ##
titanic = _____


## Options ##
# setting pandas print options and supressing warnings
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
pd.set_option('display.max_colwidth', 100)
warnings.simplefilter(action = 'ignore', category = UserWarning)


## Results ##
# displaying the head of the dataset
titanic.head(n = 5)

In [None]:
## Package Imports ##
# fundamentals
import pandas            as pd                       # data science essentials
import matplotlib.pyplot as plt                      # data visualization
import seaborn           as sns                      # enhanced data viz
import warnings

# scaling and scoring
from sklearn.preprocessing import StandardScaler     # standard scaler
from sklearn.metrics import make_scorer               # customizable scorer


# machine learning
from sklearn.model_selection import train_test_split # train-test split
from sklearn.metrics import confusion_matrix         # confusion matrix
from sklearn.metrics import roc_auc_score            # auc score
from sklearn.tree import DecisionTreeClassifier      # classification trees
from sklearn.tree import plot_tree                   # tree plots


## Data Import ##
titanic = pd.read_excel('./datasets/titanic_feature_rich.xlsx')


## Options ##
# setting pandas print options and supressing warnings
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
pd.set_option('display.max_colwidth', 100)
warnings.simplefilter(action = 'ignore', category = UserWarning)


## Results ##
# displaying the head of the dataset
titanic.head(n = 5)

<hr style="height:.9px;border:none;color:#333;background-color:#333;" /><br>

<h4>b) Make the lifeboat feature more user friendly.</h4>
Write a code to reverse <em>m_boat</em>. In other words, the original zeros should be ones and the original ones should be zeroes. Note that <strong>this is a classic technical interview question</strong>.

In [None]:
# reversing m_boat
titanic['lifeboat'] = _____

In [None]:
# reversing m_boat
titanic['lifeboat'] = abs(titanic['m_boat'] - 1)

<br>

In [None]:
# checking results
titanic[  ['m_boat', 'lifeboat']  ].value_counts()

<hr style="height:.9px;border:none;color:#333;background-color:#333;" /><br>

<strong>User-Defined Functions</strong><br>
Run the following code to load the user-defined functions used throughout this notebook.

In [None]:
#####################
## sklearn_summary ##
#####################
def classification_summary(x,
                           y,
                           model,
                           model_name   = "",
                           results_df   = None,
                           tt_split     = True,
                           test_size    = 0.25,
                           scale        = False,
                           full_tree    = False,
                           random_state = 702):
    """  
    This function is designed to generate summary statistics for the following
    classification models from scikit-learn:
    * LogisticRegression         - Logistic Regression
    * DecisionTreeClassifier     - Classification Tree
    * RandomForestClassifier     - Random Forest
    * GradientBoostingClassifier - Gradient Boosted Machine


    Additional Functionality
    ------------------------
    This function will standardize the data using StandardScaler() and create
    training and testing sets using train-test split, stratifying the
    y-variable.
    
    It will also output a tabular confusion matrix, calculate area under the
    ROC curve (AUC) for the training and testing sets, as well as the train-
    test gap.
    

    PARAMETERS
    ----------
    x            | array     | X-data before train-test split | No default.
    y            | array     | y-data before train-test split | No default.
    model        | model     | model object to instantiate    | No default.
    model_name   | str       | option to name the model       | Default = ""
    results_df   | DataFrame | place to store model results   | Default = None
    test_size    | float     | test set proportion            | Default = 0.25
    scale        | bool      | whether to scale the data      | Default = False
    random_state | int       | seed for train-test split      | Default = 702
    """
    
    ###########
    # scaling #
    ###########
    
    if scale == True:
        # instantiating a StandardScaler() object
        scaler = StandardScaler(copy = True)


        # FITTING the scaler with the data
        scaler.fit(x)

        # TRANSFORMING our data after fit
        x_scaled = scaler.transform(x)

        # converting scaled data into a DataFrame
        x_scaled_df = pd.DataFrame(x_scaled)

        # reattaching column names
        x_scaled_df.columns = list(x.columns)

        # reverting back to x as the DataFrame's name
        x = x_scaled_df
    
    
    ####################
    # train-test split #
    ####################
    # standard train-test split
    x_train, x_test, y_train, y_test = train_test_split(x, # x
                                                        y, # y
                                                        test_size    = test_size,
                                                        random_state = random_state,
                                                        stratify     = y)
    
    
    #########################
    # fit - predict - score #
    #########################
    # fitting to training data
    model_fit = model.fit(x_train, y_train)


    # predicting on new data
    model_pred = model.predict(x_test)


    # scoring results
    model_train_auc   = round(roc_auc_score(y_true  = y_train,
                              y_score = model.predict(x_train)), ndigits = 4) # auc
    
    model_test_auc    = round(roc_auc_score(y_true  = y_test,
                              y_score = model.predict(x_test)),  ndigits = 4) # auc

    model_gap         = round(abs(model_train_auc - model_test_auc), ndigits = 4)

    
    ####################
    # confusion matrix #
    ####################
    full_tree_tn, \
    full_tree_fp, \
    full_tree_fn, \
    full_tree_tp = confusion_matrix(y_true = y_test, y_pred = model_pred).ravel()

    
    ###########################
    # storing/showing results #
    ###########################
    # instantiating a list to store model results
    results_lst = [ model_name, model_train_auc, model_test_auc, model_gap ]

    # converting to DataFrame
    results_lst = pd.DataFrame(data = results_lst)

    # transposing (rotating) DataFrame
    results_lst = np.transpose(a = results_lst)
    
    # if no results DataFrame provided
    if results_df == None:

        # concatenating to coef_df
        results_df = pd.DataFrame(data = results_lst)
    
    # if results DataFrame provided
    else:
        
        # concatenating to coef_df
        results_df = pd.concat(objs = [results_df, results_lst],
                               axis         = 0,
                               ignore_index = True)
        
    # adding column names
    results_columns = ['Model Name', 'train_auc', 'test_auc', 'tt_gap']
    
    # renaming columns
    results_df.columns = results_columns
    
    
    print(f"""
    Results for {model_name}
    {'=' * 20}
    Model Type: {model}
    Training Samples: {len(x_train)} 
    Testing  Samples: {len(x_test)}
    
    
    Summary Statistics
    ------------------
    AUC (Train): {model_train_auc}
    AUC (Test) : {model_test_auc}
    TT Gap     : {model_gap}
    
    
    Confusion Matrix (test set)
    ---------------------------
    True Negatives : {full_tree_tn}
    False Positives: {full_tree_fp}
    False Negatives: {full_tree_fn}
    True Positives : {full_tree_tp}
    """)
    

########################################
# plot_feature_importances
########################################
def plot_feature_importances(model, train, export = False):
    """
    Plots the importance of features from a CART model.
    
    PARAMETERS
    ----------
    model  : CART model
    labels : DataFrame with labels (i.e., x_data)
    export : whether or not to export as a .png image, default False
    """
    
    # declaring the number
    n_features = x_data.shape[1]
    
    plt.barh(range(n_features), model.feature_importances_, align='center')
    plt.yticks(np.arange(n_features), train.columns)
    plt.xlabel("Feature importance")
    plt.ylabel("Feature")
    
    if export == True:
        plt.savefig('Feature_Importance_Plot.png')
        
        
########################################
# visual_cm
########################################
def visual_cm(true_y, pred_y, labels = None):
    """
    Creates a visualization of a confusion matrix.

    PARAMETERS
    ----------
    true_y : true values for the response variable
    pred_y : predicted values for the response variable
    labels : , default None
        """
    # visualizing the confusion matrix

    # setting labels
    lbls = labels
    

    # declaring a confusion matrix object
    cm = confusion_matrix(y_true = true_y,
                          y_pred = pred_y)


    # heatmap
    sns.heatmap(cm,
                annot       = True,
                xticklabels = lbls,
                yticklabels = lbls,
                cmap        = 'Blues',
                fmt         = 'g')


    plt.xlabel('Predicted')
    plt.ylabel('Actual')
    plt.title('Confusion Matrix of the Classifier')
    plt.show()
    

<br>

In [None]:
help(classification_summary)

<hr style="height:.9px;border:none;color:#333;background-color:#333;" /><br>

<h2>Part II: Classification Trees (CART Models)</h2><br>
CART models are very useful in classification problems as they output interesting tools such as <strong>tree plots</strong> and <strong>feature importance</strong>. As they are a nonparametric model type, they have no coefficients. <font 'color=red'><strong>They also assume no model form, meaning that we do not need to transform any features or engineer new ones.</strong></font> CART models are meant to work out of the box.<br><br>

<strong>CART Model Highlights</strong><br>

* tend to overfit unless pruned
* tend to be worse at prediction than other model types (after pruning)
* can generate very useful outputs for developing hypotheses and data-driven findings


In [None]:
# preparing to partition data
x_data   =  titanic.drop(['survived', 'm_boat', 'lifeboat',
                          'male', 'pclass_3'],
                               axis = 1)


y_data =  titanic['lifeboat']

<br>

In [None]:
# instantiating a classification tree
tree_model = DecisionTreeClassifier()

<br>

In [None]:
# using the classification_summary function
classification_summary(x          = x_data,
                       y          = y_data,
                       model      = tree_model,
                       model_name = "Full Tree")

<hr style="height:.9px;border:none;color:#333;background-color:#333;" /><br>

You may be wondering what just happened. CART models are supposed to work out of the box, and the one we just built is severely overfit. Let me make a correction to what was stated above: CART models are supposed to work out of the box <strong>if they are tuned properly</strong>. Just like gardening in real life, our decision tree needs some love. We've let it grow out of control and now it's so big that its destroying our predictions and covering up our insights. Let's take a closer look at what we've just created.
<br><br>
Run the following code generate a visual tree output.

In [None]:
# setting figure size
plt.figure(figsize=(150,50))


# developing a plotted tree
plot_tree(decision_tree = tree_model, 
          feature_names = x_data.columns,
          filled        = True, 
          rounded       = True, 
          fontsize      = 14)


# rendering the plot
plt.show()

<hr style="height:.9px;border:none;color:#333;background-color:#333;" /><br>

The visual above may remind you of a world map. Unfortunately, this is not the goal of classification tree models. We need to <strong>prune the tree</strong> (limit its layers of growth) in order to better analyze our visual output. This will also help prevent the model from overfitting as it will be unable to continually split the training data into nodes until each terminal node is as pure as it can be (often results in each observation being in its own terminal node).<br><br>
<h4>b) Develop a new classificaion tree model.</h4>
Develop a classification tree with a maximum depth of 4. Below is a link to the model type's documentation for your reference.

In [None]:
help(DecisionTreeClassifier)

<br>

In [None]:
# instantiating a classification tree
tree_model = DecisionTreeClassifier(max_depth        = 4,
                                    random_state     = 708)

<br>

In [None]:
# using the classification_summary function
classification_summary(x          = x_data,
                       y          = y_data,
                       model      = tree_model,
                       model_name = "Pruned Tree")

<br>

In [None]:
# setting figure size
plt.figure(figsize=(22, 6)) # adjust if boxes are overlapping


# developing a plotted tree
plot_tree(decision_tree = tree_model,
          feature_names = x_data.columns,
          filled        = True, 
          rounded       = True, 
          fontsize      = 12) # adjust if boxes are overlapping


# rendering the plot
plt.tight_layout()
plt.show()

<hr style="height:.9px;border:none;color:#333;background-color:#333;" /><br>

CART models have an amazing tool to help evaluate a model's features. This tool, known as <strong>feature importance</strong>, informs as to how "important" each feature is in terms of splitting the data into nodes. Run the user-defined function below to see the results of this tool.

In [None]:
help(plot_feature_importances)

<br>

In [None]:
# plotting feature importance
plot_feature_importances(model  = tree_model,
                         train  = x_data,
                         export = False)

<hr style="height:.9px;border:none;color:#333;background-color:#333;" /><br>

<h2>Part III: Using Classification Trees For Analysis</h2><br>
Tree plots can be very useful in data exploration. Let's practice by analyzing how age plays a factor in terms of getting into a lifeboat.

In [None]:
## consolidating code into one cell ##

# preparing to partition data
x_data =  titanic[ ['age'] ]


y_data =  titanic['lifeboat']


# instantiating a classification tree
tree_model = DecisionTreeClassifier(max_depth        = 4,
                                    random_state     = 708)


# using the classification_summary function
classification_summary(x          = x_data,
                       y          = y_data,
                       model      = tree_model,
                       model_name = "Age Tree")


# setting figure size
plt.figure(figsize=(16, 6)) # adjust if boxes are overlapping


# developing a plotted tree
plot_tree(decision_tree = tree_model,
          feature_names = x_data.columns,
          filled        = True, 
          rounded       = True, 
          fontsize      = 12) # adjust if boxes are overlapping


# rendering the plot
plt.tight_layout()
plt.show()

<hr style="height:.9px;border:none;color:#333;background-color:#333;" /><br>

As indicated above, it appears that there are no age groups with a definite chance of getting into a lifeboat. Let's dig deeper by looking into additional factors, such as gender and passenger class.

<strong>a) Complete the code to instantiate a tree model using <em>age</em>, <em>gender</em>, and <em>passenger class</em>.</strong><br>
Additionally, adjust the <em>max_depth</em> and <em>min_samples_leaf</em> hyperparameters to stabilize your model (if needed).

In [None]:
# checking feature names
titanic.columns

<br>

In [None]:
#############################################
# All features of interest in the same tree #
#############################################

# preparing to partition data
x_data =  _____


y_data =  titanic['lifeboat']


# instantiating a classification tree
tree_model = DecisionTreeClassifier(max_depth        = _____,
                                    min_samples_leaf = _____,
                                    random_state     = 708)


# using the classification_summary function
classification_summary(x          = x_data,
                       y          = y_data,
                       model      = tree_model,
                       model_name = "Age Tree")


# setting figure size
plt.figure(figsize=(20, 6)) # adjust if boxes are overlapping


# developing a plotted tree
plot_tree(decision_tree = tree_model,
          feature_names = x_data.columns,
          filled        = True, 
          rounded       = True, 
          fontsize      = 12) # adjust if boxes are overlapping


# rendering the plot
plt.tight_layout()
plt.show()

In [None]:
#############################################
# All features of interest in the same tree #
#############################################


# preparing to partition data
x_data =  titanic[ ['age', 'female', 'pclass_1', 'pclass_2'] ]


y_data =  titanic['lifeboat']


# instantiating a classification tree
tree_model = DecisionTreeClassifier(max_depth        = 4,
                                    min_samples_leaf = 30,
                                    random_state     = 708)


# using the classification_summary function
classification_summary(x          = x_data,
                       y          = y_data,
                       model      = tree_model,
                       model_name = "Age Tree")


# setting figure size
plt.figure(figsize=(18, 6)) # adjust if boxes are overlapping


# developing a plotted tree
plot_tree(decision_tree = tree_model,
          feature_names = x_data.columns,
          filled        = True, 
          rounded       = True, 
          fontsize      = 14) # adjust if boxes are overlapping


# rendering the plot
plt.tight_layout()
plt.show()

<hr style="height:.9px;border:none;color:#333;background-color:#333;" /><br>

<strong>b) Interpret your model's tree plot in the space below.</strong>

<hr style="height:.9px;border:none;color:#333;background-color:#333;" /><br>

<h2>Part IV: Further Analysis</h2><br>
Tree plots can be very useful in data exploration. Let's practice by analyzing how age plays a factor in terms of getting into a lifeboat.

<strong>a) Complete the code to instantiate a tree model using <em>age</em>, <em>gender</em>, and <em>passenger class</em>.</strong><br>
Additionally, adjust the <em>max_depth</em> and <em>min_samples_leaf</em> hyperparameters to stabilize your model (if needed).

In [None]:
##################################
# Subset: Female and First Class #
##################################

# preparing to partition data
x_data =  titanic.loc[ : , ['age'] ][ titanic['female'] == 1 ][ titanic['pclass_1'] == 1 ]


y_data =  titanic.loc[ : , 'lifeboat'][ titanic['female'] == 1 ][ titanic['pclass_1'] == 1 ]


# instantiating a classification tree
tree_model = DecisionTreeClassifier(max_depth        = 4,
                                    min_samples_leaf = 5,
                                    random_state     = 708)


# using the classification_summary function
classification_summary(x          = x_data,
                       y          = y_data,
                       model      = tree_model,
                       model_name = "Age Tree")


# setting figure size
plt.figure(figsize=(12, 6)) # adjust if boxes are overlapping


# developing a plotted tree
plot_tree(decision_tree = tree_model,
          feature_names = x_data.columns,
          filled        = True, 
          rounded       = True, 
          fontsize      = 12) # adjust if boxes are overlapping


# rendering the plot
plt.tight_layout()
plt.show()

<br>

In [None]:
###################################
# Subset: Female and Second Class #
###################################

# preparing to partition data
x_data =  titanic.loc[ : , ['age'] ][ titanic['female'] == 1 ][ titanic['pclass_2'] == 1 ]


y_data =  titanic.loc[ : , 'lifeboat'][ titanic['female'] == 1 ][ titanic['pclass_2'] == 1 ]


# instantiating a classification tree
tree_model = DecisionTreeClassifier(max_depth        = 3,
                                    min_samples_leaf = 5,
                                    random_state     = 708)


# using the classification_summary function
classification_summary(x          = x_data,
                       y          = y_data,
                       model      = tree_model,
                       model_name = "Age Tree")


# setting figure size
plt.figure(figsize=(12, 6)) # adjust if boxes are overlapping


# developing a plotted tree
plot_tree(decision_tree = tree_model,
          feature_names = x_data.columns,
          filled        = True, 
          rounded       = True, 
          fontsize      = 12) # adjust if boxes are overlapping


# rendering the plot
plt.tight_layout()
plt.show()

<br>

In [None]:
##################################
# Subset: Female and Third Class #
##################################

# preparing to partition data
x_data =  titanic.loc[ : , ['age'] ][ titanic['female'] == 1 ][ titanic['pclass_3'] == 1 ]


y_data =  titanic.loc[ : , 'lifeboat'][ titanic['female'] == 1 ][ titanic['pclass_3'] == 1 ]


# instantiating a classification tree
tree_model = DecisionTreeClassifier(max_depth        = 2,
                                    min_samples_leaf = 30,
                                    random_state     = 708)


# using the classification_summary function
classification_summary(x          = x_data,
                       y          = y_data,
                       model      = tree_model,
                       model_name = "Age Tree")


# setting figure size
plt.figure(figsize=(12, 6)) # adjust if boxes are overlapping


# developing a plotted tree
plot_tree(decision_tree = tree_model,
          feature_names = x_data.columns,
          filled        = True, 
          rounded       = True, 
          fontsize      = 12) # adjust if boxes are overlapping


# rendering the plot
plt.tight_layout()
plt.show()

<br>

In [None]:
################################
# Subset: Male and First Class #
################################

# preparing to partition data
x_data =  titanic.loc[ : , ['age'] ][ titanic['female'] == 0 ][ titanic['pclass_1'] == 1 ]


y_data =  titanic.loc[ : , 'lifeboat'][ titanic['female'] == 0 ][ titanic['pclass_1'] == 1 ]


# instantiating a classification tree
tree_model = DecisionTreeClassifier(max_depth        = 2,
                                    min_samples_leaf = 30,
                                    random_state     = 708)


# using the classification_summary function
classification_summary(x          = x_data,
                       y          = y_data,
                       model      = tree_model,
                       model_name = "Age Tree")


# setting figure size
plt.figure(figsize=(12, 6)) # adjust if boxes are overlapping


# developing a plotted tree
plot_tree(decision_tree = tree_model,
          feature_names = x_data.columns,
          filled        = True, 
          rounded       = True, 
          fontsize      = 12) # adjust if boxes are overlapping


# rendering the plot
plt.tight_layout()
plt.show()

<br>

In [None]:
#################################
# Subset: Male and Second Class #
#################################

# preparing to partition data
x_data =  titanic.loc[ : , ['age'] ][ titanic['female'] == 0 ][ titanic['pclass_2'] == 1 ]


y_data =  titanic.loc[ : , 'lifeboat'][ titanic['female'] == 0 ][ titanic['pclass_2'] == 1 ]


# instantiating a classification tree
tree_model = DecisionTreeClassifier(max_depth        = 3,
                                    min_samples_leaf = 30,
                                    random_state     = 708)


# using the classification_summary function
classification_summary(x          = x_data,
                       y          = y_data,
                       model      = tree_model,
                       model_name = "Age Tree")


# setting figure size
plt.figure(figsize=(12, 6)) # adjust if boxes are overlapping


# developing a plotted tree
plot_tree(decision_tree = tree_model,
          feature_names = x_data.columns,
          filled        = True, 
          rounded       = True, 
          fontsize      = 12) # adjust if boxes are overlapping


# rendering the plot
plt.tight_layout()
plt.show()

<br>

In [None]:
################################
# Subset: Male and Third Class #
################################

# preparing to partition data
x_data =  titanic.loc[ : , ['age'] ][ titanic['female'] == 0 ][ titanic['pclass_3'] == 1 ]


y_data =  titanic.loc[ : , 'lifeboat'][ titanic['female'] == 0 ][ titanic['pclass_3'] == 1 ]


# instantiating a classification tree
tree_model = DecisionTreeClassifier(max_depth        = 4,
                                    min_samples_leaf = 5,
                                    random_state     = 708)


# using the classification_summary function
classification_summary(x          = x_data,
                       y          = y_data,
                       model      = tree_model,
                       model_name = "Age Tree")


# setting figure size
plt.figure(figsize=(12, 6)) # adjust if boxes are overlapping


# developing a plotted tree
plot_tree(decision_tree = tree_model,
          feature_names = x_data.columns,
          filled        = True, 
          rounded       = True, 
          fontsize      = 12) # adjust if boxes are overlapping


# rendering the plot
plt.tight_layout()
plt.show()

<br>

In [None]:
# preparing to partition data
x_data =  titanic[ ['age'] ]


y_data =  titanic['lifeboat']



# instantiating a classification tree
tree_model = DecisionTreeClassifier(max_depth        = 4,
                                    random_state     = 708)


# using the classification_summary function
classification_summary(x          = x_data,
                       y          = y_data,
                       model      = tree_model,
                       model_name = "Age Tree")



# setting figure size
plt.figure(figsize=(16, 6)) # adjust if boxes are overlapping


# developing a plotted tree
plot_tree(decision_tree = tree_model,
          feature_names = x_data.columns,
          filled        = True, 
          rounded       = True, 
          fontsize      = 12) # adjust if boxes are overlapping


# rendering the plot
plt.tight_layout()
plt.show()

<hr style="height:.9px;border:none;color:#333;background-color:#333;" />
<hr style="height:.9px;border:none;color:#333;background-color:#333;" /><br>

~~~
      ___  ___  __                 
|__/ |__  |__  |__)                
|  \ |___ |___ |                   
                                   
 __   __   __               __    /
/ _` |__) /  \ |  | | |\ | / _`  / 
\__> |  \ \__/ |/\| | | \| \__> .  



~~~

<hr style="height:.9px;border:none;color:#333;background-color:#333;" />
<hr style="height:.9px;border:none;color:#333;background-color:#333;" />

<br>