# Modeling - Decision Tree
In this notebook, I have created a Decision Tree for predicting churn for an Internet Service Provider and tuned its hyperparameters using Optuna which is a Hyperparameter Optimization Framework that uses Tree-structured Parzen Estimator (TPE) to find the most optimal parameters.

## Table of Contents:
1. Data Loading
2. Modeling
    - Finding Best Hyperparameters
    - Building Model with tuned parameters

In [1]:
# Importing required libraries and modules
import os
import sys
import optuna
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import QuantileTransformer

from sklearn.model_selection import cross_val_score, RepeatedStratifiedKFold

In [2]:
# Setting seaborn figure size
sns.set(rc={'figure.figsize':(10,8)})

# Setting the seed
np.random.seed(42)

## Data Loading

In [3]:
train_prepared = pd.read_csv('../data/processed/train-prepared.csv')

In [4]:
print('Shape=>', train_prepared.shape)
train_prepared.head()

Shape=> (62273, 11)


Unnamed: 0,is_tv_subscriber,is_movie_package_subscriber,subscription_age,bill_avg,remaining_contract,is_contract,service_failure_count,download_avg,upload_avg,download_over_limit,churn
0,1,1,1.77,7,0.19,1,0,114.1,8.7,0,0
1,1,0,0.05,6,0.59,1,0,12.7,1.3,0,0
2,0,0,1.42,18,0.0,0,0,0.4,0.0,0,1
3,1,0,0.73,20,0.0,1,0,9.3,0.4,0,1
4,1,0,0.25,17,0.0,1,0,6.1,0.5,0,1


In [5]:
# Getting an overview of the dataset
train_prepared.info(show_counts=True,verbose=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 62273 entries, 0 to 62272
Data columns (total 11 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   is_tv_subscriber             62273 non-null  int64  
 1   is_movie_package_subscriber  62273 non-null  int64  
 2   subscription_age             62273 non-null  float64
 3   bill_avg                     62273 non-null  int64  
 4   remaining_contract           62273 non-null  float64
 5   is_contract                  62273 non-null  int64  
 6   service_failure_count        62273 non-null  int64  
 7   download_avg                 61948 non-null  float64
 8   upload_avg                   61948 non-null  float64
 9   download_over_limit          62273 non-null  int64  
 10  churn                        62273 non-null  int64  
dtypes: float64(4), int64(7)
memory usage: 5.2 MB


## Modeling

In [6]:
# Separating predictors and target
X, y = train_prepared.loc[:, train_prepared.columns != 'churn'], train_prepared.loc[:, 'churn']

### Finding Best Hyperparameters

In [7]:
def create_cart_pipeline(quantile_transform: str, min_samples_split: int,
                         min_samples_leaf: int) -> Pipeline:
    """ Returns a pipeline object created around Decision Tree
        algorithm
    
    Takes data preparation and Decision Tree modeling
        parameters as input, creates a Scikit-learn
        pipeline object and returns it
    
    Parameters
    ----------
    quantile_transform : str
        - "Yes": Quantile Transformation will be performed
        - "No": No Transformation
    
    min_samples_split: int
        min_samples_split argument of DecisionTreeClassifier
    
    min_samples_leaf: int
        min_samples_leaf argument of DecisionTreeClassifier
        
    Returns
    -------
    pipeline : Pipeline
        The pipeline object from Scikit-Learn
    """
    pipeline_steps = []
    
    # Adding SimpleImputer to pipeline
    imputer = SimpleImputer(strategy = 'median')
    pipeline_steps.append(('median_imputer', imputer))

    # Adding QuantileTransformer to pipeline (if required)
    if quantile_transform == "Yes":
        quantile_transformer = QuantileTransformer(n_quantiles=1000,
                                                   output_distribution='normal',
                                                   random_state=42)

        transformer = ColumnTransformer(transformers=[('quantile_transformer',
                                                       quantile_transformer,
                                                       [2, 3, 4, 6, 7, 8, 9])],
                                        n_jobs=-1,
                                        remainder='passthrough')
        pipeline_steps.append(('transformer', transformer))
    
    # Adding CART Model to pipeline
    model = DecisionTreeClassifier(criterion = "gini",
                                   splitter = "best",
                                   min_samples_split = min_samples_split,
                                   min_samples_leaf = min_samples_leaf,
                                   random_state = 42)
    
    pipeline_steps.append(('cart_model', model))
    
    # Building Pipeline Object
    pipeline = Pipeline(steps = pipeline_steps)
    
    return pipeline

In [8]:
def objective(trial: optuna.trial.Trial) -> np.ndarray:
    """ Returns ROC-AUC score for CART algorithm
    
    Objective function for optimizing CART algorithm
        using Optuna. Takes optuna Trial object as
        input, performs 10-fold cross-validation 5 times
        and returns mean ROC-AUC score for a set of
        hyperparameters of Decision Tree modeling
        pipeline.
        
    Parameters
    ----------
    trial : optuna.trial.Trial
        A trial is a process of evaluating an objective function.
        This object is passed to an objective function and provides
        interfaces to get parameter suggestion, manage the trial’s
        state, and set/get user-defined attributes of the trial.
    
    Returns
    -------
    roc_auc_score : np.ndarray
        Mean ROC-AUC Score of 10-fold cross-validation
        repeated 5 times for a Decision Tree modeling
        pipeline with a set of hyperparameters.
    """
    # Data preparation parameters
    quantile_transform = trial.suggest_categorical("quantile_transform", ["Yes", "No"])
    
    # Modeling parameters
    min_samples_split = trial.suggest_int("min_samples_split", low=2, high=50, step=1)
    min_samples_leaf = trial.suggest_int("min_samples_leaf", low=2, high=50, step=1)
    
    # Building modeling pipeline
    pipeline = create_cart_pipeline(quantile_transform, min_samples_split, min_samples_leaf)
    
    # Defining Cross-Validation
    cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=5, random_state=42)
    scores = cross_val_score(pipeline, X, y, scoring='roc_auc', cv=cv, n_jobs=-1)
    
    return np.mean(scores)

In [9]:
study = optuna.create_study(direction = 'maximize')
study.optimize(objective, show_progress_bar = True, n_trials = 30)

[32m[I 2021-10-13 07:14:32,747][0m A new study created in memory with name: no-name-4dfa2af9-f21a-468c-86de-b410b3ca544d[0m
  self._init_valid()


  0%|          | 0/30 [00:00<?, ?it/s]

[32m[I 2021-10-13 07:14:42,848][0m Trial 0 finished with value: 0.9708674611738232 and parameters: {'quantile_transform': 'Yes', 'min_samples_split': 25, 'min_samples_leaf': 16}. Best is trial 0 with value: 0.9708674611738232.[0m
[32m[I 2021-10-13 07:14:51,166][0m Trial 1 finished with value: 0.9728945116205585 and parameters: {'quantile_transform': 'Yes', 'min_samples_split': 50, 'min_samples_leaf': 19}. Best is trial 1 with value: 0.9728945116205585.[0m
[32m[I 2021-10-13 07:14:59,309][0m Trial 2 finished with value: 0.975060845011023 and parameters: {'quantile_transform': 'Yes', 'min_samples_split': 13, 'min_samples_leaf': 44}. Best is trial 2 with value: 0.975060845011023.[0m
[32m[I 2021-10-13 07:15:06,476][0m Trial 3 finished with value: 0.9732303801943619 and parameters: {'quantile_transform': 'No', 'min_samples_split': 29, 'min_samples_leaf': 25}. Best is trial 2 with value: 0.975060845011023.[0m
[32m[I 2021-10-13 07:15:16,487][0m Trial 4 finished with value: 0.9722

In [10]:
print("Highest Score: ", study.best_value)
print("Best Parameters: ", study.best_params)
print("Best Trial: ", study.best_trial)

Highest Score:  0.9751577233859277
Best Parameters:  {'quantile_transform': 'Yes', 'min_samples_split': 16, 'min_samples_leaf': 46}
Best Trial:  FrozenTrial(number=5, values=[0.9751577233859277], datetime_start=datetime.datetime(2021, 10, 13, 7, 15, 16, 497292), datetime_complete=datetime.datetime(2021, 10, 13, 7, 15, 24, 648573), params={'quantile_transform': 'Yes', 'min_samples_split': 16, 'min_samples_leaf': 46}, distributions={'quantile_transform': CategoricalDistribution(choices=('Yes', 'No')), 'min_samples_split': IntUniformDistribution(high=50, low=2, step=1), 'min_samples_leaf': IntUniformDistribution(high=50, low=2, step=1)}, user_attrs={}, system_attrs={}, intermediate_values={}, trial_id=5, state=TrialState.COMPLETE, value=None)


### Building Model with tuned parameters

In [11]:
tuned_params = study.best_params

In [12]:
# Building modeling pipeline
pipeline = create_cart_pipeline(quantile_transform = tuned_params["quantile_transform"],
                                min_samples_split = tuned_params["min_samples_split"],
                                min_samples_leaf = tuned_params["min_samples_leaf"])

# Defining model evaluation
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=10, random_state=42)

# Evaluating Model
scores = cross_val_score(pipeline, X, y, scoring='roc_auc', cv = cv, n_jobs = -1)

In [13]:
print("Decision Tree Classifier Pipeline: ", pipeline)
print('Mean AUC-ROC Score of Decision Tree Classifier: %.4f \u00B1 %.4f' % (np.mean(scores), np.std(scores)))

Decision Tree Classifier Pipeline:  Pipeline(steps=[('median_imputer', SimpleImputer(strategy='median')),
                ('transformer',
                 ColumnTransformer(n_jobs=-1, remainder='passthrough',
                                   transformers=[('quantile_transformer',
                                                  QuantileTransformer(output_distribution='normal',
                                                                      random_state=42),
                                                  [2, 3, 4, 6, 7, 8, 9])])),
                ('cart_model',
                 DecisionTreeClassifier(min_samples_leaf=46,
                                        min_samples_split=16,
                                        random_state=42))])
Mean AUC-ROC Score of Decision Tree Classifier: 0.9751 ± 0.0021
