# Modeling - Random Forest Classifier
In this notebook, I have created a Random Forest Classifier for predicting churn for an Internet Service Provider and tuned its hyperparameters using Optuna which is a Hyperparameter Optimization Framework that uses Tree-structured Parzen Estimator (TPE) to find the most optimal parameters.

## Table of Contents:
1. Data Loading
2. Modeling
    - Finding Best Hyperparameters
    - Building Model with tuned parameters

In [1]:
# Importing required libraries and modules
import os
import sys
import optuna
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import QuantileTransformer

from sklearn.model_selection import cross_val_score, RepeatedStratifiedKFold

In [2]:
# Setting seaborn figure size
sns.set(rc={'figure.figsize':(10,8)})

# Setting the seed
np.random.seed(42)

## Data Loading

In [3]:
train_prepared = pd.read_csv('../data/processed/train-prepared.csv')

In [4]:
print('Shape=>', train_prepared.shape)
train_prepared.head()

Shape=> (62273, 11)


Unnamed: 0,is_tv_subscriber,is_movie_package_subscriber,subscription_age,bill_avg,remaining_contract,is_contract,service_failure_count,download_avg,upload_avg,download_over_limit,churn
0,1,1,1.77,7,0.19,1,0,114.1,8.7,0,0
1,1,0,0.05,6,0.59,1,0,12.7,1.3,0,0
2,0,0,1.42,18,0.0,0,0,0.4,0.0,0,1
3,1,0,0.73,20,0.0,1,0,9.3,0.4,0,1
4,1,0,0.25,17,0.0,1,0,6.1,0.5,0,1


In [5]:
# Getting an overview of the dataset
train_prepared.info(show_counts=True,verbose=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 62273 entries, 0 to 62272
Data columns (total 11 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   is_tv_subscriber             62273 non-null  int64  
 1   is_movie_package_subscriber  62273 non-null  int64  
 2   subscription_age             62273 non-null  float64
 3   bill_avg                     62273 non-null  int64  
 4   remaining_contract           62273 non-null  float64
 5   is_contract                  62273 non-null  int64  
 6   service_failure_count        62273 non-null  int64  
 7   download_avg                 61948 non-null  float64
 8   upload_avg                   61948 non-null  float64
 9   download_over_limit          62273 non-null  int64  
 10  churn                        62273 non-null  int64  
dtypes: float64(4), int64(7)
memory usage: 5.2 MB


## Modeling

In [6]:
# Separating predictors and target
X, y = train_prepared.loc[:, train_prepared.columns != 'churn'], train_prepared.loc[:, 'churn']

### Finding Best Hyperparameters

In [7]:
def create_rfc_pipeline(quantile_transform: str, n_estimators: int,
                        min_samples_split: int, min_samples_leaf: int,
                        max_features: str) -> Pipeline:
    """ Returns a pipeline object created around Random
            Forest algorithm
    
    Takes data preparation and Random Forest Classifier
        parameters as input, creates a Scikit-learn
        pipeline object and returns it
    
    Parameters
    ----------
    quantile_transform : str
        - "Yes": Quantile Transformation will be performed
        - "No": No Transformation
    
    n_estimators: int
        n_estimators argument of RandomForestClassifier
    
    min_samples_split: int
        min_samples_split argument of RandomForestClassifier

    min_samples_leaf: int
        min_samples_leaf argument of RandomForestClassifier
    
    max_features: str
        max_features argument of RandomForestClassifier
    
    Returns
    -------
    pipeline : Pipeline
        The pipeline object from Scikit-Learn
    """
    pipeline_steps = []
    
    # Adding SimpleImputer to pipeline
    imputer = SimpleImputer(strategy = 'median')
    pipeline_steps.append(('median_imputer', imputer))

    # Adding QuantileTransformer to pipeline (if required)
    if quantile_transform == "Yes":
        quantile_transformer = QuantileTransformer(n_quantiles=1000,
                                                   output_distribution='normal',
                                                   random_state=42)

        transformer = ColumnTransformer(transformers=[('quantile_transformer',
                                                       quantile_transformer,
                                                       [2, 3, 4, 6, 7, 8, 9])],
                                        n_jobs=-1,
                                        remainder='passthrough')
        pipeline_steps.append(('transformer', transformer))
    
    # Adding CART Model to pipeline
    model = RandomForestClassifier(n_estimators = n_estimators,
                                   criterion = "gini",
                                   max_features = max_features,                           
                                   min_samples_split = min_samples_split,
                                   min_samples_leaf = min_samples_leaf,
                                   n_jobs = -1,
                                   random_state = 42)
    
    pipeline_steps.append(('cart_model', model))
    
    # Building Pipeline Object
    pipeline = Pipeline(steps = pipeline_steps)
    
    return pipeline

In [8]:
def objective(trial: optuna.trial.Trial) -> np.ndarray:
    """ Returns mean ROC-AUC score for Random Forest
        Classification algorithm
    
    Objective function for optimizing Random Forest
        algorithm using Optuna. Takes optuna Trial
        object as input, performs 10-fold cross-validation
        and returns mean ROC-AUC score for a set of
        hyperparameters of Random Forest modeling pipeline.
        
    Parameters
    ----------
    trial : optuna.trial.Trial
        A trial is a process of evaluating an objective function.
        This object is passed to an objective function and provides
        interfaces to get parameter suggestion, manage the trial’s
        state, and set/get user-defined attributes of the trial.
    
    Returns
    -------
    roc_auc_score : np.ndarray
        Mean ROC-AUC Score of 10-fold cross-validation
        for a Random Forest modeling pipeline with a set
        of hyperparameters.
    """
    # Data preparation parameters
    quantile_transform = trial.suggest_categorical("quantile_transform", ["Yes", "No"])
    
    # Modeling parameters
    n_estimators = trial.suggest_int("n_estimators", low=10, high=1000, step=10)
    min_samples_split = trial.suggest_int("min_samples_split", low=2, high=40, step=1)
    min_samples_leaf = trial.suggest_int("min_samples_leaf", low=2, high=40, step=1)
    max_features = trial.suggest_categorical("max_features", ["auto", None])
    
    # Building modeling pipeline
    pipeline = create_rfc_pipeline(quantile_transform, n_estimators, min_samples_split,
                                   min_samples_leaf, max_features)
    
    # Defining Cross-Validation
    cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=1, random_state=42)
    scores = cross_val_score(pipeline, X, y, scoring='roc_auc', cv=cv, n_jobs=-1)
    
    return np.mean(scores)

In [9]:
study = optuna.create_study(direction = 'maximize')
study.optimize(objective, show_progress_bar = True, n_trials = 30)

[32m[I 2021-10-13 08:03:27,429][0m A new study created in memory with name: no-name-8d2c641b-4302-4364-9e7d-50c1c37ca1a5[0m
  self._init_valid()


  0%|          | 0/30 [00:00<?, ?it/s]

[32m[I 2021-10-13 08:06:17,366][0m Trial 0 finished with value: 0.9796091029785851 and parameters: {'quantile_transform': 'No', 'n_estimators': 930, 'min_samples_split': 16, 'min_samples_leaf': 30, 'max_features': 'auto'}. Best is trial 0 with value: 0.9796091029785851.[0m
[32m[I 2021-10-13 08:09:23,673][0m Trial 1 finished with value: 0.9812105743577086 and parameters: {'quantile_transform': 'Yes', 'n_estimators': 940, 'min_samples_split': 26, 'min_samples_leaf': 10, 'max_features': 'auto'}. Best is trial 1 with value: 0.9812105743577086.[0m
[32m[I 2021-10-13 08:10:04,690][0m Trial 2 finished with value: 0.9803973890206571 and parameters: {'quantile_transform': 'Yes', 'n_estimators': 210, 'min_samples_split': 13, 'min_samples_leaf': 18, 'max_features': 'auto'}. Best is trial 1 with value: 0.9812105743577086.[0m
[32m[I 2021-10-13 08:11:27,251][0m Trial 3 finished with value: 0.9812747364807709 and parameters: {'quantile_transform': 'Yes', 'n_estimators': 410, 'min_samples_sp

In [10]:
print("Highest Score: ", study.best_value)
print("Best Parameters: ", study.best_params)
print("Best Trial: ", study.best_trial)

Highest Score:  0.9827541442864935
Best Parameters:  {'quantile_transform': 'No', 'n_estimators': 850, 'min_samples_split': 27, 'min_samples_leaf': 4, 'max_features': None}
Best Trial:  FrozenTrial(number=29, values=[0.9827541442864935], datetime_start=datetime.datetime(2021, 10, 13, 10, 30, 44, 903786), datetime_complete=datetime.datetime(2021, 10, 13, 10, 38, 15, 36450), params={'quantile_transform': 'No', 'n_estimators': 850, 'min_samples_split': 27, 'min_samples_leaf': 4, 'max_features': None}, distributions={'quantile_transform': CategoricalDistribution(choices=('Yes', 'No')), 'n_estimators': IntUniformDistribution(high=1000, low=10, step=10), 'min_samples_split': IntUniformDistribution(high=40, low=2, step=1), 'min_samples_leaf': IntUniformDistribution(high=40, low=2, step=1), 'max_features': CategoricalDistribution(choices=('auto', None))}, user_attrs={}, system_attrs={}, intermediate_values={}, trial_id=29, state=TrialState.COMPLETE, value=None)


### Building Model with tuned parameters

In [11]:
tuned_params = study.best_params

In [12]:
# Building modeling pipeline
pipeline = create_rfc_pipeline(quantile_transform = tuned_params["quantile_transform"],
                               n_estimators = tuned_params["n_estimators"],
                               min_samples_split = tuned_params["min_samples_split"],
                               min_samples_leaf = tuned_params["min_samples_leaf"],
                               max_features = tuned_params["max_features"])

# Defining model evaluation
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=10, random_state=42)

# Evaluating Model
scores = cross_val_score(pipeline, X, y, scoring='roc_auc', cv = cv, n_jobs = -1)

In [13]:
print("Random Forest Classifier Pipeline: ", pipeline)
print('Mean AUC-ROC Score of Random Forest Classifier: %.4f \u00B1 %.4f' % (np.mean(scores), np.std(scores)))

Random Forest Classifier Pipeline:  Pipeline(steps=[('median_imputer', SimpleImputer(strategy='median')),
                ('cart_model',
                 RandomForestClassifier(max_features=None, min_samples_leaf=4,
                                        min_samples_split=27, n_estimators=850,
                                        n_jobs=-1, random_state=42))])
Mean AUC-ROC Score of Random Forest Classifier: 0.9827 ± 0.0015
