# Part I - Introduction and Data Description

For this demonstration, we will use a bank marketing data set. A bank ran a marketing campaign in the past and has obtained data pertaining to nearly 45,000 customers, which includes variables such as their age, jobs, bank balance, education, loan status and so on. Based on this data, the bank wants to develop its future strategies based on the insights that it drew from the previous campaign and improve for the next campaign so that more customers agree to open term deposits with the bank.

Here, `'y'` (whether the customer wishes to open a deposit or not) is the target variable. A `'Yes'` in the `'y'` column indicates that the campaign was successful and the customer agreed to open a term deposit account with the bank. In contrast, a `'No'` in the `'y'` column indicates that the campaign was not very successful and the customer could not be convinced to open a term deposit account.

The purpose of this demonstration is to show the viewer how to build and implement random forest models and gradient boosted tree models for classification. We will also look at how model performance varies for different values of various hyperparameters.

## Data description

### Input features
- ***age*** : Age of the customer (numeric)
- ***job*** : Type of job (categorical)
- ***marital*** : Marital status (categorical)
- ***education***: Level of education (categorical)
- ***default***: Does the customer have a credit default or not? (categorical: 'no', 'yes', 'unknown')
- ***balance***: Bank balance of the customer (numeric)
- ***housing***: Does the customer have a housing loan or not? (categorical: 'no', 'yes', 'unknown')
- ***loan***: Does the customer have a personal loan or not? (categorical: 'no', 'yes', 'unknown')
- ***contact***: Contact communication type (categorical)
- ***day***: Last contact day of the week (categorical)
- ***month***: Last contact month of year (categorical)
- ***duration***: Last contact duration, in seconds (numeric)
- ***campaign***: Number of contacts performed during this campaign and for this client (numeric, includes last contact)
- ***pdays***: Number of days that passed by after the client was last contacted from a previous campaign (numeric)
- ***previous***: Number of contacts performed before this campaign and for this client (numeric)
- ***poutcome***: Outcome of the previous marketing campaign (categorical)

### Output feature
- ***y***: Has the client subscribed a term deposit? (categorical: 'yes', 'no')

In [None]:
# !pip install lightgbm  # Run this cell to install the lightgbm library

In [None]:
# Importing 'numpy' and 'pandas' packages for working with numbers and data frames
import numpy as np
import pandas as pd

# Importing 'matplotlib.pyplot' and 'seaborn' for visualisations
from matplotlib import pyplot as plt
import seaborn as sns

# Importing packages for building ensemble models
from sklearn.ensemble import RandomForestClassifier
from lightgbm import LGBMClassifier 

# Importing method for train-test split
from sklearn.model_selection import train_test_split

# Importing sutiable error measure methods
from sklearn.metrics import ConfusionMatrixDisplay, accuracy_score, roc_auc_score

# Import 'GridSearchCV' for hyperparameter tuning
from sklearn.model_selection import GridSearchCV

# Ignore warnings to keep output clean
import warnings
warnings.filterwarnings('ignore')

In [None]:
# Importing the raw data
df = pd.read_csv('bank-full.csv', delimiter = ';')  # This dataset uses semicolons instead of commas to separate values

In [None]:
# Taking a look at a sample of the data
df.head(5)

In [None]:
# Discard features not used in this analysis
df = df[['age', 'duration', 'balance', 'job', 'marital', 'education', 'y']]

In [None]:
# Generating dummy variables for categorical features
df_dummies = pd.get_dummies(df, columns = ['job', 'marital', 'education', 'y'])
df_dummies.head()

In [None]:
# Taking a look at the new dummy variables
for col in df_dummies.columns:
    print(col)

In [None]:
# Splitting the data into input (select features) and output
df_dummies = df_dummies[['age', 'duration', 'balance', 
                         'job_services', 'job_management', 'job_student', 'job_retired', 
                         'marital_divorced', 'marital_married', 'marital_single', 
                         'education_primary', 'education_secondary', 'education_tertiary', 
                         'y_yes']]

In [None]:
X = df_dummies.drop('y_yes', axis = 1)
y = df_dummies['y_yes']

In [None]:
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size = 0.2, random_state = 123)

# Part 2 - Random Forest

In [None]:
# Creating a random forest classifier
# Use 100 estimators, a maximum tree depth of 5, and set the class weight as 'balanced'
# Set the random state parameter to 123
rf = RandomForestClassifier(n_estimators = 100, max_depth = 5, class_weight = 'balanced', random_state = 123)

# Fit the model to the training data
rf.fit(X_train, y_train);

In [None]:
# Obtaining the feature importances from the model
rfimp = rf.feature_importances_
rfimp

In [None]:
plt.figure(figsize = (6, 4))
rfimpdf = pd.DataFrame(data = {'Features': X_train.columns, 'Importances': rfimp})
rfimpdf = rfimpdf.sort_values(by = 'Importances', ascending = False)
sns.barplot(data = rfimpdf, x = 'Importances', y = 'Features', orient = 'h');

As you can see from the above graph, ***duration*** has been determined as the most important feature by the random forest model.

In [None]:
# Display the confusion matrices for the model on the training and validation data
plt.rcParams.update({'font.size': 14})
fig, ax = plt.subplots(1, 2, figsize = (12, 4))
ConfusionMatrixDisplay.from_estimator(rf, X_train, y_train, cmap = plt.cm.Blues, ax = ax[0])
ConfusionMatrixDisplay.from_estimator(rf, X_val, y_val, cmap = plt.cm.Blues, ax = ax[1])
ax[0].set_title('Training')
ax[1].set_title('Validation');

In [None]:
# Obtaining predictions on the training and testing sets
y_pred_train = rf.predict(X_train)
y_pred_val = rf.predict(X_val)

# Compute accuracy scores
train_acc = accuracy_score(y_train, y_pred_train)
val_acc = accuracy_score(y_val, y_pred_val)

print('Accuracy on the training data = {}'.format(train_acc))
print('Accuracy on the validation data = {}'.format(val_acc))

In [None]:
# Compute the ROC AUC scores for the training and the validation data
# Obtain predicted probabilities for class '1'
train_probabilities = rf.predict_proba(X_train)[:, 1]
val_probabilities = rf.predict_proba(X_val)[:, 1]

# Compute ROC AUC scores
train_auc = roc_auc_score(y_train, train_probabilities)
val_auc = roc_auc_score(y_val, val_probabilities)

print('ROC AUC score for the training data = {}'.format(train_auc))
print('ROC AUC score for the validation data = {}'.format(val_auc))

# Part 3 - Random Forest: Hyperparameter Tuning

In this section, we will:
- Tune the random forest model for the following hyperparameters:
  - Number of estimators
  - Maximum tree depth
- Tune the random forest model for a combination of number of estimators and maximum tree depth

## Subpart 1 - Hyperparameter Tuning: Number of Estimators

In [None]:
# Define a list of number of estimators to tune over
num_estimators = np.arange(50, 550, 50)

# Create and train a random forest model for each value of number of estimators
performance_df = pd.DataFrame(data = None)

# Use a for loop to loop over the different models and capture their performances
indexcount = -1
for current_num_estimators in num_estimators:
    indexcount = indexcount + 1

    # Create a random forest model with the current specifications
    # Use the current number of estimators, a maximum tree depth of 5, and set the class weight as 'balanced'
    # Set the random state parameter to 123
    current_rf = RandomForestClassifier(n_estimators = current_num_estimators,
                                        max_depth = 5,
                                        class_weight = 'balanced',
                                        random_state = 123)

    # Fit the model on the training data
    current_rf.fit(X_train, y_train)

    print('\n Training for {} estimators is complete'.format(current_num_estimators))

    # Obtain predictions
    current_y_pred_train = current_rf.predict(X_train)
    current_y_pred_val = current_rf.predict(X_val)

    # Compute accuracy scores
    current_train_acc = accuracy_score(y_train, current_y_pred_train)
    current_val_acc = accuracy_score(y_val, current_y_pred_val)

    # Obtain predicted probabilities for class '1'
    current_train_probabilities = current_rf.predict_proba(X_train)[:, 1]
    current_val_probabilities = current_rf.predict_proba(X_val)[:, 1]

    # Compute ROC AUC scores
    current_train_auc = roc_auc_score(y_train, current_train_probabilities)
    current_val_auc = roc_auc_score(y_val, current_val_probabilities)

    tempdf = pd.DataFrame(index = [indexcount],
                          data = {'Number of Estimators': current_num_estimators,
                                  'Training Accuracy': current_train_acc,
                                  'Validation Accuracy': current_val_acc,
                                  'Training ROC AUC': current_train_auc,
                                  'Validation ROC AUC': current_val_auc})

    performance_df = pd.concat([performance_df, tempdf])

performance_df.set_index('Number of Estimators')

In [None]:
# Visualise variation in validation accuracy scores with respect to number of estimators
plt.figure(figsize = (10, 4))

sns.lineplot(data = performance_df, x = 'Number of Estimators', y = 'Validation Accuracy', marker = 'o', markersize = 12)
plt.title('Validation Accuracy Scores by Number of Estimators')
plt.ylabel('Accuracy')
plt.xlabel('Number of Estimators')
plt.xticks(num_estimators);

In [None]:
# Visualise variation in validation accuracy scores with respect to number of estimators
plt.figure(figsize = (10, 4))

sns.lineplot(data = performance_df, x = 'Number of Estimators', y = 'Validation ROC AUC', marker = 'o', markersize = 12)
plt.title('Validation ROC AUC Scores by Number of Estimators')
plt.ylabel('ROC AUC Score')
plt.xlabel('Number of Estimators')
plt.xticks(num_estimators);

## Subpart 2 - Hyperparameter Tuning: Maximum Tree Depth

In [None]:
# Define a list of maximum tree depths to tune over
max_tree_depths = np.arange(1, 11, 1)

# Create and train a random forest model for each value of maximum tree depth
performance_df = pd.DataFrame(data = None)

# Use a for loop to loop over the different models and capture their performances
indexcount = -1
for current_max_tree_depth in max_tree_depths:
    indexcount = indexcount + 1

    # Create a random forest model with the current specifications
    # Use 200 estimators, the current maximum tree depth, and set the class weight as 'balanced'
    # Set the random state parameter to 123
    current_rf = RandomForestClassifier(n_estimators = 200,
                                        max_depth = current_max_tree_depth,
                                        class_weight = 'balanced',
                                        random_state = 123)

    # Fit the model on the training data
    current_rf.fit(X_train, y_train)

    print('\n Training for tree depth of {} is complete'.format(current_max_tree_depth))

    # Obtain predictions
    current_y_pred_train = current_rf.predict(X_train)
    current_y_pred_val = current_rf.predict(X_val)

    # Compute accuracy scores
    current_train_acc = accuracy_score(y_train, current_y_pred_train)
    current_val_acc = accuracy_score(y_val, current_y_pred_val)

    # Obtain predicted probabilities for class '1'
    current_train_probabilities = current_rf.predict_proba(X_train)[:, 1]
    current_val_probabilities = current_rf.predict_proba(X_val)[:, 1]

    # Compute ROC AUC scores
    current_train_auc = roc_auc_score(y_train, current_train_probabilities)
    current_val_auc = roc_auc_score(y_val, current_val_probabilities)

    tempdf = pd.DataFrame(index = [indexcount],
                          data = {'Maximum Tree Depth': current_max_tree_depth,
                                  'Training Accuracy': current_train_acc,
                                  'Validation Accuracy': current_val_acc,
                                  'Training ROC AUC': current_train_auc,
                                  'Validation ROC AUC': current_val_auc})

    performance_df = pd.concat([performance_df, tempdf])

performance_df.set_index('Maximum Tree Depth')

In [None]:
# Visualise variation in validation accuracy scores with respect to the maximum tree depth
plt.figure(figsize = (10, 4))

sns.lineplot(data = performance_df, x = 'Maximum Tree Depth', y = 'Validation Accuracy', marker = 'o', markersize = 12)
plt.title('Validation Accuracy Scores by Maximum Tree Depth')
plt.ylabel('Accuracy')
plt.xlabel('Maximum Tree Depth')
plt.xticks(max_tree_depths);

In [None]:
# Visualise variation in validation ROC AUC scores with respect to the maximum tree depth
plt.figure(figsize = (10, 4))

sns.lineplot(data = performance_df, x = 'Maximum Tree Depth', y = 'Validation ROC AUC', marker = 'o', markersize = 12)
plt.title('Validation ROC AUC Scores by Maximum Tree Depth')
plt.ylabel('ROC AUC Score')
plt.xlabel('Maximum Tree Depth')
plt.xticks(max_tree_depths);

## Subpart 3 - Hyperparameter Tuning: Combination of Hyperparameters

In [None]:
# Initialise a basic random forest classifier model
# Set the class weight as 'balanced'
# Set the random state parameter to 123
base_grid_model = RandomForestClassifier(class_weight = 'balanced', random_state = 123)

# Define a range of hyperparameter values to tune for and store them in a dictionary
parameters_grid = {'n_estimators': [100, 200],
                   'max_depth': [5, 6, 7]}

# Perform a grid search using the 'GridSearchCV()' method to obtain a grid on which to fit the training data
# Use ROC AUC score as a scoring metric
# Use the default number of cross-validation folds
# Set the 'verbose' parameter to 3 or more to display useful results during the process
grid = GridSearchCV(estimator = base_grid_model,
                    param_grid = parameters_grid,
                    scoring = 'roc_auc',
                    verbose = 4)

# Fit the model on the training data
grid_model = grid.fit(X_train, y_train)

# Print the optimal values of 'n_estimators' and 'max_depth'
best_n_estimators = grid_model.best_params_['n_estimators']
best_max_depth = grid_model.best_params_['max_depth']
best_roc_auc_score = grid_model.best_score_

print('\n The optimal model has {} estimators, each of maximum tree depth {}, and it has an ROC AUC score of {}.'.format(best_n_estimators, best_max_depth, best_roc_auc_score))

# Part 4 - Gradient Boosted Tree

We will now fit a gradient boosted tree model to the data and study the performance of the model on the training and validation data.

In [None]:
# Create a gradient boosted tree classifier model
# Use 100 estimators, a maximum tree depth of 5, a learning rate of 0.1, and set the class weight as 'balanced'
# Set the random state parameter to 123
gbt = LGBMClassifier(n_estimators = 100, max_depth = 5, learning_rate = 0.1, class_weight = 'balanced', random_state = 123, verbose = -1)

# Fit the model to the training data
gbt.fit(X_train, y_train);

In [None]:
# Obtaining the feature importances from the model
gbtimp = gbt.feature_importances_
gbtimp

In [None]:
# Visualising the feature importances
plt.figure(figsize = (6, 4))
gbtimpdf = pd.DataFrame(data = {'Features': X_train.columns, 'Importances': gbtimp})
gbtimpdf = gbtimpdf.sort_values(by = 'Importances', ascending = False)
sns.barplot(data = gbtimpdf, x = 'Importances', y = 'Features', orient = 'h');

In [None]:
# Display the confusion matrices for the model on the training and validation data
plt.rcParams.update({'font.size': 14})
fig, ax = plt.subplots(1, 2, figsize = (12, 4))
ConfusionMatrixDisplay.from_estimator(gbt, X_train, y_train, cmap = plt.cm.Blues, ax = ax[0])
ConfusionMatrixDisplay.from_estimator(gbt, X_val, y_val, cmap = plt.cm.Blues, ax = ax[1])
ax[0].set_title('Training')
ax[1].set_title('Validation');

In [None]:
# Compute the accuracy scores on the training and validation data
# Obtaining predictions
y_pred_train = gbt.predict(X_train)
y_pred_val = gbt.predict(X_val)

# Compute accuracy scores
train_acc = accuracy_score(y_train, y_pred_train)
val_acc = accuracy_score(y_val, y_pred_val)

print('Accuracy on the training data = {}'.format(train_acc))
print('Accuracy on the validation data = {}'.format(val_acc))

In [None]:
# Compute the ROC AUC scores for the training and validation data
# Obtaining predicted probabilities for class '1'
train_probabilities = gbt.predict_proba(X_train)[:, 1]
val_probabilities = gbt.predict_proba(X_val)[:, 1]

# Compute ROC AUC scores
train_auc = roc_auc_score(y_train, train_probabilities)
val_auc = roc_auc_score(y_val, val_probabilities)

print('ROC AUC score for the training data = {}'.format(train_auc))
print('ROC AUC score for the validation data = {}'.format(val_auc))

# Part 5 - Gradient Boosted Tree: Hyperparameter Tuning

In this section, we will:
- Tune the gradient boosted tree model for the following hyperparameters:
  - Number of estimators
  - Maximum tree depth
  - Learning rate
- Tune the gradient boosted tree model for a combination of the number of estimators, maximum tree depth, and learning rate

## Subpart 1 - Hyperparameter Tuning: Number of Estimators

In [None]:
# Define a list of number of estimators to tune over
num_estimators = np.arange(50, 550, 50)

# Create and train a gradient boosted tree for each value of number of estimators
performance_df = pd.DataFrame(data = None)

# Use a for loop to loop over the different models and capture their performances
indexcount = -1
for current_num_estimators in num_estimators:
    indexcount = indexcount + 1

    # Create a gradient boosted tree model with the current specifications
    # Use the current number of estimators, a maximum tree depth of 5, a learning rate of 0.1, and set the class weight as 'balanced'
    # Set the random state parameter to 123
    current_gbt = LGBMClassifier(n_estimators = current_num_estimators,
                                 max_depth = 5,
                                 learning_rate = 0.1,
                                 class_weight = 'balanced',
                                 random_state = 123)

    # Fit the model on the training data
    current_gbt.fit(X_train, y_train)

    print('\n Training for {} estimators is complete'.format(current_num_estimators))

    # Obtain predictions
    current_y_pred_train = current_gbt.predict(X_train)
    current_y_pred_val = current_gbt.predict(X_val)

    # Compute accuracy scores
    current_train_acc = accuracy_score(y_train, current_y_pred_train)
    current_val_acc = accuracy_score(y_val, current_y_pred_val)

    # Obtain predicted probabilities for class '1'
    current_train_probabilities = current_gbt.predict_proba(X_train)[:, 1]
    current_val_probabilities = current_gbt.predict_proba(X_val)[:, 1]

    # Compute ROC AUC scores
    current_train_auc = roc_auc_score(y_train, current_train_probabilities)
    current_val_auc = roc_auc_score(y_val, current_val_probabilities)

    tempdf = pd.DataFrame(index = [indexcount],
                          data = {'Number of Estimators': current_num_estimators,
                                  'Training Accuracy': current_train_acc,
                                  'Validation Accuracy': current_val_acc,
                                  'Training ROC AUC': current_train_auc,
                                  'Validation ROC AUC': current_val_auc})

    performance_df = pd.concat([performance_df, tempdf])

performance_df.set_index('Number of Estimators')

In [None]:
# Visualise variation in validation accuracy scores with respect to number of estimators
plt.figure(figsize = (10, 4))

sns.lineplot(data = performance_df, x = 'Number of Estimators', y = 'Validation Accuracy', marker = 'o', markersize = 12)
plt.title('Validation Accuracy Scores by Number of Estimators')
plt.ylabel('Accuracy')
plt.xlabel('Number of Estimators')
plt.xticks(num_estimators);

In [None]:
# Visualise variation in validation accuracy scores with respect to number of estimators
plt.figure(figsize = (10, 4))

sns.lineplot(data = performance_df, x = 'Number of Estimators', y = 'Validation ROC AUC', marker = 'o', markersize = 12)
plt.title('Validation ROC AUC Scores by Number of Estimators')
plt.ylabel('ROC AUC Score')
plt.xlabel('Number of Estimators')
plt.xticks(num_estimators);

## Subpart 2 - Hyperparameter Tuning: Maximum Tree Depth

In [None]:
# Define a list of maximum tree depths to tune over
max_tree_depths = np.arange(1, 11, 1)

# Create and train a gradient boosted tree model for each value of maximum tree depth
performance_df = pd.DataFrame(data = None)

# Use a for loop to loop over the different models and capture their performances
indexcount = -1
for current_max_tree_depth in max_tree_depths:
    indexcount = indexcount + 1

    # Create a gradient boosted tree with the current specifications
    # Use 100 estimators, the current maximum tree depth, a learning rate of 0.1, and set the class weight as 'balanced'
    # Set the random state parameter to 123
    current_gbt = LGBMClassifier(n_estimators = 100,
                                 max_depth = current_max_tree_depth,
                                 learning_rate = 0.1,
                                 class_weight = 'balanced',
                                 random_state = 123)

    # Fit the model on the training data
    current_gbt.fit(X_train, y_train)

    print('\n Training for tree depth of {} is complete'.format(current_max_tree_depth))

    # Obtain predictions
    current_y_pred_train = current_gbt.predict(X_train)
    current_y_pred_val = current_gbt.predict(X_val)

    # Compute accuracy scores
    current_train_acc = accuracy_score(y_train, current_y_pred_train)
    current_val_acc = accuracy_score(y_val, current_y_pred_val)

    # Obtain predicted probabilities for class '1'
    current_train_probabilities = current_gbt.predict_proba(X_train)[:, 1]
    current_val_probabilities = current_gbt.predict_proba(X_val)[:, 1]

    # Compute ROC AUC scores
    current_train_auc = roc_auc_score(y_train, current_train_probabilities)
    current_val_auc = roc_auc_score(y_val, current_val_probabilities)

    tempdf = pd.DataFrame(index = [indexcount],
                          data = {'Maximum Tree Depth': current_max_tree_depth,
                                  'Training Accuracy': current_train_acc,
                                  'Validation Accuracy': current_val_acc,
                                  'Training ROC AUC': current_train_auc,
                                  'Validation ROC AUC': current_val_auc})

    performance_df = pd.concat([performance_df, tempdf])

performance_df.set_index('Maximum Tree Depth')

In [None]:
# Visualise variation in validation accuracy scores with respect to the maximum tree depth
plt.figure(figsize = (10, 4))

sns.lineplot(data = performance_df, x = 'Maximum Tree Depth', y = 'Validation Accuracy', marker = 'o', markersize = 12)
plt.title('Validation Accuracy Scores by Maximum Tree Depth')
plt.ylabel('Accuracy')
plt.xlabel('Maximum Tree Depth')
plt.xticks(max_tree_depths);

In [None]:
# Visualise variation in validation ROC AUC scores with respect to the maximum tree depth
plt.figure(figsize = (10, 4))

sns.lineplot(data = performance_df, x = 'Maximum Tree Depth', y = 'Validation ROC AUC', marker = 'o', markersize = 12)
plt.title('Validation ROC AUC Scores by Maximum Tree Depth')
plt.ylabel('ROC AUC Score')
plt.xlabel('Maximum Tree Depth')
plt.xticks(max_tree_depths);

## Subpart 3 - Hyperparameter Tuning: Learning Rate

In [None]:
# Define a list of learning rates to tune over
learning_rates = np.arange(0.02, 0.22, 0.02)

# Create and train a gradient boosted tree model for each value of learning rate
performance_df = pd.DataFrame(data = None)

# Use a for loop to loop over the different models and capture their performances
indexcount = -1
for current_learning_rate in learning_rates:
    indexcount = indexcount + 1

    # Create a gradient boosted tree with the current specifications
    # Use 100 estimators, a maximum tree depth of 5, the current learning rate, and set the class weight as 'balanced'
    # Set the random state parameter to 123
    current_gbt = LGBMClassifier(n_estimators = 100,
                                 max_depth = 5,
                                 learning_rate = current_learning_rate,
                                 class_weight = 'balanced',
                                 random_state = 123)

    # Fit the model on the training data
    current_gbt.fit(X_train, y_train)

    print('\n Training for learning rate of {} is complete'.format(np.round(current_learning_rate, 2)))

    # Obtain predictions
    current_y_pred_train = current_gbt.predict(X_train)
    current_y_pred_val = current_gbt.predict(X_val)

    # Compute accuracy scores
    current_train_acc = accuracy_score(y_train, current_y_pred_train)
    current_val_acc = accuracy_score(y_val, current_y_pred_val)

    # Obtain predicted probabilities for class '1'
    current_train_probabilities = current_gbt.predict_proba(X_train)[:, 1]
    current_val_probabilities = current_gbt.predict_proba(X_val)[:, 1]

    # Compute ROC AUC scores
    current_train_auc = roc_auc_score(y_train, current_train_probabilities)
    current_val_auc = roc_auc_score(y_val, current_val_probabilities)

    tempdf = pd.DataFrame(index = [indexcount],
                          data = {'Learning Rate': current_learning_rate,
                                  'Training Accuracy': current_train_acc,
                                  'Validation Accuracy': current_val_acc,
                                  'Training ROC AUC': current_train_auc,
                                  'Validation ROC AUC': current_val_auc})

    performance_df = pd.concat([performance_df, tempdf])

performance_df.set_index('Learning Rate')

In [None]:
# Visualise variation in validation accuracy scores with respect to the learning rate
plt.figure(figsize = (10, 4))

sns.lineplot(data = performance_df, x = 'Learning Rate', y = 'Validation Accuracy', marker = 'o', markersize = 12)
plt.title('Validation Accuracy Scores by Learning Rate')
plt.ylabel('Accuracy')
plt.xlabel('Learning Rate')
plt.xticks(learning_rates);

In [None]:
# Visualise variation in validation ROC AUC scores with respect to the learning rate
plt.figure(figsize = (10, 4))

sns.lineplot(data = performance_df, x = 'Learning Rate', y = 'Validation ROC AUC', marker = 'o', markersize = 12)
plt.title('Validation ROC AUC Scores by Learning Rate')
plt.ylabel('ROC AUC Score')
plt.xlabel('Learning Rate')
plt.xticks(learning_rates);

## Subpart 4 - Hyperparameter Tuning: Combinations of Hyperparameters

In [None]:
# Initialise a basic gradient boosted classifier model
# Set the class weight as 'balanced'
# Set the random state parameter to 123
base_grid_model = LGBMClassifier(class_weight = 'balanced', random_state = 123)

# Define a range of hyperparameter values to tune for and store them in a dictionary
parameters_grid = {'n_estimators': [50, 200],
                   'max_depth': [7, 8, 9],
                   'learning_rate': [0.06, 0.08]}

# Perform a grid search using the 'GridSearchCV()' method to obtain a grid on which to fit the training data
# Use ROC AUC score as a scoring metric
# Use the default number of cross-validation folds
# Set the 'verbose' parameter to 3 or more to display useful results during the process
grid = GridSearchCV(estimator = base_grid_model,
                    param_grid = parameters_grid,
                    scoring = 'roc_auc',
                    verbose = 4)

# Fit the model on the training data
grid_model = grid.fit(X_train, y_train)

# Print the optimal values of 'n_estimators' and 'max_depth'
best_n_estimators = grid_model.best_params_['n_estimators']
best_max_depth = grid_model.best_params_['max_depth']
best_learning_rate = grid_model.best_params_['learning_rate']
best_roc_auc_score = grid_model.best_score_

print('\n The optimal model has {} estimators, each of maximum tree depth {}, a learning rate of {}, and it has an ROC AUC score of {}.'.format(best_n_estimators, best_max_depth, best_learning_rate, best_roc_auc_score))