* Importing the required modules

In [1]:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier 
from matplotlib import pyplot as plt
import numpy as np
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.linear_model import LogisticRegression

np.random.seed(86089106)

* Load the data files

In [2]:
X_train = pd.read_csv('universal-bank-X-train-data.csv') 
y_train = pd.read_csv('universal-bank-y-train-data.csv') 
X_test = pd.read_csv('universal-bank-X-test-data.csv') 
y_test = pd.read_csv('universal-bank-y-test-data.csv')

### Model the data

First, we will create a dataframe to hold all the results of our models.

In [3]:
performance = pd.DataFrame({"model": [], "Accuracy": [], "Precision": [], "Recall": [], "F1": []})

### Fitting a DTree classification model using Grid Search (paramater tuning set 1)

In [4]:
score_measure = "recall"
kfolds = 5

param_grid = {
    'min_samples_split': [2,10,50,100,200],  
    'min_samples_leaf': [1,5,10,20,50],
    'min_impurity_decrease': [0.0001, 0.0005, 0.0010, 0.0020, 0.0050],
    'max_leaf_nodes': [10,25,50,100,200], 
    'max_depth': [5,10,20,30], 
    'criterion': ['entropy', 'gini'],
}

dtree = DecisionTreeClassifier()
grid_search = GridSearchCV(estimator = dtree, param_grid=param_grid, cv=kfolds, 
                           scoring=score_measure, verbose=1, n_jobs=-1,  # n_jobs=-1 will utilize all available CPUs 
                           return_train_score=True)

_ = grid_search.fit(X_train, y_train)

print(f"The best {score_measure} score is {grid_search.best_score_}")
print(f"... with parameters: {grid_search.best_params_}")

bestRecallTree = grid_search.best_estimator_

Fitting 5 folds for each of 5000 candidates, totalling 25000 fits
The best recall score is 0.7231450719822813
... with parameters: {'criterion': 'gini', 'max_depth': 30, 'max_leaf_nodes': 100, 'min_impurity_decrease': 0.0001, 'min_samples_leaf': 1, 'min_samples_split': 2}


In [5]:
c_matrix = confusion_matrix(y_test, grid_search.predict(X_test))
TP = c_matrix[1][1]
TN = c_matrix[0][0]
FP = c_matrix[0][1]
FN = c_matrix[1][0]
print(f"Accuracy={(TP+TN)/(TP+TN+FP+FN):.4f} Precision={TP/(TP+FP):.4f} Recall={TP/(TP+FN):.4f} F1={2*TP/(2*TP+FP+FN):.4f}")

performance = pd.concat([performance, pd.DataFrame({'model':"DTree GridSearch", 
                                                    'Accuracy': [(TP+TN)/(TP+TN+FP+FN)], 
                                                    'Precision': [TP/(TP+FP)], 
                                                    'Recall': [TP/(TP+FN)], 
                                                    'F1': [2*TP/(2*TP+FP+FN)]
                                                     }, index=[0])])

performance

Accuracy=0.9653 Precision=0.7079 Recall=0.7079 F1=0.7079


Unnamed: 0,model,Accuracy,Precision,Recall,F1
0,DTree GridSearch,0.965333,0.707865,0.707865,0.707865


### Fitting a DTree classification model using Random Search 

In [6]:
score_measure = "recall"
kfolds = 5

param_grid = {
    'min_samples_split': np.arange(4,200),  
    'min_samples_leaf': np.arange(4,200),
    'min_impurity_decrease': np.arange(0.0001, 0.001, 0.00005),
    'max_leaf_nodes': np.arange(10, 200), 
    'max_depth': np.arange(3,50), 
    'criterion': ['entropy', 'gini'],
}

dtree = DecisionTreeClassifier()
rand_search = RandomizedSearchCV(estimator = dtree, param_distributions=param_grid, cv=kfolds, n_iter=1000,
                           scoring=score_measure, verbose=1, n_jobs=-1,  # n_jobs=-1 will utilize all available CPUs 
                           return_train_score=True)

_ = rand_search.fit(X_train, y_train)

print(f"The best {score_measure} score is {rand_search.best_score_}")
print(f"... with parameters: {rand_search.best_params_}")

bestRecallTree = rand_search.best_estimator_

Fitting 5 folds for each of 1000 candidates, totalling 5000 fits
The best recall score is 0.6854928017718714
... with parameters: {'min_samples_split': 166, 'min_samples_leaf': 10, 'min_impurity_decrease': 0.00030000000000000003, 'max_leaf_nodes': 153, 'max_depth': 12, 'criterion': 'entropy'}


In [7]:
c_matrix = confusion_matrix(y_test, rand_search.predict(X_test))
TP = c_matrix[1][1]
TN = c_matrix[0][0]
FP = c_matrix[0][1]
FN = c_matrix[1][0]
print(f"Accuracy={(TP+TN)/(TP+TN+FP+FN):.7f} Precision={TP/(TP+FP):.7f} Recall={TP/(TP+FN):.7f} F1={2*TP/(2*TP+FP+FN):.7f}")

performance = pd.concat([performance, pd.DataFrame({'model':"DTree RandomSearch", 
                                                    'Accuracy': [(TP+TN)/(TP+TN+FP+FN)], 
                                                    'Precision': [TP/(TP+FP)], 
                                                    'Recall': [TP/(TP+FN)], 
                                                    'F1': [2*TP/(2*TP+FP+FN)]
                                                     }, index=[0])])

performance

Accuracy=0.9633333 Precision=0.6888889 Recall=0.6966292 F1=0.6927374


Unnamed: 0,model,Accuracy,Precision,Recall,F1
0,DTree GridSearch,0.965333,0.707865,0.707865,0.707865
0,DTree RandomSearch,0.963333,0.688889,0.696629,0.692737


### Conducting an exhaustive search across a smaller range of parameters around the parameters found in the above random search.

In [8]:
score_measure = "recall"

kfolds = 5
min_samples_split = rand_search.best_params_['min_samples_split']
min_samples_leaf = rand_search.best_params_['min_samples_leaf']
min_impurity_decrease = rand_search.best_params_['min_impurity_decrease']
max_leaf_nodes = rand_search.best_params_['max_leaf_nodes']
max_depth = rand_search.best_params_['max_depth']
criterion = rand_search.best_params_['criterion']

param_grid = {
    'min_samples_split': np.arange(min_samples_split-2,min_samples_split+2),  
    'min_samples_leaf': np.arange(min_samples_leaf-2,min_samples_leaf+2),
    'min_impurity_decrease': np.arange(min_impurity_decrease-0.0001, min_impurity_decrease+0.0001, 0.00005),
    'max_leaf_nodes': np.arange(max_leaf_nodes-2,max_leaf_nodes+2), 
    'max_depth': np.arange(max_depth-2,max_depth+2), 
    'criterion': [criterion]
}

dtree = DecisionTreeClassifier()
grid_search = GridSearchCV(estimator = dtree, param_grid=param_grid, cv=kfolds, 
                           scoring=score_measure, verbose=1, n_jobs=-1,  # n_jobs=-1 will utilize all available CPUs 
                           return_train_score=True)

_ = grid_search.fit(X_train, y_train)

print(f"The best {score_measure} score is {grid_search.best_score_}")
print(f"... with parameters: {grid_search.best_params_}")

bestRecallTree = grid_search.best_estimator_

Fitting 5 folds for each of 1024 candidates, totalling 5120 fits
The best recall score is 0.6854928017718714
... with parameters: {'criterion': 'entropy', 'max_depth': 10, 'max_leaf_nodes': 151, 'min_impurity_decrease': 0.00020000000000000004, 'min_samples_leaf': 8, 'min_samples_split': 164}


In [9]:
c_matrix = confusion_matrix(y_test, grid_search.predict(X_test))
TP = c_matrix[1][1]
TN = c_matrix[0][0]
FP = c_matrix[0][1]
FN = c_matrix[1][0]
print(f"Accuracy={(TP+TN)/(TP+TN+FP+FN):.4f} Precision={TP/(TP+FP):.4f} Recall={TP/(TP+FN):.4f} F1={2*TP/(2*TP+FP+FN):.4f}")

performance = pd.concat([performance, pd.DataFrame({'model':"DTree exhaustive Grid Search", 
                                                    'Accuracy': [(TP+TN)/(TP+TN+FP+FN)], 
                                                    'Precision': [TP/(TP+FP)], 
                                                    'Recall': [TP/(TP+FN)], 
                                                    'F1': [2*TP/(2*TP+FP+FN)]
                                                     }, index=[0])])

performance

Accuracy=0.9633 Precision=0.6889 Recall=0.6966 F1=0.6927


Unnamed: 0,model,Accuracy,Precision,Recall,F1
0,DTree GridSearch,0.965333,0.707865,0.707865,0.707865
0,DTree RandomSearch,0.963333,0.688889,0.696629,0.692737
0,DTree exhaustive Grid Search,0.963333,0.688889,0.696629,0.692737


## Fitting and testing with Logistic Regression model

In [10]:
log_reg_model = LogisticRegression(penalty=None, max_iter=900)
_ = log_reg_model.fit(X_train, np.ravel(y_train))

In [11]:
model_preds = log_reg_model.predict(X_test)
c_matrix = confusion_matrix(y_test, model_preds)
TP = c_matrix[1][1]
TN = c_matrix[0][0]
FP = c_matrix[0][1]
FN = c_matrix[1][0]
performance = pd.concat([performance, pd.DataFrame({'model':"default logistic", 
                                                    'Accuracy': [(TP+TN)/(TP+TN+FP+FN)], 
                                                    'Precision': [TP/(TP+FP)], 
                                                    'Recall': [TP/(TP+FN)], 
                                                    'F1': [2*TP/(2*TP+FP+FN)]
                                                     }, index=[0])])
performance

Unnamed: 0,model,Accuracy,Precision,Recall,F1
0,DTree GridSearch,0.965333,0.707865,0.707865,0.707865
0,DTree RandomSearch,0.963333,0.688889,0.696629,0.692737
0,DTree exhaustive Grid Search,0.963333,0.688889,0.696629,0.692737
0,default logistic,0.978667,1.0,0.640449,0.780822


## Fitting and testing with Logistic Regression L2 Regularization

In [12]:
log_reg_L2_model = LogisticRegression(penalty='l2', max_iter=1000)
_ = log_reg_L2_model.fit(X_train, np.ravel(y_train))

In [13]:
model_preds = log_reg_L2_model.predict(X_test)
c_matrix = confusion_matrix(y_test, model_preds)
TP = c_matrix[1][1]
TN = c_matrix[0][0]
FP = c_matrix[0][1]
FN = c_matrix[1][0]
performance = pd.concat([performance, pd.DataFrame({'model':"L2 logistic", 
                                                    'Accuracy': [(TP+TN)/(TP+TN+FP+FN)], 
                                                    'Precision': [TP/(TP+FP)], 
                                                    'Recall': [TP/(TP+FN)], 
                                                    'F1': [2*TP/(2*TP+FP+FN)]
                                                     }, index=[0])])
performance

Unnamed: 0,model,Accuracy,Precision,Recall,F1
0,DTree GridSearch,0.965333,0.707865,0.707865,0.707865
0,DTree RandomSearch,0.963333,0.688889,0.696629,0.692737
0,DTree exhaustive Grid Search,0.963333,0.688889,0.696629,0.692737
0,default logistic,0.978667,1.0,0.640449,0.780822
0,L2 logistic,0.978667,1.0,0.640449,0.780822


## Fitting and testing with Logistic Regression L1 Regularization

In [14]:
log_reg_L1_model = LogisticRegression(solver='liblinear', penalty='l1')
_ = log_reg_L1_model.fit(X_train, np.ravel(y_train))

In [15]:
model_preds = log_reg_L1_model.predict(X_test)
c_matrix = confusion_matrix(y_test, model_preds)
TP = c_matrix[1][1]
TN = c_matrix[0][0]
FP = c_matrix[0][1]
FN = c_matrix[1][0]
performance = pd.concat([performance, pd.DataFrame({'model':"L1 logistic", 
                                                    'Accuracy': [(TP+TN)/(TP+TN+FP+FN)], 
                                                    'Precision': [TP/(TP+FP)], 
                                                    'Recall': [TP/(TP+FN)], 
                                                    'F1': [2*TP/(2*TP+FP+FN)]
                                                     }, index=[0])])
performance

Unnamed: 0,model,Accuracy,Precision,Recall,F1
0,DTree GridSearch,0.965333,0.707865,0.707865,0.707865
0,DTree RandomSearch,0.963333,0.688889,0.696629,0.692737
0,DTree exhaustive Grid Search,0.963333,0.688889,0.696629,0.692737
0,default logistic,0.978667,1.0,0.640449,0.780822
0,L2 logistic,0.978667,1.0,0.640449,0.780822
0,L1 logistic,0.978667,1.0,0.640449,0.780822


### Results

In [16]:
performance.sort_values(by=['Recall'])

Unnamed: 0,model,Accuracy,Precision,Recall,F1
0,default logistic,0.978667,1.0,0.640449,0.780822
0,L2 logistic,0.978667,1.0,0.640449,0.780822
0,L1 logistic,0.978667,1.0,0.640449,0.780822
0,DTree RandomSearch,0.963333,0.688889,0.696629,0.692737
0,DTree exhaustive Grid Search,0.963333,0.688889,0.696629,0.692737
0,DTree GridSearch,0.965333,0.707865,0.707865,0.707865


## Conclusion

In this notebook, I have tried to identify the best predictive model using the relavant models for the case situation. Here, in order to determine the best model for predicting potential new Certificate of Deposit (CD) account customers for Universal Bank, we need to choose an evaluation metric that aligns with the requirements of the bank. In the decision tree models, I have optimized the models for Recall score in the event that the goal of the bank is to identify as many CD account customers as possible, even it means including some false positives. 

From the resulting performance metrics - accuracy, precision, recall, and F1 score—we can consider the following:

Accuracy: This metric measures the overall correctness of the model's predictions, indicating the proportion of correctly classified instances. However, accuracy alone might not be the best choice if there are imbalances in dataset, where one class (e.g., potential CD account customers) significantly outnumbers the other class (e.g., non-customers).

Precision: Precision represents the proportion of correctly predicted positive instances (potential CD account customers) out of all instances predicted as positive. It focuses on minimizing false positives, which could be important if the bank wants to avoid targeting individuals who are not likely to become CD account customers.

Recall: Recall(sensitivity) measures the proportion of correctly predicted positive instances out of all actual positive instances. It is particularly relevant when the goal is to identify as many potential CD account customers as possible, even if it means including some false positives.

F1 Score: The F1 score is the harmonic mean of precision and recall. It provides a balance between precision and recall and is useful when both false positives and false negatives are equally important. It is a good choice when there is a trade-off between precision and recall, and the bank wants to consider both aspects simultaneously.

Based on the resulting evaluation metrics, we can see that all the variations of logistic models perform equally well in all the metrics and have the best accuracy(0.978667), precision(1.0) and F1 score(0.780822). The decision tree models perform best in terms of Recall score with DTree Grid search having the best Recall score of 0.707865. 

Finally, the best evaluation metric to choose depends on the priorities of Universal Bank. If Universal Bank wants to prioritize correctly identifying potential customers, recall would be the most important metric to consider. But, if minimizing false positives (ensuring the customers identified are truly potential customers) is the priority, precision should be emphasized. If both precision and recall are equally important, the F1 score can provide a balanced assessment.
