# Practical Application III: Comparing Classifiers

**Overview**: In this practical application, your goal is to compare the performance of the classifiers we encountered in this section, namely K Nearest Neighbor, Logistic Regression, Decision Trees, and Support Vector Machines.  We will utilize a dataset related to marketing bank products over the telephone.  



### Getting Started

Our dataset comes from the UCI Machine Learning repository [link](https://archive.ics.uci.edu/ml/datasets/bank+marketing).  The data is from a Portugese banking institution and is a collection of the results of multiple marketing campaigns.  We will make use of the article accompanying the dataset [here](CRISP-DM-BANK.pdf) for more information on the data and features.



### Problem 1: Understanding the Data

To gain a better understanding of the data, please read the information provided in the UCI link above, and examine the **Materials and Methods** section of the paper.  How many marketing campaigns does this data represent?

In [3]:
NumberOfMarketingCampaigns = 17

print(f'The dataset represents {NumberOfMarketingCampaigns} distinct marketing campaigns.')

The dataset represents 17 distinct marketing campaigns.


### Problem 2: Read in the Data

Use pandas to read in the dataset `bank-additional-full.csv` and assign to a meaningful variable name.

In [4]:
import pandas as pd

In [7]:
##df = pd.read_csv('data/bank-additional/bank-additional-full.csv', sep = ';')


import pandas as pd
from google.colab import drive
drive.mount('/content/drive')

file_path = '/content/drive/My Drive/Colab Notebooks/module17_starter/data/bank-additional-full.csv'
df = pd.read_csv(file_path, sep=';')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [8]:
df.head()

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
1,57,services,married,high.school,unknown,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
2,37,services,married,high.school,no,yes,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
3,40,admin.,married,basic.6y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
4,56,services,married,high.school,no,no,yes,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no


### Problem 3: Understanding the Features


Examine the data description below, and determine if any of the features are missing values or need to be coerced to a different data type.


```
Input variables:
# bank client data:
1 - age (numeric)
2 - job : type of job (categorical: 'admin.','blue-collar','entrepreneur','housemaid','management','retired','self-employed','services','student','technician','unemployed','unknown')
3 - marital : marital status (categorical: 'divorced','married','single','unknown'; note: 'divorced' means divorced or widowed)
4 - education (categorical: 'basic.4y','basic.6y','basic.9y','high.school','illiterate','professional.course','university.degree','unknown')
5 - default: has credit in default? (categorical: 'no','yes','unknown')
6 - housing: has housing loan? (categorical: 'no','yes','unknown')
7 - loan: has personal loan? (categorical: 'no','yes','unknown')
# related with the last contact of the current campaign:
8 - contact: contact communication type (categorical: 'cellular','telephone')
9 - month: last contact month of year (categorical: 'jan', 'feb', 'mar', ..., 'nov', 'dec')
10 - day_of_week: last contact day of the week (categorical: 'mon','tue','wed','thu','fri')
11 - duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y='no'). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.
# other attributes:
12 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
13 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
14 - previous: number of contacts performed before this campaign and for this client (numeric)
15 - poutcome: outcome of the previous marketing campaign (categorical: 'failure','nonexistent','success')
# social and economic context attributes
16 - emp.var.rate: employment variation rate - quarterly indicator (numeric)
17 - cons.price.idx: consumer price index - monthly indicator (numeric)
18 - cons.conf.idx: consumer confidence index - monthly indicator (numeric)
19 - euribor3m: euribor 3 month rate - daily indicator (numeric)
20 - nr.employed: number of employees - quarterly indicator (numeric)

Output variable (desired target):
21 - y - has the client subscribed a term deposit? (binary: 'yes','no')
```



In [11]:


print("Data Types:")
print(df.dtypes)
print("\nCategorical Features:")
categorical_cols = df.select_dtypes(include='object').columns
for col in categorical_cols:
  print(f"- {col}: {df[col].unique()}")

Data Types:
age                 int64
job                object
marital            object
education          object
default            object
housing            object
loan               object
contact            object
month              object
day_of_week        object
duration            int64
campaign            int64
pdays               int64
previous            int64
poutcome           object
emp.var.rate      float64
cons.price.idx    float64
cons.conf.idx     float64
euribor3m         float64
nr.employed       float64
y                  object
dtype: object

Categorical Features:
- job: ['housemaid' 'services' 'admin.' 'blue-collar' 'technician' 'retired'
 'management' 'unemployed' 'self-employed' 'unknown' 'entrepreneur'
 'student']
- marital: ['married' 'single' 'divorced' 'unknown']
- education: ['basic.4y' 'high.school' 'basic.6y' 'basic.9y' 'professional.course'
 'unknown' 'university.degree' 'illiterate']
- default: ['no' 'unknown' 'yes']
- housing: ['no' 'yes' 'unknown']

### Problem 4: Understanding the Task

After examining the description and data, your goal now is to clearly state the *Business Objective* of the task.  State the objective below.

In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41188 entries, 0 to 41187
Data columns (total 21 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   age             41188 non-null  int64  
 1   job             41188 non-null  object 
 2   marital         41188 non-null  object 
 3   education       41188 non-null  object 
 4   default         41188 non-null  object 
 5   housing         41188 non-null  object 
 6   loan            41188 non-null  object 
 7   contact         41188 non-null  object 
 8   month           41188 non-null  object 
 9   day_of_week     41188 non-null  object 
 10  duration        41188 non-null  int64  
 11  campaign        41188 non-null  int64  
 12  pdays           41188 non-null  int64  
 13  previous        41188 non-null  int64  
 14  poutcome        41188 non-null  object 
 15  emp.var.rate    41188 non-null  float64
 16  cons.price.idx  41188 non-null  float64
 17  cons.conf.idx   41188 non-null 

### Problem 5: Engineering Features

Now that you understand your business objective, we will build a basic model to get started.  Before we can do this, we must work to encode the data.  Using just the bank information features, prepare the features and target column for modeling with appropriate encoding and transformations.

In [12]:
# prompt: prepare the features and target column for modeling with appropriate encoding and transformations.

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer

# Define features to use (bank information features as requested)
bank_info_features = ['age', 'job', 'marital', 'education', 'default', 'housing', 'loan']
target = 'y'

# Separate features and target
X = df[bank_info_features]
y = df[target]

# Identify categorical and numerical features within the selected bank info features
categorical_features = X.select_dtypes(include='object').columns
numerical_features = X.select_dtypes(include='number').columns

# Preprocessing: One-Hot Encode categorical features and leave numerical features as they are
# Use handle_unknown='ignore' to handle potential new categories during testing
preprocessor = ColumnTransformer(
    transformers=[
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features)
    ],
    remainder='passthrough' # Keep numerical features as they are
)

# Apply the preprocessing
X_processed = preprocessor.fit_transform(X)

# Encode the target variable 'y' (binary: 'yes'/'no')
label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y)

print("Features processed successfully:")
print(f"Original features shape: {X.shape}")
print(f"Processed features shape: {X_processed.shape}")
print(f"Target variable shape: {y_encoded.shape}")

# You can optionally split the data into training and testing sets now
# X_train, X_test, y_train, y_test = train_test_split(X_processed, y_encoded, test_size=0.25, random_state=42)


Features processed successfully:
Original features shape: (41188, 7)
Processed features shape: (41188, 34)
Target variable shape: (41188,)


### Problem 6: Train/Test Split

With your data prepared, split it into a train and test set.

In [13]:
X_train, X_test, y_train, y_test = train_test_split(X_processed, y_encoded, test_size=0.25, random_state=42)


### Problem 7: A Baseline Model

Before we build our first model, we want to establish a baseline.  What is the baseline performance that our classifier should aim to beat?

In [14]:
# prompt: What is the baseline performance that our classifier should aim to beat?

# The baseline performance for a binary classification task is typically the accuracy achieved by predicting the majority class for all instances.
# We need to calculate the proportion of the majority class in the target variable `y_encoded`.

import numpy as np

# Calculate the counts of each class in the target variable
class_counts = np.bincount(y_encoded)

# The index of the majority class is the one with the highest count
majority_class_index = np.argmax(class_counts)

# The count of the majority class
majority_class_count = class_counts[majority_class_index]

# The total number of instances
total_instances = len(y_encoded)

# The baseline accuracy is the proportion of the majority class
baseline_accuracy = majority_class_count / total_instances

print(f"Class counts: {class_counts}")
print(f"Majority class index: {majority_class_index}")
print(f"Majority class count: {majority_class_count}")
print(f"Total instances: {total_instances}")
print(f"Baseline accuracy (predicting majority class): {baseline_accuracy:.4f}")

Class counts: [36548  4640]
Majority class index: 0
Majority class count: 36548
Total instances: 41188
Baseline accuracy (predicting majority class): 0.8873


### Problem 8: A Simple Model

Use Logistic Regression to build a basic model on your data.  

In [16]:
# prompt: Use Logistic Regression to build a basic model on your data.

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

# Initialize the Logistic Regression model
# Use a reasonable number of iterations or a solver that handles large datasets
# C=1.0 is the default, you might experiment with regularization
log_reg_model = LogisticRegression(random_state=42, max_iter=1000)

# Train the model on the training data
log_reg_model.fit(X_train, y_train)

# Make predictions on the test data
y_pred = log_reg_model.predict(X_test)



### Problem 9: Score the Model

What is the accuracy of your model?

In [17]:
# Evaluate the model's performance
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred, target_names=label_encoder.classes_) # Use original class names

print(f"Logistic Regression Model Accuracy: {accuracy:.4f}")
print("\nClassification Report:")
report

Logistic Regression Model Accuracy: 0.8880

Classification Report:


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


'              precision    recall  f1-score   support\n\n          no       0.89      1.00      0.94      9144\n         yes       0.00      0.00      0.00      1153\n\n    accuracy                           0.89     10297\n   macro avg       0.44      0.50      0.47     10297\nweighted avg       0.79      0.89      0.84     10297\n'

### Problem 10: Model Comparisons

Now, we aim to compare the performance of the Logistic Regression model to our KNN algorithm, Decision Tree, and SVM models.  Using the default settings for each of the models, fit and score each.  Also, be sure to compare the fit time of each of the models.  Present your findings in a `DataFrame` similar to that below:

| Model | Train Time | Train Accuracy | Test Accuracy |
| ----- | ---------- | -------------  | -----------   |
|     |    |.     |.     |

In [20]:
import pandas as pd
import time
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression # Import Logistic Regression

# Initialize models
log_reg_model = LogisticRegression(random_state=42) # Initialize Logistic Regression
knn_model = KNeighborsClassifier()
dt_model = DecisionTreeClassifier(random_state=42)
svm_model = SVC(random_state=42)

# Create lists to store results
model_names = ['Logistic Regression', 'K Nearest Neighbor', 'Decision Tree', 'SVM']
train_times = []
train_accuracies = []
test_accuracies = []

# Train, score, and record results for Logistic Regression
start_time = time.time()
log_reg_model.fit(X_train, y_train)
train_time = time.time() - start_time
train_times.append(train_time)
train_accuracies.append(log_reg_model.score(X_train, y_train))
test_accuracies.append(log_reg_model.score(X_test, y_test))

# Train, score, and record results for K Nearest Neighbor
start_time = time.time()
knn_model.fit(X_train, y_train)
train_time = time.time() - start_time
train_times.append(train_time)
train_accuracies.append(knn_model.score(X_train, y_train))
test_accuracies.append(knn_model.score(X_test, y_test))

# Train, score, and record results for Decision Tree
start_time = time.time()
dt_model.fit(X_train, y_train)
train_time = time.time() - start_time
train_times.append(train_time)
train_accuracies.append(dt_model.score(X_train, y_train))
test_accuracies.append(dt_model.score(X_test, y_test))

# Train, score, and record results for SVM
# Note: SVM can be computationally expensive on larger datasets
start_time = time.time()
svm_model.fit(X_train, y_train)
train_time = time.time() - start_time
train_times.append(train_time)
train_accuracies.append(svm_model.score(X_train, y_train))
test_accuracies.append(svm_model.score(X_test, y_test))

# Create the results DataFrame
results_df = pd.DataFrame({
    'Model': model_names,
    'Train Time': train_times,
    'Train Accuracy': train_accuracies,
    'Test Accuracy': test_accuracies
})

# Format the DataFrame for better readability
results_df['Train Time'] = results_df['Train Time'].map('{:.4f}s'.format)
results_df['Train Accuracy'] = results_df['Train Accuracy'].map('{:.4f}'.format)
results_df['Test Accuracy'] = results_df['Test Accuracy'].map('{:.4f}'.format)

print(results_df)

                 Model Train Time Train Accuracy Test Accuracy
0  Logistic Regression    0.3707s         0.8871        0.8880
1   K Nearest Neighbor    0.0025s         0.8896        0.8776
2        Decision Tree    0.7941s         0.9170        0.8648
3                  SVM   17.3348s         0.8871        0.8880


### Problem 11: Improving the Model

Now that we have some basic models on the board, we want to try to improve these.  Below, we list a few things to explore in this pursuit.

- More feature engineering and exploration.  For example, should we keep the gender feature?  Why or why not?
- Hyperparameter tuning and grid search.  All of our models have additional hyperparameters to tune and explore.  For example the number of neighbors in KNN or the maximum depth of a Decision Tree.  
- Adjust your performance metric

In [21]:
# prompt: Now that we have some basic models on the board, we want to try to improve these. Below, we list a few things to explore in this pursuit.
# More feature engineering and exploration. For example, should we keep the gender feature? Why or why not?
# Hyperparameter tuning and grid search. All of our models have additional hyperparameters to tune and explore. For example the number of neighbors in KNN or the maximum depth of a Decision Tree.
# Adjust your performance metric

from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline

# --- More feature engineering and exploration (Gender feature) ---
# Examining the data description, there is no 'gender' feature.
# If there were a gender feature, we would need to explore:
# 1. Its distribution and relationship with the target variable.
# 2. Whether it introduces multicollinearity with other features.
# 3. Potential bias considerations related to the gender feature.
# Based on these explorations, we would decide whether to include it or not.
# Since 'gender' is not in this dataset, this point is addressed by noting its absence.

# --- Hyperparameter tuning and grid search ---
# Let's focus on tuning one model as an example, e.g., Decision Tree Classifier, and explain the process for others.

# Define the parameter grid for Decision Tree
dt_param_grid = {
    'max_depth': [None, 5, 10, 15, 20],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'criterion': ['gini', 'entropy']
}

# Initialize the GridSearchCV object
# Using a smaller subset of data for quicker experimentation is often useful
# For full tuning, use the entire training set (X_train, y_train)
# cv=5 specifies 5-fold cross-validation
# scoring='accuracy' specifies the metric to optimize
grid_search_dt = GridSearchCV(DecisionTreeClassifier(random_state=42), dt_param_grid, cv=5, scoring='accuracy')

# Perform the grid search on the training data
print("Performing Grid Search for Decision Tree...")
start_time = time.time()
grid_search_dt.fit(X_train, y_train)
grid_search_time = time.time() - start_time
print(f"Grid Search for Decision Tree took: {grid_search_time:.4f}s")

# Get the best parameters and best score
best_dt_params = grid_search_dt.best_params_
best_dt_score = grid_search_dt.best_score_

print("\nBest parameters for Decision Tree:")
print(best_dt_params)
print(f"Best cross-validation accuracy on training set: {best_dt_score:.4f}")

# Evaluate the best model on the test set
best_dt_model = grid_search_dt.best_estimator_
test_accuracy_tuned_dt = best_dt_model.score(X_test, y_test)
print(f"Test Accuracy with best Decision Tree model: {test_accuracy_tuned_dt:.4f}")

# Repeat this process for other models (KNN, Logistic Regression, SVM)
# For KNN: KNeighborsClassifier, parameters like n_neighbors
# For Logistic Regression: LogisticRegression, parameters like C, penalty, solver
# For SVM: SVC, parameters like C, kernel, gamma

# Example for KNN:
# knn_param_grid = {'n_neighbors': [3, 5, 7, 9]}
# grid_search_knn = GridSearchCV(KNeighborsClassifier(), knn_param_grid, cv=5, scoring='accuracy')
# grid_search_knn.fit(X_train, y_train)
# print("\nBest parameters for KNN:")
# print(grid_search_knn.best_params_)
# print(f"Best cross-validation accuracy on training set: {grid_search_knn.best_score_:.4f}")
# print(f"Test Accuracy with best KNN model: {grid_search_knn.score(X_test, y_test):.4f}")


# --- Adjust your performance metric ---
# The initial metric used is 'accuracy'. For imbalanced datasets (which is often the case in marketing campaigns, where 'yes' is the minority class),
# accuracy can be misleading.
# More appropriate metrics for imbalanced datasets include:
# - Precision: Of all the instances predicted as positive, how many were actually positive? Important when the cost of false positives is high.
# - Recall (Sensitivity): Of all the actual positive instances, how many were predicted correctly? Important when the cost of false negatives is high.
# - F1-Score: The harmonic mean of precision and recall, balancing both metrics.
# - ROC AUC: Measures the ability of a classifier to distinguish between classes.

# To use a different metric in GridSearchCV, change the `scoring` parameter.
# Example using 'recall':
# grid_search_dt_recall = GridSearchCV(DecisionTreeClassifier(random_state=42), dt_param_grid, cv=5, scoring='recall')
# grid_search_dt_recall.fit(X_train, y_train)
# print("\nGrid Search optimized for Recall for Decision Tree:")
# print("Best parameters:", grid_search_dt_recall.best_params_)
# print(f"Best cross-validation recall on training set: {grid_search_dt_recall.best_score_:.4f}")
# print(f"Test Recall with best Decision Tree model (optimized for recall): {grid_search_dt_recall.score(X_test, y_test):.4f}")

# To evaluate models using multiple metrics after fitting:
y_pred_tuned_dt = best_dt_model.predict(X_test)
report_tuned_dt = classification_report(y_test, y_pred_tuned_dt, target_names=label_encoder.classes_)

print("\nClassification Report for Tuned Decision Tree:")
report_tuned_dt

# The choice of metric depends on the specific business objective. If the goal is to identify as many potential subscribers ('yes') as possible, Recall might be prioritized. If the goal is to avoid wasting resources on contacting non-subscribers, Precision might be more important. F1-score or ROC AUC offer a balanced view.

Performing Grid Search for Decision Tree...
Grid Search for Decision Tree took: 115.3566s

Best parameters for Decision Tree:
{'criterion': 'gini', 'max_depth': 5, 'min_samples_leaf': 4, 'min_samples_split': 2}
Best cross-validation accuracy on training set: 0.8869
Test Accuracy with best Decision Tree model: 0.8876

Classification Report for Tuned Decision Tree:


'              precision    recall  f1-score   support\n\n          no       0.89      1.00      0.94      9144\n         yes       0.42      0.01      0.02      1153\n\n    accuracy                           0.89     10297\n   macro avg       0.66      0.50      0.48     10297\nweighted avg       0.84      0.89      0.84     10297\n'

##### Questions