# Practical Application III: Comparing Classifiers

**Overview**: In this practical application, your goal is to compare the performance of the classifiers we encountered in this section, namely K Nearest Neighbor, Logistic Regression, Decision Trees, and Support Vector Machines.  We will utilize a dataset related to marketing bank products over the telephone.  



### Getting Started

Our dataset comes from the UCI Machine Learning repository [link](https://archive.ics.uci.edu/ml/datasets/bank+marketing).  The data is from a Portugese banking institution and is a collection of the results of multiple marketing campaigns.  We will make use of the article accompanying the dataset [here](CRISP-DM-BANK.pdf) for more information on the data and features.



### Problem 1: Understanding the Data

To gain a better understanding of the data, please read the information provided in the UCI link above, and examine the **Materials and Methods** section of the paper.  How many marketing campaigns does this data represent?

The dataset represents data from 17 marketing campaigns conducted by a Portuguese bank between May 2008 and November 2010. During these campaigns, a total of 79,354 contacts were made. Out of these contacts, there were 6,499 successes, resulting in an 8% success rate.







### Problem 2: Read in the Data

Use pandas to read in the dataset `bank-additional-full.csv` and assign to a meaningful variable name.

In [39]:
import pandas as pd

In [40]:
df = pd.read_csv('data/bank-additional-full.csv', sep = ';')

In [41]:
df.head()

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
1,57,services,married,high.school,unknown,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
2,37,services,married,high.school,no,yes,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
3,40,admin.,married,basic.6y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
4,56,services,married,high.school,no,no,yes,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no


### Problem 3: Understanding the Features


Examine the data description below, and determine if any of the features are missing values or need to be coerced to a different data type.


```
Input variables:
# bank client data:
1 - age (numeric)
2 - job : type of job (categorical: 'admin.','blue-collar','entrepreneur','housemaid','management','retired','self-employed','services','student','technician','unemployed','unknown')
3 - marital : marital status (categorical: 'divorced','married','single','unknown'; note: 'divorced' means divorced or widowed)
4 - education (categorical: 'basic.4y','basic.6y','basic.9y','high.school','illiterate','professional.course','university.degree','unknown')
5 - default: has credit in default? (categorical: 'no','yes','unknown')
6 - housing: has housing loan? (categorical: 'no','yes','unknown')
7 - loan: has personal loan? (categorical: 'no','yes','unknown')
# related with the last contact of the current campaign:
8 - contact: contact communication type (categorical: 'cellular','telephone')
9 - month: last contact month of year (categorical: 'jan', 'feb', 'mar', ..., 'nov', 'dec')
10 - day_of_week: last contact day of the week (categorical: 'mon','tue','wed','thu','fri')
11 - duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y='no'). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.
# other attributes:
12 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
13 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
14 - previous: number of contacts performed before this campaign and for this client (numeric)
15 - poutcome: outcome of the previous marketing campaign (categorical: 'failure','nonexistent','success')
# social and economic context attributes
16 - emp.var.rate: employment variation rate - quarterly indicator (numeric)
17 - cons.price.idx: consumer price index - monthly indicator (numeric)
18 - cons.conf.idx: consumer confidence index - monthly indicator (numeric)
19 - euribor3m: euribor 3 month rate - daily indicator (numeric)
20 - nr.employed: number of employees - quarterly indicator (numeric)

Output variable (desired target):
21 - y - has the client subscribed a term deposit? (binary: 'yes','no')
```



age: Numeric feature, no data type conversion needed.

job: Categorical feature, no missing values mentioned.

marital: Categorical feature, no missing values mentioned.

education: Categorical feature, no missing values mentioned.

default: Categorical feature, no missing values mentioned.

housing: Categorical feature, no missing values mentioned.

loan: Categorical feature, no missing values mentioned.

contact: Categorical feature, no missing values mentioned.

month: Categorical feature, no missing values mentioned.

day_of_week: Categorical feature, no missing values mentioned.

duration: Numeric feature, but it's mentioned that this attribute should be discarded for realistic predictive 

modeling. So, it will be dropped.

campaign: Numeric feature, no missing values mentioned.

pdays: Numeric feature, no missing values mentioned.

previous: Numeric feature, no missing values mentioned.

poutcome: Categorical feature, no missing values mentioned.

emp.var.rate: Numeric feature, no missing values mentioned.

cons.price.idx: Numeric feature, no missing values mentioned.

cons.conf.idx: Numeric feature, no missing values mentioned.

euribor3m: Numeric feature, no missing values mentioned.

nr.employed: Numeric feature, no missing values mentioned.

y: Target variable, binary categorical feature ('yes' or 'no'), no missing values mentioned.

Duration needs to be dropped.

### Problem 4: Understanding the Task

After examining the description and data, your goal now is to clearly state the *Business Objective* of the task.  State the objective below.

In [33]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41188 entries, 0 to 41187
Data columns (total 21 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   age             41188 non-null  int64  
 1   job             41188 non-null  object 
 2   marital         41188 non-null  object 
 3   education       41188 non-null  object 
 4   default         41188 non-null  object 
 5   housing         41188 non-null  object 
 6   loan            41188 non-null  object 
 7   contact         41188 non-null  object 
 8   month           41188 non-null  object 
 9   day_of_week     41188 non-null  object 
 10  duration        41188 non-null  int64  
 11  campaign        41188 non-null  int64  
 12  pdays           41188 non-null  int64  
 13  previous        41188 non-null  int64  
 14  poutcome        41188 non-null  object 
 15  emp.var.rate    41188 non-null  float64
 16  cons.price.idx  41188 non-null  float64
 17  cons.conf.idx   41188 non-null 

The business objective of the task is to predict whether a client will subscribe to a term deposit based on various client attributes and marketing campaign data collected by a Portuguese bank. This prediction can help the bank optimize its marketing strategies, allocate resources effectively, and improve its overall success rate in promoting term deposit subscriptions.

### Problem 5: Engineering Features

Now that you understand your business objective, we will build a basic model to get started.  Before we can do this, we must work to encode the data.  Using just the bank information features (columns 1 - 7), prepare the features and target column for modeling with appropriate encoding and transformations.

In [34]:
# Extract bank information features (columns 1 - 7)
bank_info_features = df.iloc[:, :7]

# Encode categorical features using one-hot encoding
bank_info_encoded = pd.get_dummies(bank_info_features)

# Transform the target column into binary numerical values
df['y'] = df['y'].map({'yes': 1, 'no': 0})

# Prepare features and target column for modeling
features = bank_info_encoded
target = df['y']

# Display the prepared features and target column
print("Encoded Bank Information Features:")
print(features.head())
print("\nTarget Column:")
print(target.head())

Encoded Bank Information Features:
   age  job_admin.  job_blue-collar  job_entrepreneur  job_housemaid  \
0   56       False            False             False           True   
1   57       False            False             False          False   
2   37       False            False             False          False   
3   40        True            False             False          False   
4   56       False            False             False          False   

   job_management  job_retired  job_self-employed  job_services  job_student  \
0           False        False              False         False        False   
1           False        False              False          True        False   
2           False        False              False          True        False   
3           False        False              False         False        False   
4           False        False              False          True        False   

   ...  education_unknown  default_no  default_unkn

Encoded Bank Information Features: This section displays the bank information features after encoding using one-hot encoding. Each original categorical feature has been transformed into multiple binary features, where each binary feature indicates the presence or absence of a particular category. For example:

The job feature has been transformed into multiple binary features (job_admin., job_blue-collar, etc.), where each feature represents a specific job category, and the value True or False indicates whether the individual belongs to that category or not.
Target Column: This section displays the target column after transformation into binary numerical values. The original target column contained categorical values ('yes' or 'no'), indicating whether a client subscribed to a term deposit. After transformation, 'yes' is represented as 1 and 'no' is represented as 0, allowing it to be used in machine learning models which require numerical inputs.

In summary, the encoded bank information features provide a numerical representation of categorical features, and the target column has been transformed into binary numerical values for modeling purposes.






### Problem 6: Train/Test Split

With your data prepared, split it into a train and test set.

In [35]:
from sklearn.model_selection import train_test_split

# Split the data into features (X) and target (y)
X = features
y = target

# Split the data into train and test sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Display the shapes of the train and test sets
print("Train set shape - Features:", X_train.shape, "Target:", y_train.shape)
print("Test set shape - Features:", X_test.shape, "Target:", y_test.shape)

Train set shape - Features: (32950, 34) Target: (32950,)
Test set shape - Features: (8238, 34) Target: (8238,)


### Problem 7: A Baseline Model

Before we build our first model, we want to establish a baseline.  What is the baseline performance that our classifier should aim to beat?


In classification tasks, the baseline performance can often be established by considering the distribution of the target classes in the dataset. In this case, since the target variable represents whether a client subscribed to a term deposit or not (binary: 'yes' or 'no'), we can calculate the baseline accuracy by simply considering the majority class.

Here's how to calculate the baseline accuracy:

Determine the frequency of each class in the target variable.
The baseline accuracy is the proportion of the majority class in the dataset.

In [36]:
# Calculate the frequency of each class in the target variable
class_frequency = y_train.value_counts()

# Determine the majority class
majority_class = class_frequency.idxmax()

# Calculate the proportion of the majority class
baseline_accuracy = class_frequency[majority_class] / len(y_train)

print("Baseline Accuracy:", baseline_accuracy)

Baseline Accuracy: 0.887556904400607


The baseline accuracy represents the accuracy that a trivial classifier would achieve by simply predicting the majority class for every sample. Any classifier we build should aim to surpass this baseline accuracy to be considered effective.

### Problem 8: A Simple Model

Use Logistic Regression to build a basic model on your data.  

In [37]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Initialize the Logistic Regression model
logistic_model = LogisticRegression(random_state=42, max_iter=1000)

# Train the model on the training data
logistic_model.fit(X_train, y_train)

# Predict the target values for the test data
y_pred = logistic_model.predict(X_test)

# Calculate the accuracy of the model
accuracy = accuracy_score(y_test, y_pred)

print("Accuracy of the Logistic Regression model:", accuracy)

Accuracy of the Logistic Regression model: 0.8865015780529255


### Problem 9: Score the Model

What is the accuracy of your model?

The accuracy of the Logisitic Regression model is 0.8865.







### Problem 10: Model Comparisons

Now, we aim to compare the performance of the Logistic Regression model to our KNN algorithm, Decision Tree, and SVM models.  Using the default settings for each of the models, fit and score each.  Also, be sure to compare the fit time of each of the models.  Present your findings in a `DataFrame` similar to that below:

| Model | Train Time | Train Accuracy | Test Accuracy |
| ----- | ---------- | -------------  | -----------   |
|     |    |.     |.     |

In [38]:
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
import time
import warnings

# Ignore warnings about feature names
warnings.filterwarnings("ignore", message="X does not have valid feature names, but LogisticRegression")

# Initialize models
models = {
    'Logistic Regression': LogisticRegression(random_state=42, max_iter=1000),
    'KNN': KNeighborsClassifier(),
    'Decision Tree': DecisionTreeClassifier(),
    'SVM': SVC()
}

# Initialize results dictionary
results = {'Model': [], 'Train Time': [], 'Train Accuracy': [], 'Test Accuracy': []}

# Fit and score each model
for name, model in models.items():
    start_time = time.time()
    model.fit(X_train, y_train)
    train_time = time.time() - start_time
    
    train_accuracy = model.score(X_train.to_numpy(), y_train.to_numpy())
    test_accuracy = model.score(X_test.to_numpy(), y_test.to_numpy())
    
    results['Model'].append(name)
    results['Train Time'].append(round(train_time, 3))
    results['Train Accuracy'].append(round(train_accuracy, 3))
    results['Test Accuracy'].append(round(test_accuracy, 3))

# Create DataFrame
results_df = pd.DataFrame(results)

# Display the results DataFrame
print(results_df)


                 Model  Train Time  Train Accuracy  Test Accuracy
0  Logistic Regression       2.909           0.888          0.887
1                  KNN       0.011           0.891          0.875
2        Decision Tree       0.081           0.917          0.861
3                  SVM       6.480           0.888          0.887


### Problem 11: Improving the Model

Now that we have some basic models on the board, we want to try to improve these.  Below, we list a few things to explore in this pursuit.

- More feature engineering and exploration.  For example, should we keep the gender feature?  Why or why not?
- Hyperparameter tuning and grid search.  All of our models have additional hyperparameters to tune and explore.  For example the number of neighbors in KNN or the maximum depth of a Decision Tree.  
- Adjust your performance metric


To improve our models, we can explore the following steps:

### 1. Feature Engineering and Exploration:

We can perform more in-depth analysis of the existing features and possibly engineer new features that might better capture the underlying patterns in the data.
For example, regarding the gender feature, if it's not providing significant predictive power and there are concerns about bias, it might be prudent to remove it. However, if gender is relevant to the business objective and it contributes meaningfully to the model's performance, then it should be retained.
Conducting correlation analysis, feature importance analysis, and visualizations can help in understanding the relationships between features and the target variable.

### 2. Hyperparameter Tuning and Grid Search:

For each model, we can perform hyperparameter tuning using techniques like grid search or random search to find the optimal set of hyperparameters that maximize the model's performance.
For example, in KNN, we can tune the number of neighbors, in Decision Trees, we can tune the maximum depth or minimum samples split, and in SVM, we can tune the kernel type and regularization parameters.
Cross-validation can be used to ensure that the model performance estimates are robust and not overfitting to the training data.

### 3. Adjust Performance Metric:

Depending on the business objective and the characteristics of the dataset, we may need to adjust the performance metric used to evaluate the models.
While accuracy is a common metric, it may not be suitable for imbalanced datasets where one class dominates the other.
Alternative metrics such as precision, recall, F1-score, ROC-AUC, or precision-recall curves might be more appropriate, especially for binary classification tasks with imbalanced classes.
The choice of the performance metric should align with the specific goals of the project and the relative importance of correctly predicting each class.

##### Questions