# KNN Assignment
*Name:* Zach Novak

*PID:* za659148

*Date:* 3/9/2025

The dataset 'bank_marketing.csv' is related with direct marketing campaigns (phone calls) of a banking institution. The classification goal is to predict if the client will subscribe (yes/no) a term deposit (variable y).

Important: When you read the csv file, you might need to manipulate the columsn somehow for better printing.

For more information abour the dataset, please check out the link below:
https://archive.ics.uci.edu/dataset/222/bank+marketing

## Step 1: import necessary libraries and load the dataset

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.decomposition import PCA
from sklearn.preprocessing import LabelEncoder, StandardScaler

In [None]:
# read the CSV with the ; separator
df = pd.read_csv('data/bank_marketing.csv', sep=';')
print(df.head(1))
print(df.shape)
print(df.info())

# check for missing values
print("\nMissing values per column:")
print(df.isnull().sum())

# summary statistics for numerical columns
print("\nSummary statistics:")
print(df.describe())

## Step 2: Preprocessing

We can see from the above code that there are no missing values in the dataset.

Let's get an inventory of all the unqiue values in each feature so we can plan out the preprocessing approach. This also can be a method to spot bad data or slight variations within the data.

In [None]:
# list of columns to check for unique values
columns_to_check = df.columns

# print unique values for each column
for column in columns_to_check:
    unique_values = df[column].unique()
    print(f"Unique values in '{column}':\n{unique_values}\n")

Description of features:

- age: 
    - age of the client
- job: 
    - the job of the client
- marital: 
    - marital status of the client
- education: 
    - education level of the client
- default: 
    - whether the client has credit in default
- housing: 
    - whether the client has a housing loan
- loan: 
    - whether the client has a personal loan
- contact: 
    - type of contact method (cellular or telephone)
- month: 
    - month of the last contact
- day_of_week: 
    - day of the week of the last contact
- duration: 
    - duration of the last contact in seconds
- campaign: 
    - number of contacts during the current campaign
- pdays: 
    - number of days since the client was last contacted in a previous campaign (999 indicates no previous contact)
- previous: 
    - number of contacts performed before the current campaign
- poutcome: 
    - outcome of the previous marketing campaign (nonexistent, failure, success)
- emp.var.rate: 
    - employment variation rate (economic indicator)
- cons.price.idx: 
    - consumer price index (economic indicator)
- cons.conf.idx: 
    - consumer confidence index (economic indicator)
- euribor3m: 
    - 3-month Euribor rate (economic indicator)
- nr.employed: 
    - number of employees (economic indicator)
- y: 
    - whether the client subscribed to a term deposit (target variable)

To formulate my preprocessing approach, first, I categorize each feature into one of the two broad types: Numeric and Categorical.

Numerical features in the dataset are:
- age
- duration
- campaign
- pdays
- previous
- emp.var.rate
- cons.price.idx
- cons.conf.idx
- euribor3m
- nr.employed

Categorical categories are:
- job
- marital
- education
- default
- housing
- loan
- contact
- month
- day_of_week
- poutcome
- y

We can:
- Remove duplicate entries from the dataset
- Encode variables to a more machine readable format.
- Replace '999' in the pdays feature by creating a new feature called "first_campaign" since the '999' value means the client wasn't contacted before.with 'unknown' since it seems to be a placeholder
- We have already validated the data integrity such as typos or unreasonable values like age > 100 for the features.
- We have already validated there are not any NULL values.

Later in the notebook, you will see resampling done to the dataset to address a sample imbalance that was identified during the initial training process.


In [None]:
# check for duplicates
print("\nNumber of duplicates in the initial datset:", df.duplicated().sum())

# dop dupicates
df.drop_duplicates(inplace=True)
print("\nNumber of duplicates after dropping:", df.duplicated().sum())

### Numeric Features

Most of these features will be scaled shortly.

#### Pdays

In [None]:
# create the 'first_campaign' feature
def first_campaign(pdays):
    if pdays == 999:
        return 1
    else:
        return 0

df['first_campaign'] = df['pdays'].apply(first_campaign)

# display the first few rows to verify the new feature
print(df[['pdays', 'first_campaign']].head(10))

# print unique values and statistics for the new feature
print("\nUnique values in 'first_campaign':", df['first_campaign'].unique())
print("\nCount of first campaign vs returning customers:")
print(df['first_campaign'].value_counts())

### Categorical Features

These can be further disected into binary (two values) and categorical features (over two values).

#### Job

In [None]:
# shape of the dataset before encoding
print("Shape of the dataset before encoding:", df.shape)

# first use Label Encoder for binary categorical variables
binary_features = ['default', 'housing', 'loan', 'y']
label_encoder = LabelEncoder()
for feature in binary_features:
    df[feature] = label_encoder.fit_transform(df[feature])

# use one-hot encoding for non-binary categorical variables
categorical_features = ['job', 'contact', 'marital', 'education', 'month', 'day_of_week', 'poutcome']
df_encoded = pd.get_dummies(df, columns=categorical_features)

# print the shape of the new dataframe to see how many features we now have
print("Shape of the dataset after encoding:", df_encoded.shape)
print("\nNew features:", df_encoded.columns.tolist())

The age, duration, campaign, pdays, cons.price.idx, cons.conf.idx, euribor3m, nr.employed, emp. var.rate are all quantitative features. However, pdays has a special indicator of 999 which would be treated as an outlier incorrectly. The economic indicator features have very narrow ranges and although some extreme ranges may be outliers, they reflect an important time in economic events which can impact sales.
Let's review the appropraite features to see if there are any outliers that could skew the training.

In [None]:
#@ Initial check of feature distribution

# list of quantitative features to check for outliers
quantitative_features = ['age','duration', 'campaign']

# create box plots and histograms for each quantitative feature
fig, axes = plt.subplots(len(quantitative_features), 2, figsize=(15, 4*len(quantitative_features)))
fig.suptitle('Distribution and Outliers of Quantitative Features', fontsize=16, y=1.02)

for i, feature in enumerate(quantitative_features):
    # box plot
    sns.boxplot(x=df[feature], ax=axes[i,0])
    axes[i,0].set_title(f'Box plot of {feature}')
    axes[i,0].set_xlabel('Value')
    
    # histogram with KDE
    sns.histplot(df[feature], kde=True, ax=axes[i,1])
    axes[i,1].set_title(f'Distribution of {feature}')
    axes[i,1].set_xlabel('Value')

plt.tight_layout()
plt.show()

From the visualiazations which help us understand the data better, we see there are outliers. It is beneficial to filter outliers because of the KNN model's distance based algorithm.

In [None]:
#@ Remove outliers using IQR

# print the shape of the dataset before removing outliers
print("Shape of the dataset before removing outliers: ", df.shape)

# function to remove outliers using IQR
def remove_outliers(df, column):
    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    return df[(df[column] >= lower_bound) & (df[column] <= upper_bound)]

# remove outliers for each quantitative feature
for feature in quantitative_features:
    df = remove_outliers(df, feature)

# Print the shape of the dataset after removing outliers
print("Shape of the dataset after removing outliers: ", df.shape)

In [None]:
#@ Check for outliers again

# create box plots and histograms for each quantitative feature
fig, axes = plt.subplots(len(quantitative_features), 2, figsize=(15, 4*len(quantitative_features)))
fig.suptitle('Distribution and Outliers of Quantitative Features', fontsize=16, y=1.02)

for i, feature in enumerate(quantitative_features):
    # box plot
    sns.boxplot(x=df[feature], ax=axes[i,0])
    axes[i,0].set_title(f'Box plot of {feature}')
    axes[i,0].set_xlabel('Value')
    
    # histogram with KDE
    sns.histplot(df[feature], kde=True, ax=axes[i,1])
    axes[i,1].set_title(f'Distribution of {feature}')
    axes[i,1].set_xlabel('Value')

plt.tight_layout()
plt.show()

In [None]:
# preview of the now encoded dataset
df_encoded.head()

## Step 3: Scale the features (important for KNN)

Scaling is very importing in a KNN model since a KNN model is a distance based algorithm.

Let's scale the numerical variables in the Duration and Campaign features.

In [None]:
# create a scaler object
scaler = StandardScaler()

numerical_features = ['age','duration','campaign','previous','emp.var.rate','cons.price.idx']
df[numerical_features] = scaler.fit_transform(df[numerical_features])

# verify the scaling by checking mean (should be ~0) and std (should be ~1)
print("Scaled features statistics:")
print(df[numerical_features].describe())

one-hot encoding increases dimensionality which can impact the KNN model's performance. Because of this, a dimenstionality reduction technique like PCA can help retain the most relevant information.

In [None]:
# shape of the dataset before PCA
print("Shape of the dataset before PCA:", df_encoded.shape)

# define the number of components to keep
n_components = 10

# initialize PCA
pca = PCA(n_components=n_components)

# separate features and target variable
X = df_encoded.drop(columns=['y'])
y = df_encoded['y']

# fit and transform the data
X_pca = pca.fit_transform(X)

# print the explained variance ratio to see how much variance is captured by each component
print("Explained variance ratio by each component:")
print(pca.explained_variance_ratio_)

# print the shape of the transformed data
print("Shape of the data after PCA:", X_pca.shape)

## Step 4: Define features (X) and target (y)

In [None]:
# assuming df_encoded is your fully preprocessed dataframe
y = df_encoded['y']
X = df_encoded.drop('y', axis=1)

## Step 5: Split into training and test sets (80 & 20)

In [None]:
# split the data into 80/20 train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## Step 6, 7 & 8: Find the optimal K using 5-fold cross-validation, plot the error rates & print the optimal K-value.

A small K can be noisy and sensitive to outliers, while a large K may smooth out patterns too much.

In [None]:
# define a range of k values to test
k_values = range(1, 41)  # range to test for k
error_rates = []

# use 5-fold cross-validation to compute the error rate for each k
for k in k_values:
    knn = KNeighborsClassifier(n_neighbors=k)
    # compute cross-validated accuracy on the training data
    scores = cross_val_score(knn, X_train, y_train, cv=5, scoring='accuracy')
    error_rate = 1 - scores.mean()  # error rate = 1 - average accuracy
    error_rates.append(error_rate)
    print(f"k = {k}, Error Rate = {error_rate:.4f}")

# identify the optimal k
optimal_k = k_values[np.argmin(error_rates)]
print("\nOptimal K (with lowest error rate):", optimal_k)

# visualization of error rates vs. k values
plt.figure(figsize=(10, 6))
plt.plot(k_values, error_rates, marker='o', linestyle='--', color='blue')
plt.title('Error Rate vs. K Value')
plt.xlabel('K Value')
plt.ylabel('Error Rate')
plt.axvline(x=optimal_k, linestyle='--', color='red', label=f'Optimal K = {optimal_k}')
plt.legend()
plt.show()


## Step 9 & 10: Train the final KNN model with the optimal K and make predictions

In [None]:
# train the final KNN model with the optimal K
knn_final = KNeighborsClassifier(n_neighbors=optimal_k)
knn_final.fit(X_train, y_train)

# make predictions on the test set
y_pred = knn_final.predict(X_test)

# evaluate the model's performance
accuracy = accuracy_score(y_test, y_pred)
print("Test Accuracy:", accuracy)

print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))

print("\nClassification Report:")
print(classification_report(y_test, y_pred))


## Step 11: Model evaluation

In [None]:
# Evaluate overall accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Test Accuracy:", accuracy)

# Compute the confusion matrix and classification report
cm = confusion_matrix(y_test, y_pred)
print("\nConfusion Matrix:")
print(cm)

print("\nClassification Report:")
print(classification_report(y_test, y_pred))

# Visualize the confusion matrix
plt.figure(figsize=(6, 5))
plt.imshow(cm, interpolation='nearest', cmap=plt.cm.Blues)
plt.title('Confusion Matrix')
plt.colorbar()

# Set the tick marks and labels; adjust class labels if necessary
classes = ['No', 'Yes']  # Change these if your labels differ
tick_marks = np.arange(len(classes))
plt.xticks(tick_marks, classes, rotation=45)
plt.yticks(tick_marks, classes)

plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.tight_layout()
plt.show()


#### Interpretation:

This model predicts the majority class (class 0) very well which the high accuracy of the model demonstrates.

When it comes to interpreting the minority class (class 1) performance, the recall of 47% demontrates how ineffective the model is at intrepreting the minority class (a term deposit subscriber). This recall number means the model only captures 47% of the actual positive cases meaning more than half of the term deposit subscribers (i.e., opportunities) are being missed. This is further exasturbated by the low F1 score (55%).

In additional experimentation, the data sampling will be reviewed for a more optimal handling of the minority class, or in otherwords, accurately handling the business use case needs of finding sales opportunities.

## Step 12: Additional experimentation through resampling

* Requires an additional package dependency: $pip install imblearn

Oversampling with SMOTE:

In [None]:
from imblearn.over_sampling import SMOTE

# apply SMOTE only on the training data
smote = SMOTE(random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)

# check the new class distribution
print("Before resampling:\n", y_train.value_counts())
print("\nAfter resampling:\n", pd.Series(y_train_resampled).value_counts())


Recalculate the optimal K-value using the same methodology as previously in this notebook.

In [None]:
# define a range of k values to test on the resampled data
k_values_resampled = range(1, 41) # range to test for k
error_rates_resampled = []

# use 5-fold cross-validation on the resampled training data to compute error rates
for k in k_values_resampled:
    knn = KNeighborsClassifier(n_neighbors=k)
    scores = cross_val_score(knn, X_train_resampled, y_train_resampled, cv=5, scoring='accuracy')
    error_rate = 1 - scores.mean()
    error_rates_resampled.append(error_rate)
    print(f"k = {k}, Error Rate = {error_rate:.4f}")

# identify the optimal k value based on the resampled data
optimal_k_resampled = k_values_resampled[np.argmin(error_rates_resampled)]
print("\nOptimal K on resampled data:", optimal_k_resampled)

# visualize the error rates vs. k values for the resampled data
plt.figure(figsize=(10, 6))
plt.plot(k_values_resampled, error_rates_resampled, marker='o', linestyle='--', color='blue')
plt.title('Error Rate vs. K Value (Resampled Data)')
plt.xlabel('K Value')
plt.ylabel('Error Rate')
plt.axvline(x=optimal_k_resampled, linestyle='--', color='red', label=f'Optimal K = {optimal_k_resampled}')
plt.legend()
plt.show()


Train the KNN model on Resampled Data

In [None]:
# train a new KNN model using the optimal_k determined previously, but with resampled data
knn_final_resampled = KNeighborsClassifier(n_neighbors=optimal_k)
knn_final_resampled.fit(X_train_resampled, y_train_resampled)

# make predictions on the original test set
y_pred_resampled = knn_final_resampled.predict(X_test)

print(y_pred_resampled)

Evaluate the new model.

In [None]:
# evaluate the new model's performance
accuracy_resampled = accuracy_score(y_test, y_pred_resampled)
print("Test Accuracy after resampling:", accuracy_resampled)

# show the confusion matrix and classification report
cm_resampled = confusion_matrix(y_test, y_pred_resampled)
print("\nConfusion Matrix:")
print(cm_resampled)

print("\nClassification Report:")
print(classification_report(y_test, y_pred_resampled))

# visualize the confusion matrix
plt.figure(figsize=(6, 5))
plt.imshow(cm_resampled, interpolation='nearest', cmap=plt.cm.Blues)
plt.title('Confusion Matrix (After Resampling)')
plt.colorbar()

classes = ['No', 'Yes']
tick_marks = np.arange(len(classes))
plt.xticks(tick_marks, classes, rotation=45)
plt.yticks(tick_marks, classes)

plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.tight_layout()
plt.show()


#### Interpretation:

The total recall for class 1 (potential term deposit subscribers) increased dramatically to 88%. This translates to an 88% prediction rate of actual positives.
The resampled data enhances the model's robustness when dealing with the minority class.

The model's accuracy dropped to 84% which can happen when addressing class imbalances because the model now focuses more on detecting positives, even if that means sacrificing some accuracy on the majority class.

The reducred precision for Class 1 (41%) means that while the model is catching more positive cases, more of them are false positives. This is acceptable for the business use case because the false positives can be screened by a person and it leaves more confidence that the process is not missing opportunities. However, it is important to cost in mind for these screenings as it could potentially not be worth it if there are too many false positives.

Overall, these results suggest the model is now more effective at identifying the minority class.