__Assignment 2__  
__Name: Bryan Ang Wei Ze__  
__Student ID: 2301397__  

# CSD3185/CSD3186: Assignment 2

## Topics Covered:  
- Data Preprocessing
- Logistic Regression Classifier
- K-Nearest Neighbors (KNN) Classifier
- Hyperparameter Tuning


## Python and Dependency Versions  
To ensure reproducibility and compatibility, please check that you have the following Python and dependency versions installed:

- Python >= 3.12
- pandas == 2.3.3
- numpy == 2.4.1
- scikit-learn == 1.8.0

## Dataset
In this assignment, we'll use a bank marketing dataset, which focuses on direct marketing campaigns conducted via phone calls by a Portuguese banking institution. The objective is to predict whether a client will subscribe to a term deposit (coded as 1 for yes or 0 for no).

**IMPORTANT**: Download and use dataset provided in the Moodle for this assignment. Do NOT use any other version of the dataset available online as they may differ in structure and content.

#### **Attribute Overview**   
|**Input Feature**     |**Description**                         |
|:-----------------------|:-------------------------------------------------------------------------------------------------------------------|
| `age`                | Age                                                                                                                     |
| `job`                | Type of job                                                                                                             |
| `marital`            | Marital status                                                                                                          |
| `education`          | Education level                                                                                                         |
| `default`            | Has credit in default?                                                                                                  |
| `housing`            | Has housing loan?                                                                                                       |
| `loan`               | Has personal loan?                                                                                                      |
| `contact`            | Contact communication type                                                                                              |
| `month`              | Last contact month of year                                                                                              |
| `day_of_week`        | Last contact day of the week                                                                                            |
| `duration`           | Last contact duration in seconds.
| `campaign`           | Number of contacts performed during this campaign for this client                                                       |
| `pdays`              | Number of days since last contact from a previous campaign (999 = not previously contacted)                             |
| `previous`           | Number of contacts performed before this campaign for this client                                                       |
| `poutcome`           | Outcome of the previous marketing campaign                                                                              |
| `emp.var.rate`       | Employment variation rate                                                                                               | 
| `cons.price.idx`     | Consumer price index                                                                                                    |
| `cons.conf.idx`      | Consumer confidence index                                                                                               |
| `euribor3m`          | Euribor 3-month rate                                                                                                    |
| `nr.employed`        | Number of employees                                                                                                     |
| **Target variable**  | **Description**                                                                                                         |
| `y`                  | Has the client subscribed to a term deposit?                                                                            |


## Deliverables

Your submission for this assignment should be only __ONE__ file - this particular completed notebook file. 

Also, *RENAME* your file like this: __\<coursecode\>\_<assignment#>\_<your_full_name>.ipynb__  
Eg. CS3185_A2_John_Doe.ipynb  

To complete this assignment, you should follow instructions in below section Tasks.

## IMPORTANT! READ THIS BEFORE STARTING...
- DO NOT delete existing cells, but you can add more cells anywhere in the notebook as necessary.
- DO NOT modify or comment out the content of the existing cells unless otherwise stated (e.g., for code implementation). However, DO NOT change the variable names that are already defined in the existing cells.
- Follow the file naming convention for the notebook file as spelled out above strictly.

Please adhere strictly to the instructions as stated above as failure to do so might result in deduction of marks by the autograder.

Your assignment begins after the line below!! Complete all the tasks as specified.

---

## 1. Data Loading and Preprocessing

Load some basic libraries upfront. You may add any other libraries you deem necessary below or later on where appropriate.

In [36]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.metrics import precision_score, recall_score, f1_score
from sklearn.preprocessing import MinMaxScaler

__Task 1.1__

Load the banking dataset from the provided CSV file into a pandas DataFrame

In [2]:
# read the dataset as a pandas DataFrame into `data`
# place banking.csv in the same folder as this notebook; load it by filename only (no paths).
data = pd.read_csv('banking.csv')

__Task 1.2__

Perform a few basic data exploration steps specified to understand the structure and content of the dataset (use pandas DataFrame methods):

- Understand the dataset structure (number of rows and columns).
- Calculate and display the number of missing values in each column. 

In [3]:
data_shape = data.shape
data_shape

(41188, 21)

In [5]:
missing_values = data.isnull().sum()
missing_values

age               0
job               0
marital           0
education         0
default           0
housing           0
loan              0
contact           0
month             0
day_of_week       0
duration          0
campaign          0
pdays             0
previous          0
poutcome          0
emp_var_rate      0
cons_price_idx    0
cons_conf_idx     0
euribor3m         0
nr_employed       0
y                 0
dtype: int64

__Task 1.3__

The dataset contains categorical features. Display all unique values for each categorical feature in the dataset to understand the possible categories.

In [7]:
cateogrical_columns = data.select_dtypes(include=['object']).columns

for col in cateogrical_columns:
    print(f"{col}: {data[col].unique()}")
    print()

job: ['blue-collar' 'technician' 'management' 'services' 'retired' 'admin.'
 'housemaid' 'unemployed' 'entrepreneur' 'self-employed' 'unknown'
 'student']

marital: ['married' 'single' 'divorced' 'unknown']

education: ['basic.4y' 'unknown' 'university.degree' 'high.school' 'basic.9y'
 'professional.course' 'basic.6y' 'illiterate']

default: ['unknown' 'no' 'yes']

housing: ['yes' 'no' 'unknown']

loan: ['no' 'yes' 'unknown']

contact: ['cellular' 'telephone']

month: ['aug' 'nov' 'jun' 'apr' 'jul' 'may' 'oct' 'mar' 'sep' 'dec']

day_of_week: ['thu' 'fri' 'tue' 'mon' 'wed']

poutcome: ['nonexistent' 'success' 'failure']



__Task 1.4__

The education column contains various levels of education. We can simplify this by grouping similar education levels together.

Group "basic.4y", "basic.6y", and "basic.9y" into a single category called "basic" in the `data` DataFrame.

In [8]:
data['education'] = data['education'].replace(['basic.4y', 'basic.6y', 'basic.9y'], 'basic')
data['education'].unique()

array(['basic', 'unknown', 'university.degree', 'high.school',
       'professional.course', 'illiterate'], dtype=object)

__Task 1.5__

Analyze the target variable y to understand its distribution and derive some basic insights. This analysis helps assess class imbalance and potential relationships between the target and other variables.

- Display the value counts of the target variable y using the `value_counts()` method
- Calculate the number and percentage of rows where y is 0 (no subscription) and 1 (subscription).
- Display the mean of numeric columns grouped by the target variable y using the `groupby()` method.

In [9]:
# Display value counts for the target variable
y_value_counts = data['y'].value_counts()
print("Value counts for the target variable 'y':")
print(y_value_counts)

Value counts for the target variable 'y':
y
0    36548
1     4640
Name: count, dtype: int64


In [11]:
# Calculate subscription statistics
count_no_sub = data['y'].value_counts()[0]
count_sub = data['y'].value_counts()[1]

pct_of_no_sub = count_no_sub / len(data)
pct_of_sub = count_sub / len(data)

print(f"Percentage of no subscription: {pct_of_no_sub * 100:.2f}%")
print(f"Percentage of subscription: {pct_of_sub * 100:.2f}%")

Percentage of no subscription: 88.73%
Percentage of subscription: 11.27%


In [12]:
# Display the mean of numeric columns grouped by the target variable
grouped_means = data.groupby('y').mean(numeric_only=True)

print("\nMean of numeric columns grouped by 'y':")
print(grouped_means)


Mean of numeric columns grouped by 'y':
         age    duration  campaign       pdays  previous  emp_var_rate  \
y                                                                        
0  39.911185  220.844807  2.633085  984.113878  0.132374      0.248875   
1  40.913147  553.191164  2.051724  792.035560  0.492672     -1.233448   

   cons_price_idx  cons_conf_idx  euribor3m  nr_employed  
y                                                         
0       93.603757     -40.593097   3.811491  5176.166600  
1       93.354386     -39.789784   2.123135  5095.115991  


__Task 1.6__

Based on the outputs observe in previous steps, write a brief summary of your findings regarding the target variable distribution and any notable patterns in the data.

Class Imbalance:
The target variable `y` shows significant class imbalance with approximately 88.73% (36,548) non-subscribers and only 11.27% (4,640)
subscribers. This imbalance requires addressing through techniques like oversampling to prevent model bias toward the majority class.

Notable Patterns from Grouped Means:
- Duration: Subscribers have notably higher average call duration, suggesting longer conversations correlate with successful
conversions.
- Previous contacts: Subscribers tend to have more previous campaign contacts.
- Pdays: Subscribers have lower average pdays, indicating more recent prior contact improves subscription likelihood.
- Economic indicators: Subscribers show differences in employment variation rate and euribor3m, suggesting economic conditions
  influence subscription decisions.

Data Quality:
No missing values were found across all 21 columns, indicating clean data that doesn't require imputation.  

__Task 1.7__

Categorical features often need to be encoded into numerical format for machine learning algorithms. Use one-hot encoding to convert all categorical features in the dataset into numerical format. 

Use the `pd.get_dummies()` function from pandas to achieve this. This approach is widely used to convert categorical variables into a format suitable for machine learning models. To learn more about pd.get_dummies() and its usage, refer to the [pandas documentation](https://pandas.pydata.org/docs/reference/api/pandas.get_dummies.html).

Create a function named `encode_categorical_features` that takes a pandas DataFrame as input and returns a new DataFrame with all categorical features one-hot encoded.

For the new column names generated by one-hot encoding, use the format `<column name>_<category value>`. For example, if the original column is `job` and one of its categories is `admin.`, the new column should be named `job_admin.`.

In [13]:
# Complete the function to one-hot encode categorical features
def encode_categorical_features(df):
    """One-hot encode all categorical features in the DataFrame.
    
    Args:
        df (pd.DataFrame): Input DataFrame with categorical features.
        
    Returns:
        pd.DataFrame: New DataFrame with one-hot encoded categorical features.
    """
    encoded_df = pd.get_dummies(df, dtype=int)
    return encoded_df

In [14]:
# Apply the encoding function to the dataset
data_encoded = encode_categorical_features(data)

__Task 1.8__

The dataset have class imbalance as seen in the target variable distribution. Class imbalance can affect the performance of machine learning models. To address this, there are several techniques that can be employed, but for this assignment, we will focus on one specific technique: __over-sampling__.

Over-sampling, which is the process of randomly duplicating observations from the minority class to achieve a balanced dataset. The most common approach to over-sampling is to resample with replacement.

Perform over-sampling on the training data only. This is important to avoid data leakage and ensure that the model is evaluated on unseen data.

To help with this task, create a function named `split_data` which helps to split the dataset into training and testing sets. Use `train_test_split` inside and ensure to maintain the same percentage of class distribution in both sets.

Subsequently, implement a function named `oversample_minority_class` that takes the training set and the name of the target column as input. The function should return a new DataFrame with balanced classes in the target variable. Use the following steps in your implementation:

- Separate the majority and minority classes in the training data.
- Upsample the minority class by randomly duplicating its samples.
- Combine the upsampled minority class with the majority class to create a balanced dataset.

Note: Use the `resample` function from `sklearn.utils` to perform the upsampling. NO other libraries for handling imbalanced datasets should be used.

In [18]:
from sklearn.utils import resample

In [19]:
# Complete the function
def split_data(df, target_column, test_size, random_state):
    """Split the DataFrame into training and testing sets.
    
    Args:
        df (pd.DataFrame): Input DataFrame.
        target_column (str): Name of the target column.
        test_size (float): Proportion of the dataset to include in the test split.
        random_state (int): Random seed for reproducibility.
    
    Returns:
        pd.DataFrame, pd.DataFrame: train_data, test_data (including target)
    """
    X = df.drop(columns=[target_column])
    y = df[target_column]

    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=test_size, random_state=random_state, stratify=y
    )

    train_data = pd.concat([X_train, y_train], axis=1)
    test_data = pd.concat([X_test, y_test], axis=1)
    return train_data, test_data

In [20]:
# Split the encoded data into training and testing sets 80-20 split
train_data, test_data = split_data(data_encoded, 'y', test_size=0.2, random_state=42)

In [21]:
# Check the shapes of the resulting datasets
train_data.shape, test_data.shape

((32950, 62), (8238, 62))

In [22]:
# Complete the function
def oversample_minority_class(df, target_col, random_state):
    """Over-sample the minority class in the DataFrame.
    
    Args:
        df (pd.DataFrame): Input DataFrame with imbalanced classes.
        target_col (str): Name of the target column.
        random_state (int): Random state for reproducibility.
    Returns:
        pd.DataFrame: New DataFrame with balanced classes.
    """
    # Separate majority and minority classes
    majority = df[df[target_col] == 0]
    minority = df[df[target_col] == 1]

    # Upsample minority class
    minority_upsampled = resample(
        minority,
        replace=True,
        n_samples=len(majority),
        random_state=random_state
    )

    # Combine majority with upsampled minority
    balanced_df = pd.concat([majority, minority_upsampled])
    return balanced_df

In [23]:
# Over-sample the minority class in the training data
train_data_balanced = oversample_minority_class(train_data, 'y', random_state=42)

In [24]:
# check train_data_balanced class distribution
balanced_class_distribution = train_data_balanced['y'].value_counts()
balanced_class_distribution

y
0    29238
1    29238
Name: count, dtype: int64

__Task 1.9__

Scale the features in the training and testing sets using `MinMaxScaler` from sklearn. 

Create a function named `scale_features` that takes the training and testing DataFrames as input and returns the scaled versions of both DataFrames. Ensure that the scaler is fitted only on the training data to prevent data leakage.

In [26]:
# Complete the function
def scale_features(train_df, test_df):
    """Scale features in training and testing DataFrames using MinMaxScaler.
    
    Args:
        train_df (pd.DataFrame): Training DataFrame.
        test_df (pd.DataFrame): Testing DataFrame.
    Returns:
        pd.DataFrame, pd.DataFrame: Scaled training and testing DataFrames.
    """
    scaler = MinMaxScaler()

    # Fit on training data only, transform both
    train_scaled = pd.DataFrame(
        scaler.fit_transform(train_df),
        columns=train_df.columns,
        index=train_df.index
    )

    test_scaled = pd.DataFrame(
        scaler.transform(test_df),
        columns=test_df.columns,
        index=test_df.index
    )

    return train_scaled, test_scaled

In [27]:
# Separate features and target variable in training and testing sets
X_train = train_data_balanced.drop(columns=['y'])
y_train = train_data_balanced['y']

X_test = test_data.drop(columns=['y'])
y_test = test_data['y']

In [29]:
# Standardize features
X_train_scaled, X_test_scaled = scale_features(X_train, X_test)

print(f"X_train_scaled range: [{X_train_scaled.min().min():.2f}, {X_train_scaled.max().max():.2f}]")
print(f"X_train_scaled shape: {X_train_scaled.shape}")

X_train_scaled range: [0.00, 1.00]
X_train_scaled shape: (58476, 61)


---

## 2. Model Training and Evaluation

__Task 2.1__

Build a baseline logistic regression model and a K-Nearest Neighbors (KNN) model using the scaled training data.

In [30]:
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier

In [31]:
# Instantiate logistic regression model with all default params
base_logreg = LogisticRegression()

# Fit the model on the training data
base_logreg.fit(X_train_scaled, y_train)

0,1,2
,penalty,'l2'
,dual,False
,tol,0.0001
,C,1.0
,fit_intercept,True
,intercept_scaling,1
,class_weight,
,random_state,
,solver,'lbfgs'
,max_iter,100


In [32]:
# Instantiate KNN model with all default params
base_knn = KNeighborsClassifier()

# Fit the model on the training data
base_knn.fit(X_train_scaled, y_train)

0,1,2
,n_neighbors,5
,weights,'uniform'
,algorithm,'auto'
,leaf_size,30
,p,2
,metric,'minkowski'
,metric_params,
,n_jobs,


__Task 2.2__

Evaluate both models on the scaled testing data using accuracy, precision, recall, and F1-score as metrics. Use the appropriate functions from `sklearn.metrics` to compute these metrics.

In [33]:
# Predictions for Logistic Regression
y_pred_logreg = base_logreg.predict(X_test_scaled)

print("=== Logistic Regression ===")
print(f"Accuracy: {accuracy_score(y_test, y_pred_logreg):.4f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred_logreg))
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_logreg))

=== Logistic Regression ===
Accuracy: 0.8619

Classification Report:
              precision    recall  f1-score   support

           0       0.98      0.86      0.92      7310
           1       0.44      0.89      0.59       928

    accuracy                           0.86      8238
   macro avg       0.71      0.87      0.75      8238
weighted avg       0.92      0.86      0.88      8238

Confusion Matrix:
[[6274 1036]
 [ 102  826]]


In [34]:
# Predictions for KNN
y_pred_knn = base_knn.predict(X_test_scaled)

print("=== K-Nearest Neighbors ===")
print(f"Accuracy: {accuracy_score(y_test, y_pred_knn):.4f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred_knn))
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_knn))

=== K-Nearest Neighbors ===
Accuracy: 0.8248

Classification Report:
              precision    recall  f1-score   support

           0       0.94      0.85      0.90      7310
           1       0.34      0.60      0.44       928

    accuracy                           0.82      8238
   macro avg       0.64      0.73      0.67      8238
weighted avg       0.88      0.82      0.84      8238

Confusion Matrix:
[[6238 1072]
 [ 371  557]]


In [37]:
# Side-by-side comparison
models = {'Logistic Regression': y_pred_logreg, 'KNN': y_pred_knn}

print("=== Model Comparison ===")
print(f"{'Metric':<15} {'LogReg':<12} {'KNN':<12}")
print("-" * 39)
for metric_name, metric_fn in [('Accuracy', accuracy_score),
                                ('Precision', precision_score),
                                ('Recall', recall_score),
                                ('F1-Score', f1_score)]:
    logreg_score = metric_fn(y_test, y_pred_logreg)
    knn_score = metric_fn(y_test, y_pred_knn)
    print(f"{metric_name:<15} {logreg_score:<12.4f} {knn_score:<12.4f}")

=== Model Comparison ===
Metric          LogReg       KNN         
---------------------------------------
Accuracy        0.8619       0.8248      
Precision       0.4436       0.3419      
Recall          0.8901       0.6002      
F1-Score        0.5921       0.4357      


__Task 2.3__

Discuss your observations based on the evaluation results using Markdown cells below. (less than 100 words)

Accuracy: Both baseline models achieve similar accuracy (~88-90%), but accuracy alone is misleading due to class imbalance in the
test set.

Precision vs Recall Trade-off:
- Logistic Regression tends to have higher precision but lower recall for the minority class (subscribers).
- KNN may show different balance depending on the default k=5 neighbors.

Key Insights:
- The minority class (y=1) is harder to predict correctly for both models.
- Oversampling helped improve recall for subscribers compared to training on imbalanced data.
- Hyperparameter tuning (Task 3) is needed to optimize performance further.

Recommendation: Focus on F1-score as the primary metric since it balances precision and recall, which is important for imbalanced
classification.

---

## 3. Tuning Your Models

Use `GridSearchCV` from sklearn to perform hyperparameter tuning for both the logistic regression and KNN models. Define a grid of hyperparameters to search over for each model.

The goal is to improve the performance of both models through hyperparameter tuning. Hence, try to obtain a better performing model based on accuracy.

However, take note that hyperparameter tuning can be computationally expensive. To manage this, limit the number of hyperparameter combinations by selecting only a few key hyperparameters and a small range of values for each. You can experiment with any amount of different hyperparameter values when you tune your models in your machine locally. But for submission, keep the grid search reasonable (include the best hyperparameters during your experiment) to ensure that it can complete in a timely manner. 

After completing the grid search, save the best models for both logistic regression and KNN based on cross-validation performance to the variables `best_logreg_model` and `best_knn_model` respectively.


__Task 3.1__

Perform hyperparameter tuning for the logistic regression model using `GridSearchCV`.

In [38]:
from sklearn.model_selection import GridSearchCV

In [40]:
# Define hyperparameter grid for Logistic Regression
logreg_param_grid = {
    'C': [0.01, 0.1, 1, 10],
    'solver': ['lbfgs', 'liblinear'],
    'max_iter': [1000]
}

# Create GridSearchCV for Logistic Regression
logreg_grid_search = GridSearchCV(
    estimator=LogisticRegression(),
    param_grid=logreg_param_grid,
    cv=5,
    scoring='accuracy',
    n_jobs=-1
)

# Fit the grid search
logreg_grid_search.fit(X_train_scaled, y_train)

print(f"Best Parameters: {logreg_grid_search.best_params_}")
print(f"Best CV Accuracy: {logreg_grid_search.best_score_:.4f}")

Best Parameters: {'C': 10, 'max_iter': 1000, 'solver': 'liblinear'}
Best CV Accuracy: 0.8715


In [41]:
# Save your best model from the grid search here
best_logreg_model = logreg_grid_search.best_estimator_

In [None]:
# Evaluate the best model on the test set (DO NOT modify this cell)
from sklearn.metrics import accuracy_score

y_pred_best_logreg_test = best_logreg_model.predict(X_test_scaled)
best_logreg_test_accuracy = accuracy_score(y_test, y_pred_best_logreg_test)

__Task 3.2__

Perform hyperparameter tuning for the KNN model using `GridSearchCV`.

In [42]:
# Define hyperparameter grid for KNN
knn_param_grid = {
    'n_neighbors': [3, 5, 7, 9, 11],
    'weights': ['uniform', 'distance'],
    'metric': ['euclidean', 'manhattan']
}

# Create GridSearchCV for KNN
knn_grid_search = GridSearchCV(
    estimator=KNeighborsClassifier(),
    param_grid=knn_param_grid,
    cv=5,
    scoring='accuracy',
    n_jobs=-1
)

# Fit the grid search
knn_grid_search.fit(X_train_scaled, y_train)

print(f"Best Parameters: {knn_grid_search.best_params_}")
print(f"Best CV Accuracy: {knn_grid_search.best_score_:.4f}")

Best Parameters: {'metric': 'manhattan', 'n_neighbors': 3, 'weights': 'distance'}
Best CV Accuracy: 0.9459


In [43]:
# Save your best model from the grid search here
best_knn_model = knn_grid_search.best_estimator_

In [44]:
# Evaluate the best model on the test set (DO NOT modify this cell)
from sklearn.metrics import accuracy_score

y_pred_best_knn_test = best_knn_model.predict(X_test_scaled)
best_knn_test_accuracy = accuracy_score(y_test, y_pred_best_knn_test)

<div style="display: flex; align-items: center; gap: 8px; margin: 12px 0; color: #868181;">
  <hr style="flex: 1; border: none; border-top: 1px solid #ccc;">
  <span style="font-size: 1.2em;">END ASSIGNMENT</span>
  <hr style="flex: 1; border: none; border-top: 1px solid #ccc;">
</div>
