### **State University of Campinas - UNICAMP** </br>
**Course**: MC886A </br>
**Professor**: Marcelo da Silva Reis </br>
**TA (PED)**: Marcos Vinicius Souza Freire

---

**Student 1**:

**RA 1**:

**Student 2**:

**RA 2**:

---

### **Assignment 2: MC886A**
#### **Model Selection and Regularization**
##### Notebook: mc886_1s2025-assignment_2.ipynb

Hello students! Our first task was about Linear Regression on teh Life Expectancy dataset. Now, in this task 2, we will cover Model Selection and Regularization. Let's review what we covered during lectures.

For this task, we'll use a Bank Marketing dataset [1] from UCI's repository. This dataset contains information about marketing campaigns conducted by a bank, with the goal of predicting whether a client will subscribe to a term deposit (our target variable 'y').

The dataset includes various client attributes such as age, job type, marital status, and education level, along with information about previous contacts and campaign outcomes. This rich set of features will allow us to explore different modeling approaches and regularization techniques.

Throughout this assignment, you'll:
1. Perform data exploration and preprocessing
2. Implement various models with different regularization techniques
3. Compare model performance using appropriate metrics
4. Practice model selection based on validation results

Remember that proper handling of categorical features and addressing class imbalance (if present) will be crucial for building effective models. We'll also focus on preventing overfitting through regularization methods we've discussed in our lectures.

#### **Objective:**

To explore **Model Selection** techniques to select the best model and hyperparameters for a classification task using PyTorch.

##### **Dataset: Bank Marketing**

The dataset contains data from a bank's marketing campaigns. Each record represents a client's interaction with a campaign, aimed at predicting whether the client will subscribe to a term deposit.

You can check the information about the dataset here: [https://archive.ics.uci.edu/dataset/222/bank+marketing](https://archive.ics.uci.edu/dataset/222/bank+marketing)

Our classification objective is to determine whether a client will subscribe to a term deposit, indicated by the "y" column (target), where "yes" = 1 and "no" = 0.

Features and their descriptions:

- **age**: Age of the client (numeric).
- **job**: Type of job (categorical: e.g., "admin.", "blue-collar", "student").
- **marital**: Marital status (categorical: "married", "single", "divorced").
- **education**: Education level (categorical: "primary", "secondary", "tertiary", "unknown").
- **default**: Has credit in default? (categorical: "yes", "no").
- **balance**: Average yearly balance in euros (numeric).
- **housing**: Has a housing loan? (categorical: "yes", "no").
- **loan**: Has a personal loan? (categorical: "yes", "no").
- **contact**: Contact communication type (categorical: "cellular", "telephone", "unknown").
- **day**: Last contact day of the month (numeric).
- **month**: Last contact month of the year (categorical: e.g., "jan", "feb").
- **duration**: Last contact duration in seconds (numeric).
- **campaign**: Number of contacts performed during this campaign for this client (numeric).
- **pdays**: Number of days since the client was last contacted from a previous campaign (numeric; -1 means not previously contacted).
- **previous**: Number of contacts performed before this campaign for this client (numeric).
- **poutcome**: Outcome of the previous marketing campaign (categorical: "success", "failure", "other", "unknown").
- **y**: Target value, indicating whether the client subscribed to a term deposit (categorical: "yes" or "no").

**How to load the dataset**

We'll download the dataset directly from the UCI Machine Learning Repository within the notebook. Run the following cell to fetch and load the data.

*If you want to run the notebook locally, just download the dataset and change the path accondingly to the location of the folder in your local environment. You can revisit the notebook 00 from our first hands-on, and set the environment that better suits you.*

In [None]:
import pandas as pd
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
import plotly.express as px
from torchmetrics import F1Score

In [None]:
# Download and load dataset (you can change this part, and save the file directly to your Drive or locally)
!wget https://archive.ics.uci.edu/ml/machine-learning-databases/00222/bank.zip
!unzip -o bank.zip

In [None]:
# Remember you must mount the drive here, to read the CSV file. Recap the Task 1 (assignement 1) or the notebook from the hands-on 1

In [None]:
df = pd.read_csv('bank-full.csv', sep=';')

### **Data Analysis and Preprocessing** $(1.0 \space point)$

In this section, explore the dataset to understand its structure and relationships. Avoid using test data during training.

P.S.: You can yse the same approach as in the task 1 (assignment 1).

#### **Exploration**

You can plot graphs with features that you think are important to visualize the relation with the target. You can also use boxplot graphs to understand feature distributions. There are no minimal/maximum requirements in what graphs you should use, explore just what you think can help in understanding the dataset.

As in the previous task, preprocess the data, transform the categorical features with `OneHotEncoding`, and remember to scale continuous features to be in a similar scale between each other.

In [None]:
# Display first few rows
print(df.head())
print(df.info())
print(df['y'].value_counts())  # Check class distribution

In [None]:
# Example visualization: Duration vs. Subscription
fig = px.box(df, x='y', y='duration', title='Duration by Subscription Outcome')
fig.show()

In [None]:
# Drop 'duration' to prevent data leakage
"""
    You may want to change this part. This is just an example.
    You should judge what column makes sense to drop to avoid data leakage.
    You can recall from the Task 1 (assignment 1), where we had to deal with data leakage.
"""
df = df.drop('duration', axis=1)

In [None]:
# Example: Job type distribution
fig = px.histogram(df, x='job', color='y', title='Job Distribution by Subscription')
fig.show()

#### **Preprocessing**

In [None]:
# Separate features and target

In [None]:
# One-hot encode categorical features

In [None]:
# Scale numerical features

In [None]:
# Manual train-test split
"""
    This is just an example. You can change this implementation.
    You can recall what you have learned from the Task 1.
"""
np.random.seed(42)
indices = np.arange(len(X_encoded))
np.random.shuffle(indices)
split = int(0.8 * len(indices))
train_indices = indices[:split]
test_indices = indices[split:]

### **Metric Selection** $(0.5 \space point)$

Since the dataset is imbalanced (more "no" than "yes" subscriptions), accuracy is not a good performance indicator. What metric should be used here, making it suitable for imbalanced classification tasks like this one, where could correctly identify the minority class ("yes") is important?

Discuss the choice of the metric here:

### **Model Selection** $(4.5 \space points)$

This section focuses on selecting the best model and hyperparameters using PyTorch. You must implement a logistic regression model with L1 (Lasso) and L2 (Ridge) regularization, manually, and perform a grid search with K-fold cross-validation. You can use the Scikit-learn library, in this part, which has helper functions to create the [K-fold](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html) logic and the model.

##### **Utility Functions**

*You must implement F1 Score from scratch (manual) and from Torchmetrics and compare the results (manual implementation and Torchmetrics implementation)*

##### **Grid Search with K-fold CV**

In [None]:
K = # Based on you interpretation of the data, what is the best value for K?

print(f"Best hyperparameters: {best_params}")

##### **Discussion of Key Points**
- **What was the best model hyperparameters, according to cross-validation?**
  - Run the code to find out. The best combination of regularization type, strength, and polynomial degree will be printed.
- **Did models with regularization outperform the one without it?**
  - Compare the F1-score when lambda_reg=0 (no regularization) vs. other values. Regularization might improve performance by preventing overfitting, especially with polynomial features.

### **Threshold Testing** $(1.0 \space point)$

Test different thresholds to optimize the F1-score for the best model.

In [None]:


print(f"Best threshold: {best_threshold}")

### **Visualizing/Interpreting Weights** $(1.5 \space points)$

In [None]:
# Weight plotting function


In [None]:
# Train final model


In [None]:
# Generate feature names for plotting


##### **Discussion of Key Points**
- **What conclusions can you draw from the weight graphs?**
  - Look at the magnitude of weights. L1 regularization (Lasso) may set some weights to zero, indicating feature sparsity, while L2 (Ridge) shrinks weights but keeps all features. Compare this to lambda_reg=0 if tested.

### **Testing** $(1.5 \space point)$

In [None]:
# ...

def plot_confusion_matrix(y_true, y_pred):

---

#### **Deadline**

Thursday, May 18, 11:59 pm.

Penalty policy for late submission: You are not encouraged to submit your assignment after due date. However, in case you do, your grade will be penalized as follows:
- May (3) 6 11:59 pm : grade * 0.75
- May (4) 7, 11:59 pm : grade * 0.5
- May (5) 8, 11:59 pm : grade * 0.25

---

#### **Submission**

On Google Classroom, submit your Jupyter Notebook (in Portuguese or English) or Google Colaboratory link (remember to share it!).

**This activity is NOT individual, it must be done in pairs (two-person group).**

Only one individual should deliver the notebook.

---

#### **REFERENCE**

Moro, S., Rita, P., & Cortez, P. (2014). Bank Marketing [Dataset]. UCI Machine Learning Repository. https://doi.org/10.24432/C5K306.