**Group information**

| Family name | First name | Email address |
| ----------- | ---------- | ------------- |
|             |            |               |
|             |            |               |
|             |            |               |

# Network - Practice

This tutorial explores how to implement a simple neural network to predict the likelihood of loan default from borrower loan characteristics. The labelled dataset contains 100,000 observations and 16 predictors (e.g. income, credit score). The response is a binary variable indicating whether the borrower defaulted on the loan.

In [4]:
!pip install captum

Collecting captum
  Downloading captum-0.8.0-py3-none-any.whl.metadata (26 kB)
Downloading captum-0.8.0-py3-none-any.whl (1.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.4/1.4 MB[0m [31m1.6 MB/s[0m  [33m0:00:00[0m eta [36m0:00:01[0m
[?25hInstalling collected packages: captum
Successfully installed captum-0.8.0


In [5]:
# Packages
import numpy as np
import pandas as pd
import shutil
import os
import torch
import torchinfo

from captum import attr
from matplotlib import pyplot as plt
from sklearn import metrics, model_selection, preprocessing
from torch import nn, optim, utils
from tqdm import tqdm
from urllib import request

# Device
device = 'cuda' if torch.cuda.is_available() else 'mps' if torch.backends.mps.is_available() else 'cpu'
device = torch.device(device)

# Utilities
def download_data():
    '''Downloads the data folder'''
    if os.getcwd().endswith('/data'):
        print('Data folder already exists')
    else:
        request.urlretrieve('https://www.dropbox.com/scl/fo/tniycpagp0c3p72uy0ag1/ACdmVyp71Zw_89tPERPN2mI?rlkey=0nxq0gifiqh5fwl9j0dk8lgk9&dl=1', 'data.zip')
        shutil.unpack_archive('data.zip', 'data')
        os.remove('data.zip')
        os.chdir('data')

In [6]:
# Execute on first run
download_data()

**1. Descriptive statistics**

Load the `X.csv` and `y.csv` files using `pd.read_csv`. Display the first few observations with the `head` method and generate descriptive statistics for both continuous (e.g. `describe`) and categorical (e.g. `value_counts`) variables.

In [7]:
print(os.getcwd())

/Users/eduardo/Desktop/Applied-Data-Science-Classes/data


In [8]:
X = pd.read_csv('X.csv')
y = pd.read_csv('y.csv')

In [13]:
print(X.head())
print(X.describe())
print(y.head())
print(y.describe())

   loan_id  age  income  loan_amount  credit_score  months_employed  \
0        0   29   17239       104883           769               93   
1        1   19   72980        51678           752               42   
2        2   59   31220       223085           550               22   
3        3   39   85614       213354           817               97   
4        4   61  149367        94041           535              117   

   num_credit_lines  interest_rate  loan_term  dti_ratio    education  \
0                 4           2.10         24       0.82     bachelor   
1                 3          10.95         24       0.25  high_school   
2                 2          21.02         60       0.66  high_school   
3                 3          13.59         36       0.32  high_school   
4                 4          19.25         36       0.52          phd   

  employment_type marital_status has_mortgage has_dependents loan_purpose  \
0       part_time       divorced          yes            

In [15]:
cat_cols = X.select_dtypes(include=["object", "category"]).columns.tolist()

value_counts = {col: X[col].value_counts(dropna=False) for col in cat_cols}

for col, vc in value_counts.items():
    print(f"\n--- {col} ---")
    print(vc)


--- education ---
education
high_school    25698
bachelor       25392
master         24706
phd            24204
Name: count, dtype: int64

--- employment_type ---
employment_type
unemployed       25966
part_time        25269
self_employed    24706
full_time        24059
Name: count, dtype: int64

--- marital_status ---
marital_status
divorced    33966
single      33221
married     32813
Name: count, dtype: int64

--- has_mortgage ---
has_mortgage
no     50775
yes    49225
Name: count, dtype: int64

--- has_dependents ---
has_dependents
no     50710
yes    49290
Name: count, dtype: int64

--- loan_purpose ---
loan_purpose
business     20359
other        20022
auto         19942
education    19919
home         19758
Name: count, dtype: int64

--- has_cosigner ---
has_cosigner
no     51110
yes    48890
Name: count, dtype: int64


**2. Format data**

Pre-process the data by encoding categorical variables with `pd.get_dummies` and converting the target variable to a probability format i.e. `float`.

In [16]:
X_dummies = pd.get_dummies(
    X,
    columns=cat_cols,
    drop_first=True,
    dtype=int
)

print(X.shape, "->", X_dummies.shape)

(100000, 17) -> (100000, 25)


**3. Split samples and scale data**

Split the dataset into a training and a test sample using `model_selection.train_test_split` and allocate 80% of the observations to the training sample. 

Scale the input variables using `preprocessing.MinMaxScaler` by fitting the scaler on the training sample and applying the transformation to both the training and the test sample. Explain why input scaling is required for machine learning models.

In [18]:
X_train, X_test, y_train, y_test = model_selection.train_test_split(
    X_dummies, y,
    test_size=0.2,
    random_state=42
)

scaler = preprocessing.MinMaxScaler()

X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

**4. Data loaders**

Convert the training and test sets to `torch.Tensor` objects using `torch.from_numpy` and wrap them in `utils.data.TensorDataset`. Use these datasets to create `utils.data.DataLoader` instances for training and testing. Explain why `shuffle=True` is necessary for the training loader, select an appropriate batch size, and discuss the trade-offs involved in that choice.


In [24]:
X_test_scaled = np.asarray(X_test_scaled)
X_train_scaled = np.asarray(X_train_scaled)
y_train_scaled = np.asarray(y_train)
y_test_scaled = np.asarray(y_test)

train_dataset = utils.data.TensorDataset(torch.from_numpy(X_train_scaled).float(), torch.from_numpy(y_train).float())
test_dataset = utils.data.TensorDataset(torch.from_numpy(X_test_scaled).float(), torch.from_numpy(y_test).float())
train_loader = utils.data.DataLoader(dataset=train_dataset, batch_size=128, shuffle=True)
test_loader = utils.data.DataLoader(dataset=test_dataset, batch_size=128, shuffle=False)

TypeError: expected np.ndarray (got DataFrame)

**5. Model structure**

Define a feedforward neural network class using PyTorch. The model takes as input a feature vector and outputs a probability scores. The model consists of two hidden layers with 16 units each, each followed by a ReLU activation, and ends with a linear output layer. Instantiate the model and print the model architecture.

Note: For numerical stability, PyTorch loss functions expect raw logit scores rather than probability distributions. The sigmoid transformation is applied internally within the loss function.

**6. Model training** 

Define the appropriate loss function `nn.BCEWithLogitsLoss` and an optimisation algorithm (e.g. `optim.AdamW`).  Write a PyTorch training loop to estimate the model parameters using the training sample, with a maximum of 25 epochs and a learning rate of `1e-3`. Remember to move the model and the batch data to the correct device.

**7. Model performance** 

Write a PyTorch evaluation loop to assess the model's generalisation performance on the test sample. Interpret the results using `metrics.classification_report` and `metrics.confusion_matrix`.

**8. Feature importance** 

Using the `attr.IntegratedGradients` function, compute local variable importance on the test sample. Aggregate the results across all observations and display them as a bar plot. Which variables are the most important for the classification?

**9. Regularisation**

Apply regularisation by configuring the optimiser's `weight_decay`, adding a penalty term to the loss function, inserting dropout layers after the activation functions, or implementing early stopping based on performance on a validation sample.

**10. Model architecture tuning**

Modify the model structure to improve predictive performance.