# Predicting Default Payments with Fully-Connected NNs

The dataset contains information on default payments, demographic factors, credit data, history of payment, and bill statements of credit card clients in Taiwan from April 2005 to September 2005.

## Inspecting the data

any comment about data dimensionality/distribution goes here

In [7]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, LabelEncoder
from tensorflow.keras.utils import to_categorical
import matplotlib.pyplot as plt
import os

# Suppress TensorFlow logging warnings
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '1'

def load_data(path, train=True):
    """
    Load data from a CSV file.

    Parameters
    ----------
    path: str
        The path to the CSV file.

    train: bool (default True)
        Specifies whether the data are training data.
        If True, random shuffling is applied.

    Returns
    -------
    X: numpy.ndarray
        The features as a multi-dimensional array of floats.
    labels/ids: numpy.ndarray
        The target labels for training data or IDs for test data.
    """
    df = pd.read_csv(path, encoding="ISO-8859-2")
    
    if train:
        # Shuffle training data to prevent ordering bias
        data = df.sample(frac=1, random_state=42).values
        X, labels = data[:, 1:-1].astype(np.float32), data[:, -1]
        return X, labels
    else:
        X, ids = df.iloc[:, 1:].values.astype(np.float32), df.iloc[:, 0].astype(str)
        return X, ids

def preprocess_data(X, scaler=None):
    """
    Preprocess input data by standardizing features to have zero mean and unit variance.
    
    Parameters
    ----------
    X: numpy.ndarray
        The input features to preprocess.
    
    scaler: StandardScaler, optional
        A pre-fitted scaler for transformation (used in testing).
    
    Returns
    -------
    X_scaled: numpy.ndarray
        The standardized features.
    scaler: StandardScaler
        The fitted scaler.
    """
    if not scaler:
        scaler = StandardScaler()
        scaler.fit(X)
    X_scaled = scaler.transform(X)
    return X_scaled, scaler

def preprocess_labels(labels, encoder=None, categorical=True):
    """
    Encode labels as integers and optionally convert them to categorical one-hot encoding.
    
    Parameters
    ----------
    labels: numpy.ndarray
        The target labels to preprocess.
    
    encoder: LabelEncoder, optional
        A pre-fitted encoder for consistent label encoding.
    
    categorical: bool (default True)
        Whether to convert labels to one-hot encoding.
    
    Returns
    -------
    y: numpy.ndarray
        The encoded labels, either as integers or one-hot vectors.
    encoder: LabelEncoder
        The fitted label encoder.
    """
    if not encoder:
        encoder = LabelEncoder()
        encoder.fit(labels)
    y = encoder.transform(labels).astype(np.int32)
    if categorical:
        y = to_categorical(y)
    return y, encoder

# Load the dataset
url_train = './train.csv'
url_test = './test.csv'

# Load training data and labels
X_train, labels = load_data(url_train, train=True)

# Dimensionality inspection
print("Training set dimensions (rows, columns):", X_train.shape)
print("Number of features:", X_train.shape[1])

# Inspect the first few rows of the training set
print("\nFirst 5 samples of the training data:")
print(pd.DataFrame(X_train).head())

print("\nTraining set labels distribution:")
print(pd.Series(labels).value_counts())

# Additional EDA: Summary statistics for training data
df_train = pd.read_csv(url_train)
print("\nSummary statistics for training data:")
print(df_train.describe())

# Inspecting test data dimensionality
X_test, ids = load_data(url_test, train=False)
print("\nTest set dimensions (rows, columns):", X_test.shape)

Training set dimensions (rows, columns): (25500, 23)
Number of features: 23

First 5 samples of the training data:
         0    1    2    3     4    5    6    7    8    9   ...       13  \
0   70000.0  2.0  3.0  2.0  26.0  0.0  0.0  0.0  0.0  0.0  ...   8948.0   
1  320000.0  2.0  2.0  2.0  28.0 -1.0 -1.0 -1.0 -1.0 -1.0  ...    944.0   
2   30000.0  2.0  2.0  2.0  36.0  0.0 -1.0 -1.0  0.0  0.0  ...  30452.0   
3   20000.0  2.0  3.0  1.0  35.0  0.0  0.0  2.0  2.0  0.0  ...  18621.0   
4   80000.0  1.0  2.0  2.0  32.0  1.0  2.0  0.0  0.0  0.0  ...  28242.0   

        14       15       16      17       18      19      20      21      22  
0   9006.0  10570.0  11421.0  2000.0   1200.0  1500.0  2000.0  1000.0  2000.0  
1    473.0   1747.0   1193.0   390.0    944.0   473.0  5000.0  1200.0   980.0  
2  29667.0  28596.0  29180.0   490.0  33299.0  1400.0   572.0   584.0   400.0  
3  18024.0  18434.0  19826.0  3000.0   1000.0     0.0   700.0  1700.0     0.0  
4  21400.0      0.0      0.0     7

## Preparing the data

describe the choice made during the preprocessing operations, also taking into account the previous considerations during the data inspection.

In [8]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from tensorflow.keras.utils import to_categorical

# Step 1: Load the data
url_train = './train.csv'
X_train_raw, labels_raw = load_data(url_train, train=True)

# Step 2: Handle missing values (if any)
df_train = pd.read_csv(url_train)

# Inspect if there are any missing values
print("Missing values in training data:")
print(df_train.isnull().sum())

# Assuming no missing values from previous inspection; if found, we could impute or drop.
# For now, we move forward without explicit handling.

# Step 3: Split the data into training and validation sets
# This step is crucial to evaluate the model performance on unseen validation data
X_train, X_val, y_train, y_val = train_test_split(X_train_raw, labels_raw, test_size=0.2, random_state=42, stratify=labels_raw)

# Step 4: Scale the features using StandardScaler
# Standardizing both the training and validation features
X_train_scaled, scaler = preprocess_data(X_train)
X_val_scaled, _ = preprocess_data(X_val, scaler)

# Step 5: Encode the labels
# Convert the labels to a numerical format and then to categorical (one-hot encoding)
y_train_encoded, encoder = preprocess_labels(y_train)
y_val_encoded, _ = preprocess_labels(y_val, encoder)

# Step 6: Preprocessing test data
# We apply the same scaler and encoder used on training data to ensure consistency
url_test = './test.csv'
X_test_raw, ids_test = load_data(url_test, train=False)
X_test_scaled, _ = preprocess_data(X_test_raw, scaler)

# Final datasets ready for model training
print("Training data (X_train_scaled):", X_train_scaled.shape)
print("Validation data (X_val_scaled):", X_val_scaled.shape)
print("Test data (X_test_scaled):", X_test_scaled.shape)

# You can also check the shapes of labels
print("Encoded training labels (y_train_encoded):", y_train_encoded.shape)
print("Encoded validation labels (y_val_encoded):", y_val_encoded.shape)

Missing values in training data:
ID                            0
LIMIT_BAL                     0
SEX                           0
EDUCATION                     0
MARRIAGE                      0
AGE                           0
PAY_0                         0
PAY_2                         0
PAY_3                         0
PAY_4                         0
PAY_5                         0
PAY_6                         0
BILL_AMT1                     0
BILL_AMT2                     0
BILL_AMT3                     0
BILL_AMT4                     0
BILL_AMT5                     0
BILL_AMT6                     0
PAY_AMT1                      0
PAY_AMT2                      0
PAY_AMT3                      0
PAY_AMT4                      0
PAY_AMT5                      0
PAY_AMT6                      0
default payment next month    0
dtype: int64
Training data (X_train_scaled): (20400, 23)
Validation data (X_val_scaled): (5100, 23)
Test data (X_test_scaled): (4500, 23)
Encoded training labels (y_tr

## Building the network

any description/comment about the procedure you followed in the choice of the network structure and hyperparameters goes here, together with consideration about the training/optimization procedure (e.g. optimizer choice, final activations, loss functions, training metrics)

## Analyze and comment the training results

here goes any comment/visualization of the training history and any initial consideration on the training results  

## Validate the model and comment the results

please describe the evaluation procedure on a validation set, commenting the generalization capability of your model (e.g. under/overfitting). You may also describe the performance metrics that you choose: what is the most suitable performance measure (or set of performance measures) in this case/dataset, according to you? Why?

## Make predictions (on the provided test set)

Based on the results obtained and analyzed during the training and the validation phases, what are your (rather _personal_) expectations with respect to the performances of your model on the blind external test set? Briefly motivate your answer.

# OPTIONAL -- Export the predictions in the format indicated in the assignment release page and verify you prediction on the [assessment page](https://aml-assignmentone-2425.streamlit.app/).