## Data Preprocessing


In this section, we shall consider the following:

* **Importing Libraries**
* **Missing Values:** How they were handled.
* **Encoding:** Convert categorical variables to numeric.
* **Feature Engineering:** Add new features or remove redundant ones.
* **Scaling:** Apply normalization/standardization.

### Importing Libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
import joblib


## Load Dataset

In [3]:
# Load dataset from local CSV instead of downloading again
csv_path = "data/default_credit_card_clients.csv"
data_credit_card = pd.read_csv(csv_path)

# Display first few rows
data_credit_card.head()

Unnamed: 0,ID,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,...,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default.payment.next.month
0,1,20000.0,2,2,1,24,2,2,-1,-1,...,0.0,0.0,0.0,0.0,689.0,0.0,0.0,0.0,0.0,1
1,2,120000.0,2,2,2,26,-1,2,0,0,...,3272.0,3455.0,3261.0,0.0,1000.0,1000.0,1000.0,0.0,2000.0,1
2,3,90000.0,2,2,2,34,0,0,0,0,...,14331.0,14948.0,15549.0,1518.0,1500.0,1000.0,1000.0,1000.0,5000.0,0
3,4,50000.0,2,2,1,37,0,0,0,0,...,28314.0,28959.0,29547.0,2000.0,2019.0,1200.0,1100.0,1069.0,1000.0,0
4,5,50000.0,1,2,1,57,-1,0,-1,0,...,20940.0,19146.0,19131.0,2000.0,36681.0,10000.0,9000.0,689.0,679.0,0


### Handle Missing Values

In [4]:
data_credit_card.fillna(0, inplace=True)  # Replace NaNs with 0 (modify as needed)

### Separate Features and Target

In [5]:
X = data_credit_card.drop('default.payment.next.month', axis=1)
y = data_credit_card['default.payment.next.month']


### Encode Categorical Variables

In [6]:
categorical_features = ['SEX', 'EDUCATION', 'MARRIAGE']
numerical_features = [col for col in X.columns if col not in categorical_features]

preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numerical_features),
        ('cat', OneHotEncoder(), categorical_features)
    ]
)

### Split Data

In [7]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

### Feature Preprocessing

In [8]:
X_train = preprocessor.fit_transform(X_train)
X_test = preprocessor.transform(X_test)

In [9]:
# Save preprocessed data and preprocessor
np.save('X_train.npy', X_train)
np.save('X_test.npy', X_test)
np.save('y_train.npy', y_train)
np.save('y_test.npy', y_test)
joblib.dump(preprocessor, 'preprocessor.pkl')

print("Preprocessing complete. Data saved successfully.")

Preprocessing complete. Data saved successfully.
