# 1. Feature Engineering Objective.

The objective of feature engineering in this project is to transform raw loan applicant data into a **consistent, numerical, and model-ready format** while preserving the underlying information relevant to **loan default risk**. Since the dataset contains a mix of numerical and categorical variables, appropriate preprocessing is required to ensure that all models can effectively learn from the data.

Feature engineering will be performed with the following principles in mind:

- **Identifier columns** (such as unique loan IDs) will be excluded, as they do not carry predictive information.
- **Numerical and categorical features** will be handled using suitable transformations to ensure compatibility with machine learning algorithms.
- **Data leakage will be strictly avoided** by learning all preprocessing steps exclusively from the training data.
- **A consistent engineered feature set** will be used across all models to enable fair and meaningful comparison between algorithms.
- The **feature engineering pipeline will be reproducible and deployment-ready**, allowing the trained model to be reliably applied to unseen data.

This approach ensures that the engineered features support **robust model training**, **reliable evaluation**, and **future deployment** without introducing bias or inconsistencies.


# 2. Train–Test Split Strategy

The dataset is split into training and test sets **before any preprocessing or feature transformations** (such as encoding or scaling) to prevent data leakage and ensure a fair evaluation of model performance.

- The **target variable (`y`)** is defined as the `Default` column, which indicates whether a loan default occurred.
- The **feature matrix (`X`)** consists of all remaining columns excluding:
  - the target variable (`Default`)
  - the unique identifier column (`LoanID`)

The `LoanID` column is removed because it serves only as a record identifier and does not contain predictive information.

Due to the observed **class imbalance** in the target variable, a **stratified split** is used to preserve the proportion of defaulters and non-defaulters in both the training and test sets.

- The data is divided into **80% training data** and **20% test data**, providing sufficient data for model training while retaining an unseen test set for final evaluation.
- A **fixed random state** is used during the split to ensure reproducibility of results.

Model selection and hyperparameter tuning will be performed using **cross-validation on the training set** only.

The **test set will be used exactly once**, after model selection is complete, to evaluate the final model’s performance and provide an unbiased estimate of generalization ability.


In [181]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

In [182]:
df = pd.read_csv("../Loan_default.csv")

In [183]:
y = df["Default"]
X = df.drop(["Default", "LoanID"], axis=1)

In [184]:
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.20,
    random_state=42,
    stratify=y
)

In [185]:
# Verify shapes
print(f"X_train shape: {X_train.shape}")
print(f"X_test shape: {X_test.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"y_test shape: {y_test.shape}")

X_train shape: (204277, 16)
X_test shape: (51070, 16)
y_train shape: (204277,)
y_test shape: (51070,)


1. features such as 'Income', 'loanAmount' and 'CreditScore' are large values compared to target variable default
2. 'loanid' is a an Id which will not affect our target variable so no use of this column
3. 'Education' is sort of important because higher the study done, then the chances of good paying job is high but not necessariley, sometime people does the buisness and pay the loan
4. 'Employment' is a good feature, if the person is unemployed then the chances of paying loan is tough unless someone pays
5. 'MaritalStatus' can suggest us that the person if married the expense is high but also shows high responsibility
6. 'HasMortgage' feature says whrther the perosn has mortaged a property or not, suggests that if mortgage is there then the person will have more burden in paying the loan
7. 'HasDependents' says if the person has dependedents then the expense of the person increases
8. 'HasCoSigner' says if there is a co-applicatant which decreases the chances of default
9. 'LoanPurpose' feature suggest for what purpose is the loan, 'Auto' loan shows we can keep as a security.
10. We can feature age as the age into numbers such as age between 18 to 69 into different age groups. 