# 1. Feature Engineering Objective.

The objective of feature engineering in this project is to transform raw loan applicant data into a **consistent, numerical, and model-ready format** while preserving the underlying information relevant to **loan default risk**. Since the dataset contains a mix of numerical and categorical variables, appropriate preprocessing is required to ensure that all models can effectively learn from the data.

Feature engineering will be performed with the following principles in mind:

- **Identifier columns** (such as unique loan IDs) will be excluded, as they do not carry predictive information.
- **Numerical and categorical features** will be handled using suitable transformations to ensure compatibility with machine learning algorithms.
- **Data leakage will be strictly avoided** by learning all preprocessing steps exclusively from the training data.
- **A consistent engineered feature set** will be used across all models to enable fair and meaningful comparison between algorithms.
- The **feature engineering pipeline will be reproducible and deployment-ready**, allowing the trained model to be reliably applied to unseen data.

This approach ensures that the engineered features support **robust model training**, **reliable evaluation**, and **future deployment** without introducing bias or inconsistencies.


# 2. Train–Test Split Strategy

The dataset is split into training and test sets **before any preprocessing or feature transformations** (such as encoding or scaling) to prevent data leakage and ensure a fair evaluation of model performance.

- The **target variable (`y`)** is defined as the `Default` column, which indicates whether a loan default occurred.
- The **feature matrix (`X`)** consists of all remaining columns excluding:
  - the target variable (`Default`)
  - the unique identifier column (`LoanID`)

The `LoanID` column is removed because it serves only as a record identifier and does not contain predictive information.

Due to the observed **class imbalance** in the target variable, a **stratified split** is used to preserve the proportion of defaulters and non-defaulters in both the training and test sets.

- The data is divided into **80% training data** and **20% test data**, providing sufficient data for model training while retaining an unseen test set for final evaluation.
- A **fixed random state** is used during the split to ensure reproducibility of results.

Model selection and hyperparameter tuning will be performed using **cross-validation on the training set** only.

The **test set will be used exactly once**, after model selection is complete, to evaluate the final model’s performance and provide an unbiased estimate of generalization ability.


In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

In [2]:
df = pd.read_csv("../Loan_default.csv")
df.columns = df.columns.str.lower().str.strip()

In [3]:
y = df["default"]
X = df.drop(["default", "loanid"], axis=1)

In [4]:
df_train, df_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.20,
    random_state=42,
    stratify=y
)

In [5]:
# Verify shapes
print(f"df_train shape: {df_train.shape}")
print(f"df_test shape: {df_test.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"y_test shape: {y_test.shape}")

df_train shape: (204277, 16)
df_test shape: (51070, 16)
y_train shape: (204277,)
y_test shape: (51070,)


# 3. Handling Categorical Variables

Several features in the dataset are categorical in nature and must be converted into numerical representations before being used by machine learning models.

The categorical features identified in this dataset are:

- **Education**
- **EmploymentType**
- **MaritalStatus**
- **HasMortgage**
- **HasDependents**
- **LoanPurpose**
- **HasCoSigner**

These features are **nominal**, meaning there is no inherent ordering between their categories (e.g., *Married* is not greater than *Single*).

**One-Hot Encoding** is chosen as the encoding strategy because:

- It avoids introducing artificial ordinal relationships between categories.
- It is compatible with a wide range of models, including:
  - Linear models
  - Tree-based models
  - Gradient boosting models
  - Neural networks

To prevent **data leakage**, the encoding process will be **fitted only on the training data** and then applied to the test data.

The encoder will be configured to **handle unseen categories gracefully**, ensuring robustness during model evaluation and reliable behavior during future inference and deployment.


In [6]:

from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

categorical_features = ['education', 'employmenttype', 'maritalstatus',
       'hasmortgage', 'hasdependents', 'loanpurpose', 'hascosigner']

categorical_encoder = OneHotEncoder(handle_unknown='ignore',sparse_output=False)

categorical_transformer = ColumnTransformer(
    transformers=[('cat',categorical_encoder,categorical_features)]
    ,
    remainder='drop'
)

# Fit on training data and transform both train and test sets
X_train_cat = categorical_transformer.fit_transform(df_train)
X_test_cat = categorical_transformer.transform(df_test)

# Retrieve encoded categorical feature names
cat_feature_names = categorical_transformer.get_feature_names_out()

# Sanity checks
print("X_train_cat shape:", X_train_cat.shape)
print("X_test_cat shape:", X_test_cat.shape)
print("Number of encoded categorical features:", len(cat_feature_names))


X_train_cat shape: (204277, 22)
X_test_cat shape: (51070, 22)
Number of encoded categorical features: 22


# 4. Handling Numerical Variables

The dataset includes several numerical features that capture quantitative information about the applicant’s financial profile and loan characteristics.

The numerical features considered in this project are:

- **Age**
- **Income**
- **LoanAmount**
- **CreditScore**
- **MonthsEmployed**
- **NumCreditLines**
- **InterestRate**
- **LoanTerm**
- **DTIRatio**

These numerical features exist on very different scales (for example, *Income* versus *InterestRate*), which can negatively impact certain machine learning models if left unprocessed.

Models such as **Logistic Regression** and **Neural Networks** are sensitive to feature scale and typically perform better when numerical inputs are standardized.

**Tree-based models** (Decision Trees, Random Forests, XGBoost) are generally scale-invariant; however, scaled numerical features are still prepared to maintain a **consistent and reusable preprocessing pipeline** across all models.

The **Age** feature is treated as a continuous numerical variable and is included in numerical scaling. No age binning or discretization is applied at this stage to avoid unnecessary information loss.

**Standardization** is chosen as the numerical preprocessing method, as it centers features around zero and scales them to unit variance.

To prevent **data leakage**, the scaler will be **fitted only on the training data** and then applied to the test data using the learned parameters.

Numerical preprocessing is handled **separately from categorical encoding** to maintain a clear separation of responsibilities and to allow flexibility for future feature experimentation.


In [7]:
from sklearn.preprocessing import StandardScaler

numerical_features = ['age', 'income', 'loanamount', 'creditscore',
       'monthsemployed', 'numcreditlines', 'interestrate', 'loanterm',
       'dtiratio']

numerical_transformer = ColumnTransformer(
    transformers=[('num',StandardScaler(),numerical_features)]
    ,
    remainder='drop'
)

X_train_num = numerical_transformer.fit_transform(df_train)
X_test_num = numerical_transformer.transform(df_test)



# 5.Feature Assembly

- After preprocessing, the dataset consists of two separate feature representations:
  - One-hot encoded categorical features generated in **Section 3**
  - Scaled numerical features generated in **Section 4**

- These two feature sets are combined **column-wise** to form a single, unified feature matrix for modeling.

- Column-wise concatenation ensures that:
  - Each row still represents the same loan applicant
  - Feature alignment across samples is preserved
  - No mismatch occurs between numerical and categorical features

- The same feature assembly procedure is applied **consistently** to both the training and test datasets.

- No additional fitting, scaling, or encoding is performed at this stage; only **previously transformed features** are combined.

- The outputs of this step are:
  - **X_train_final**: Final feature matrix used for model training and cross-validation
  - **X_test_final**: Final feature matrix reserved for final model evaluation

- The resulting feature matrices are fully numerical and compatible with all planned models, including:
  - Linear models
  - Tree-based models
  - Neural networks


In [8]:
X_train = np.concatenate((X_train_num, X_train_cat), axis=1)
X_test = np.concatenate((X_test_num, X_test_cat), axis=1)

print(X_train.shape)
print(X_test.shape)


(204277, 31)
(51070, 31)
