# **Feature Engineering for Credit Risk Modeling**  

Feature engineering enhances model performance by **extracting meaningful insights** from raw data, helping the model better identify **risk patterns** and improve prediction accuracy.

In [42]:
# Import required libraries
import pandas as pd
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, MinMaxScaler

## **Encoding Categorical Variables**

In [43]:
# Load the cleaned dataset
dataset_path = r'C:/Users/vagel/Desktop/C.R/credit_risk_analysis.csv'
data = pd.read_csv(dataset_path)

# Identify categorical columns
categorical_cols = ["Gender", "Owns_Car", "Owns_Property", "Income_Type", "Education_Level",
                    "Family_Status", "Housing_Type", "Occupation_Type"]

# Apply Label Encoding for ordinal variables
label_encoder = LabelEncoder()
ordinal_cols = ["Education_Level", "Family_Status"]  # These have a ranked order

for col in ordinal_cols:
    data[col] = label_encoder.fit_transform(data[col])

# Apply One-Hot Encoding for nominal variables
nominal_cols = ["Gender", "Owns_Car", "Owns_Property", "Income_Type", "Housing_Type", "Occupation_Type"]
data = pd.get_dummies(data, columns=nominal_cols, drop_first=True)  # Avoid dummy variable trap

# Display encoded dataset sample
display(data)

Unnamed: 0,Applicant_ID,Number_of_Children,Income_Total,Education_Level,Family_Status,Age_in_Years,Years_Employed,Has_Mobile_Phone,Has_Work_Phone,Has_Phone,...,Occupation_Type_Laborers,Occupation_Type_Low-skill Laborers,Occupation_Type_Managers,Occupation_Type_Medicine staff,Occupation_Type_Private service staff,Occupation_Type_Realty agents,Occupation_Type_Sales staff,Occupation_Type_Secretaries,Occupation_Type_Security staff,Occupation_Type_Waiters/barmen staff
0,5008804,0,427500.0,1,0,32.9,12.4,1,1,0,...,True,False,False,False,False,False,False,False,False,False
1,5008805,0,427500.0,1,0,32.9,12.4,1,1,0,...,True,False,False,False,False,False,False,False,False,False
2,5008806,0,112500.0,4,1,58.8,3.1,1,0,0,...,False,False,False,False,False,False,False,False,True,False
3,5008808,0,270000.0,4,3,52.4,8.4,1,0,1,...,False,False,False,False,False,False,True,False,False,False
4,5008809,0,270000.0,4,3,52.4,8.4,1,0,1,...,False,False,False,False,False,False,True,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
36452,5149828,0,315000.0,4,1,47.5,6.6,1,0,0,...,False,False,True,False,False,False,False,False,False,False
36453,5149834,0,157500.0,1,1,33.9,3.6,1,0,1,...,False,False,False,True,False,False,False,False,False,False
36454,5149838,0,157500.0,1,1,33.9,3.6,1,0,1,...,False,False,False,True,False,False,False,False,False,False
36455,5150049,0,283500.0,4,1,49.2,1.8,1,0,0,...,False,False,False,False,False,False,True,False,False,False


Many machine learning models cannot handle categorical variables in their raw text form.
 
We convert them into numerical values using:
- **Label Encoding** for ordinal variables (Education Level, Family Status).
- **One-Hot Encoding** for nominal variables (Occupation Type, Housing Type, Income Type).

**Why do we need encoding?**
- Models work better with numerical values.
- Helps capture relationships between categories and credit risk.

### **Interpretation**
- **Education Level & Family Status** are label-encoded since they have a clear ranking.
- **Gender, Car Ownership, Property Ownership, Income Type, Housing Type, Occupation Type** are one-hot encoded.
- We dropped the **first category** for each one-hot encoded variable to avoid redundancy.


## **Normalization (0-1 Scaling)**

In [44]:
# Identify numerical columns to scale
numerical_cols = ["Income_Total", "Age_in_Years", "Years_Employed", "Family_Members_Count", "worst_status"]

# Apply MinMax Scaling
scaler = MinMaxScaler()
data[numerical_cols] = scaler.fit_transform(data[numerical_cols])

# Display scaled dataset sample
display(data)


Unnamed: 0,Applicant_ID,Number_of_Children,Income_Total,Education_Level,Family_Status,Age_in_Years,Years_Employed,Has_Mobile_Phone,Has_Work_Phone,Has_Phone,...,Occupation_Type_Laborers,Occupation_Type_Low-skill Laborers,Occupation_Type_Managers,Occupation_Type_Medicine staff,Occupation_Type_Private service staff,Occupation_Type_Realty agents,Occupation_Type_Sales staff,Occupation_Type_Secretaries,Occupation_Type_Security staff,Occupation_Type_Waiters/barmen staff
0,5008804,0,0.258721,1,0,0.256198,0.970681,1,1,0,...,True,False,False,False,False,False,False,False,False,False
1,5008805,0,0.258721,1,0,0.256198,0.970681,1,1,0,...,True,False,False,False,False,False,False,False,False,False
2,5008806,0,0.055233,4,1,0.791322,0.961771,1,0,0,...,False,False,False,False,False,False,False,False,True,False
3,5008808,0,0.156977,4,3,0.659091,0.966849,1,0,1,...,False,False,False,False,False,False,True,False,False,False
4,5008809,0,0.156977,4,3,0.659091,0.966849,1,0,1,...,False,False,False,False,False,False,True,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
36452,5149828,0,0.186047,4,1,0.557851,0.965124,1,0,0,...,False,False,True,False,False,False,False,False,False,False
36453,5149834,0,0.084302,1,1,0.276860,0.962250,1,0,1,...,False,False,False,True,False,False,False,False,False,False
36454,5149838,0,0.084302,1,1,0.276860,0.962250,1,0,1,...,False,False,False,True,False,False,False,False,False,False
36455,5150049,0,0.165698,4,1,0.592975,0.960525,1,0,0,...,False,False,False,False,False,False,True,False,False,False


Feature scaling ensures that all numerical variables have the **same range (0 to 1)**.
This prevents features with larger values (e.g., income) from dominating the model.

**Why do we need normalization?**
- Improves model convergence for algorithms like **Logistic Regression, Neural Networks**.
- Ensures fair comparison between different features.

## **Interpretation**
- All numerical features are now scaled between **0 and 1**.
- This prevents **large numerical values** (like income) from dominating smaller ones (like family members count).



## **Creating New Features**

In [45]:
# Creating Age Groups
def age_group(age):
    if age < 30:
        return "Young"
    elif 30 <= age <= 50:
        return "Mid-Age"
    else:
        return "Senior"

data["Age_Group"] = data["Age_in_Years"].apply(age_group)

# Creating Employment Stability Categories
def employment_stability(years):
    if years < 2:
        return "Short-term"
    elif 2 <= years < 10:
        return "Medium-term"
    else:
        return "Long-term"

data["Employment_Stability"] = data["Years_Employed"].apply(employment_stability)

# Creating Credit History Length Categories
def credit_history_category(months):
    if months < 12:
        return "New Borrower"
    elif 12 <= months < 60:
        return "Experienced Borrower"
    else:
        return "Long-Term Borrower"

data["Credit_History_Length"] = data["worst_status"].apply(credit_history_category)

# Apply One-Hot Encoding to newly created categorical variables
new_features = ["Age_Group", "Employment_Stability", "Credit_History_Length"]
data = pd.get_dummies(data, columns=new_features, drop_first=True)

# Display dataset with new features
display(data)


Unnamed: 0,Applicant_ID,Number_of_Children,Income_Total,Education_Level,Family_Status,Age_in_Years,Years_Employed,Has_Mobile_Phone,Has_Work_Phone,Has_Phone,...,Occupation_Type_Laborers,Occupation_Type_Low-skill Laborers,Occupation_Type_Managers,Occupation_Type_Medicine staff,Occupation_Type_Private service staff,Occupation_Type_Realty agents,Occupation_Type_Sales staff,Occupation_Type_Secretaries,Occupation_Type_Security staff,Occupation_Type_Waiters/barmen staff
0,5008804,0,0.258721,1,0,0.256198,0.970681,1,1,0,...,True,False,False,False,False,False,False,False,False,False
1,5008805,0,0.258721,1,0,0.256198,0.970681,1,1,0,...,True,False,False,False,False,False,False,False,False,False
2,5008806,0,0.055233,4,1,0.791322,0.961771,1,0,0,...,False,False,False,False,False,False,False,False,True,False
3,5008808,0,0.156977,4,3,0.659091,0.966849,1,0,1,...,False,False,False,False,False,False,True,False,False,False
4,5008809,0,0.156977,4,3,0.659091,0.966849,1,0,1,...,False,False,False,False,False,False,True,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
36452,5149828,0,0.186047,4,1,0.557851,0.965124,1,0,0,...,False,False,True,False,False,False,False,False,False,False
36453,5149834,0,0.084302,1,1,0.276860,0.962250,1,0,1,...,False,False,False,True,False,False,False,False,False,False
36454,5149838,0,0.084302,1,1,0.276860,0.962250,1,0,1,...,False,False,False,True,False,False,False,False,False,False
36455,5150049,0,0.165698,4,1,0.592975,0.960525,1,0,0,...,False,False,False,False,False,False,True,False,False,False


In [46]:
# Add Debt-to-Income Ratio feature
data["Debt_to_Income"] = data["Income_Total"] / (data["Family_Members_Count"] + 1)

# Add Missed Payment Frequency (Count of worst_status >= 1)
data["Missed_Payment_Count"] = data.groupby("Applicant_ID")["worst_status"].transform(lambda x: (x >= 1).sum())

# Add Maximum Consecutive Missed Payments feature
def max_consecutive_missed(series):
    """Calculate the maximum consecutive months of missed payments (worst_status >= 1)."""
    status_str = ''.join(series.astype(str))
    consecutive_missed = max(map(len, status_str.split("0"))) if "1" in status_str else 0
    return consecutive_missed

data["Max_Consecutive_Missed"] = data.groupby("Applicant_ID")["worst_status"].transform(max_consecutive_missed)



## **Newly Created Features**
1. **Age Groups** – Categorizes applicants into meaningful segments: **Young, Mid-Age, and Senior**.
2. **Employment Stability** – Classifies work experience into **Short-term, Medium-term, and Long-term** employment.
3. **Debt-to-Income Ratio** – Measures financial leverage by comparing income to family size.
4. **Credit History Length** – Groups applicants based on the duration of their credit history.
5. **Missed Payment Frequency** – Counts the number of past due payments.
6. **Maximum Consecutive Missed Payments** – Identifies the longest sequence of missed payments.


## **Feature Interpretations**
### **1. Age Group**
**Definition**: Converts continuous age into categorical labels to differentiate risk profiles.  
- **Young ( < 30 years)** – May have less financial experience.  
- **Mid-Age (30 - 50 years)** – Typically more stable in employment and credit behavior.  
- **Senior ( > 50 years)** – May have long credit histories but could be approaching retirement.  

### **2. Employment Stability**
**Definition**: Classifies job tenure into three categories based on financial stability.  
- **Short-term ( < 2 years)** – Potentially unstable employment.  
- **Medium-term (2 - 10 years)** – More stable but still developing career progression.  
- **Long-term ( > 10 years)** – Financially more secure and established.

### **3. Debt-to-Income Ratio**
**Formula:**  
$Debt\_to\_Income = \frac{Income\_Total}{Family\_Members\_Count + 1}$

**Interpretation**:  
- Measures an applicant’s financial leverage.  
- A **higher ratio** suggests greater financial strain and potential difficulty in repaying debt.  

### **4. Missed Payment Frequency**
**Formula:**  
$Missed\_Payment\_Count = \sum(worst\_status \geq 1)$

**Interpretation**:  
- Represents the total number of months the applicant was late on payments.  
- A **higher count** suggests a pattern of repayment difficulties and a higher default risk.  

### **5. Maximum Consecutive Missed Payments**
**Formula:**  
$Max\_Consecutive\_Missed = \max(\text{length of consecutive 1s in worst\_status history})$

**Interpretation**:  
- Identifies the longest uninterrupted period of missed payments.  
- A **higher value** suggests prolonged financial distress, indicating a higher likelihood of default.


## **Save the Processed Dataset**
After feature engineering, we save the dataset for **model training**.


In [47]:
# Save the processed dataset
processed_data_path = r"c:/Users/vagel/Desktop/C.R/credit_risk_analysis_processed.csv"
data.to_csv(processed_data_path, index=False)

# Confirm file saved
print(f"Processed dataset saved at: {processed_data_path}")
data

Processed dataset saved at: c:/Users/vagel/Desktop/C.R/credit_risk_analysis_processed.csv


Unnamed: 0,Applicant_ID,Number_of_Children,Income_Total,Education_Level,Family_Status,Age_in_Years,Years_Employed,Has_Mobile_Phone,Has_Work_Phone,Has_Phone,...,Occupation_Type_Medicine staff,Occupation_Type_Private service staff,Occupation_Type_Realty agents,Occupation_Type_Sales staff,Occupation_Type_Secretaries,Occupation_Type_Security staff,Occupation_Type_Waiters/barmen staff,Debt_to_Income,Missed_Payment_Count,Max_Consecutive_Missed
0,5008804,0,0.258721,1,0,0.256198,0.970681,1,1,0,...,False,False,False,False,False,False,False,0.245785,0,0
1,5008805,0,0.258721,1,0,0.256198,0.970681,1,1,0,...,False,False,False,False,False,False,False,0.245785,0,0
2,5008806,0,0.055233,4,1,0.791322,0.961771,1,0,0,...,False,False,False,False,False,True,False,0.052471,0,0
3,5008808,0,0.156977,4,3,0.659091,0.966849,1,0,1,...,False,False,False,True,False,False,False,0.156977,0,0
4,5008809,0,0.156977,4,3,0.659091,0.966849,1,0,1,...,False,False,False,True,False,False,False,0.156977,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
36452,5149828,0,0.186047,4,1,0.557851,0.965124,1,0,0,...,False,False,False,False,False,False,False,0.176744,1,2
36453,5149834,0,0.084302,1,1,0.276860,0.962250,1,0,1,...,True,False,False,False,False,False,False,0.080087,1,2
36454,5149838,0,0.084302,1,1,0.276860,0.962250,1,0,1,...,True,False,False,False,False,False,False,0.080087,1,2
36455,5150049,0,0.165698,4,1,0.592975,0.960525,1,0,0,...,False,False,False,True,False,False,False,0.157413,0,0
