<a href="https://colab.research.google.com/github/daniel0ku/AutoLayout-iOS13/blob/master/Machine_Learning_Capstone_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Home Credit Default Risk

## About The Capstone Project

Many people can’t get loans because they don’t have enough credit history, leaving them vulnerable to unfair lenders. Home Credit Group helps these individuals by using alternative data, like phone and transaction records, to predict if they can repay a loan. They’re asking for help to improve their methods, so more people get fair and manageable loans that set them up for success.

## Goal

The primary goal is to build a machine learning model that predicts whether an applicant will repay a loan or default, based on their financial, demographic, and historical data.

# Notebook Preparation

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
from google.colab import drive
drive.mount('/content/drive/')

Mounted at /content/drive/


In [None]:
from EDA_Utilities import *

ModuleNotFoundError: No module named 'EDA_Utilities'

In [None]:
description = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/Machine Learning Capstone Project/home-credit-default-risk/HomeCredit_columns_description.csv', encoding = 'ISO-8859-1')
test = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/Machine Learning Capstone Project/home-credit-default-risk/application_test.csv', encoding = 'ISO-8859-1')
train = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/Machine Learning Capstone Project/home-credit-default-risk/application_train.csv', encoding = 'ISO-8859-1')
bureau = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/Machine Learning Capstone Project/home-credit-default-risk/bureau.csv', encoding = 'ISO-8859-1')
bureau_balance = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/Machine Learning Capstone Project/home-credit-default-risk/bureau_balance.csv', encoding = 'ISO-8859-1')
credit_balance = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/Machine Learning Capstone Project/home-credit-default-risk/credit_card_balance.csv', encoding = 'ISO-8859-1')
payments = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/Machine Learning Capstone Project/home-credit-default-risk/installments_payments.csv', encoding = 'ISO-8859-1')

In [None]:
description.head(5)

In [None]:
test.head(5)

In [None]:
train.head(5)

In [None]:
bureau.head(5)

In [None]:
bureau_balance.head(5)

In [None]:
credit_balance.head(5)

In [None]:
payments.head(5)

In [None]:
train.describe()

In [None]:
train.columns.tolist()

In [None]:
missing_values = train.isnull().sum()
missing_values

In [None]:
# prompt: i need to plot the distribution of what data type is being used in the dataset. plot a barplot

import matplotlib.pyplot as plt
import seaborn as sns

# Count data types
data_types = train.dtypes.value_counts()

# Create bar plot
plt.figure(figsize=(10, 6))
sns.barplot(x=data_types.index, y=data_types.values)
plt.title('Distribution of Data Types in the Dataset')
plt.xlabel('Data Type')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.show()

## EDA

Identification Columns:
* SD_ID_CURR
* TARGET

Loan Information:
* NAME_CONTRACT_TYPE
* AMT_CREDIT
* AMT_ANNUITY
* AMT_GOODS_PRICE

Demographic Information:
* CODE_GENDER
* FLAG_OWN_CAR
* FLAG_OWN_REALTY
* CNT_CHILDREN
* AMT_INCOME_TOTAL

Employment and Education:
* NAME_INCOME_TYPE
* NAME_EDUCATION_TYPE
* DAYS_EMPLOYED
* OCCUPATION_TYPE
* FLAG_EMP_PHONE

Family and Housing:
* NAME_FAMILY_STATUS
* NAME_HOUSING_TYPE
* CNT_FAM_MEMBERS

Regional and Residency Information:
* REGION_POPULATION_RELATIVE
* DAYS_BIRTH
* DAYS_REGISTRATION
* REG_REGION_NOT_LIVE_REGION
* REG_CITY_NOT_LIVE_CITY

Application and Processing Information:
* WEEKDAY_APPR_PROCESS_START
* HOUR_APPR_PROCESS_START
* ORGANIZATION_TYPE
* FLAG_EMAIL

Housing Characteristics (Aggregates and Averages):
* APARTMENTS_AVG, BASEMENTAREA_AVG, YEARS_BEGINEXPLUATATION_AVG, YEARS_BUILD_AVG
* ELEVATORS_AVG, ENTRANCES_AVG, FLOORSMAX_AVG

Social Connections:
* OBS_30_CNT_SOCIAL_CIRCLE
* DEF_30_CNT_SOCIAL_CIRCLE
* OBS_60_CNT_SOCIAL_CIRCLE, DEF_60_CNT_SOCIAL_CIRCLE

Document Flags:
* FLAG_DOCUMENT_2 to FLAG_DOCUMENT_21

Credit Bureau Data:
* AMT_REQ_CREDIT_BUREAU_HOUR, AMT_REQ_CREDIT_BUREAU_DAY, AMT_REQ_CREDIT_BUREAU_WEEK

External Sources and Scoring:
* EXT_SOURCE_1, EXT_SOURCE_2, EXT_SOURCE_3



### Frame the Problem and look at the big picture

The primary goal is to build a machine learning model that predicts whether an applicant will repay a loan or default, based on their financial, demographic, and historical data.

In [None]:
target_counts = train['TARGET'].value_counts()

plt.figure(figsize=(8, 8))
plt.pie(target_counts, labels=target_counts.index, autopct='%1.1f%%', startangle=90)
plt.title('Distribution of Target Variable')
plt.axis('equal')
plt.show()

The data is highly imbalanced, with 91.9% in class 0 (no payment issues) and only 8.1% in class 1 (payment issues). This can cause models to favor the majority class, so techniques like resampling or class weighting are needed to handle it effectively.

### Loan Information

* NAME_CONTRACT_TYPE
* AMT_CREDIT
* AMT_ANNUITY
* AMT_GOODS_PRICE

In [None]:
columns_to_describe = ['NAME_CONTRACT_TYPE', 'AMT_CREDIT', 'AMT_ANNUITY', 'AMT_GOODS_PRICE']
train[columns_to_describe].describe(include='all')

In [None]:
plot_distribution_by_target(train, 'NAME_CONTRACT_TYPE', 'TARGET')


The left chart reveals that cash loans are far more common than revolving loans. The right chart shows that both cash and revolving loans have similar repayment patterns, with most being repaid (TARGET 0 = repaid, TARGET 1 = default)

In [None]:
plot_feature_distribution_by_target(train, 'AMT_CREDIT', 'TARGET')

The chart shows AMT_CREDIT distribution for repaid loans (blue, TARGET = 0) and defaults (red, TARGET = 1). Most loans are under 1 million, with repaid loans far more frequent. Defaults follow a similar pattern but at much lower rates across all amounts.

The AMT_CREDIT distribution in the chart resembles a right-skewed (positive skew) distribution. Most loan amounts are concentrated on the lower end, with a long tail extending towards higher values. This suggests that smaller loans are more common, while larger loans are less frequent but still present.

In [None]:
plot_feature_distribution_by_target(train, 'AMT_ANNUITY', 'TARGET')

The plot shows the distribution of AMT_ANNUITY (loan annuity amount) for repaid loans (blue, TARGET = 0) and defaulted loans (red, TARGET = 1). Most annuities are concentrated at lower values, with a peak around 25,000, and the distribution is right-skewed. Defaults follow a similar pattern but occur at a much lower frequency across all annuity amounts. Smaller annuities are more common, but defaults still happen across different annuity levels.

In [None]:
plot_feature_distribution_by_target(train, 'AMT_GOODS_PRICE', 'TARGET')

The AMT_GOODS_PRICE distribution is right-skewed, with most values below 1 million. There are multiple peaks, likely reflecting common loan amounts for specific product prices. Repaid loans (blue) are far more frequent, but defaults (red) follow a similar pattern. The sharp spikes suggest standardized pricing, possibly for fixed-price goods or structured loan plans.

#### Overall Insights
1. The AMT_CREDIT and AMT_GOODS_PRICE distributions show that most loan amounts are below 1 million. The multiple peaks in AMT_GOODS_PRICE suggest that people are borrowing fixed amounts, likely based on the cost of specific products (e.g., appliances, electronics, vehicles). This means loans aren't random amounts but rather follow predefined pricing structures.
2. In all distributions (AMT_CREDIT, AMT_ANNUITY, AMT_GOODS_PRICE), repaid loans (TARGET = 0) and defaulted loans (TARGET = 1) have similar shapes. A borrower’s loan amount alone isn’t a strong predictor of default.
3. Both cash loans and revolving loans have similar default proportions (as seen in the contract type plots). This indicates that the type of loan a person takes doesn’t significantly affect their likelihood of repayment.
4. **The sharp spikes in AMT_GOODS_PRICE show that many loans are issued at fixed amounts, likely matching common retail product prices. Instead of borrowers requesting random amounts, they might be financing specific purchases (e.g., home appliances, cars, or mobile phones).**

### Demographic Information:

* CODE_GENDER
* FLAG_OWN_CAR
* FLAG_OWN_REALTY
* CNT_CHILDREN
* AMT_INCOME_TOTAL

In [None]:
columns_to_describe = ['CODE_GENDER', 'FLAG_OWN_CAR', 'FLAG_OWN_REALTY', 'CNT_CHILDREN', 'AMT_INCOME_TOTAL']
train[columns_to_describe].describe(include='all')

In [None]:
plt.figure(figsize=(12, 6))

plt.subplot(1, 2, 1)
gender_target_counts = train.groupby('CODE_GENDER')['TARGET'].value_counts().unstack()
gender_target_counts.plot(kind='bar', stacked=True, ax=plt.gca())
plt.title('Count of Target by Gender')
plt.ylabel('Count')

plt.subplot(1, 2, 2)
gender_target_props = train.groupby('CODE_GENDER')['TARGET'].value_counts(normalize=True).unstack()
gender_target_props.plot(kind='bar', stacked=True, ax=plt.gca())
plt.title('Proportion of Target by Gender')
plt.ylabel('Proportion')

plt.tight_layout()
plt.show()

In [None]:
plot_distribution_by_target(train, 'CODE_GENDER', 'TARGET')

In [None]:
xna_gender_count = train[train['CODE_GENDER'] == 'XNA'].shape[0]

print(f"Number of 'XNA' genders: {xna_gender_count}")

In [None]:
plot_distribution_by_target(train, 'FLAG_OWN_CAR', 'TARGET')

In [None]:
plt.figure(figsize=(12, 6))
sns.boxplot(x='FLAG_OWN_CAR', y='AMT_GOODS_PRICE', hue='TARGET', data=train)
plt.title('AMT_GOODS_PRICE by Car Ownership and Target')
plt.show()

In [None]:
plot_distribution_by_target(train, 'FLAG_OWN_REALTY', 'TARGET')

In [None]:
plt.figure(figsize=(12, 6))
sns.boxplot(x='FLAG_OWN_REALTY', y='AMT_GOODS_PRICE', hue='TARGET', data=train)
plt.title('AMT_GOODS_PRICE by Realty Ownership and Target')
plt.show()

In [None]:
plot_distribution_by_target(train, 'CNT_CHILDREN', 'TARGET')

In [None]:
plot_feature_distribution_by_target(train, 'AMT_INCOME_TOTAL', 'TARGET')

In [None]:
plot_feature_distribution_by_target(train, 'AMT_INCOME_TOTAL', 'TARGET', remove_outliers=True, outlier_feature='AMT_INCOME_TOTAL', percentile=0.99)

Repaid loans (blue) dominate across all income levels, showing income impacts repayment ability. Multiple sharp peaks suggest many applicants report standardized or rounded salaries. While more common at lower incomes, defaults (red) still occur across higher brackets, meaning other risk factors matter.

#### Overall Insights
1. More female applicants than males, but both genders have a similar proportion of defaults.
2. Non-car owners have more loans overall, but default rates are similar for both groups.
3. Real estate owners and non-owners have nearly identical default rates.
4. Higher-income applicants repay loans more reliably, but defaults occur across all income levels.
5. **Applicants with more children tend to default more, possibly due to higher financial burdens.**
6. **AMT_GOODS_PRICE and AMT_CREDIT show standardized loan amounts, meaning loans are likely tied to specific products.**

###Employment and Education:

* NAME_INCOME_TYPE
* NAME_EDUCATION_TYPE
* DAYS_EMPLOYED
* OCCUPATION_TYPE
* FLAG_EMP_PHONE

In [None]:
columns_to_describe = ['NAME_INCOME_TYPE', 'NAME_EDUCATION_TYPE', 'DAYS_EMPLOYED', 'OCCUPATION_TYPE', 'FLAG_EMP_PHONE']
train[columns_to_describe].describe(include='all')

In [None]:
plot_distribution_by_target(train, 'NAME_INCOME_TYPE', 'TARGET')

In [None]:
plot_distribution_by_target(train, 'NAME_EDUCATION_TYPE', 'TARGET')

In [None]:
# prompt: plot distribution of DAYS_EMPLOYED by diving it by 365 to plot by years

plt.figure(figsize=(12, 6))
sns.histplot(train['DAYS_EMPLOYED'] / 365, kde=True)
plt.xlabel('Years Employed')
plt.ylabel('Frequency')
plt.title('Distribution of DAYS_EMPLOYED (Years)')
plt.show()

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Filter out negative values in DAYS_EMPLOYED
filtered_days_employed = train[train['DAYS_EMPLOYED'] < 0]['DAYS_EMPLOYED']

# Plot the distribution of DAYS_EMPLOYED (divided by 365 to represent years)
plt.figure(figsize=(12, 6))
sns.histplot(filtered_days_employed / -365, kde=True)
plt.xlabel('Years Employed')
plt.ylabel('Frequency')
plt.title('Distribution of DAYS_EMPLOYED (Years)')
plt.show()


In [None]:
# Filter the excluded values (negative and zero values)
excluded_values = train[train['DAYS_EMPLOYED'] >= 0]

# Display the excluded values
print("Excluded values (DAYS_EMPLOYED <= 0):")
print(excluded_values['DAYS_EMPLOYED'])


In [None]:
# Filter the excluded values (negative and zero values)
excluded_values = train[train['DAYS_EMPLOYED'] >= 0]

# List the unique values
unique_values = excluded_values['DAYS_EMPLOYED'].unique()
print("Unique excluded values:")
print(unique_values)


In [None]:
plot_distribution_by_target(train, 'OCCUPATION_TYPE', 'TARGET')

In [None]:
plot_distribution_by_target(train, 'FLAG_EMP_PHONE', 'TARGET')

#### Overall Insights

1. Unstable Income Groups Default More – Higher default rates among Unemployed and Maternity Leave individuals, while Working and Pensioners show lower risk.
2. Lower Education Increases Risk – Secondary education and Incomplete higher education groups default more, whereas Academic Degree holders have lower default rates.
3. Low-Skill Jobs Have Higher Defaults – Laborers, Sales Staff, and Drivers show greater default risk, while Accountants, HR, and IT Staff are more financially stable.
4. Short Employment Duration is Risky – Shorter DAYS_EMPLOYED (few years of employment) correlates with a higher likelihood of default.
5. Employment Phone Ownership Shows No Clear Impact – FLAG_EMP_PHONE does not significantly differentiate defaulters from non-defaulters.


###Family and Housing:

* NAME_FAMILY_STATUS
* NAME_HOUSING_TYPE
* CNT_FAM_MEMBERS

In [None]:
columns_to_describe = ['NAME_FAMILY_STATUS', 'NAME_HOUSING_TYPE', 'CNT_FAM_MEMBERS']
train[columns_to_describe].describe(include='all')

In [None]:
plot_distribution_by_target(train, 'NAME_FAMILY_STATUS', 'TARGET')

In [None]:
plot_distribution_by_target(train, 'NAME_HOUSING_TYPE', 'TARGET')

In [None]:
plot_distribution_by_target(train, 'CNT_FAM_MEMBERS', 'TARGET')

#### Overall Insights

1. Married applicants form the majority of borrowers but have lower default rates. Single and separated applicants show slightly higher default proportions, indicating potential financial instability.
2. Most applicants live in houses/apartments, but renters and those living with parents show higher default proportions.
3. Most borrowers have 2-3 family members, and larger families show a higher proportion of defaults.

###Regional and Residency Information:

* REGION_POPULATION_RELATIVE
* DAYS_BIRTH
* DAYS_REGISTRATION
* REG_REGION_NOT_LIVE_REGION
* REG_CITY_NOT_LIVE_CITY

In [None]:
columns_to_describe = ['REGION_POPULATION_RELATIVE', 'DAYS_BIRTH', 'DAYS_REGISTRATION', 'REG_REGION_NOT_LIVE_REGION', 'REG_CITY_NOT_LIVE_CITY']
train[columns_to_describe].describe()

In [None]:
plot_feature_distribution_by_target(train, 'REGION_POPULATION_RELATIVE', 'TARGET')

In [None]:
plt.figure(figsize=(12, 6))

# Convert DAYS_BIRTH from days to years
sns.histplot((train[train['TARGET'] == 0]['DAYS_BIRTH'] / -365), kde=True, label='TARGET = 0', color='blue', alpha=0.5)
sns.histplot((train[train['TARGET'] == 1]['DAYS_BIRTH'] / -365), kde=True, label='TARGET = 1', color='red', alpha=0.5)

plt.xlabel('Age (Years)')
plt.ylabel('Frequency')
plt.title('Distribution of Age by TARGET')
plt.legend()
plt.show()

In [None]:
plt.figure(figsize=(12, 6))

# Convert DAYS_BIRTH from days to years
sns.histplot((train[train['TARGET'] == 0]['DAYS_REGISTRATION'] / -365), kde=True, label='TARGET = 0', color='blue', alpha=0.5)
sns.histplot((train[train['TARGET'] == 1]['DAYS_REGISTRATION'] / -365), kde=True, label='TARGET = 1', color='red', alpha=0.5)

plt.xlabel('Age (Years)')
plt.ylabel('Frequency')
plt.title('Distribution of Registration by TARGET')
plt.legend()
plt.show()

In [None]:
plot_distribution_by_target(train, 'REG_REGION_NOT_LIVE_REGION', 'TARGET')

In [None]:
plot_distribution_by_target(train, 'REG_CITY_NOT_LIVE_CITY', 'TARGET')

#### Overall Insights

1. Default rates are more evenly spread across lower population density areas, suggesting that applicants from less populated regions might be at slightly higher risk.
2. Younger applicants show higher default rates, while older borrowers tend to repay more reliably.
3. Whether an applicant lives in a different region or city than registered does not significantly change default rates.

###Application and Processing Information:

* WEEKDAY_APPR_PROCESS_START
* HOUR_APPR_PROCESS_START
* ORGANIZATION_TYPE
* FLAG_EMAIL

In [None]:
plot_distribution_by_target(train, 'WEEKDAY_APPR_PROCESS_START', 'TARGET')

In [None]:
plt.figure(figsize=(12, 6))

sns.histplot(train[train['TARGET'] == 0]['HOUR_APPR_PROCESS_START'], kde=True, label=f'TARGET = 0', color='blue', alpha=0.5)
sns.histplot(train[train['TARGET'] == 1]['HOUR_APPR_PROCESS_START'], kde=True, label=f'TARGET = 1', color='red', alpha=0.5)

plt.xlabel('Hour of Application Process Start')
plt.ylabel('Frequency')
plt.title('Distribution of HOUR_APPR_PROCESS_START by Target')
plt.legend()
plt.show()

In [None]:
plot_distribution_by_target(train, 'ORGANIZATION_TYPE', 'TARGET')

In [None]:
train['ORGANIZATION_TYPE'].nunique()

In [None]:
plot_distribution_by_target(train, 'FLAG_EMAIL', 'TARGET')

###Housing Characteristics (Aggregates and Averages):

APARTMENTS_AVG, BASEMENTAREA_AVG, YEARS_BEGINEXPLUATATION_AVG, YEARS_BUILD_AVG
ELEVATORS_AVG, ENTRANCES_AVG, FLOORSMAX_AVG

###Social Connections:

* OBS_30_CNT_SOCIAL_CIRCLE
* DEF_30_CNT_SOCIAL_CIRCLE
* OBS_60_CNT_SOCIAL_CIRCLE
* DEF_60_CNT_SOCIAL_CIRCLE

###Document Flags:

FLAG_DOCUMENT_2 to FLAG_DOCUMENT_21

###Credit Bureau Data:

* AMT_REQ_CREDIT_BUREAU_HOUR
* AMT_REQ_CREDIT_BUREAU_DAY
* AMT_REQ_CREDIT_BUREAU_WEEK

###External Sources and Scoring:

EXT_SOURCE_1, EXT_SOURCE_2, EXT_SOURCE_3

#### **Backward Sequential Feature Selection**


Working Code for submitting the results

In [None]:
import pandas as pd
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.utils.class_weight import compute_sample_weight

# Feature engineering and preprocessing (example - adapt to your specific needs)
features = ['AMT_INCOME_TOTAL', 'AMT_CREDIT', 'AMT_ANNUITY', 'AMT_GOODS_PRICE']  # Example features
categorical_features = ['NAME_CONTRACT_TYPE', 'CODE_GENDER'] # Example Categorical Features
for col in categorical_features:
    train[col] = train[col].astype('category').cat.codes
    test[col] = test[col].astype('category').cat.codes

# Handle missing values (example - use more sophisticated methods if needed)
train = train.fillna(0)
test = test.fillna(0)

X = train[features + categorical_features]
y = train['TARGET']
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# Compute class weights
sample_weights = compute_sample_weight(class_weight='balanced', y=y_train)

# Create and train the XGBoost model with class weights
model = xgb.XGBClassifier(use_label_encoder=False, eval_metric='logloss', random_state=42)
model.fit(X_train, y_train, sample_weight=sample_weights)

# Make predictions on the test set
X_test = test[features + categorical_features]
predictions = model.predict_proba(X_test)[:, 1]  # Probability of TARGET = 1

# Create submission file
submission = pd.DataFrame({'SK_ID_CURR': test['SK_ID_CURR'], 'TARGET': predictions})
submission.to_csv('submission.csv', index=False)

print("Submission file 'submission.csv' created successfully.")

Parameters: { "use_label_encoder" } are not used.



Submission file 'submission.csv' created successfully.


In [None]:
# prompt: shiw files in this directory: /content/sample_data

!ls /datalab

run.sh	web


In [None]:
from google.colab import drive
drive.mount('/content/drive')