# Loan Default Risk Analysis

In this project, I will prepare a report for the loan department of a bank. The objective is to determine whether a client's **marital status** and **number of children** influence the likelihood of defaulting on a loan. The bank already has historical data on its clients' creditworthiness.

My analysis will support the development of a **credit scoring system** to evaluate the risk profile of potential borrowers. Credit scoring is an essential tool used to assess an applicant's ability to repay a loan.

Before diving into the data, I’ll clearly define the project’s goals and the hypotheses I intend to test. Throughout this notebook, I will explain my reasoning and decisions to make the report understandable for future collaborators or team members.


1. [Loading Dataset](#loading-dataset)  
2. [Data Exploration](#data-exploration)  
   2.1 [Dataset Description](#dataset-description)  
3. [Data Transformation](#data-transformation)  
   3.1 [Education](#education)  
   3.2 [Children](#children)  
   3.3 [Days employed](#days-employed)   
   3.4 [Client's Age](#clients-age)  
   3.5 [Family Status](#family-status)  
   3.6 [Gender](#gender)  
   3.7 [Income Type](#income-type)  
   3.8 [Duplicates](#duplicates)
4. [Handling missing values](#handling-missing-values)  
   4.1 [Restoring missing values in Total Income](#restoring-missing-values-in-total-income)  
   4.2 [Restoring missing values in Days Employed](#restoring-missing-values-in-dayes-employed)
5. [Data Categorization](#data-categorization)
6. [Hypothesis testing](#hypothesis-testing)
   6.1 [Children impact](#children-impact)
   6.2 [Family status impact](#family-status-impact)
   6.3 [Total income](#total-income)
   6.4 [Purpose](#purpose)
7. [Final conclusion](#final-conclusion)

## Loading dataset


In [1]:
# Loading libraries
import pandas as pd

# Try loading data from GitHub
try:
    url = "https://raw.githubusercontent.com/gabriel-amoroso/bootcamp_dataanalysis/refs/heads/main/loan_default_risk_analysis/credit_scoring_eng.csv"
    credit_score = pd.read_csv(url)
    print("Dataset loaded from GitHub.")
except:
    # Fallback: load from local file (if running offline)
    credit_score = pd.read_csv("credit_scoring_eng.csv")
    print("Dataset loaded from local file.")


Dataset loaded from GitHub.


## Data Exploration

### Dataset Description

Below is a summary of the columns present in the dataset:

- `children` — number of children in the family  
- `days_employed` — work experience in days  
- `dob_years` — client’s age in years  
- `education` — client’s education level  
- `education_id` — education identifier  
- `family_status` — client’s marital status  
- `family_status_id` — marital status identifier  
- `gender` — client’s gender  
- `income_type` — type of employment  
- `debt` — whether the client had any loan payment defaults  
- `total_income` — monthly income  
- `purpose` — reason for applying for the loan

To begin my analysis, I will explore the general structure of the dataset and examine the available columns to understand the nature of the data.

Let’s start by checking how many rows and columns are in the dataset.

In [52]:
credit_score.info()

<class 'pandas.core.frame.DataFrame'>
Index: 21451 entries, 0 to 21524
Data columns (total 19 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   children              21451 non-null  int64  
 1   days_employed         21451 non-null  float64
 2   dob_years             21451 non-null  int64  
 3   education             21451 non-null  object 
 4   education_id          21451 non-null  int64  
 5   family_status         21451 non-null  object 
 6   family_status_id      21451 non-null  int64  
 7   gender                21451 non-null  object 
 8   income_type           21451 non-null  object 
 9   debt                  21451 non-null  int64  
 10  total_income          21451 non-null  float64
 11  purpose               21451 non-null  object 
 12  age_group             21451 non-null  object 
 13  purpose_id            21451 non-null  int32  
 14  income_total_id       21451 non-null  object 
 15  children_debt_status  21

I'll display the first 10 rows of the dataset to get an initial sense of the data structure and contents.

In [53]:
credit_score.head(10)

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose,age_group,purpose_id,income_total_id,children_debt_status,family_debt_status,incomerank_debt,purpose_debt_status
0,1,8437.673028,42,bachelor's degree,0,married,0,F,employee,0,40620.102,purchase of the house,adult,2,high income,with children and no debt,married and no debt,high income and no debt,buying a house and no debt
1,1,4024.803754,36,secondary education,1,married,0,F,employee,0,17932.802,car purchase,adult,3,medium income,with children and no debt,married and no debt,medium income and no debt,buying a car and no debt
2,0,5623.42261,33,secondary education,1,married,0,M,employee,0,23341.752,purchase of the house,adult,2,medium income,no children and no debt,married and no debt,medium income and no debt,buying a house and no debt
3,3,4124.747207,32,secondary education,1,married,0,M,employee,0,42820.568,supplementary education,adult,4,high income,with children and no debt,married and no debt,high income and no debt,education and no debt
4,0,340266.072047,53,secondary education,1,married,1,F,retiree,0,25378.572,to have a wedding,adult,1,medium income,no children and no debt,married and no debt,medium income and no debt,wedding and no debt
5,0,926.185831,27,bachelor's degree,0,married,1,M,business,0,40922.17,purchase of the house,adult,2,high income,no children and no debt,married and no debt,high income and no debt,buying a house and no debt
6,0,2879.202052,43,bachelor's degree,0,married,0,F,business,0,38484.156,housing transactions,adult,2,high income,no children and no debt,married and no debt,high income and no debt,buying a house and no debt
7,0,152.779569,50,secondary education,1,married,0,M,employee,0,21731.829,education,adult,4,medium income,no children and no debt,married and no debt,medium income and no debt,education and no debt
8,2,6929.865299,35,bachelor's degree,0,married,1,F,employee,0,15337.093,having a wedding,adult,1,medium income,with children and no debt,married and no debt,medium income and no debt,wedding and no debt
9,0,2188.756445,41,secondary education,1,married,0,M,employee,0,23108.15,purchase of the house for my family,adult,2,medium income,no children and no debt,married and no debt,medium income and no debt,buying a house and no debt


First, the `days_employed` column contains negative values for several individuals. Additionally, it's stored as a float, which doesn't make sense for analyzing a time duration measured in days.

Second, the `total_income` column is also stored as a float. Since it represents monthly income, decimal precision is unnecessary for credit analysis — rounding to whole units is more appropriate.

Third, the `purpose` column contains duplicated loan reasons written in slightly different ways. These entries will need to be standardized and grouped accordingly.

Fourth, the `education` column has inconsistent capitalization and will require normalization to lowercase.

There are missing values in the `days_employed` and `total_income` columns.

Let’s filter the dataset to display the rows with missing values in the first of these columns.

The number of missing entries corresponds to a little over 10% of the total dataset.  
If the average number of days worked (comparing rows with and without missing values) does not differ significantly, it may be reasonable to simply remove those rows.

In [4]:
credit_score[(credit_score['days_employed'].isna())|credit_score['total_income'].isna()]

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
12,0,,65,secondary education,1,civil partnership,1,M,retiree,0,,to have a wedding
26,0,,41,secondary education,1,married,0,M,civil servant,0,,education
29,0,,63,secondary education,1,unmarried,4,F,retiree,0,,building a real estate
41,0,,50,secondary education,1,married,0,F,civil servant,0,,second-hand car purchase
55,0,,54,secondary education,1,civil partnership,1,F,retiree,1,,to have a wedding
...,...,...,...,...,...,...,...,...,...,...,...,...
21489,2,,47,Secondary Education,1,married,0,M,business,0,,purchase of a car
21495,1,,50,secondary education,1,civil partnership,1,F,employee,0,,wedding ceremony
21497,0,,48,BACHELOR'S DEGREE,0,married,0,F,business,0,,building a property
21502,1,,42,secondary education,1,married,0,F,employee,0,,building a real estate


We can confirm that the missing values in `days_employed` and `total_income` occur in the same rows.

Since our goal is to standardize the evaluation of whether a client is eligible for credit, and these missing values do not significantly affect the overall averages, we can safely remove them from the dataset without compromising the analysis.

Now let’s apply multiple conditions to filter the data and observe how many rows meet the criteria.


In [5]:
print(credit_score[(credit_score['days_employed'].isna()) & credit_score['total_income'].isna()].shape)

(2174, 12)


The missing values are symmetrical — I confirmed that the rows with missing values are the same for both the `days_employed` and `total_income` columns.

Now let's investigate whether the clients with missing values in these key attributes share any common patterns that might explain the absence of data.

In [6]:
filtered_cred_score = credit_score[(credit_score['days_employed'].isna())|credit_score['total_income'].isna()]
filtered_cred_score.head(50)

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
12,0,,65,secondary education,1,civil partnership,1,M,retiree,0,,to have a wedding
26,0,,41,secondary education,1,married,0,M,civil servant,0,,education
29,0,,63,secondary education,1,unmarried,4,F,retiree,0,,building a real estate
41,0,,50,secondary education,1,married,0,F,civil servant,0,,second-hand car purchase
55,0,,54,secondary education,1,civil partnership,1,F,retiree,1,,to have a wedding
65,0,,21,secondary education,1,unmarried,4,M,business,0,,transactions with commercial real estate
67,0,,52,bachelor's degree,0,married,0,F,retiree,0,,purchase of the house for my family
72,1,,32,bachelor's degree,0,married,0,M,civil servant,0,,transactions with commercial real estate
82,2,,50,bachelor's degree,0,married,0,F,employee,0,,housing
83,0,,52,secondary education,1,married,0,M,employee,0,,housing


The percentage of missing values is slightly above 10%, which suggests that the missing data does not represent a large portion of the dataset. As a result, removing these rows is unlikely to significantly impact the distribution.

Additionally, the missing values appear to be random. I found no indication of any pattern or correlation with other columns that would suggest a specific reason for their absence.

## Data Transformation

### Education

I will now examine all the unique values in the `education` column to identify any inconsistent formatting that needs to be corrected.

In [7]:
print((credit_score['education'].unique()))

["bachelor's degree" 'secondary education' 'Secondary Education'
 'SECONDARY EDUCATION' "BACHELOR'S DEGREE" 'some college'
 'primary education' "Bachelor's Degree" 'SOME COLLEGE' 'Some College'
 'PRIMARY EDUCATION' 'Primary Education' 'Graduate Degree'
 'GRADUATE DEGREE' 'graduate degree']


Standardizing the entries in the `education` column by converting all text to lowercase and consolidating similar categories, and then checking the corrections.

In [8]:
credit_score['education'] = credit_score['education'].str.lower()
credit_score = credit_score.replace("graduate degree", "bachelor's degree")
print(credit_score['education'].unique())

["bachelor's degree" 'secondary education' 'some college'
 'primary education']


Checking `education_id`

In [9]:
credit_score['education_id'].value_counts() 

education_id
1    15233
0     5260
2      744
3      282
4        6
Name: count, dtype: int64

There are 5 different id's for education, and we should have only 4. Let's see a slice of the dataset.

In [10]:
credit_score[credit_score['education_id'] == 4]

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
2963,0,337584.81556,69,bachelor's degree,4,married,0,M,retiree,0,15800.399,buy residential real estate
4170,0,-409.200149,45,bachelor's degree,4,unmarried,4,M,employee,0,31771.321,transactions with commercial real estate
6551,0,-5352.03818,58,bachelor's degree,4,married,0,M,employee,0,42945.794,going to university
12021,3,-5968.075884,36,bachelor's degree,4,married,0,F,civil servant,0,17822.757,purchase of the house
12786,0,376276.219531,62,bachelor's degree,4,married,0,F,retiree,0,40868.031,buy residential real estate
21519,1,-2351.431934,37,bachelor's degree,4,divorced,3,M,employee,0,18551.846,buy commercial real estate


I noticed that there are two different categories assigned to the same education level.  
I’ll standardize the values so that we only have four distinct education IDs.

In [11]:
credit_score['education_id'] = credit_score['education_id'].replace(4, 0)

### Children

Now let’s look at the distribution of values in the `children` column.

In [12]:
print(credit_score['children'].value_counts())

children
 0     14149
 1      4818
 2      2055
 3       330
 20       76
-1        47
 4        41
 5         9
Name: count, dtype: int64


There are several entries showing families with 20 or -1 children, which clearly indicates a data entry error.  
It's likely that the intended values were 2 and 1, respectively, and were mistyped.

Correcting these anomalies by replacing them with 2 and 1, as they are reasonable assumptions and unlikely to negatively impact the analysis.

In [13]:
credit_score['children'] = credit_score['children'].replace(20,2)
credit_score['children'] = credit_score['children'].replace(-1,1)
print(credit_score['children'].value_counts())

children
0    14149
1     4865
2     2131
3      330
4       41
5        9
Name: count, dtype: int64


### Days employed

Let’s identify any problematic values in the `days_employed` column:

In [14]:
print(credit_score['days_employed'].unique())

[-8437.67302776 -4024.80375385 -5623.42261023 ... -2113.3468877
 -3112.4817052  -1984.50758853]


The values in the `days_employed` column are negative, which doesn’t make sense in this context.  
To correct this, I converted all values to their absolute equivalents.  
Additionally, the column represents a number of days, the data should be in integers. There’s no such thing as a fractional day of employment, but I'm not going to make that change for now.

In [15]:
credit_score['days_employed'] = abs(credit_score['days_employed'])
credit_score['days_employed'].describe()

count     19351.000000
mean      66914.728907
std      139030.880527
min          24.141633
25%         927.009265
50%        2194.220567
75%        5537.882441
max      401755.400475
Name: days_employed, dtype: float64

### Client's age

Now I’ll examine the `dob_years` column to check for any suspicious or inconsistent values.


In [16]:
credit_score['dob_years'].value_counts()

dob_years
35    617
40    609
41    607
34    603
38    598
42    597
33    581
39    573
31    560
36    555
44    547
29    545
30    540
48    538
37    537
50    514
43    513
32    510
49    508
28    503
45    497
27    493
56    487
52    484
47    480
54    479
46    475
58    461
57    460
53    459
51    448
59    444
55    443
26    408
60    377
25    357
61    355
62    352
63    269
64    265
24    264
23    254
65    194
22    183
66    183
67    167
21    111
0     101
68     99
69     85
70     65
71     58
20     51
72     33
19     14
73      8
74      6
75      1
Name: count, dtype: int64

There is at least one entry in the `dob_years` column with a value of 0, which is clearly invalid.  
In this case, the most reasonable solution is to replace the zero values with the average age calculated from the rest of the dataset.

In [17]:
credit_score['dob_years'] = (credit_score['dob_years'].replace(0,credit_score['dob_years'].mean().astype('int')))
print(credit_score['dob_years'].value_counts())

dob_years
35    617
43    614
40    609
41    607
34    603
38    598
42    597
33    581
39    573
31    560
36    555
44    547
29    545
30    540
48    538
37    537
50    514
32    510
49    508
28    503
45    497
27    493
56    487
52    484
47    480
54    479
46    475
58    461
57    460
53    459
51    448
59    444
55    443
26    408
60    377
25    357
61    355
62    352
63    269
64    265
24    264
23    254
65    194
66    183
22    183
67    167
21    111
68     99
69     85
70     65
71     58
20     51
72     33
19     14
73      8
74      6
75      1
Name: count, dtype: int64


### Family Status

Let’s take a look at the unique values in the `family_status` column.

In [18]:
print(credit_score['family_status'].unique())

['married' 'civil partnership' 'widow / widower' 'divorced' 'unmarried']


In this case, for the purposes of this study, I believe that married and civil partnership can be considered part of the same category. Therefore, I will group them together.


In [19]:
credit_score['family_status'] = credit_score['family_status'].replace('civil partnership', 'married')
print(credit_score['family_status'].unique())

['married' 'widow / widower' 'divorced' 'unmarried']


### Gender

Let’s check the unique values in the `gender` column.

In [20]:
print(credit_score['gender'].unique())

['F' 'M' 'XNA']


The value XNA does not represent a valid gender, but in the context of this analysis, it’s not relevant to the outcome. Therefore, I will ignore it without making any replacements or removals.

### Income type

Examining the unique values in `income_type` column.

In [21]:
print(credit_score['income_type'].value_counts())

income_type
employee                       11119
business                        5085
retiree                         3856
civil servant                   1459
unemployed                         2
entrepreneur                       2
student                            1
paternity / maternity leave        1
Name: count, dtype: int64


For this column, the types of income are important and should mostly be kept distinct.  
However, it makes sense to group entrepreneur with business, and civil servant with employee, as they represent similar income profiles for the purpose of this analysis.


In [22]:
credit_score['income_type'] = credit_score['income_type'].replace('entrepreneur', 'business')
credit_score['income_type'] = credit_score['income_type'].replace('civil servant', 'employee')
print(credit_score['income_type'].value_counts())

income_type
employee                       12578
business                        5087
retiree                         3856
unemployed                         2
student                            1
paternity / maternity leave        1
Name: count, dtype: int64


### Duplicates

Checking for duplicate rows in the dataset:

In [23]:
credit_score.duplicated().sum()

74

There are 74 duplicate rows, which can be safely removed as they will not impact the final analysis.

In [24]:
credit_score = credit_score.drop_duplicates()
credit_score.duplicated().sum()

0

Let’s check the current size of the dataset after the initial cleaning and transformations.

In [25]:
credit_score.info()

<class 'pandas.core.frame.DataFrame'>
Index: 21451 entries, 0 to 21524
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   children          21451 non-null  int64  
 1   days_employed     19351 non-null  float64
 2   dob_years         21451 non-null  int64  
 3   education         21451 non-null  object 
 4   education_id      21451 non-null  int64  
 5   family_status     21451 non-null  object 
 6   family_status_id  21451 non-null  int64  
 7   gender            21451 non-null  object 
 8   income_type       21451 non-null  object 
 9   debt              21451 non-null  int64  
 10  total_income      19351 non-null  float64
 11  purpose           21451 non-null  object 
dtypes: float64(2), int64(5), object(5)
memory usage: 2.1+ MB


The original dataset contained 21,525 rows.  
After filtering and cleaning the data, the final dataset now contains 21,451 rows.

## Handling Missing Values

### Restoring Missing Values in Total Income

The columns that require correction are `days_employed` and `total_income`.  
To address this, I will replace the missing values using either the mean or the median, depending on which statistic better reflects the data's central tendency in each case.
First, since `age` is a relevant information for both columns, I'll make some categories for later usage.

In [26]:
def age_group(age):
    if age <= 24:
        return 'young adult'
    if age <= 60:
        return 'adult'
    return 'elder'

In [27]:
credit_score['age_group'] = credit_score['dob_years'].apply(age_group)
print(credit_score['age_group'].value_counts())

age_group
adult          18450
elder           2126
young adult      875
Name: count, dtype: int64


Now I’ll create a version of the dataset without missing values, and explore which columns are related to `days_employed` and `total_income`, as well as how these features interact with each other.

In [28]:
credit_score_xna = credit_score.dropna()
credit_score_xna.head(10)

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose,age_group
0,1,8437.673028,42,bachelor's degree,0,married,0,F,employee,0,40620.102,purchase of the house,adult
1,1,4024.803754,36,secondary education,1,married,0,F,employee,0,17932.802,car purchase,adult
2,0,5623.42261,33,secondary education,1,married,0,M,employee,0,23341.752,purchase of the house,adult
3,3,4124.747207,32,secondary education,1,married,0,M,employee,0,42820.568,supplementary education,adult
4,0,340266.072047,53,secondary education,1,married,1,F,retiree,0,25378.572,to have a wedding,adult
5,0,926.185831,27,bachelor's degree,0,married,1,M,business,0,40922.17,purchase of the house,adult
6,0,2879.202052,43,bachelor's degree,0,married,0,F,business,0,38484.156,housing transactions,adult
7,0,152.779569,50,secondary education,1,married,0,M,employee,0,21731.829,education,adult
8,2,6929.865299,35,bachelor's degree,0,married,1,F,employee,0,15337.093,having a wedding,adult
9,0,2188.756445,41,secondary education,1,married,0,M,employee,0,23108.15,purchase of the house for my family,adult


Probably, based on the available information, the factors that most strongly influence an individual’s total income are their education level and type of income.

Therefore, I’ll calculate the average income for each education level and for each income type, and see if that's true. If I'm correct, I'll use those values to guide the replacement of missing entries.

In [29]:
for education_id in sorted(credit_score_xna['education_id'].unique()):
    mean_income = credit_score_xna.loc[credit_score_xna['education_id'] == education_id, 'total_income'].mean()
    print(f"Education ID {education_id}: {mean_income}")

Education ID 0: 33136.21694769166
Education ID 1: 24594.50303709925
Education ID 2: 29045.443644444447
Education ID 3: 21144.88221072797


The results aligned with expectations: the group with the highest income consists of those with a college degree, followed by those with some college-level education, and finally, individuals with only school-level education.

Now, checking the type of income.

In [30]:
for income_type in credit_score_xna['income_type'].unique():
    mean_income = credit_score_xna.loc[credit_score_xna['income_type'] == income_type, 'total_income'].mean()
    print(f"{income_type}: {mean_income}")


employee: 25997.252501147803
retiree: 21940.39450304967
business: 32397.165025557017
unemployed: 21014.360500000003
student: 15712.26
paternity / maternity leave: 8612.661


Once again, the results were as expected. The highest income is associated with those who own a business, followed by employees, retirees, unemployed individuals, students, and finally parents receiving child support.

So we can confirm that the characteristics having the greatest impact on income are education level and income type.  
Since the mean is skewed by high-income outliers and is not consistently close across all categories, I chose to use the median as it better represents the typical income within each group.


In [31]:
def fill_nan_total_income(df, column_to_fill):
    df[column_to_fill] = df[column_to_fill].fillna(
        df.groupby(['education', 'income_type'])[column_to_fill].transform('median'))
    df[column_to_fill] = df[column_to_fill].fillna(df[column_to_fill].median())
    
    return df

In [32]:
credit_score = fill_nan_total_income(credit_score, 'total_income')
credit_score.isna().sum()

children               0
days_employed       2100
dob_years              0
education              0
education_id           0
family_status          0
family_status_id       0
gender                 0
income_type            0
debt                   0
total_income           0
purpose                0
age_group              0
dtype: int64

However, for `days_employed`, I will hold off on filling the missing values until I further investigate potential patterns or biases in that variable.

### Restoring missing values in Days Employed

The parameters that should influence the number of days employed are the type of income and the age group to which the individual belongs.


In [33]:
for income_type in credit_score_xna['income_type'].unique():
    median_days = credit_score_xna.loc[credit_score_xna['income_type'] == income_type, 'days_employed'].median()
    print(f"{income_type}: {median_days}")

employee: 1673.3520511213108
retiree: 365213.3062657312
business: 1546.3332141566746
unemployed: 366413.65274420456
student: 578.7515535382181
paternity / maternity leave: 3296.7599620220594


Individuals who own their own business tend to have a higher number of days employed compared to other groups.

In [34]:
for age_group in credit_score_xna['age_group'].unique():
    median_days = credit_score_xna.loc[credit_score_xna['age_group'] == age_group, 'days_employed'].median()
    print(f"{age_group}: {median_days}")

adult: 2020.3878080900597
young adult: 744.5421298923785
elder: 356191.1376668496


Here I noticed a problem: the median does not accurately reflect the number of days worked for the elder group or the young adult group.  
This suggests that a different strategy may be needed for imputing or interpreting values within those age categories.

In [35]:
for income_type in credit_score_xna['income_type'].unique():
    mean_incometype = credit_score_xna.loc[credit_score_xna['income_type'] == income_type, 'days_employed'].mean()
    print(f"{income_type}: {mean_incometype}")

employee: 2450.8412399581803
retiree: 365003.4912448612
business: 2111.176937329012
unemployed: 366413.65274420456
student: 578.7515535382181
paternity / maternity leave: 3296.7599620220594


Once again, individuals who are employed have the highest average number of days worked.  
The average values for all other income types are relatively close to each other.

In [36]:
for age_group in credit_score_xna['age_group'].unique():
    mean_days = credit_score_xna.loc[credit_score_xna['age_group'] == age_group, 'days_employed'].mean()
    print(f"{age_group}: {mean_days}")

adult: 44225.724093366334
young adult: 1282.6504205713256
elder: 290708.5548373909


Although young adults still show the highest number of days worked, the values for the adult group are much more realistic and consistent with expectations.

I’ve decided to use the mean values, as the number of days worked by adults appears to be more consistent and representative of reality.


In [37]:
def fill_nan_days_employed(df, column_to_fill):
    df[column_to_fill] = df[column_to_fill].fillna(
        df.groupby(['age_group', 'income_type'])[column_to_fill].transform('mean'))
    df[column_to_fill] = df[column_to_fill].fillna(df[column_to_fill].mean())
    
    return df

In [38]:
fill_nan_days_employed(credit_score,'days_employed')
credit_score.isna().sum()

children            0
days_employed       0
dob_years           0
education           0
education_id        0
family_status       0
family_status_id    0
gender              0
income_type         0
debt                0
total_income        0
purpose             0
age_group           0
dtype: int64

No more missing values.


## Data Categorization

It’s now important to categorize the values in the `purpose` column, as many loan purposes are written differently but refer to the same concept.  
Additionally, I’ll create a ranking system for the different types of income, based on their relevance or stability.


In [39]:
print(credit_score['purpose'].value_counts())

purpose
wedding ceremony                            791
having a wedding                            768
to have a wedding                           765
real estate transactions                    675
buy commercial real estate                  661
housing transactions                        652
buying property for renting out             651
transactions with commercial real estate    650
purchase of the house                       646
housing                                     646
purchase of the house for my family         638
construction of own property                634
property                                    633
transactions with my real estate            627
building a real estate                      624
buy real estate                             621
purchase of my own house                    620
building a property                         618
housing renovation                          607
buy residential real estate                 606
buying my own car               

In [40]:
print(credit_score['purpose'].unique())

['purchase of the house' 'car purchase' 'supplementary education'
 'to have a wedding' 'housing transactions' 'education' 'having a wedding'
 'purchase of the house for my family' 'buy real estate'
 'buy commercial real estate' 'buy residential real estate'
 'construction of own property' 'property' 'building a property'
 'buying a second-hand car' 'buying my own car'
 'transactions with commercial real estate' 'building a real estate'
 'housing' 'transactions with my real estate' 'cars' 'to become educated'
 'second-hand car purchase' 'getting an education' 'car'
 'wedding ceremony' 'to get a supplementary education'
 'purchase of my own house' 'real estate transactions'
 'getting higher education' 'to own a car' 'purchase of a car'
 'profile education' 'university education'
 'buying property for renting out' 'to buy a car' 'housing renovation'
 'going to university']


Now let's write a function to categorize the loan purposes based on common topics or keywords.

In [41]:
def categorize_purpose(x):
    x = x.lower()

    if 'wedding' in x:
        return 1
    elif 'real estate' in x or 'property' in x or 'housing' in x or 'house' in x:
        return 2
    elif 'car' in x or 'cars' in x:
        return 3
    elif 'education' in x or 'university' in x or 'school' in x or 'educated' in x:
        return 4
    else:
        return None

credit_score['purpose_id'] = (credit_score['purpose'].apply(categorize_purpose).astype(int))
print(credit_score['purpose_id'].value_counts())

purpose_id
2    10809
3     4306
4     4012
1     2324
Name: count, dtype: int64


Now let’s categorize the different types of income

In [42]:
def income_total_type(income):
    if income <= 10000:
        return 'low income'
    if income <= 30000:
        return 'medium income'
    return 'high income'

credit_score['income_total_id'] = credit_score['total_income'].apply(income_total_type)
print(credit_score['income_total_id'].value_counts())

income_total_id
medium income    14419
high income       6106
low income         926
Name: count, dtype: int64


## Hypothesis Testing

### Children impact

First, let's check whether the `children` column has any impact on loan default.

In [43]:
def children_debt(row):
    children = row['children']
    debt = row['debt']
    
    if children == 0:
        if debt == 0:
            return 'no children and no debt'
        if debt == 1:
            return 'no children and with debt'
    if children >= 1:
        if debt == 0:
            return 'with children and no debt'
        if debt == 1:
            return 'with children and with debt'

credit_score['children_debt_status'] = credit_score.apply(children_debt, axis=1)
print(credit_score['children_debt_status'].value_counts())

children_debt_status
no children and no debt        13026
with children and no debt       6684
no children and with debt       1063
with children and with debt      678
Name: count, dtype: int64


Let's calculate the debt rate between families with and without children

In [44]:
counts = credit_score['children_debt_status'].value_counts()

no_children_debt = counts['no children and with debt']
no_children_total = counts['no children and with debt'] + counts['no children and no debt']
no_children_debt_rate = no_children_debt / no_children_total

with_children_debt = counts['with children and with debt']
with_children_total = counts['with children and with debt'] + counts['with children and no debt']

with_children_debt_rate = with_children_debt / with_children_total

print(f'Debt rate for people with no children: {no_children_debt_rate:.2%}')
print(f'Debt rate for people with children: {with_children_debt_rate:.2%}')

Debt rate for people with no children: 7.54%
Debt rate for people with children: 9.21%


**Conclusion**

After analysis, I found that the default rate is 9.2% among families with at least one child, compared to 7.5% among families without children.

Next, let's see if there is a correlation between family and debt status.

In [45]:
def familystat_debt(row):
    
    familystat = row['family_status']
    debt = row['debt']
    
    if 'married' in familystat:
        if debt == 0:
            return 'married and no debt'
        if debt == 1:
            return 'married and with debt'
    if 'widow / widower' in familystat:
        if debt == 0:
            return 'widowed and no debt'
        if debt == 1:
            return 'widowed and with debt'
    if 'divorced' in familystat:
        if debt == 0:
            return 'divorced and no debt'
        if debt == 1:
            return 'divorced and with debt'
    if 'unmarried' in familystat:
        if debt == 0:
            return 'single and no debt'
        if debt == 1:
            return 'single and with debt'

credit_score['family_debt_status'] = credit_score.apply(familystat_debt, axis=1)
print(credit_score['family_debt_status'].value_counts())

family_debt_status
married and no debt       17704
married and with debt      1593
divorced and no debt       1110
widowed and no debt         896
divorced and with debt       85
widowed and with debt        63
Name: count, dtype: int64


### Family status impact

Let's calculate the debt rate between different family status.

In [46]:
counts_family = credit_score['family_debt_status'].value_counts()

married_debt = counts_family['married and with debt']
married_total = counts_family['married and no debt'] + counts_family['married and with debt']
married_debt_rate = married_debt / married_total

divorced_debt = counts_family['divorced and with debt']
divorced_total = counts_family['divorced and no debt'] + counts_family['divorced and with debt']
divorced_debt_rate = divorced_debt / divorced_total

widowed_debt = counts_family['widowed and with debt']
widowed_total = counts_family['widowed and no debt'] + counts_family['widowed and with debt']
widowed_debt_rate = widowed_debt / widowed_total

print(f'Debt rate for maried people: {married_debt_rate:.2%}')
print(f'Debt rate for divorced: {divorced_debt_rate:.2%}')
print(f'Debt rate for widowed: {widowed_debt_rate:.2%}')

Debt rate for maried people: 8.26%
Debt rate for divorced: 7.11%
Debt rate for widowed: 6.57%


The default rates are as follows:

8.2% for married clients

7.1% for divorced clients

6.5% for widowed clients

These figures suggest that widowed clients are the least likely to default on their loans, followed by divorced clients. Married clients, although still relatively low-risk, show a slightly higher default rate compared to the others.

### Total income

Next comparison: total income.

In [47]:
def incomerank_debt(row):
    
    incomerank = row['income_total_id']
    debt = row['debt']

    if 'low income' in incomerank:
        if debt == 0:
            return 'low income and no debt'
        if debt == 1:
            return 'low income and with debt'
    if 'medium income' in incomerank:
        if debt == 0:
            return 'medium income and no debt'
        if debt == 1:
            return 'medium income and with debt'
    if 'high income' in incomerank:
        if debt == 0:
            return 'high income and no debt'
        if debt == 1:
            return 'high income and with debt'

credit_score['incomerank_debt'] = credit_score.apply(incomerank_debt, axis=1)
print(credit_score['incomerank_debt'].value_counts())

incomerank_debt
medium income and no debt      13180
high income and no debt         5662
medium income and with debt     1239
low income and no debt           868
high income and with debt        444
low income and with debt          58
Name: count, dtype: int64


In [48]:
counts_income = credit_score['incomerank_debt'].value_counts()

low_income_debt = counts_income['low income and with debt']
low_income_total = counts_income['low income and no debt'] + counts_income['low income and with debt']
low_income_debt_rate = low_income_debt / low_income_total

medium_income_debt = counts_income['medium income and with debt']
medium_income_total = counts_income['medium income and no debt'] + counts_income['medium income and with debt']
medium_income_debt_rate = medium_income_debt / medium_income_total

high_income_debt = counts_income['high income and with debt']
high_income_total = counts_income['high income and no debt'] + counts_income['high income and with debt']
high_income_debt_rate = high_income_debt / high_income_total

print(f'Debt rate for low income: {low_income_debt_rate:.2%}')
print(f'Debt rate for medium income: {medium_income_debt_rate:.2%}')
print(f'Debt rate for high income: {high_income_debt_rate:.2%}')

Debt rate for low income: 6.26%
Debt rate for medium income: 8.59%
Debt rate for high income: 7.27%


**Conclusion**

The debt default rates are as follows:

6.2% for low-income clients

8.5% for medium-income clients

7.2% for high-income clients

Although low-income clients could be further divided into more granular categories (which might reveal more nuanced insights into default risk), for the purposes of this analysis, we will maintain the current structure.

### Purpose

And finally, let's see the purpose rate.

In [49]:
def purpose_debt(row):
    purpose_id = row['purpose_id']
    debt = row['debt']

    if purpose_id == 1:
        return 'wedding and no debt' if debt == 0 else 'wedding and with debt'
    elif purpose_id == 2:
        return 'buying a house and no debt' if debt == 0 else 'buying a house and with debt'
    elif purpose_id == 3:
        return 'buying a car and no debt' if debt == 0 else 'buying a car and with debt'
    elif purpose_id == 4:
        return 'education and no debt' if debt == 0 else 'education and with debt'

credit_score['purpose_debt_status'] = credit_score.apply(purpose_debt, axis=1)
print(credit_score['purpose_debt_status'].value_counts())

purpose_debt_status
buying a house and no debt      10027
buying a car and no debt         3903
education and no debt            3642
wedding and no debt              2138
buying a house and with debt      782
buying a car and with debt        403
education and with debt           370
wedding and with debt             186
Name: count, dtype: int64


In [50]:
counts_purpose = credit_score['purpose_debt_status'].value_counts()

wedding_debt = counts_purpose['wedding and with debt']
wedding_total = counts_purpose['wedding and no debt'] + wedding_debt
wedding_debt_rate = wedding_debt / wedding_total

house_debt = counts_purpose['buying a house and with debt']
house_total = counts_purpose['buying a house and no debt'] + house_debt
house_debt_rate = house_debt / house_total

car_debt = counts_purpose['buying a car and with debt']
car_total = counts_purpose['buying a car and no debt'] + car_debt
car_debt_rate = car_debt / car_total

education_debt = counts_purpose['education and with debt']
education_total = counts_purpose['education and no debt'] + education_debt
education_debt_rate = education_debt / education_total

print(f"Debt rate for wedding purposes: {wedding_debt_rate:.2%}")
print(f"Debt rate for house purposes: {house_debt_rate:.2%}")
print(f"Debt rate for car purposes: {car_debt_rate:.2%}")
print(f"Debt rate for education purposes: {education_debt_rate:.2%}")

Debt rate for wedding purposes: 8.00%
Debt rate for house purposes: 7.23%
Debt rate for car purposes: 9.36%
Debt rate for education purposes: 9.22%


**Conclusion**

The debt rates are as follows:

8.0% for clients who use the money for a wedding

7.2% for those who use it to buy a house

9.3% for those who use it to buy a car

9.2% for those who use it to study


## Final Conclusion

In this project, my goal was to assess whether factors such as **number of children**, **marital status**, **income level**, and **loan purpose** had any significant correlation with **loan delinquency**. After performing thorough data cleaning, transformation, and exploratory analysis, several patterns emerged:

- **Children**: Families with at least one child showed a slightly higher delinquency rate (9.2%) compared to those without children (7.5%). This suggests a modest association between having children and financial risk, possibly due to increased household expenses.

- **Marital Status**: Married clients had a delinquency rate of 8.2%, while divorced and widowed clients showed lower rates (7.1% and 6.5%, respectively). This could reflect differences in financial responsibility or available household income after separation or loss of a partner.

- **Income Level**: Interestingly, clients with **medium income** had the highest delinquency rate (8.5%), compared to **low income** (6.2%) and **high income** (7.2%). While counterintuitive at first glance, this may point to medium-income clients being more exposed to credit or loans relative to their financial buffer.

- **Loan Purpose**: Clients who took loans for **education** (9.2%) and **car purchases** (9.3%) had the highest delinquency rates. Those who borrowed for **weddings** had a slightly lower rate (8.0%), and **real estate purchases** showed the lowest (7.2%). These findings suggest that borrowing for appreciating assets (like real estate) may be less risky than for depreciating assets or services.

While none of the analyzed features alone can definitively predict delinquency, together they offer valuable indicators for assessing risk. These insights can help improve the **credit scoring model** by integrating categorical and behavioral patterns into a more nuanced and robust risk assessment tool. 