## Hypothesis

Project Objective: To investigate the influence of a customer's marital status, number of children, income, and differences in loan purposes on the probability of loan default.

The objectives are to prove whether:
1. An individual fails to repay a loan due to additional expenditures for their spouse and children.
2. An individual fails to repay a loan because their income is insufficient.
3. Differences in loan purposes can be a factor in an individual's failure to repay the loan.

In [1]:
import pandas as pd

try:
    df = pd.read_csv('D:\datasets\credit_scoring_eng.csv')
except:
    df = pd.read_csv('/datasets/credit_scoring_eng.csv') 

In [2]:
df.shape

(21525, 12)

This data consists of 21,525 rows and 12 columns.

In [3]:
df.head(20)

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
0,1,-8437.673028,42,bachelor's degree,0,married,0,F,employee,0,40620.102,purchase of the house
1,1,-4024.803754,36,secondary education,1,married,0,F,employee,0,17932.802,car purchase
2,0,-5623.42261,33,Secondary Education,1,married,0,M,employee,0,23341.752,purchase of the house
3,3,-4124.747207,32,secondary education,1,married,0,M,employee,0,42820.568,supplementary education
4,0,340266.072047,53,secondary education,1,civil partnership,1,F,retiree,0,25378.572,to have a wedding
5,0,-926.185831,27,bachelor's degree,0,civil partnership,1,M,business,0,40922.17,purchase of the house
6,0,-2879.202052,43,bachelor's degree,0,married,0,F,business,0,38484.156,housing transactions
7,0,-152.779569,50,SECONDARY EDUCATION,1,married,0,M,employee,0,21731.829,education
8,2,-6929.865299,35,BACHELOR'S DEGREE,0,civil partnership,1,F,employee,0,15337.093,having a wedding
9,0,-2188.756445,41,secondary education,1,married,0,M,employee,0,23108.15,purchase of the house for my family


**Observations:**

1. For the 'days_employed' column, there are negative values, which should not be the case since this column represents the number of days employed, and it should not be negative.

2. In the 'education' column, there are entries written in all capital letters. This could affect the analysis as the program may treat them as different from entries in lowercase.

3. There are missing values in the 'days_employed' and 'total_income' columns.

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21525 entries, 0 to 21524
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   children          21525 non-null  int64  
 1   days_employed     19351 non-null  float64
 2   dob_years         21525 non-null  int64  
 3   education         21525 non-null  object 
 4   education_id      21525 non-null  int64  
 5   family_status     21525 non-null  object 
 6   family_status_id  21525 non-null  int64  
 7   gender            21525 non-null  object 
 8   income_type       21525 non-null  object 
 9   debt              21525 non-null  int64  
 10  total_income      19351 non-null  float64
 11  purpose           21525 non-null  object 
dtypes: float64(2), int64(5), object(5)
memory usage: 2.0+ MB


**Observations:**

1. There is missing data in the 'days_employed' and 'total_income' columns.
2. This can potentially impact the analysis process.
3. This could also be due to individuals who are already retired but are applying for loans.
4. It may be necessary to change the data type of the 'days_employed' column to int64 because the number of days employed should not have decimal values.

In [5]:
df[df['days_employed'].isnull()]

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
12,0,,65,secondary education,1,civil partnership,1,M,retiree,0,,to have a wedding
26,0,,41,secondary education,1,married,0,M,civil servant,0,,education
29,0,,63,secondary education,1,unmarried,4,F,retiree,0,,building a real estate
41,0,,50,secondary education,1,married,0,F,civil servant,0,,second-hand car purchase
55,0,,54,secondary education,1,civil partnership,1,F,retiree,1,,to have a wedding
...,...,...,...,...,...,...,...,...,...,...,...,...
21489,2,,47,Secondary Education,1,married,0,M,business,0,,purchase of a car
21495,1,,50,secondary education,1,civil partnership,1,F,employee,0,,wedding ceremony
21497,0,,48,BACHELOR'S DEGREE,0,married,0,F,business,0,,building a property
21502,1,,42,secondary education,1,married,0,F,employee,0,,building a real estate


If we observe that for a single row, the missing value in the `days_employed` column corresponds to a missing value in the `total_income` column, it could be assumed that this is indicative of individuals who are retired or there might be an error in data input.

There are 2174 missing values in both the 'days_employed' and 'total_income' columns. This aligns with the previous table you provided.

In [6]:
df.loc[(df['days_employed'].isna()) & (df['total_income'].isna())].shape

(2174, 12)

The number of rows with missing values in both the `days_employed` and `total_income` columns is the same, which is 2174.

**Preliminary Conclusion**

The total number of rows in the data is 21,525, and there are 2,174 rows with missing values. Some columns describe the borrower's conditions, such as 'children,' 'family_status,' and 'total_income.' However, there are missing values in the 'total_income' column, which is crucial for analysis.

The characteristics leading to missing data could be related to a person's occupation and age. It's possible that some individuals, particularly retirees, no longer need to provide values for 'dob_years' and 'total_income.' These missing values appear to have a dependency on the specified characteristics, namely 'dob_years' and 'income_type.'

In [7]:
percantage_of_days_employed = df['days_employed'].isna().sum() * 100/len(df)
percantage_of_total_income = df['total_income'].isna().sum() * 100/len(df)
print(f'Data presentation for the missing values is in the days_employed column.: {percantage_of_days_employed}')
print(f'Data presentation for the missing values is in the total_income column: {percantage_of_total_income}')

Presentasi data yang hilang adalah pada kolom days_employed: 10.099883855981417
Presentasi data yang hilang adalah pada kolom total_income: 10.099883855981417


The missing data percentage is 10%. While this percentage isn't exceptionally large, it will still have an impact on future analysis, given that the missing data is in the 'total_income' column. Since one of the factors we consider when assessing an individual's ability to repay a loan is their income, addressing these missing values is essential. They can be filled with the mean or median value.

The next step is to fill in the missing values, but before doing that, data cleaning is necessary, specifically in columns like 'children,' 'dob_years,' and 'education.'

In [8]:
df_null = df[(df['days_employed'].isnull()) & (df['total_income'].isnull())]
df_null

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
12,0,,65,secondary education,1,civil partnership,1,M,retiree,0,,to have a wedding
26,0,,41,secondary education,1,married,0,M,civil servant,0,,education
29,0,,63,secondary education,1,unmarried,4,F,retiree,0,,building a real estate
41,0,,50,secondary education,1,married,0,F,civil servant,0,,second-hand car purchase
55,0,,54,secondary education,1,civil partnership,1,F,retiree,1,,to have a wedding
...,...,...,...,...,...,...,...,...,...,...,...,...
21489,2,,47,Secondary Education,1,married,0,M,business,0,,purchase of a car
21495,1,,50,secondary education,1,civil partnership,1,F,employee,0,,wedding ceremony
21497,0,,48,BACHELOR'S DEGREE,0,married,0,F,business,0,,building a property
21502,1,,42,secondary education,1,married,0,F,employee,0,,building a real estate


In [9]:
df_null[df_null['income_type']=='civil servant']

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
26,0,,41,secondary education,1,married,0,M,civil servant,0,,education
41,0,,50,secondary education,1,married,0,F,civil servant,0,,second-hand car purchase
72,1,,32,bachelor's degree,0,married,0,M,civil servant,0,,transactions with commercial real estate
242,0,,58,secondary education,1,married,0,F,civil servant,0,,purchase of my own house
389,1,,31,SECONDARY EDUCATION,1,divorced,3,M,civil servant,0,,supplementary education
...,...,...,...,...,...,...,...,...,...,...,...,...
20469,0,,57,secondary education,1,widow / widower,2,M,civil servant,1,,to own a car
20479,1,,26,bachelor's degree,0,married,0,F,civil servant,0,,education
20914,0,,32,bachelor's degree,0,married,0,F,civil servant,0,,car purchase
21242,1,,33,bachelor's degree,0,unmarried,4,F,civil servant,0,,construction of own property


In [10]:
percentage = df_null['income_type'].value_counts(normalize=True).reset_index().rename(columns={"income_type":"percentage"})
percentage['percentage'] = percentage['percentage'].apply("{:,.2%}".format)
count = df_null['income_type'].value_counts().reset_index().rename(columns={"income_type":"counts"})
distribution = pd.concat([percentage,count[['counts']]], axis=1)
distribution

Unnamed: 0,index,percentage,counts
0,employee,50.83%,1105
1,business,23.37%,508
2,retiree,19.00%,413
3,civil servant,6.76%,147
4,entrepreneur,0.05%,1


The occupation with the highest number of missing values is "employee." One possible reason for this could be that some of these individuals have retired and no longer need to provide their 'total_income' information.

**Possible Causes of Missing Data**

There appears to be a specific pattern in the missing data. Whenever a data point is missing in the 'days_employed' column, the corresponding data in the 'total_income' column is also missing, and this pattern occurs frequently in the same rows. This may be because some individuals have retired, are currently unemployed, or have recently started a job and haven't received income yet. However, it's also possible that some of them simply chose not to disclose their income and length of employment.

In [11]:
df['income_type'].value_counts(normalize=True)

employee                       0.516562
business                       0.236237
retiree                        0.179141
civil servant                  0.067782
unemployed                     0.000093
entrepreneur                   0.000093
student                        0.000046
paternity / maternity leave    0.000046
Name: income_type, dtype: float64

**Preliminary Conclusion**

Upon examination, it appears that the overall dataset and the filtered dataset have the same distribution of values. This suggests that the missing data occurs randomly across almost every data point. This contradicts the initial assumption that the missing data in 'total_income' is related to the customer's occupation. It could instead be due to data input errors or other unknown issues.

In [12]:
df_null_pivot = df_null.pivot_table(index='dob_years', columns='income_type', values='debt', aggfunc='count')
df_null_pivot

income_type,business,civil servant,employee,entrepreneur,retiree
dob_years,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,2.0,,5.0,,3.0
19,,,1.0,,
20,1.0,,4.0,,
21,7.0,1.0,10.0,,
22,6.0,,11.0,,
23,5.0,1.0,30.0,,
24,9.0,1.0,10.0,,1.0
25,4.0,4.0,15.0,,
26,9.0,2.0,24.0,,
27,6.0,3.0,27.0,,


**Preliminary Conclusion**

The missing values seem to be quite random, and there is no distinct pattern that could explain why this data is missing. As indicated in the table above, the missing data can be observed across all types of occupations and age groups.

In [13]:
df_null_pivot2 = df_null.pivot_table(index='income_type', columns='gender', values='debt', aggfunc='count')
df_null_pivot2

gender,F,M
income_type,Unnamed: 1_level_1,Unnamed: 2_level_1
business,327.0,181.0
civil servant,112.0,35.0
employee,693.0,412.0
entrepreneur,,1.0
retiree,352.0,61.0


**Conclusion**

- No clear patterns are evident.
- From the exploration conducted, I have combined the 'gender' and 'income_type' columns. From this, it is apparent that the majority of customers are female, and the most common occupations are 'employee,' 'retiree,' and 'business.'
- The missing values will be filled with appropriate values or possibly removed.
- Short-term plan: Examine missing values in the filtered data, remove duplicate values, check for unique values, correct similar but distinct values, and amend values that are likely input errors.

# Transformation

In [14]:
df['education'].unique()

array(["bachelor's degree", 'secondary education', 'Secondary Education',
       'SECONDARY EDUCATION', "BACHELOR'S DEGREE", 'some college',
       'primary education', "Bachelor's Degree", 'SOME COLLEGE',
       'Some College', 'PRIMARY EDUCATION', 'Primary Education',
       'Graduate Degree', 'GRADUATE DEGREE', 'graduate degree'],
      dtype=object)

In [15]:
df['education'] = df['education'].str.lower()

In [16]:
df['education'].unique()

array(["bachelor's degree", 'secondary education', 'some college',
       'primary education', 'graduate degree'], dtype=object)

In [17]:
df['children'].value_counts()

 0     14149
 1      4818
 2      2055
 3       330
 20       76
-1        47
 4        41
 5         9
Name: children, dtype: int64

The data contains unique values such as -1 and 20. As we know, it's not possible for the number of children to be -1, and having 20 children is extremely rare, if not impossible. It seems like these values are input errors. Perhaps, -1 signifies 1 child, and 20 signifies 2 children.

These values can be dropped or replaced with the appropriate values. However, in this case, I will choose to replace them with the corrected values.

In [18]:
old_values = [-1, 20]
new_values = [1, 2]

df['children'] = df['children'].replace(old_values, new_values)

In [19]:
df['children'].value_counts()

0    14149
1     4865
2     2131
3      330
4       41
5        9
Name: children, dtype: int64

In [20]:
df['days_employed'].unique()

array([-8437.67302776, -4024.80375385, -5623.42261023, ...,
       -2113.3468877 , -3112.4817052 , -1984.50758853])

In [21]:
percentage_days_employed = len(df[df['days_employed']<0])/len(df)
print ("{:.2%}".format(percentage_days_employed))

73.90%


In [22]:
df['days_employed'] = df['days_employed'].abs()

In [23]:
df['days_employed'].unique()

array([8437.67302776, 4024.80375385, 5623.42261023, ..., 2113.3468877 ,
       3112.4817052 , 1984.50758853])

In [24]:
df['dob_years'].unique()

array([42, 36, 33, 32, 53, 27, 43, 50, 35, 41, 40, 65, 54, 56, 26, 48, 24,
       21, 57, 67, 28, 63, 62, 47, 34, 68, 25, 31, 30, 20, 49, 37, 45, 61,
       64, 44, 52, 46, 23, 38, 39, 51,  0, 59, 29, 60, 55, 58, 71, 22, 73,
       66, 69, 19, 72, 70, 74, 75], dtype=int64)

In [25]:
df['dob_years'].value_counts(ascending=True)

75      1
74      6
73      8
19     14
72     33
20     51
71     58
70     65
69     85
68     99
0     101
21    111
67    167
66    183
22    183
65    194
23    254
24    264
64    265
63    269
62    352
61    355
25    357
60    377
26    408
55    443
59    444
51    448
53    459
57    460
58    461
46    475
54    479
47    480
52    484
56    487
27    493
45    497
28    503
49    508
32    510
43    513
50    514
37    537
48    538
30    540
29    545
44    547
36    555
31    560
39    573
33    581
42    597
38    598
34    603
41    607
40    609
35    617
Name: dob_years, dtype: int64

In [26]:
percatage_of_zero = len(df[df['dob_years']==0])/len(df) 
print ("{:.2%}".format(percatage_of_zero))

0.47%


In this column, there is one value, which is 0, representing the age of a customer. It's not possible for a customer's age to be 0. There are 101 such entries. Since the percentage of missing data is only 0.47%, it can be safely removed from the dataset as it is unlikely to significantly affect the analysis.

In [27]:
df = df.drop(df[df['dob_years']==0].index)

In [28]:
print(df[df['dob_years']==0])

Empty DataFrame
Columns: [children, days_employed, dob_years, education, education_id, family_status, family_status_id, gender, income_type, debt, total_income, purpose]
Index: []


In [29]:
df['family_status'].value_counts()

married              12331
civil partnership     4156
unmarried             2797
divorced              1185
widow / widower        955
Name: family_status, dtype: int64

There doesn't appear to be anything unusual with the 'family_status' column.

In [30]:
df['gender'].unique()

array(['F', 'M', 'XNA'], dtype=object)

Terdapat nilai XNA pada kolom gender. Ini mungkin bisa terjadi karena pemberian informasi yang salah ketika memasukkan data atau ada kesalahan sistem.

In [31]:
df[df['gender']=='XNA']

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
10701,0,2358.600502,24,some college,2,civil partnership,1,XNA,business,0,32624.825,buy real estate


There's only one value, 'XNA,' in the 'gender' column, which is not clear and might represent missing or incorrect data. It can be safely removed from the dataset since it's a single value with unclear meaning.

In [32]:
df = df.drop(df[df['gender']=='XNA'].index)

In [33]:
df['gender'].unique()

array(['F', 'M'], dtype=object)

The gender type 'XNA' has been successfully removed from the dataset.

In [34]:
df['income_type'].unique()

array(['employee', 'retiree', 'business', 'civil servant', 'unemployed',
       'entrepreneur', 'student', 'paternity / maternity leave'],
      dtype=object)

There don't appear to be any values that need to be addressed in this column. The values in this column appear to be correct.

In [35]:
df[df.duplicated()]

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
2849,0,,41,secondary education,1,married,0,F,employee,0,,purchase of the house for my family
3290,0,,58,secondary education,1,civil partnership,1,F,retiree,0,,to have a wedding
4182,1,,34,bachelor's degree,0,civil partnership,1,F,employee,0,,wedding ceremony
4851,0,,60,secondary education,1,civil partnership,1,F,retiree,0,,wedding ceremony
5557,0,,58,secondary education,1,civil partnership,1,F,retiree,0,,to have a wedding
...,...,...,...,...,...,...,...,...,...,...,...,...
20702,0,,64,secondary education,1,married,0,F,retiree,0,,supplementary education
21032,0,,60,secondary education,1,married,0,F,retiree,0,,to become educated
21132,0,,47,secondary education,1,married,0,F,employee,0,,housing renovation
21281,1,,30,bachelor's degree,0,married,0,F,employee,0,,buy commercial real estate


In [36]:
df.duplicated().sum()

71

In [37]:
new_df = df.drop_duplicates().reset_index(drop=True)

In [38]:
new_df.duplicated().sum()

0

In [39]:
new_df.shape

(21352, 12)

In [40]:
new_df.tail(20)

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
21332,0,338904.866406,53,secondary education,1,civil partnership,1,M,retiree,0,12070.399,to have a wedding
21333,1,1556.249906,33,bachelor's degree,0,civil partnership,1,F,employee,0,23286.719,wedding ceremony
21334,1,79.832064,32,secondary education,1,civil partnership,1,F,civil servant,0,15708.845,second-hand car purchase
21335,0,386497.714078,62,secondary education,1,married,0,M,retiree,0,11622.175,property
21336,0,362161.054124,59,bachelor's degree,0,married,0,M,retiree,0,11684.65,real estate transactions
21337,2,,28,secondary education,1,married,0,F,employee,0,,car purchase
21338,0,612.569129,29,bachelor's degree,0,civil partnership,1,F,employee,1,22410.956,buying property for renting out
21339,0,165.377752,26,bachelor's degree,0,unmarried,4,M,business,0,23568.233,to get a supplementary education
21340,0,1166.216789,35,secondary education,1,married,0,F,employee,0,40157.783,purchase of the house
21341,0,280.469996,27,some college,2,unmarried,4,M,business,0,56958.145,building a property


In this dataset, we have already removed 173 entries, including duplicate data and values that appeared to be unique. Additionally, we have corrected values in the 'children,' 'days_employed,' and 'dob_years' columns. The percentage of data that was modified amounts to **0.8%**, which is considerably smaller compared to the **10%** of missing data in this dataset with fewer records.

# Working with Missing Values

In [41]:
education = new_df[['education', 'education_id']]
education = education.drop_duplicates()
education

Unnamed: 0,education,education_id
0,bachelor's degree,0
1,secondary education,1
13,some college,2
31,primary education,3
2946,graduate degree,4


In [42]:
education_dict = dict(zip(education['education'], education['education_id']))
education_dict

{"bachelor's degree": 0,
 'secondary education': 1,
 'some college': 2,
 'primary education': 3,
 'graduate degree': 4}

In [43]:
family = new_df[['family_status', 'family_status_id']]
family = family.drop_duplicates()
family

Unnamed: 0,family_status,family_status_id
0,married,0
4,civil partnership,1
18,widow / widower,2
19,divorced,3
24,unmarried,4


In [44]:
family_dict = dict(zip(family['family_status'], family['family_status_id']))
family_dict

{'married': 0,
 'civil partnership': 1,
 'widow / widower': 2,
 'divorced': 3,
 'unmarried': 4}

Because the dataset is quite extensive, and filtering based on 'purpose' might consume a significant amount of time, we need to filter through the 'id' instead.

Since there are two categories that have 'id,' which are 'family_status' and 'education,' we need to create a dictionary for both categories.

## Missing Values in total_income

There are two columns that need to be addressed: the first one is 'total_income,' and the second one is 'days_employed.' For the 'total_income' column, we need to fill in the missing values with the median value. As for the 'days_employed' column, we will fill in the missing values with the mean or average value.

In [45]:
new_df['dob_years'].sort_values().unique()

array([19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35,
       36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52,
       53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69,
       70, 71, 72, 73, 74, 75], dtype=int64)

In [46]:
def age_category(age):
    if age < 30:
        return '19-29'
    elif age < 40:
        return '30-39'
    elif age < 50:
        return '40-49'
    elif age < 60:
        return '50-59'
    elif age < 70:
        return '60-69'
    elif age <= 75:
        return '70-75'

In [47]:
age_category(35)

'30-39'

In [48]:
new_df['age_category'] = new_df['dob_years'].apply(age_category)

In [49]:
new_df[['age_category']]

Unnamed: 0,age_category
0,40-49
1,30-39
2,30-39
3,30-39
4,50-59
...,...
21347,40-49
21348,60-69
21349,30-39
21350,30-39


Typically, factors that depend on income include age, education, family, and occupation. 

- **Age:** Generally, the longer someone has been working, the higher their income is due to their accumulated experience.
- **Education:** Higher levels of education often lead to higher income or career advancement.
- **Family:** Having a family can lead to additional benefits and allowances, depending on the number of family members.
- **Occupation:** Different jobs come with different income levels.

In [50]:
df_without_null = new_df[new_df['days_employed'].notnull()]
df_without_null.head(10)

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose,age_category
0,1,8437.673028,42,bachelor's degree,0,married,0,F,employee,0,40620.102,purchase of the house,40-49
1,1,4024.803754,36,secondary education,1,married,0,F,employee,0,17932.802,car purchase,30-39
2,0,5623.42261,33,secondary education,1,married,0,M,employee,0,23341.752,purchase of the house,30-39
3,3,4124.747207,32,secondary education,1,married,0,M,employee,0,42820.568,supplementary education,30-39
4,0,340266.072047,53,secondary education,1,civil partnership,1,F,retiree,0,25378.572,to have a wedding,50-59
5,0,926.185831,27,bachelor's degree,0,civil partnership,1,M,business,0,40922.17,purchase of the house,19-29
6,0,2879.202052,43,bachelor's degree,0,married,0,F,business,0,38484.156,housing transactions,40-49
7,0,152.779569,50,secondary education,1,married,0,M,employee,0,21731.829,education,50-59
8,2,6929.865299,35,bachelor's degree,0,civil partnership,1,F,employee,0,15337.093,having a wedding,30-39
9,0,2188.756445,41,secondary education,1,married,0,M,employee,0,23108.15,purchase of the house for my family,40-49


In [51]:
test_age_mean = df_without_null.pivot_table(index='age_category', values='total_income', aggfunc='mean')
test_age_mean.sort_values(by='total_income', ascending=False)

Unnamed: 0_level_0,total_income
age_category,Unnamed: 1_level_1
40-49,28551.375635
30-39,28312.479963
50-59,25811.700327
19-29,25531.501098
60-69,23242.812818
70-75,20125.658331


In [52]:
test_age_median = df_without_null.pivot_table(index='age_category', values='total_income', aggfunc='median')
test_age_median.sort_values(by='total_income', ascending=False)

Unnamed: 0_level_0,total_income
age_category,Unnamed: 1_level_1
40-49,24764.229
30-39,24667.528
19-29,22735.911
50-59,22203.0745
60-69,19817.44
70-75,18751.324


In [53]:
test_education_mean = df_without_null.pivot_table(index='age_category', columns='education', values='total_income', aggfunc='mean')
test_education_mean

education,bachelor's degree,graduate degree,primary education,secondary education,some college
age_category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
19-29,29395.106109,,27695.27152,23379.052855,25265.724779
30-39,34225.243752,18187.3015,21979.584515,25666.783012,31991.246531
40-49,35687.086166,31771.321,23618.267667,26193.975154,33703.486528
50-59,33622.674858,42945.794,17797.322623,24057.27166,27623.492353
60-69,30329.905325,28334.215,18710.592883,21691.515498,30476.594957
70-75,26173.068696,,18892.886,19245.043953,13917.989667


In [54]:
test_education_median = df_without_null.pivot_table(index='age_category', columns='education', values='total_income', aggfunc='median')
test_education_median

education,bachelor's degree,graduate degree,primary education,secondary education,some college
age_category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
19-29,25956.164,,25488.916,21114.762,22655.467
30-39,28794.931,18187.3015,19542.3265,22912.993,28463.439
40-49,30282.333,31771.321,21511.5635,22973.258,29323.673
50-59,28152.1765,42945.794,16922.625,21245.482,21132.309
60-69,25222.3445,28334.215,17657.4995,18873.764,28178.917
70-75,25497.392,,15013.505,18508.577,14479.193


In [55]:
test_family_mean = df_without_null.pivot_table(index='age_category', columns='family_status', values='total_income', aggfunc='mean')
test_family_mean

family_status,civil partnership,divorced,married,unmarried,widow / widower
age_category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
19-29,25567.116944,25073.832144,25536.319673,25553.297719,22915.45
30-39,27714.148395,29115.091847,28573.954796,27739.154427,27781.924289
40-49,28956.763892,28275.677651,28440.743835,29572.001536,24929.826568
50-59,25009.327522,25679.409055,26298.481978,27466.250638,22177.332934
60-69,23044.254153,25911.107331,22982.894501,23566.604574,23004.88178
70-75,20728.343107,22049.8776,20000.835831,21892.799385,19048.42127


In [56]:
test_family_median = df_without_null.pivot_table(index='age_category', columns='family_status',values='total_income', aggfunc='median')
test_family_median

family_status,civil partnership,divorced,married,unmarried,widow / widower
age_category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
19-29,22468.182,22276.462,22799.258,22895.1195,22915.45
30-39,23975.64,24960.638,25027.453,23752.956,26565.3295
40-49,24966.693,25138.149,24668.613,25788.161,22723.861
50-59,21628.212,22212.501,22596.0325,22775.122,20389.921
60-69,19905.649,21057.2675,19877.788,19676.369,19301.928
70-75,19934.7915,22101.506,19163.383,16860.329,17971.729


In [57]:
test_job_median = df_without_null.pivot_table(index='age_category', columns='income_type',values='total_income', aggfunc='mean')
test_job_median

income_type,business,civil servant,employee,entrepreneur,paternity / maternity leave,retiree,student,unemployed
age_category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
19-29,28685.453876,25145.840248,24099.859203,79866.103,,14888.651857,15712.26,
30-39,33145.949183,27921.836553,26191.716908,,8612.661,23122.709862,,9593.119
40-49,33989.836505,28568.272491,26193.926281,,,27020.126339,,32435.602
50-59,32385.032725,25838.10573,26073.759931,,,22221.765833,,
60-69,32494.91835,29305.166039,27307.60661,,,21544.426743,,
70-75,27766.3072,32189.795667,26672.382429,,,18994.044264,,


In [58]:
test_job_median = df_without_null.pivot_table(index='age_category', columns='income_type',values='total_income', aggfunc='median')
test_job_median

income_type,business,civil servant,employee,entrepreneur,paternity / maternity leave,retiree,student,unemployed
age_category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
19-29,25542.585,23504.5105,21537.479,79866.103,,12807.071,15712.26,
30-39,28765.5935,24662.744,23218.803,,8612.661,18735.716,,9593.119
40-49,28698.4665,24890.759,23108.15,,,22498.708,,32435.602
50-59,27299.6345,23136.302,22547.831,,,19526.812,,
60-69,29171.989,23390.057,23316.965,,,18446.1435,,
70-75,28138.895,24525.224,24660.901,,,17650.466,,


The most significant determinants are occupation and the worker's age. When someone has a good job combined with their seniority or experience, it becomes the primary factor affecting income. While other factors do have an impact, they are not as influential as age and occupation. 

For example, in the case of 'family_status,' someone might earn a higher income when they are married and have children due to additional allowances, but another individual might earn the same or even more despite being younger and unmarried because of their different occupation.

In this case, I will use the median to fill in the missing values. This is because there are significant outliers for each person, and filling them based on the median would be a fair approach.

In [59]:
def get_median(income_type, age_category):
    try:
        return test_job_median[income_type][age_category]
    except:
        return "error"

In [60]:
get_median('business', '19-29')

25542.585

In [61]:
new_df['median_income'] = new_df.apply(lambda x: get_median(x['income_type'], x['age_category']), axis=1)

In [62]:
new_df.head(10)

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose,age_category,median_income
0,1,8437.673028,42,bachelor's degree,0,married,0,F,employee,0,40620.102,purchase of the house,40-49,23108.15
1,1,4024.803754,36,secondary education,1,married,0,F,employee,0,17932.802,car purchase,30-39,23218.803
2,0,5623.42261,33,secondary education,1,married,0,M,employee,0,23341.752,purchase of the house,30-39,23218.803
3,3,4124.747207,32,secondary education,1,married,0,M,employee,0,42820.568,supplementary education,30-39,23218.803
4,0,340266.072047,53,secondary education,1,civil partnership,1,F,retiree,0,25378.572,to have a wedding,50-59,19526.812
5,0,926.185831,27,bachelor's degree,0,civil partnership,1,M,business,0,40922.17,purchase of the house,19-29,25542.585
6,0,2879.202052,43,bachelor's degree,0,married,0,F,business,0,38484.156,housing transactions,40-49,28698.4665
7,0,152.779569,50,secondary education,1,married,0,M,employee,0,21731.829,education,50-59,22547.831
8,2,6929.865299,35,bachelor's degree,0,civil partnership,1,F,employee,0,15337.093,having a wedding,30-39,23218.803
9,0,2188.756445,41,secondary education,1,married,0,M,employee,0,23108.15,purchase of the house for my family,40-49,23108.15


In [63]:
new_df['total_income'] = new_df['total_income'].fillna(new_df['median_income'])

In [64]:
new_df.isnull().sum()

children               0
days_employed       2093
dob_years              0
education              0
education_id           0
family_status          0
family_status_id       0
gender                 0
income_type            0
debt                   0
total_income           1
purpose                0
age_category           0
median_income          1
dtype: int64

There seems to be an issue with the 'total_income' column and the 'median_values' column, where using `fillna()` might not be applicable. In this case, we will manually fill in the values using the average income for the age range of 50-59.

In [65]:
new_df.at[5907, 'total_income'] = 25811.700327
new_df.at[5907, 'median_income'] = 25811.700327

In [66]:
print(new_df.loc[5907, 'total_income'])
print(new_df.loc[5907, 'median_income'])

25811.700327
25811.700327


In [67]:
new_df.isna().sum()

children               0
days_employed       2093
dob_years              0
education              0
education_id           0
family_status          0
family_status_id       0
gender                 0
income_type            0
debt                   0
total_income           0
purpose                0
age_category           0
median_income          0
dtype: int64

In [68]:
new_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21352 entries, 0 to 21351
Data columns (total 14 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   children          21352 non-null  int64  
 1   days_employed     19259 non-null  float64
 2   dob_years         21352 non-null  int64  
 3   education         21352 non-null  object 
 4   education_id      21352 non-null  int64  
 5   family_status     21352 non-null  object 
 6   family_status_id  21352 non-null  int64  
 7   gender            21352 non-null  object 
 8   income_type       21352 non-null  object 
 9   debt              21352 non-null  int64  
 10  total_income      21352 non-null  float64
 11  purpose           21352 non-null  object 
 12  age_category      21352 non-null  object 
 13  median_income     21352 non-null  float64
dtypes: float64(3), int64(5), object(6)
memory usage: 2.3+ MB


## Missing Values in days_employed

In [69]:
days_employed_median = new_df.pivot_table(index='age_category', columns='income_type', values='days_employed', aggfunc='median')
days_employed_median

income_type,business,civil servant,employee,entrepreneur,paternity / maternity leave,retiree,student,unemployed
age_category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
19-29,906.647054,1362.645769,1008.784193,520.848083,,364348.197352,578.751554,
30-39,1526.857305,2591.886944,1527.637665,,3296.759962,365336.560325,,337524.466835
40-49,1896.569279,3551.609375,1904.197739,,,367477.402016,,395302.838654
50-59,2003.075639,3822.891349,2235.163052,,,364343.08814,,
60-69,2358.275817,3339.663548,2669.706829,,,365484.09062,,
70-75,3095.344969,1678.969771,1504.924191,,,366503.463685,,


In [70]:
days_employed_mean = new_df.pivot_table(index='age_category', columns='income_type', values='days_employed', aggfunc='mean')
days_employed_mean

income_type,business,civil servant,employee,entrepreneur,paternity / maternity leave,retiree,student,unemployed
age_category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
19-29,1128.501894,1595.684495,1202.858627,520.848083,,360706.968329,578.751554,
30-39,1854.594279,2764.437287,2010.804449,,3296.759962,365461.247788,,337524.466835
40-49,2477.016422,4032.418745,2673.300396,,,366963.446993,,395302.838654
50-59,2795.567404,4825.214157,3254.446979,,,364611.037392,,
60-69,3477.081,4433.159823,3916.746275,,,365158.698699,,
70-75,5292.598849,2811.783573,3310.691322,,,366046.629499,,


For the 'days_employed' column, you've decided to use the mean or average value because the outliers in the data are not too significant. This way, you can fill in the missing values based on the average for age and job position.

In [71]:
def get_mean_days(income_type, age_category):
    return days_employed_mean[income_type][age_category]

In [72]:
new_df['mean_days'] = new_df.apply(lambda x: get_mean_days(x['income_type'], x['age_category']), axis=1)

In [73]:
get_mean_days('business', '40-49')

2477.0164217983383

In [74]:
new_df.head(10)

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose,age_category,median_income,mean_days
0,1,8437.673028,42,bachelor's degree,0,married,0,F,employee,0,40620.102,purchase of the house,40-49,23108.15,2673.300396
1,1,4024.803754,36,secondary education,1,married,0,F,employee,0,17932.802,car purchase,30-39,23218.803,2010.804449
2,0,5623.42261,33,secondary education,1,married,0,M,employee,0,23341.752,purchase of the house,30-39,23218.803,2010.804449
3,3,4124.747207,32,secondary education,1,married,0,M,employee,0,42820.568,supplementary education,30-39,23218.803,2010.804449
4,0,340266.072047,53,secondary education,1,civil partnership,1,F,retiree,0,25378.572,to have a wedding,50-59,19526.812,364611.037392
5,0,926.185831,27,bachelor's degree,0,civil partnership,1,M,business,0,40922.17,purchase of the house,19-29,25542.585,1128.501894
6,0,2879.202052,43,bachelor's degree,0,married,0,F,business,0,38484.156,housing transactions,40-49,28698.4665,2477.016422
7,0,152.779569,50,secondary education,1,married,0,M,employee,0,21731.829,education,50-59,22547.831,3254.446979
8,2,6929.865299,35,bachelor's degree,0,civil partnership,1,F,employee,0,15337.093,having a wedding,30-39,23218.803,2010.804449
9,0,2188.756445,41,secondary education,1,married,0,M,employee,0,23108.15,purchase of the house for my family,40-49,23108.15,2673.300396


In [75]:
new_df['days_employed'] = new_df['days_employed'].fillna(new_df['mean_days'])

In [76]:
new_df.isnull().sum()

children            0
days_employed       1
dob_years           0
education           0
education_id        0
family_status       0
family_status_id    0
gender              0
income_type         0
debt                0
total_income        0
purpose             0
age_category        0
median_income       0
mean_days           1
dtype: int64

In [77]:
new_df[new_df['days_employed'].isna()]

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose,age_category,median_income,mean_days
5907,0,,58,bachelor's degree,0,married,0,M,entrepreneur,0,25811.700327,buy residential real estate,50-59,25811.700327,


The same issue has occurred in a different column. It will be manually filled in based on age categories to address this.

In [78]:
age_mean = new_df.pivot_table(index='age_category', values='days_employed', aggfunc='mean')
age_mean

Unnamed: 0_level_0,days_employed
age_category,Unnamed: 1_level_1
19-29,2114.390998
30-39,4268.104511
40-49,12466.515797
50-59,133488.932188
60-69,282851.863348
70-75,318946.710089


In [79]:
new_df.at[5907, 'days_employed'] = 133488.932188
new_df.at[5907, 'mean_days'] = 133488.932188

In [80]:
new_df.isnull().sum()

children            0
days_employed       0
dob_years           0
education           0
education_id        0
family_status       0
family_status_id    0
gender              0
income_type         0
debt                0
total_income        0
purpose             0
age_category        0
median_income       0
mean_days           0
dtype: int64

There are no more missing values in this dataset.

## Data Categorization

In [81]:
new_df[['purpose']]

Unnamed: 0,purpose
0,purchase of the house
1,car purchase
2,purchase of the house
3,supplementary education
4,to have a wedding
...,...
21347,housing transactions
21348,purchase of a car
21349,property
21350,buying my own car


In [82]:
new_df['purpose'].sort_values().unique()

array(['building a property', 'building a real estate',
       'buy commercial real estate', 'buy real estate',
       'buy residential real estate', 'buying a second-hand car',
       'buying my own car', 'buying property for renting out', 'car',
       'car purchase', 'cars', 'construction of own property',
       'education', 'getting an education', 'getting higher education',
       'going to university', 'having a wedding', 'housing',
       'housing renovation', 'housing transactions', 'profile education',
       'property', 'purchase of a car', 'purchase of my own house',
       'purchase of the house', 'purchase of the house for my family',
       'real estate transactions', 'second-hand car purchase',
       'supplementary education', 'to become educated', 'to buy a car',
       'to get a supplementary education', 'to have a wedding',
       'to own a car', 'transactions with commercial real estate',
       'transactions with my real estate', 'university education',
       'we

From the unique values above, I can derive the general topics: property, real estate, car, education, wedding, and house.

In [95]:
# sample_dict = {
#     'building a property':'property',
#     'buying property for renting out':'property',
#     'construction of own property':'property',
#     'property':'property',
#     'building a real estate':'real estate',
#     'buy commercial real estate':'real estate',
#     'buy real estate':'real estate',
#     'buy residential real estate':'real estate',
#     'real estate transactions':'real estate',
#     'transactions with commercial real estate':'real estate',
#     'transactions with my real estate':'real estate',
#     'buying a second-hand car':'car',
#     'buying my own car':'car',
#     'car':'car',
#     'car purchase':'car',
#     'cars':'car',
#     'purchase of a car':'car',
#     'second-hand car purchase':'car',
#     'to buy a car':'car',
#     'to own a car':'car',
#     'education':'education',
#     'getting an education':'education',
#     'getting higher education':'education',
#     'profile education':'education',
#     'supplementary education':'education',
#     'to get a supplementary education':'education',
#     'university education':'education',
#     'going to university':'education',
#     'to become educated':'education',
#     'having a wedding':'wedding',
#     'to have a wedding':'wedding',
#     'wedding ceremony':'wedding',
#     'housing':'house',
#     'housing renovation':'house',
#     'housing transactions':'house',
#     'purchase of my own house':'house',
#     'purchase of the house':'house',
#     'purchase of the house for my family':'house'
# }
def purpose_common(purpose):
    if 'property' in purpose:
        return 'property'
    elif 'estate' in purpose:
        return 'property'
    elif 'hous' in purpose:
        return 'property'
    elif 'car' in purpose:
        return 'car'
    elif 'educ' in purpose:
        return 'education'
    elif 'univ' in purpose:
        return 'education'
    elif 'wedd' in purpose:
        return 'wedding'

In [96]:
new_df['purpose_categorize'] = new_df['purpose'].apply(purpose_common)

In [97]:
new_df['purpose_categorize'].value_counts()

property     10763
car           4284
education     3995
wedding       2310
Name: purpose_categorize, dtype: int64

In [98]:
new_df[['total_income']]

Unnamed: 0,total_income
0,40620.102
1,17932.802
2,23341.752
3,42820.568
4,25378.572
...,...
21347,35966.698
21348,24959.969
21349,14347.610
21350,39054.888


In [99]:
new_df['total_income'].describe()

count     21352.000000
mean      26464.393364
std       15725.852146
min        3306.762000
25%       17222.623000
50%       23136.302000
75%       31321.653000
max      362496.645000
Name: total_income, dtype: float64

In [100]:
def income_categorize(income):
    if income <= 50000:
        return "low"
    elif income <= 100000:
        return "middle"
    elif income > 100000:
        return "high"

In [101]:
new_df['income_categorize'] = new_df['total_income'].apply(income_categorize)

In [102]:
new_df['income_categorize'].value_counts()

low       20034
middle     1219
high         99
Name: income_categorize, dtype: int64

## Hypothesis Checking

**Is there a correlation between having children and the probability of loan default?**

In [103]:
pivot_table_children = new_df.pivot_table(index='children', columns='debt' ,values='total_income', aggfunc='count')
pivot_table_children[1] = pivot_table_children[1].fillna(0)
pivot_table_children['percentage_debt'] = pivot_table_children[1]/(pivot_table_children[1]+pivot_table_children[0])
pivot_table_children['percentage_debt'] = pivot_table_children['percentage_debt'].apply("{:,.2%}".format)
pivot_table_children

debt,0,1,percentage_debt
children,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,12963.0,1058.0,7.55%
1,4397.0,442.0,9.13%
2,1912.0,202.0,9.56%
3,301.0,27.0,8.23%
4,37.0,4.0,9.76%
5,9.0,0.0,0.00%


For customers without children, the non-payment rate is significantly lower compared to others. This may be due to having fewer dependents, such as child-related responsibilities. As a result, they can repay loans smoothly. However, what is unexpected is that customers with 5 children can also settle their debts. Could this be because of their high income?

**Is there a correlation between family status and the probability of loan default?**

In [104]:
pivot_table_family= new_df.pivot_table(index='family_status', columns='debt' ,values='total_income', aggfunc='count')
pivot_table_family['percentage_debt'] = pivot_table_family[1]/(pivot_table_family[1]+pivot_table_family[0])
pivot_table_family['percentage_debt'] = pivot_table_family['percentage_debt'].apply("{:,.2%}".format)
pivot_table_family

debt,0,1,percentage_debt
family_status,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
civil partnership,3743,386,9.35%
divorced,1100,85,7.17%
married,11363,927,7.54%
unmarried,2521,273,9.77%
widow / widower,892,62,6.50%


For customers who are widowed or divorced, they have a lower percentage of non-payment. This may be because most of them live independently and do not have family or child responsibilities. The highest non-payment rate is among unmarried or single customers.

**How does the loan purpose affect the default rate?**

In [105]:
pivot_table_credit = new_df.pivot_table(index='purpose_categorize', columns='debt' ,values='total_income', aggfunc='count')
pivot_table_credit['percentage_debt'] = pivot_table_credit[1]/(pivot_table_credit[1]+pivot_table_credit[0])
pivot_table_credit['percentage_debt'] = pivot_table_credit['percentage_debt'].apply("{:,.2%}".format)
pivot_table_credit

debt,0,1,percentage_debt
purpose_categorize,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
car,3884,400,9.34%
education,3625,370,9.26%
property,9984,779,7.24%
wedding,2126,184,7.97%


For customers who want to buy a car, they have the highest default rate. This may be because buying a car involves not only repaying the purchase cost but also additional expenses such as taxes and others. On the other hand, the lowest default rate is found among customers who buy property. When buying property, we can make monthly or yearly installments and can start a business once we have the property. This makes it easier for us to repay the loan.

## Conclusion

The analysis begins with us forming hypotheses. These hypotheses include:
1. Someone fails to repay the loan due to other expenses for their spouse and children.
2. Someone fails to repay the loan because their income turns out to be insufficient.
3. Differences in loan purposes can be a factor in someone's failure to repay the loan.

Before testing these hypotheses, we first perform data preprocessing, including:
1. Converting values in the education column to all lowercase.
2. Correcting incorrect values in the children column.
3. Changing negative values in the days_employed column to positive values.
4. Removing 0 values in the dob_years column.
5. Removing unique values like 'XNA' in the gender column.
6. Removing duplicate values.

After all the data is clean, we proceed to fill in the missing values:
* Total_income column:
1. Categorizing the dob_years column into age ranges.
2. Then, I investigate which categories have a significant impact on someone's income. It turns out that occupation and age are significant factors in a person's income.
3. Afterward, we determine which value is more suitable for filling in the missing values, the mean or the median.
4. It turns out that the median value is suitable for filling in the missing values due to the presence of significant outliers.
* Dob_years column:
1. The steps are similar to the total_income column.
2. Here, I use the mean value because the outliers are not as significant. Thus, it is fair to fill in the missing values with the mean.

In filling the missing values, there was one value left unfilled, specifically in row 5907. Therefore, I filled it according to the median and mean values corresponding to the age group of 50-59.

Afterwards, I categorized the income_type and purpose columns for ease of analysis.

The hypothesis testing began and yielded the following:
1. For customers with no children, the non-payment rate is very low compared to others. This may be because they have fewer dependents, such as child-related responsibilities, making it easier for them to repay the loan. However, what is unexpected is that customers with 5 children can also settle their debts. Could this be due to their high income?
2. For customers who are widowed or divorced, they have a lower percentage of non-payment. This may be because most of them live independently and do not have family or child responsibilities. The highest non-payment rate is among unmarried or single customers.
3. Customers who want to buy a car have the highest default rate. This may be because buying a car involves not only repaying the purchase cost but also additional expenses such as taxes and others. On the other hand, the lowest default rate is found among customers who buy a house. When buying a house, we can make monthly or yearly installments, making it easier for us to repay the loan.

However, from all the tables above, it can be concluded that the highest non-payment rate is among unmarried individuals. This may be due to the reasons behind their borrowing. Whether it's for buying a car or other purposes, it appears that unmarried individuals have a higher non-payment rate.