# Analyzing borrowers’ risk of defaulting

This project is to prepare a report for a bank’s loan division. Our goal is to find out if a customer’s marital status and number of children has an impact on whether they will default on a loan. Here we already have some data on customers’ credit worthiness.

This report will be considered when building the **credit score** of a potential customer. The **credit score** is used to evaluate the ability of a potential borrower to repay their loan.

<b>Goals:</b>

- Find out if a customer's marital status and number of children has an impact on whether they will default a loan.

<b>Stages:</b>

<b>Step 1.</b> Open the data file `/datasets/credit_scoring_eng.csv` and have a look at the general information.

<b>Step 2.</b> Preprocess the data:

- Identify and fill in missing values
- Replace the real number data type with the integer type
- Delete duplicate data
- Categorize the data

<b>Explain:</b>

- Which missing values you identified
- Possible reasons these missing values were present
- Which method you used to fill in missing values
- Which method you used to find and delete duplicate data and why
- Possible reasons why duplicate data was present
- Which method you used to change the data type and why
- Which dictionaries you've selected for this dataset and why

The data may contain artifacts, or values that don't correspond to reality (for instance, a negative number of days employed). This kind of thing happens when you're working with real data. You need to describe the possible reasons such data may have turned up and process it.

<b>Step 3.</b> Answer these questions:

Is there a connection between having kids and repaying a loan on time?
Is there a connection between marital status and repaying a loan on time?
Is there a connection between income level and repaying a loan on time?
How do different loan purposes affect on-time loan repayment?
Interpret your answers. Explain what the results you obtained mean.

<b>Step 4.</b> Write an overall conclusion.

## General Information

In [1]:
import pandas as pd# Loading all the libraries
import numpy as np

credit_scoring = pd.read_csv('/datasets/credit_scoring_eng.csv')# Loading the data
credit_scoring.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21525 entries, 0 to 21524
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   children          21525 non-null  int64  
 1   days_employed     19351 non-null  float64
 2   dob_years         21525 non-null  int64  
 3   education         21525 non-null  object 
 4   education_id      21525 non-null  int64  
 5   family_status     21525 non-null  object 
 6   family_status_id  21525 non-null  int64  
 7   gender            21525 non-null  object 
 8   income_type       21525 non-null  object 
 9   debt              21525 non-null  int64  
 10  total_income      19351 non-null  float64
 11  purpose           21525 non-null  object 
dtypes: float64(2), int64(5), object(5)
memory usage: 2.0+ MB


## Task 1. Data exploration

**Description of the data**
- `children` - the number of children in the family
- `days_employed` - work experience in days
- `dob_years` - client's age in years
- `education` - client's education
- `education_id` - education identifier
- `family_status` - marital status
- `family_status_id` - marital status identifier
- `gender` - gender of the client
- `income_type` - type of employment
- `debt` - was there any debt on loan repayment
- `total_income` - monthly income
- `purpose` - the purpose of obtaining a loan

In [2]:
credit_scoring.shape# To see how many rows and columns our dataset has

(21525, 12)

In [3]:
credit_scoring.head(10)# let's print the first 10 rows

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
0,1,-8437.673028,42,bachelor's degree,0,married,0,F,employee,0,40620.102,purchase of the house
1,1,-4024.803754,36,secondary education,1,married,0,F,employee,0,17932.802,car purchase
2,0,-5623.42261,33,Secondary Education,1,married,0,M,employee,0,23341.752,purchase of the house
3,3,-4124.747207,32,secondary education,1,married,0,M,employee,0,42820.568,supplementary education
4,0,340266.072047,53,secondary education,1,civil partnership,1,F,retiree,0,25378.572,to have a wedding
5,0,-926.185831,27,bachelor's degree,0,civil partnership,1,M,business,0,40922.17,purchase of the house
6,0,-2879.202052,43,bachelor's degree,0,married,0,F,business,0,38484.156,housing transactions
7,0,-152.779569,50,SECONDARY EDUCATION,1,married,0,M,employee,0,21731.829,education
8,2,-6929.865299,35,BACHELOR'S DEGREE,0,civil partnership,1,F,employee,0,15337.093,having a wedding
9,0,-2188.756445,41,secondary education,1,married,0,M,employee,0,23108.15,purchase of the house for my family


Comment: Under the 'days_employed' column, it looks like  some of the entries are negative in value. This might be an error. The first 5 applicants don't seem to have any debt, further analysis is required to see if this is accurate info.

In [4]:
credit_scoring.info()# Get info on data

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21525 entries, 0 to 21524
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   children          21525 non-null  int64  
 1   days_employed     19351 non-null  float64
 2   dob_years         21525 non-null  int64  
 3   education         21525 non-null  object 
 4   education_id      21525 non-null  int64  
 5   family_status     21525 non-null  object 
 6   family_status_id  21525 non-null  int64  
 7   gender            21525 non-null  object 
 8   income_type       21525 non-null  object 
 9   debt              21525 non-null  int64  
 10  total_income      19351 non-null  float64
 11  purpose           21525 non-null  object 
dtypes: float64(2), int64(5), object(5)
memory usage: 2.0+ MB


Comment: There seems to be missing values in the 'total_income' and 'days_employed' columns. 

In [5]:
print(credit_scoring['days_employed'].isna().sum())# Let's look in the filtered table at the the first column with missing data


2174


Comment: Based on the info above and below, it looks like the 'days_employed' and 'total_income columns have the same amount of missing values. 

In [6]:
missing_values_share = credit_scoring.isnull().sum()* 100/ len(credit_scoring)# Let's apply multiple conditions for filtering data and look at the number of rows in the filtered table.
missing_values_share


children             0.000000
days_employed       10.099884
dob_years            0.000000
education            0.000000
education_id         0.000000
family_status        0.000000
family_status_id     0.000000
gender               0.000000
income_type          0.000000
debt                 0.000000
total_income        10.099884
purpose              0.000000
dtype: float64

Comment: In the 'days_employed' column, we'll go ahead and change those missing values from negative to positive values. After that, we're filling the missing values with the average 'days_employed'.

In [7]:
credit_scoring['days_employed']= credit_scoring['days_employed'].abs()
mean = credit_scoring['days_employed'].mean()
credit_scoring['days_employed'].fillna(mean,inplace=True)
credit_scoring['days_employed'].isnull().sum()


0

**Intermediate conclusion**

Comment: Next we'll do the same thing for the 'total_income' column. We'll fill in the 'total_income' missing values with the average 'total_income'.

In [8]:
days_employed_mean = credit_scoring['total_income'].mean()# Let's investigate clients who do not have data on identified characteristic and the column with the missing values
credit_scoring['total_income'].fillna(days_employed_mean, inplace= True)
credit_scoring['total_income'].isnull().sum()


0

Comment: Now there aren't any missing values in our dataset.

In [9]:
credit_scoring.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21525 entries, 0 to 21524
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   children          21525 non-null  int64  
 1   days_employed     21525 non-null  float64
 2   dob_years         21525 non-null  int64  
 3   education         21525 non-null  object 
 4   education_id      21525 non-null  int64  
 5   family_status     21525 non-null  object 
 6   family_status_id  21525 non-null  int64  
 7   gender            21525 non-null  object 
 8   income_type       21525 non-null  object 
 9   debt              21525 non-null  int64  
 10  total_income      21525 non-null  float64
 11  purpose           21525 non-null  object 
dtypes: float64(2), int64(5), object(5)
memory usage: 2.0+ MB


**Possible reasons for missing values in data**

Comment: It's unclear why there might be missing values. A few possible explanations could be that the data was not submitted correctly, or there might have been a systems error.


**Intermediate conclusion**

In [10]:
duplicates = credit_scoring.duplicated().sum()
print(f'There are {duplicates} duplicated rows.')# Checking for duplicated rows


There are 54 duplicated rows.


**Intermediate conclusion**

In [11]:
credit_scoring['education'].unique()

array(["bachelor's degree", 'secondary education', 'Secondary Education',
       'SECONDARY EDUCATION', "BACHELOR'S DEGREE", 'some college',
       'primary education', "Bachelor's Degree", 'SOME COLLEGE',
       'Some College', 'PRIMARY EDUCATION', 'Primary Education',
       'Graduate Degree', 'GRADUATE DEGREE', 'graduate degree'],
      dtype=object)

**Conclusions**

Comment: We seem to have 54 duplicated rows and a lot of inconsistent naming conventions under the 'education' column. We can correct those as well. 

## Data transformation

We're going to go through each column to see if there are any other mistakes or inconsistencies. We'll begin my removing duplicates. 

In [12]:
credit_scoring.drop_duplicates()

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
0,1,8437.673028,42,bachelor's degree,0,married,0,F,employee,0,40620.102,purchase of the house
1,1,4024.803754,36,secondary education,1,married,0,F,employee,0,17932.802,car purchase
2,0,5623.422610,33,Secondary Education,1,married,0,M,employee,0,23341.752,purchase of the house
3,3,4124.747207,32,secondary education,1,married,0,M,employee,0,42820.568,supplementary education
4,0,340266.072047,53,secondary education,1,civil partnership,1,F,retiree,0,25378.572,to have a wedding
...,...,...,...,...,...,...,...,...,...,...,...,...
21520,1,4529.316663,43,secondary education,1,civil partnership,1,F,business,0,35966.698,housing transactions
21521,0,343937.404131,67,secondary education,1,married,0,F,retiree,0,24959.969,purchase of a car
21522,1,2113.346888,38,secondary education,1,civil partnership,1,M,employee,1,14347.610,property
21523,3,3112.481705,38,secondary education,1,married,0,M,employee,1,39054.888,buying my own car


In [13]:
credit_scoring.duplicated().sum()

54

In [14]:
credit_scoring['education'].unique()# Let's see all values in education column to check if and what spellings will need to be fixed


array(["bachelor's degree", 'secondary education', 'Secondary Education',
       'SECONDARY EDUCATION', "BACHELOR'S DEGREE", 'some college',
       'primary education', "Bachelor's Degree", 'SOME COLLEGE',
       'Some College', 'PRIMARY EDUCATION', 'Primary Education',
       'Graduate Degree', 'GRADUATE DEGREE', 'graduate degree'],
      dtype=object)

In [15]:
credit_scoring['education'] = credit_scoring['education'].str.lower()# Changing all the values to lowercase
print(credit_scoring.loc[:,'education'].unique())
#regular expression

["bachelor's degree" 'secondary education' 'some college'
 'primary education' 'graduate degree']


In [16]:
credit_scoring.duplicated().sum()

71

In [17]:
credit_scoring.loc[credit_scoring.duplicated(), :]

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
2849,0,66914.728907,41,secondary education,1,married,0,F,employee,0,26787.568355,purchase of the house for my family
3290,0,66914.728907,58,secondary education,1,civil partnership,1,F,retiree,0,26787.568355,to have a wedding
4182,1,66914.728907,34,bachelor's degree,0,civil partnership,1,F,employee,0,26787.568355,wedding ceremony
4851,0,66914.728907,60,secondary education,1,civil partnership,1,F,retiree,0,26787.568355,wedding ceremony
5557,0,66914.728907,58,secondary education,1,civil partnership,1,F,retiree,0,26787.568355,to have a wedding
...,...,...,...,...,...,...,...,...,...,...,...,...
20702,0,66914.728907,64,secondary education,1,married,0,F,retiree,0,26787.568355,supplementary education
21032,0,66914.728907,60,secondary education,1,married,0,F,retiree,0,26787.568355,to become educated
21132,0,66914.728907,47,secondary education,1,married,0,F,employee,0,26787.568355,housing renovation
21281,1,66914.728907,30,bachelor's degree,0,married,0,F,employee,0,26787.568355,buy commercial real estate


Comment: We have cleaned up the naming convention in the 'education' column, but there still seems to be 71 duplicates in our dataset. From here, what we can do is use the drop_duplicates function and save it back to the original 'credit_scoring' dataset. 
    

In [18]:
credit_scoring= credit_scoring.drop_duplicates()
credit_scoring.shape

(21454, 12)

In [19]:
credit_scoring.duplicated().sum()

0

Comment: Now our dataset looks cleaner without any duplicates.

In [20]:
credit_scoring['education']# Checking all the values in the column to make sure we fixed them

0          bachelor's degree
1        secondary education
2        secondary education
3        secondary education
4        secondary education
                ...         
21520    secondary education
21521    secondary education
21522    secondary education
21523    secondary education
21524    secondary education
Name: education, Length: 21454, dtype: object

Comment: Next we'll check the 'children' column

In [21]:
credit_scoring['children'].unique()# Checking all unique values in the 'children' column


array([ 1,  0,  3,  2, -1,  4, 20,  5])

In [22]:
credit_scoring['children'].value_counts()

 0     14091
 1      4808
 2      2052
 3       330
 20       76
-1        47
 4        41
 5         9
Name: children, dtype: int64

Comment: There seems to be 47 instances of applicants with -1 children. Since this is a mistake, we'll convert the -1 to a regular 1.

In [23]:
credit_scoring['children']= credit_scoring['children'].abs() #converting the negative values to positive
credit_scoring['children'].value_counts() #double checking the result

0     14091
1      4855
2      2052
3       330
20       76
4        41
5         9
Name: children, dtype: int64

Comment: Now the 47 have been combined with the 4808 others who also have 1 child. 

In [24]:
credit_scoring.duplicated().sum()

0

In [25]:
credit_scoring.isnull().sum()


children            0
days_employed       0
dob_years           0
education           0
education_id        0
family_status       0
family_status_id    0
gender              0
income_type         0
debt                0
total_income        0
purpose             0
dtype: int64

In [26]:
credit_scoring.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 21454 entries, 0 to 21524
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   children          21454 non-null  int64  
 1   days_employed     21454 non-null  float64
 2   dob_years         21454 non-null  int64  
 3   education         21454 non-null  object 
 4   education_id      21454 non-null  int64  
 5   family_status     21454 non-null  object 
 6   family_status_id  21454 non-null  int64  
 7   gender            21454 non-null  object 
 8   income_type       21454 non-null  object 
 9   debt              21454 non-null  int64  
 10  total_income      21454 non-null  float64
 11  purpose           21454 non-null  object 
dtypes: float64(2), int64(5), object(5)
memory usage: 2.1+ MB


In [27]:
credit_scoring['children'].isna().sum()# Checking the `children` column again to make sure it's all fixed

0

[Check the data in the `days_employed` column. Firstly think about what kind of issues could there be and what you may want to check and how you will do it.]

In [28]:
credit_scoring['days_employed'].isnull().sum()# Find problematic data in `days_employed`, if they exist, and calculate the percentage


0

In [29]:
credit_scoring['days_employed']

0          8437.673028
1          4024.803754
2          5623.422610
3          4124.747207
4        340266.072047
             ...      
21520      4529.316663
21521    343937.404131
21522      2113.346888
21523      3112.481705
21524      1984.507589
Name: days_employed, Length: 21454, dtype: float64

Comment: Then we'll convert the 'days_employed' values to integers so it looks cleaner. 

In [30]:
credit_scoring['days_employed']= credit_scoring['days_employed'].astype(int)
credit_scoring['days_employed']

0          8437
1          4024
2          5623
3          4124
4        340266
          ...  
21520      4529
21521    343937
21522      2113
21523      3112
21524      1984
Name: days_employed, Length: 21454, dtype: int64

Comment: Now we'll look at the client's age and whether there are any issues there.

In [31]:
credit_scoring['dob_years'].unique()# Checking the `dob_years` for suspicious values

array([42, 36, 33, 32, 53, 27, 43, 50, 35, 41, 40, 65, 54, 56, 26, 48, 24,
       21, 57, 67, 28, 63, 62, 47, 34, 68, 25, 31, 30, 20, 49, 37, 45, 61,
       64, 44, 52, 46, 23, 38, 39, 51,  0, 59, 29, 60, 55, 58, 71, 22, 73,
       66, 69, 19, 72, 70, 74, 75])

In [32]:
credit_scoring['dob_years'].isnull().sum()

0

In [33]:
credit_scoring['dob_years'].value_counts()

35    616
40    607
41    605
34    601
38    597
42    596
33    581
39    572
31    559
36    554
44    545
29    544
30    537
37    536
48    536
50    513
43    512
32    509
49    508
28    503
45    496
27    493
52    484
56    483
47    477
54    476
46    472
53    459
57    456
58    454
51    446
59    443
55    443
26    408
60    374
25    357
61    354
62    348
63    269
24    264
64    260
23    252
65    193
22    183
66    182
67    167
21    111
0     101
68     99
69     85
70     65
71     56
20     51
72     33
19     14
73      8
74      6
75      1
Name: dob_years, dtype: int64

Comment: Here we can see that there are 101 applicants with the age '0'. We'll just drop rows with the age '0'.

In [34]:
credit_scoring.loc[credit_scoring['dob_years'] == 0, :]

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
99,0,346541,0,secondary education,1,married,0,F,retiree,0,11406.644000,car
149,0,2664,0,secondary education,1,divorced,3,F,employee,0,11228.230000,housing transactions
270,3,1872,0,secondary education,1,married,0,F,employee,0,16346.633000,housing renovation
578,0,397856,0,secondary education,1,married,0,F,retiree,0,15619.310000,construction of own property
1040,0,1158,0,bachelor's degree,0,divorced,3,F,business,0,48639.062000,to own a car
...,...,...,...,...,...,...,...,...,...,...,...,...
19829,0,66914,0,secondary education,1,married,0,F,employee,0,26787.568355,housing
20462,0,338734,0,secondary education,1,married,0,F,retiree,0,41471.027000,purchase of my own house
20577,0,331741,0,secondary education,1,unmarried,4,F,retiree,0,20766.202000,property
21179,2,108,0,bachelor's degree,0,married,0,M,business,0,38512.321000,building a real estate


In [35]:
credit_scoring.drop(credit_scoring[credit_scoring['dob_years'] == 0].index, inplace= True)#dropping '0' values in the age column


In [36]:
credit_scoring['dob_years'].value_counts() #checking to see if the '0' values have been removed

35    616
40    607
41    605
34    601
38    597
42    596
33    581
39    572
31    559
36    554
44    545
29    544
30    537
48    536
37    536
50    513
43    512
32    509
49    508
28    503
45    496
27    493
52    484
56    483
47    477
54    476
46    472
53    459
57    456
58    454
51    446
55    443
59    443
26    408
60    374
25    357
61    354
62    348
63    269
24    264
64    260
23    252
65    193
22    183
66    182
67    167
21    111
68     99
69     85
70     65
71     56
20     51
72     33
19     14
73      8
74      6
75      1
Name: dob_years, dtype: int64

Comment: Now all the rows where the age of the applicant is 0 has been removed from the dataset.


Comment: Now we'll check the `family_status` column.

In [37]:
credit_scoring['family_status'].unique()# Let's see the values for the column

array(['married', 'civil partnership', 'widow / widower', 'divorced',
       'unmarried'], dtype=object)

In [38]:
credit_scoring['family_status'].isnull().sum()# Checking if there are missing values

0

In [39]:
credit_scoring['family_status'].value_counts()


married              12290
civil partnership     4130
unmarried             2794
divorced              1185
widow / widower        954
Name: family_status, dtype: int64

Comment: There doesn't seem to be any obvious issues in the 'family_status' column. Now we'll check the `gender` column to see what kind of values there are and what problems we may need to address.

In [40]:
credit_scoring['gender']# Let's see the values in the column

0        F
1        F
2        M
3        M
4        F
        ..
21520    F
21521    F
21522    M
21523    M
21524    F
Name: gender, Length: 21353, dtype: object

In [41]:
credit_scoring['gender'].isnull().sum() # Checking to see if there are any missing values in the 'gender' column

0

In [42]:
credit_scoring['gender'].unique() # Checking for unique values in 'gender' column

array(['F', 'M', 'XNA'], dtype=object)

Comment: Now we'll check the `income_type` column. See what kind of values there are and what problems we may need to address.

In [43]:
credit_scoring['income_type'].unique()# Let's see the values in the column

array(['employee', 'retiree', 'business', 'civil servant', 'unemployed',
       'entrepreneur', 'student', 'paternity / maternity leave'],
      dtype=object)

In [44]:
credit_scoring['income_type'].isnull().sum() #Checking to see if there are missing values

0

Comment: There doesn't seem to be any issues in the 'income_type' column. Now let's see if we have any duplicates in our data.

In [45]:
duplicated_rows = credit_scoring[credit_scoring.duplicated()]# Checking duplicates
print(duplicated_rows.count())

children            0
days_employed       0
dob_years           0
education           0
education_id        0
family_status       0
family_status_id    0
gender              0
income_type         0
debt                0
total_income        0
purpose             0
dtype: int64


# Working with missing values

### Restoring missing values in `total_income`

Comment: Next we want to try and categorize the values in the 'dob_years' column to make it look cleaner. For people that are over 65 years old, we'll label them as 'Retired'. For people between 17 and 65 years old we'll label them as 'Adult'. For people under 18, we'll label them as 'Minor'.

In [46]:
def age(dob_years):# Let's write a function that calculates the age category
    if dob_years > 65:
        return 'Retired'
    elif dob_years > 17 and dob_years <= 65:
        return 'Adult'
    else:
        return 'Minor'

In [47]:
print(age(37))# Testing if the function works

Adult


In [48]:
credit_scoring['age_category'] = credit_scoring['dob_years'].apply(age) # Applying the 'age_category' column to the dataset
print(credit_scoring.head(10))

   children  days_employed  dob_years            education  education_id  \
0         1           8437         42    bachelor's degree             0   
1         1           4024         36  secondary education             1   
2         0           5623         33  secondary education             1   
3         3           4124         32  secondary education             1   
4         0         340266         53  secondary education             1   
5         0            926         27    bachelor's degree             0   
6         0           2879         43    bachelor's degree             0   
7         0            152         50  secondary education             1   
8         2           6929         35    bachelor's degree             0   
9         0           2188         41  secondary education             1   

       family_status  family_status_id gender income_type  debt  total_income  \
0            married                 0      F    employee     0     40620.102   
1

In [49]:
credit_scoring['age_category']# Checking values in the new column

0          Adult
1          Adult
2          Adult
3          Adult
4          Adult
          ...   
21520      Adult
21521    Retired
21522      Adult
21523      Adult
21524      Adult
Name: age_category, Length: 21353, dtype: object

In [50]:
credit_scoring_nonull = credit_scoring.dropna()# Creating a table without missing values and print a few of its rows to make sure it looks fine

credit_scoring_nonull.head(5)


Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose,age_category
0,1,8437,42,bachelor's degree,0,married,0,F,employee,0,40620.102,purchase of the house,Adult
1,1,4024,36,secondary education,1,married,0,F,employee,0,17932.802,car purchase,Adult
2,0,5623,33,secondary education,1,married,0,M,employee,0,23341.752,purchase of the house,Adult
3,3,4124,32,secondary education,1,married,0,M,employee,0,42820.568,supplementary education,Adult
4,0,340266,53,secondary education,1,civil partnership,1,F,retiree,0,25378.572,to have a wedding,Adult


In [51]:
credit_scoring_nonull['total_income'].mean()# the mean of the 'total_income' values

26793.762710827545

In [52]:
credit_scoring['total_income'].median() # The median of the 'total_income' values

24964.052000000003

In [53]:
credit_scoring['total_income'].max() # The maximum value in the 'total_income' column

362496.645

In [54]:
credit_scoring['total_income'].min() # The minimum value in the 'total_income' column

3306.762

Comment: There doesn't seem to be much of a difference between the table's Mean and Median with or without NaN values. 

In [55]:
credit_scoring['total_income']= credit_scoring['total_income'].astype(int) #Changing the data type from float to integer
credit_scoring['total_income']

0        40620
1        17932
2        23341
3        42820
4        25378
         ...  
21520    35966
21521    24959
21522    14347
21523    39054
21524    13127
Name: total_income, Length: 21353, dtype: int64

In [56]:
credit_scoring['total_income'].isnull().sum() #checking for missing values

0

In [57]:
income_median = credit_scoring['total_income'].median()

credit_scoring['total_income'] = credit_scoring['total_income'].fillna(income_median)

credit_scoring['total_income'].isnull().sum()

0

###  Restoring values in `days_employed`

In [58]:
days_employed_median = credit_scoring['days_employed'].median()# Checking the median in the 'days_employed' column
print(days_employed_median)
print()


2595.0



Comment: Check that the total number of values in this column matches the number of values in other ones.

In [59]:
credit_scoring['total_income'].shape# Checking the entries in all columns to make sure we fixed all missing values

(21353,)

In [60]:
credit_scoring['days_employed'].shape

(21353,)

In [61]:
credit_scoring['dob_years'].shape

(21353,)

Comment: Confirmed. They all have the same number of rows now.

## Categorization of data

In [62]:
credit_scoring.info()#Overview of column names and values

<class 'pandas.core.frame.DataFrame'>
Int64Index: 21353 entries, 0 to 21524
Data columns (total 13 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   children          21353 non-null  int64 
 1   days_employed     21353 non-null  int64 
 2   dob_years         21353 non-null  int64 
 3   education         21353 non-null  object
 4   education_id      21353 non-null  int64 
 5   family_status     21353 non-null  object
 6   family_status_id  21353 non-null  int64 
 7   gender            21353 non-null  object
 8   income_type       21353 non-null  object
 9   debt              21353 non-null  int64 
 10  total_income      21353 non-null  int64 
 11  purpose           21353 non-null  object
 12  age_category      21353 non-null  object
dtypes: int64(7), object(6)
memory usage: 2.3+ MB


Comment: Let's check for unique values

In [63]:
credit_scoring['education'].unique()# Checking unique values

array(["bachelor's degree", 'secondary education', 'some college',
       'primary education', 'graduate degree'], dtype=object)

In [64]:
credit_scoring['family_status'].unique() # Checking unique values

array(['married', 'civil partnership', 'widow / widower', 'divorced',
       'unmarried'], dtype=object)

In [65]:
credit_scoring['gender'].unique() # Checking unique values

array(['F', 'M', 'XNA'], dtype=object)

In [66]:
credit_scoring['income_type'].unique() # Checking unique values

array(['employee', 'retiree', 'business', 'civil servant', 'unemployed',
       'entrepreneur', 'student', 'paternity / maternity leave'],
      dtype=object)

In [67]:
credit_scoring['purpose'].unique() # Checking unique values

array(['purchase of the house', 'car purchase', 'supplementary education',
       'to have a wedding', 'housing transactions', 'education',
       'having a wedding', 'purchase of the house for my family',
       'buy real estate', 'buy commercial real estate',
       'buy residential real estate', 'construction of own property',
       'property', 'building a property', 'buying a second-hand car',
       'buying my own car', 'transactions with commercial real estate',
       'building a real estate', 'housing',
       'transactions with my real estate', 'cars', 'to become educated',
       'second-hand car purchase', 'getting an education', 'car',
       'wedding ceremony', 'to get a supplementary education',
       'purchase of my own house', 'real estate transactions',
       'getting higher education', 'to own a car', 'purchase of a car',
       'profile education', 'university education',
       'buying property for renting out', 'to buy a car',
       'housing renovation', 'going

Comment: Based on the unique values in the 'purpose' column, it looks like the main categories are: Realestate, Car Purchase, Education, and Wedding. Based on these themes, we will probably want to categorize our data.

In [68]:
def loan_purpose(purpose):# Let's write a function to categorize the data based on those 4 categories
    for row in credit_scoring:
        if 'house' in purpose:
            return 'Realestate'
        elif 'estate' in purpose:
            return 'Realestate'
        elif 'housing' in purpose:
            return 'Realestate'
        elif 'property' in purpose:
            return 'Realestate'
        elif 'car' in purpose:
            return 'Car Purchase'
        elif 'education' in purpose:
            return 'Education'
        else: 
            return 'wedding'

print(loan_purpose(credit_scoring['purpose'])) #checking to see if the funtion and loop works

wedding


In [69]:
credit_scoring['purpose']= credit_scoring['purpose'].apply(loan_purpose)# Creating a column with the values ran through the function


Comment: Next we'll work on the 'debt' column

In [70]:
credit_scoring['debt'].value_counts()#checking on the debt versus default values


0    19620
1     1733
Name: debt, dtype: int64

In [71]:
'''
def debt_rename(value):#Renaming the values in the 'debt' column
        if value == 1:
            return('Defaulted')
        else:
            return('Paid')
    
'''

"\ndef debt_rename(value):#Renaming the values in the 'debt' column\n        if value == 1:\n            return('Defaulted')\n        else:\n            return('Paid')\n    \n"

Comment: In the 'debt' column, we are trying to clearly show who 'Paid', and who 'Defaulted'.

** I undid this function so that I could easliy turn it into a pivot table later on.

In [72]:
#credit_scoring['debt']= credit_scoring['debt'].apply(debt_rename) #applying the name change to the function
credit_scoring['debt'].value_counts() #checking to see if it worked

0    19620
1     1733
Name: debt, dtype: int64

Comment: Here we can clearly see 19,620 people paid their debts, while 1,733 people defaulted.

## Checking the Hypotheses


**Is there a correlation between having children and paying back on time?**

In [73]:
children_debt = credit_scoring.pivot_table(index= ['children'], values= ['debt'], aggfunc= 'mean').sort_values(('debt'), ascending=True)
# Check the children data and paying back on time

children_debt# Calculating default-rate based on the number of children

Unnamed: 0_level_0,debt
children,Unnamed: 1_level_1
5,0.0
0,0.075453
3,0.082317
1,0.091341
2,0.095145
4,0.097561
20,0.106667


**Conclusion**

Here we can see that generally having between 0 to 5 children doesn't seem to have a correlation to whether they'll pay their loan on time. But based on this data if a person has 20 children, then they are most likely to default on their loan.

**Is there a correlation between family status and paying back on time?**

In [74]:
family_status_debt = credit_scoring.pivot_table(index= ['family_status'], values= ['debt'], aggfunc= 'mean').sort_values(('debt'), ascending=True)
# Check the family status data and paying back on time
family_status_debt
# Calculating default-rate based on family status

Unnamed: 0_level_0,debt
family_status,Unnamed: 1_level_1
widow / widower,0.06499
divorced,0.07173
married,0.075427
civil partnership,0.093462
unmarried,0.097709


**Conclusion**

Here we can see that a person is more likely to pay their loan on time if their spouse has died, or they have been divorced. Unmarried people seem to have defaulted their payments the most. 

**Is there a correlation between income level and paying back on time?**

In [75]:
income_type_debt = credit_scoring.pivot_table(index= ['total_income'], values= ['debt'], aggfunc= 'mean').sort_values(('debt'), ascending=True)
# Check the income level data and paying back on time
income_type_debt
# Calculating default-rate based on income level

Unnamed: 0_level_0,debt
total_income,Unnamed: 1_level_1
24394,0.0
30066,0.0
30069,0.0
30070,0.0
30077,0.0
...,...
45805,1.0
45850,1.0
11984,1.0
15202,1.0


In [76]:
def income_cat(income):
    if income <= 30000:
        return 'E'
    elif income <= 50000:
        return 'D'
    elif income <= 200000:
        return 'C'
    elif income <= 1000000:
        return 'B'
    else:
        return 'A'
credit_scoring['total_income_category'] = credit_scoring['total_income'].apply(income_cat)

In [77]:
credit_scoring.pivot_table(index= ['total_income_category'], values= ['debt'], aggfunc= 'mean').sort_values(('debt'), ascending=True)



Unnamed: 0_level_0,debt
total_income_category,Unnamed: 1_level_1
C,0.069625
D,0.075022
E,0.083942
B,0.090909


Here we can see that 'total_income' is not a good indicator of whether someone will pay their loan on time. This is not a good form of measurement.

**Conclusion**

Based on the above data, entrepreneurs and students are most likely to pay their loans on time. People who are unemployed or on paternity/ maternity leave are most likely to default on their payments.

**How does credit purpose affect the default rate?**

In [78]:
purpose_debt = credit_scoring.pivot_table(index= ['purpose'], values= ['debt'], aggfunc= 'mean').sort_values(('debt'), ascending=True)
purpose_debt# Check the percentages for default rate for each credit purpose and analyze them


Unnamed: 0_level_0,debt
purpose,Unnamed: 1_level_1
Realestate,0.072371
wedding,0.082866
Education,0.093053
Car Purchase,0.093371


Here we can see that people who use the loan for Realestate are the most likely to pay on time. People with car loans are most likely to default on their payments.  

In [79]:
gender_debt = credit_scoring.pivot_table(index= ['gender'], values= ['debt'], aggfunc= 'mean').sort_values(('debt'), ascending=True)

gender_debt

Unnamed: 0_level_0,debt
gender,Unnamed: 1_level_1
XNA,0.0
F,0.070132
M,0.102621


Comment: Based on the above data it looks like non-binary and women are most likley to pay their loans on time, while men are more likely to default on their payments. 

# General Conclusion 

Looking through this risk of defaulting data, we were able to identify and fix many issues. There were missing values, which we dropped and filled in with averages. There were duplicates which we dropped. There were values that were initially negative, and we turned them into positive integers to save a lot of the data. In trying to visualize the 'debt' columns value, we originally wanted to show the 'Paid' versus 'Defaulted' values. We quickly realized that it was much more useful to keep them as integers for easier data calculation. 

In the end, after looking through all the data, we were able to compare the debt share against several different conditions/ classifications. So far, the data that seemed the most important were in only 3 catergories:

- Family Status
- Loan Purpose
- Gender

Within these 3 categories we've found that a person is most likely to pay off their loan on time if they are female/ non-binary, if the loan was for realestate or a wedding, and if they are retired or widowed.