<a id="goal"></a>
## Project Description and Goal


The project is to prepare a report for a bank’s loan division. We’ll need to find out if a customer’s marital status and number of children has an impact on whether they will default on a loan. The bank already has some data on customers’ credit worthiness.

The report will be considered when building a **credit scoring** of a potential customer. A **credit scoring** is used to evaluate the ability of a potential borrower to repay their loan.

## Step 1. Open the data file and study the general information

In [1]:
import pandas as pd

In [2]:
credit_scoring.head(5)

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
0,1,-8437.673028,42,bachelor's degree,0,married,0,F,employee,0,40620.102,purchase of the house
1,1,-4024.803754,36,secondary education,1,married,0,F,employee,0,17932.802,car purchase
2,0,-5623.42261,33,Secondary Education,1,married,0,M,employee,0,23341.752,purchase of the house
3,3,-4124.747207,32,secondary education,1,married,0,M,employee,0,42820.568,supplementary education
4,0,340266.072047,53,secondary education,1,civil partnership,1,F,retiree,0,25378.572,to have a wedding


In [3]:
credit_scoring.tail(5)

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
21520,1,-4529.316663,43,secondary education,1,civil partnership,1,F,business,0,35966.698,housing transactions
21521,0,343937.404131,67,secondary education,1,married,0,F,retiree,0,24959.969,purchase of a car
21522,1,-2113.346888,38,secondary education,1,civil partnership,1,M,employee,1,14347.61,property
21523,3,-3112.481705,38,secondary education,1,married,0,M,employee,1,39054.888,buying my own car
21524,2,-1984.507589,40,secondary education,1,married,0,F,employee,0,13127.587,to buy a car


In [4]:
credit_scoring.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21525 entries, 0 to 21524
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   children          21525 non-null  int64  
 1   days_employed     19351 non-null  float64
 2   dob_years         21525 non-null  int64  
 3   education         21525 non-null  object 
 4   education_id      21525 non-null  int64  
 5   family_status     21525 non-null  object 
 6   family_status_id  21525 non-null  int64  
 7   gender            21525 non-null  object 
 8   income_type       21525 non-null  object 
 9   debt              21525 non-null  int64  
 10  total_income      19351 non-null  float64
 11  purpose           21525 non-null  object 
dtypes: float64(2), int64(5), object(5)
memory usage: 2.0+ MB


## Conclusion

<font color="blue">
    
The data table has 12 columns and 21525 rows.<br>
Columns `days_employed` and `total_income` have missing values.<br>
Column `days_employed` contains negative values and some data are not logical(if we convert them to years, in some cases they are irrational).

<a id="pre"></a>
## Step 2. Data preprocessing

### Processing missing values

Rename the `dob_years` and `total_income` into `age` and `monthly_income`. And convert the total_income values into monthly_income:

In [5]:
credit_scoring.set_axis(['children','days_employed','age','education','education_id','family_status','family_status_id','gender','income_type','debt','monthly_income','purpose'], axis='columns', inplace=True)

In [6]:
credit_scoring.describe()

Unnamed: 0,children,days_employed,age,education_id,family_status_id,debt,monthly_income
count,21525.0,19351.0,21525.0,21525.0,21525.0,21525.0,19351.0
mean,0.538908,63046.497661,43.29338,0.817236,0.972544,0.080883,26787.568355
std,1.381587,140827.311974,12.574584,0.548138,1.420324,0.272661,16475.450632
min,-1.0,-18388.949901,0.0,0.0,0.0,0.0,3306.762
25%,0.0,-2747.423625,33.0,1.0,0.0,0.0,16488.5045
50%,0.0,-1203.369529,42.0,1.0,0.0,0.0,23202.87
75%,1.0,-291.095954,53.0,1.0,1.0,0.0,32549.611
max,20.0,401755.400475,75.0,4.0,4.0,1.0,362496.645


 Convert negative data in `days_employed` column to positive values:

In [7]:
credit_scoring.loc[credit_scoring['days_employed']<0 ,'days_employed']=-1 * credit_scoring['days_employed']    
credit_scoring['days_employed']  

0          8437.673028
1          4024.803754
2          5623.422610
3          4124.747207
4        340266.072047
             ...      
21520      4529.316663
21521    343937.404131
21522      2113.346888
21523      3112.481705
21524      1984.507589
Name: days_employed, Length: 21525, dtype: float64

In [8]:
def day(days):
    if days> 20000:
        return days/24
    return days
credit_scoring['days_employed']=credit_scoring['days_employed'].apply(day)
credit_scoring['days_employed']

0         8437.673028
1         4024.803754
2         5623.422610
3         4124.747207
4        14177.753002
             ...     
21520     4529.316663
21521    14330.725172
21522     2113.346888
21523     3112.481705
21524     1984.507589
Name: days_employed, Length: 21525, dtype: float64

Check the number of missing values in culumns and the percentage:

In [9]:
credit_scoring.isnull().sum()

children               0
days_employed       2174
age                    0
education              0
education_id           0
family_status          0
family_status_id       0
gender                 0
income_type            0
debt                   0
monthly_income      2174
purpose                0
dtype: int64

In [10]:
credit_scoring.isnull().sum()/len(credit_scoring)

children            0.000000
days_employed       0.100999
age                 0.000000
education           0.000000
education_id        0.000000
family_status       0.000000
family_status_id    0.000000
gender              0.000000
income_type         0.000000
debt                0.000000
monthly_income      0.100999
purpose             0.000000
dtype: float64

<font color="blue">
    
The missing values are quantitive values and about 10% of them. so, it is better to fill them with median or mean. <br>
and also in 'age' column we have 0 and it's not logic, so we have to replace it with median or mean too.<br>
let's check out if we have outliers in this three columns!

In [11]:
num_columns = ['days_employed','monthly_income','age']
credit_scoring[num_columns].mean()

days_employed      4641.641176
monthly_income    26787.568355
age                  43.293380
dtype: float64

In [12]:
credit_scoring[num_columns].median()

days_employed      2194.220567
monthly_income    23202.870000
age                  42.000000
dtype: float64

In all three columns is mean>median it means that we have high outliers and we use the median to fill in missing values:

In [13]:
credit_scoring['income_type'].value_counts()

employee                       11119
business                        5085
retiree                         3856
civil servant                   1459
unemployed                         2
entrepreneur                       2
paternity / maternity leave        1
student                            1
Name: income_type, dtype: int64

In [14]:
credit_scoring['days_employed']= credit_scoring['days_employed'].fillna(credit_scoring['days_employed'].median())

In [15]:
credit_scoring_grouped_mean=credit_scoring.groupby('income_type')['monthly_income'].mean()
credit_scoring_grouped_mean

income_type
business                       32386.793835
civil servant                  27343.729582
employee                       25820.841683
entrepreneur                   79866.103000
paternity / maternity leave     8612.661000
retiree                        21940.394503
student                        15712.260000
unemployed                     21014.360500
Name: monthly_income, dtype: float64

In [16]:
credit_scoring_grouped_median=credit_scoring.groupby('income_type')['monthly_income'].median()
credit_scoring_grouped_median

income_type
business                       27577.2720
civil servant                  24071.6695
employee                       22815.1035
entrepreneur                   79866.1030
paternity / maternity leave     8612.6610
retiree                        18962.3180
student                        15712.2600
unemployed                     21014.3605
Name: monthly_income, dtype: float64

In [17]:
credit_scoring[credit_scoring['monthly_income'].isnull()]['income_type'].value_counts()

employee         1105
business          508
retiree           413
civil servant     147
entrepreneur        1
Name: income_type, dtype: int64

In [18]:
credit_scoring['monthly_income']=credit_scoring.groupby('income_type')['monthly_income'].apply(lambda x: x.fillna(x.median()))

In [19]:
credit_scoring['monthly_income'].isnull().sum()

0

In [20]:
credit_scoring.loc[credit_scoring['age']==0 ,'age']= 42
credit_scoring['age'].unique()

array([42, 36, 33, 32, 53, 27, 43, 50, 35, 41, 40, 65, 54, 56, 26, 48, 24,
       21, 57, 67, 28, 63, 62, 47, 34, 68, 25, 31, 30, 20, 49, 37, 45, 61,
       64, 44, 52, 46, 23, 38, 39, 51, 59, 29, 60, 55, 58, 71, 22, 73, 66,
       69, 19, 72, 70, 74, 75], dtype=int64)

In column 'gender' we see 'XNA'. it's just in one row and we can drop or ignore it:  

In [21]:
credit_scoring['gender'].value_counts()

F      14236
M       7288
XNA        1
Name: gender, dtype: int64

In [22]:
credit_scoring=credit_scoring[credit_scoring['gender']!='XNA']
credit_scoring[credit_scoring['gender']=='XNA']

Unnamed: 0,children,days_employed,age,education,education_id,family_status,family_status_id,gender,income_type,debt,monthly_income,purpose


in 'Children' column we schould fix -1 and 20:

In [23]:
credit_scoring['children'].value_counts()

 0     14148
 1      4818
 2      2055
 3       330
 20       76
-1        47
 4        41
 5         9
Name: children, dtype: int64

In [31]:
credit_scoring.loc[credit_scoring['children']<0 ,'children']= -1* credit_scoring['children']
credit_scoring.loc[credit_scoring['children']==20 , 'children']=2
credit_scoring['children'].unique()

array([1, 0, 3, 2, 4, 5], dtype=int64)

In [25]:
credit_scoring.describe()

Unnamed: 0,children,days_employed,age,education_id,family_status_id,debt,monthly_income
count,21524.0,21524.0,21524.0,21524.0,21524.0,21524.0,21524.0
mean,0.479744,4394.549122,43.491358,0.817181,0.972542,0.080886,26435.764393
std,0.755539,5131.652276,12.218156,0.548092,1.420357,0.272667,15687.294175
min,0.0,24.141633,19.0,0.0,0.0,0.0,3306.762
25%,0.0,1025.593536,34.0,1.0,0.0,0.0,17247.3565
50%,0.0,2194.220567,42.0,1.0,0.0,0.0,22815.1035
75%,1.0,4779.672511,53.0,1.0,1.0,0.0,31287.232
max,5.0,18388.949901,75.0,4.0,4.0,1.0,362496.645


### Conclusion

<font color ="blue">
    
The two age and income columns were renamed.

In the `days_employed` column, some of the data were giant data(probably given in hours) was corrected by converting to "day" and negative values became positive.

The number of children -1 and 20 who may have been typo mistake, was corrected to 1 and 2 children.

Age 0 may typo mistake or customer didn’t want to give it, filled with the median.

There are missing values in the `days_employed` and `monthly_income` columns. It may be due to forgetting or unwillingness to give information.

In the `days_employed` column the missing values were replaced with the median. 

For the missing values in the `monthly_income` column, they were filled with the median of their group based on the type of income of the people. Because each group has a different level of income.

in all three columns(`days_employed` , `monthly_income` and `age` ) is mean>median, it means that we have high outliers, so we used the median to fill in missing values.

the table was checked one more time, there is neither missing values nor irrational data.

### Data type replacement

We change the type of column `days_employed` because the number of days the customer has been working, can't be in float! 

In [26]:
credit_scoring['days_employed']=credit_scoring['days_employed'].astype('int')

In [27]:
credit_scoring['monthly_income']=credit_scoring['monthly_income'].astype('int')

In [28]:
credit_scoring.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 21524 entries, 0 to 21524
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   children          21524 non-null  int64 
 1   days_employed     21524 non-null  int32 
 2   age               21524 non-null  int64 
 3   education         21524 non-null  object
 4   education_id      21524 non-null  int64 
 5   family_status     21524 non-null  object
 6   family_status_id  21524 non-null  int64 
 7   gender            21524 non-null  object
 8   income_type       21524 non-null  object
 9   debt              21524 non-null  int64 
 10  monthly_income    21524 non-null  int32 
 11  purpose           21524 non-null  object
dtypes: int32(2), int64(5), object(5)
memory usage: 2.6+ MB


### Conclusion

<font color ="blue">
    
The type of columns `days_employed` and `monthly_incom` has been changed into integer, because the number of days the customer has been working and income, can't be in float!

### Processing duplicates

Column `education` has some duplicates, just different in Uppercase and lowercase letters. so we make them all in lowercase:

In [29]:
credit_scoring['education'].value_counts()

secondary education    13750
bachelor's degree       4718
SECONDARY EDUCATION      772
Secondary Education      711
some college             667
BACHELOR'S DEGREE        274
Bachelor's Degree        268
primary education        250
Some College              47
SOME COLLEGE              29
PRIMARY EDUCATION         17
Primary Education         15
graduate degree            4
Graduate Degree            1
GRADUATE DEGREE            1
Name: education, dtype: int64

In [30]:
credit_scoring['education']=credit_scoring['education'].str.lower()
credit_scoring['education'].value_counts()

secondary education    15233
bachelor's degree       5260
some college             743
primary education        282
graduate degree            6
Name: education, dtype: int64

In [31]:
credit_scoring.duplicated().sum()

72

In [32]:
credit_scoring=credit_scoring.drop_duplicates()

In [33]:
credit_scoring.duplicated().sum()

0

In [34]:
credit_scoring.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 21452 entries, 0 to 21524
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   children          21452 non-null  int64 
 1   days_employed     21452 non-null  int32 
 2   age               21452 non-null  int64 
 3   education         21452 non-null  object
 4   education_id      21452 non-null  int64 
 5   family_status     21452 non-null  object
 6   family_status_id  21452 non-null  int64 
 7   gender            21452 non-null  object
 8   income_type       21452 non-null  object
 9   debt              21452 non-null  int64 
 10  monthly_income    21452 non-null  int32 
 11  purpose           21452 non-null  object
dtypes: int32(2), int64(5), object(5)
memory usage: 2.0+ MB


### Conclusion

<font color ="blue">

Column `education` had some duplicates, just different in Uppercase and lowercase letters. So, we made them all in lowercase.

Also, the duplicates in the table were found and dropped. there were 72 duplicates, dropped. The cause of the duplicates may have been entered in the system twice by mistake, or the customer may have applied for a loan several times and the duplicate information has been entered.

<a id="Categorizing"></a>
### Categorizing Data

We make level for the income to make the analysis easier:

In [35]:
def level_income(income):
    if income <=15000:
        return 'low'
    if income >15000 and income<=23000:
        return 'medium-low'
    if income >23000 and income<=30000:    
        return 'medium-high'
    return 'high'

In [36]:
credit_scoring['level_income']=credit_scoring['monthly_income'].apply(level_income)
credit_scoring['level_income']

0               high
1         medium-low
2        medium-high
3               high
4        medium-high
            ...     
21520           high
21521    medium-high
21522            low
21523           high
21524            low
Name: level_income, Length: 21452, dtype: object

In [37]:
credit_scoring['level_income'].value_counts()

medium-low     7230
high           5919
medium-high    4558
low            3745
Name: level_income, dtype: int64

In [38]:
credit_scoring['children'].value_counts()

0    14089
1     4855
2     2128
3      330
4       41
5        9
Name: children, dtype: int64

We categorize the children to (0,1,2,3 and more) bacause the number of data for cases with 4 and 5 (children )is low and it is better to put them with 3 (children) in one category because the analysis with more data is more reliable:

In [39]:
def children_grouped(children):
    if children ==0:
        return '0'
    if children ==1:
        return '1'
    if children ==2:
        return '2'
    return '3 or more'

In [40]:
credit_scoring['children_grouped']=credit_scoring['children'].apply(children_grouped)
credit_scoring['children_grouped']

0                1
1                1
2                0
3        3 or more
4                0
           ...    
21520            1
21521            0
21522            1
21523    3 or more
21524            2
Name: children_grouped, Length: 21452, dtype: object

In [41]:
credit_scoring['children_grouped'].value_counts()

0            14089
1             4855
2             2128
3 or more      380
Name: children_grouped, dtype: int64

Column `purpose` also contains similar items with different expressions. So we can categorize them to a common theme:

In [42]:
credit_scoring['purpose'].value_counts()

wedding ceremony                            791
having a wedding                            767
to have a wedding                           765
real estate transactions                    675
buy commercial real estate                  661
housing transactions                        652
buying property for renting out             651
transactions with commercial real estate    650
housing                                     646
purchase of the house                       646
purchase of the house for my family         638
construction of own property                635
property                                    633
transactions with my real estate            627
building a real estate                      624
buy real estate                             620
purchase of my own house                    620
building a property                         619
housing renovation                          607
buy residential real estate                 606
buying my own car                       

In [43]:
def theme(query):
    if 'wedding' in query: 
        return 'wedding'
    if 'car' in query: 
        return 'car'
    if 'educat' in query or 'university' in query: 
        return 'education'
    if 'commercial'in query  or 'rent' in query: 
        return 'commercial estate'
    if 'construction' in query or 'build' in query or 'renovation' in query or 'housing' in query:
        return 'housing'
    return 'purchase residential estate'

In [44]:
credit_scoring['purpose']=credit_scoring['purpose'].apply(theme)        
credit_scoring['purpose'].value_counts()

purchase residential estate    5065
car                            4306
education                      4013
housing                        3783
wedding                        2323
commercial estate              1962
Name: purpose, dtype: int64

Check that there is no missing value:

In [45]:
credit_scoring['purpose'].isnull().sum()

0

### Conclusion

<font color ="blue">
    
Some columns in the dataset were categorized to make them more appropriate for analysis:

1-	levels for the income were made and saved in a new column `level_income`:
    
|level_income               |numbers|
|------                     |------ |
|Low<=15000                 |3745   |
|15000<Medium-low<=22000    |5775   |
|22000<Medium-high<=30000   |6040   |
|high >30000                |5918   |
    
It was tried to make the number of people in the groups approximately equal. 


2-The children column was categorized to (0,1,2,3 and more) because the amount of data for cases with 4 and 5 (children) is low and it is better to put them with 3 (children) in one category because the analysis with more data is more reliable.

   
|children  |numbers|
|------    |------ |
|0         |14089  |
|1         |4855   |
|2         |2128   |
|3 or more |380    |  

3-	Column `purpose` also contains similar items with different expressions. So, we can categorize them to a common theme.
    
|purpose                       |numbers|
|------                        |------ |
|purchase residential estate   |5065   |
|car                           |4306   |
|education                     |4013   |
|housinge                      |3783   |
|wedding                       |2323   |
|commercial estate             |1962   |    
    
As we can see, the largest number of borrowers: in terms of the number of children is the group without children. In terms of income, the people whose income level is medium-high (between 22,000 and 30,000). In terms of the purpose of a loan, the highest incentive is to purchase residential estate and then buying a car. The lowest incentive is for commercial estate.

<a id="question"></a>
## Step 3. Answer these questions

- **Is there a relation between having kids and repaying a loan on time?**

In [46]:
credit_scoring['children_grouped'].value_counts()

0            14089
1             4855
2             2128
3 or more      380
Name: children_grouped, dtype: int64

In [47]:
credit_scoring_grouped=credit_scoring.pivot_table(index='children_grouped', values='debt' , aggfunc=['count','sum'])
credit_scoring_grouped['ratio']=credit_scoring_grouped['sum']/credit_scoring_grouped['count']*100
credit_scoring_grouped
credit_scoring_grouped.sort_values('ratio', ascending=False)

Unnamed: 0_level_0,count,sum,ratio
Unnamed: 0_level_1,debt,debt,Unnamed: 3_level_1
children_grouped,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
2,2128,202,9.492481
1,4855,445,9.165808
3 or more,380,31,8.157895
0,14089,1063,7.544893


<font color ="blue">
    
According to statistics, people with two children and one have a higher default rate than people with more than three children, as well as those without children. 9.4 and 9.1 versus 8.1 and 7.5% indicate this difference. Therefore, it can be inferred that the number of children affects the repaying of a loan on time. Of course, this effect is not linear.

- **Is there a relation between marital status and repaying a loan on time?**

In [48]:
credit_scoring['family_status'].value_counts()

married              12339
civil partnership     4149
unmarried             2810
divorced              1195
widow / widower        959
Name: family_status, dtype: int64

In [49]:
credit_scoring_grouped=credit_scoring.pivot_table(index='family_status', values='debt' , aggfunc=['count','sum'])
credit_scoring_grouped['ratio']=credit_scoring_grouped['sum']/credit_scoring_grouped['count']*100
credit_scoring_grouped.sort_values('ratio', ascending=False)

Unnamed: 0_level_0,count,sum,ratio
Unnamed: 0_level_1,debt,debt,Unnamed: 3_level_1
family_status,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
unmarried,2810,274,9.75089
civil partnership,4149,388,9.351651
married,12339,931,7.545182
divorced,1195,85,7.112971
widow / widower,959,63,6.569343


<font color ="blue">
    
As the results of the study show, it seems that the marriage status of individuals is effective in regular repayment of loan installments. Those who are not married have the highest default rate (9.7%) and the lowest default rate against widows (6.5%). In other words, out of every 100 single people, about ten will fail, while out of every 100 widows, only six will fail. Those who are divorced or widowed have the lowest risk of failure.

- **Is there a relation between income level and repaying a loan on time?**

In [50]:
credit_scoring['level_income'].value_counts()

medium-low     7230
high           5919
medium-high    4558
low            3745
Name: level_income, dtype: int64

In [51]:
credit_scoring_grouped=credit_scoring.pivot_table(index='level_income', values='debt' , aggfunc=['count','sum'])
credit_scoring_grouped['ratio']=credit_scoring_grouped['sum']/credit_scoring_grouped['count']*100
credit_scoring_grouped.sort_values('ratio', ascending=False)

Unnamed: 0_level_0,count,sum,ratio
Unnamed: 0_level_1,debt,debt,Unnamed: 3_level_1
level_income,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
medium-low,7230,627,8.672199
medium-high,4558,380,8.33699
low,3745,298,7.957276
high,5919,436,7.366109


<font color ="blue">
    
According to statistics, people with high incomes (more than 3,0000) have the lowest default rates. Low-income people (less than 1,5000) have similar rates and have a low risk(it's logical, People with low incomes are more cautious about getting and repaying loans.) But middle-income people(between 1,5000 and 3,0000) have the highest default rates. Among the middle class, the lower middle class (between 15,000 and 22,000) have a higher risk. Therefore, the level of income is effective in repaying the loan.


- **How do different loan purposes affect on-time repayment of the loan?**

In [52]:
credit_scoring['purpose'].value_counts()

purchase residential estate    5065
car                            4306
education                      4013
housing                        3783
wedding                        2323
commercial estate              1962
Name: purpose, dtype: int64

In [53]:
credit_scoring_grouped=credit_scoring.pivot_table(index='purpose', values='debt' , aggfunc=['count','sum'])
credit_scoring_grouped['ratio']=credit_scoring_grouped['sum']/credit_scoring_grouped['count']*100
credit_scoring_grouped.sort_values('ratio', ascending=False)

Unnamed: 0_level_0,count,sum,ratio
Unnamed: 0_level_1,debt,debt,Unnamed: 3_level_1
purpose,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
car,4306,403,9.359034
education,4013,370,9.220035
wedding,2323,186,8.006888
commercial estate,1962,151,7.696228
housing,3783,273,7.216495
purchase residential estate,5065,358,7.068115


<font color ="blue">

Statistical results show that if the purpose of loan is related to the car with education, the risk of default is significantly higher than those who borrowed for the purpose of marriage or the purchase of real estate. The default rate for car and education is more than 9%, while the rate for marriage and real estate is less than 8%.

In [54]:
credit_scoring_grouped=credit_scoring.pivot_table(index=['purpose','level_income'], values='debt' , aggfunc=['count','sum'])
credit_scoring_grouped['ratio']=credit_scoring_grouped['sum']/credit_scoring_grouped['count']*100
credit_scoring_grouped.sort_values('ratio', ascending=False)

Unnamed: 0_level_0,Unnamed: 1_level_0,count,sum,ratio
Unnamed: 0_level_1,Unnamed: 1_level_1,debt,debt,Unnamed: 4_level_1
purpose,level_income,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
car,medium-high,873,90,10.309278
education,medium-low,1361,140,10.286554
car,medium-low,1489,147,9.872398
wedding,low,401,38,9.476309
education,medium-high,877,83,9.464082
wedding,medium-low,804,75,9.328358
car,high,1203,104,8.645054
education,low,726,61,8.402204
car,low,741,62,8.367072
commercial estate,high,553,46,8.318264


<font color ="blue">
When we look at the default rate from both the income level and the purpose of the loan at the same time, we find that in cases where the purpose of the loan is related to education or car and also people belong to the low and middle income category, the risk is much higher. (Rate = 9.6%) compared to when the goal is a car or education but people are related to the high-income category (rate = 8%). When we analyze the rates from the perspective of two criteria, we come to new and interesting results.

In [55]:
credit_scoring_grouped=credit_scoring.pivot_table(index=['purpose','children_grouped'], values='debt' , aggfunc=['count','sum'])
credit_scoring_grouped['ratio']=credit_scoring_grouped['sum']/credit_scoring_grouped['count']*100
credit_scoring_grouped.sort_values('ratio', ascending=False)

Unnamed: 0_level_0,Unnamed: 1_level_0,count,sum,ratio
Unnamed: 0_level_1,Unnamed: 1_level_1,debt,debt,Unnamed: 4_level_1
purpose,children_grouped,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
wedding,3 or more,36,5,13.888889
car,2,415,50,12.048193
education,2,418,47,11.244019
purchase residential estate,3 or more,91,10,10.989011
car,1,974,104,10.677618
education,1,876,90,10.273973
housing,2,376,37,9.840426
commercial estate,1,471,45,9.55414
wedding,1,534,51,9.550562
education,0,2642,229,8.667676


<font color ="blue">
When we examine the default rate from the same perspective of the loan target and the number of children, we come to an interesting conclusion. It is not surprising that the goal of car and education is high risk, but it is interesting that when the target is residential houses and the number of children is three or more, the default rate goes up. (Approximately 11%). This may be because they have larger homes and the cost of buying a home is high, while having three or more children increases the running costs of the household and reduces the possibility of repaying the loan on a regular basis.

In [56]:
credit_scoring_grouped=credit_scoring.pivot_table(index=['purpose','family_status'], values='debt' , aggfunc=['count','sum'])
credit_scoring_grouped['ratio']=credit_scoring_grouped['sum']/credit_scoring_grouped['count']*100
credit_scoring_grouped.sort_values('ratio', ascending=False)

Unnamed: 0_level_0,Unnamed: 1_level_0,count,sum,ratio
Unnamed: 0_level_1,Unnamed: 1_level_1,debt,debt,Unnamed: 4_level_1
purpose,family_status,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
education,civil partnership,404,60,14.851485
car,unmarried,637,82,12.872841
car,civil partnership,434,51,11.751152
education,unmarried,577,62,10.745234
purchase residential estate,civil partnership,440,45,10.227273
commercial estate,civil partnership,195,19,9.74359
car,widow / widower,218,20,9.174312
commercial estate,widow / widower,105,9,8.571429
housing,unmarried,584,50,8.561644
commercial estate,unmarried,271,23,8.487085


<font color ="blue">
From the results of this section, we come to an interesting point. Not surprisingly, the first 4 rows of the table are related to cars and education (as we did in this analysis in previous tables). But the interesting thing is that if the purpose of the loan is to buy a residential house and the marital status is civil partnership, the risk of default will be high and it will be in the top 5 in terms of default rate.

## Conclusion

<font color ="blue">

**1-There is a relationship between having kids and repaying a loan on time.  
        the default-rate based on the number of children is shown below:**
 
|children_grouped| count| sum | ratio|        
|----------------|------|-----|------|	
|2               |2128  |  202|9.492481|
|1               |4855  |  445|9.165808|
|3 or more       |380   |   31|8.157895|
|0               |14089 | 1063|7.544893|

According to statistics, people with two children and one have a higher default rate than people with more than three children, as well as those without children. 9.4 and 9.1 versus 8.1 and 7.5% indicate this difference. Therefore, it can be inferred that the number of children affects the repaying of a loan on time. Of course, this effect is not linear.


  **2- There is a relationship between marital status and repaying a loan on time.
   the default-rate based on the family_status is shown below:**
   
|family_status     | count| sum | ratio|        
|----------------  |------|-----|------|	
|unmarried         |2810  | 274 |9.750890|
|civil partnership |4149  | 388 |9.351651|
| married          |12339 |	931 |7.545182|
|divorced          |1195  | 85  |7.112971|    
| widow / widower  |959   | 63  |6.569343|    

As the results of the analysis show, it seems that the marriage status of individuals is effective in regular repayment of loan installments. Those who are not married have the highest default rate (9.7%) and the lowest default rate against widows (6.5%). In other words, out of every 100 single people, about ten will fail, while out of every 100 widows, only six will fail. Those who are divorced or widowed have the lowest risk of failure.


  **3-There is a relation between income level and default rate.
  the default-rate based on the level_income is shown below:**
  

| level_income    | count| sum | ratio|        
|---------------- |------|-----|------|	
|medium-low       |5775  | 494 |8.554113|
| medium-high     |6014  | 513 |8.530096|
| low             |3745  | 298 |7.957276|
| high            |5918  | 436 |7.367354| 
    
According to statistics, people with high incomes (more than 3,0000) have the lowest default rates. Low-income people (less than 1,5000) have similar rates and have a low risk(it's logical, People with low incomes are more cautious about getting and repaying loans.) But middle-income people(between 1,5000 and 3,0000) have the highest default rates. Among the middle class, the lower middle class (between 15,000 and 22,000) have a higher risk. Therefore, the level of income is effective in repaying the loan.

  **4-Different loan purposes affect on-time repayment of the loan.
  the default-rate based on the purpose of the loan is shown below:**
  
| purpose	                 | count| sum | ratio|        
|----------------            |------|-----|------|	
|car                         |4306  | 403 |9.359034|
| education                  |4013  | 370 |9.220035|
| wedding                    |2323  | 186 |8.006888|
| commercial estate          |1962  | 151 |7.696228| 
| housing                    |3783  | 273 |7.216495| 
| purchase residential estate|5065  | 358 |7.068115|   
    

Statistical results show that if the purpose of loan is related to the car with education, the risk of default is significantly higher than those who borrowed for the purpose of marriage or the purchase of real estate. The default rate for car and education is more than 9%, while the rate for marriage and real estate is less than 8%.


## Step 4. General conclusion

<font color ="blue">
    
### [Goal](#goal)

The goal was to analyze borrowers' risk of default. We should find out if a customer’s marital status, number of children, income, and purpose of the loan has an impact on whether they will default on a loan.**


### [Data Preprocessing](#pre)
In order to work with appropriate and valid data, the following tasks were performed: 

  - Processing missing values 
  - Type replacement 
  - Processing duplicates 

### [Categorizing Data](#Categorizing)

Some columns in the dataset were categorized to make them more appropriate for analysis:

1-	levels for the income were made and saved in a new column `level_income`:
    
|level_income               |numbers|
|------                     |------ |
|Low<=15000                 |3745   |
|15000<Medium-low<=22000    |5775   |
|22000<Medium-high<=30000   |6040   |
|high >30000                |5918   |
    
It was tried to make the number of people in the groups approximately equal. 


2-The children column was categorized to (0,1,2,3 and more) because the amount of data for cases with 4 and 5 (children) is low and it is better to put them with 3 (children) in one category because the analysis with more data is more reliable.

   
|children  |numbers|
|------    |------ |
|0         |14089  |
|1         |4855   |
|2         |2128   |
|3 or more |380    |  

3-	Column `purpose` also contains similar items with different expressions. So, we can categorize them to a common theme.
    
|purpose                       |numbers|
|------                        |------ |
|purchase residential estate   |5065   |
|car                           |4306   |
|education                     |4013   |
|housinge                      |3783   |
|wedding                       |2323   |
|commercial estate             |1962   |    
    
As we can see, the largest number of borrowers: in terms of the number of children is the group without children. In terms of income, the people whose income level is medium-high (between 22,000 and 30,000). In terms of the purpose of a loan, the highest incentive is to purchase residential estate and then buying a car. The lowest incentive is for commercial estate.

The results of the analysis
After completing the four steps related to data preprocessing, it is time to analyze the results. 

### [Key Findings](#question)

**1- There is a relationship between having kids and repaying a loan on time.  
        the default-rate based on the number of children is shown below:**
 
                                
|children_grouped| count| sum | ratio|        
|----------------|------|-----|------|	
|2               | 2128 |  202|9.492481|
|1               |4855  |  445|9.165808|
|3 or more       |380   |   31|8.157895|
|0               |14089 | 1063|7.544893|

    
According to statistics, people with two children and one have a higher default rate than people with more than three children, as well as those without children. 9.4 and 9.1 versus 8.1 and 7.5% indicate this difference. Therefore, it can be inferred that the number of children affects the repaying of a loan on time. Of course, this effect is not linear.



  **2- There is a relationship between marital status and repaying a loan on time.
   the default-rate based on the family_status is shown below:**
   
    
|family_status     | count| sum | ratio|        
|----------------  |------|-----|------|	
|unmarried         |2810  | 274 |9.750890|
|civil partnership |4149  | 388 |9.351651|
| married          |12339 |	931 |7.545182|
|divorced          |1195  | 85  |7.112971|    
| widow / widower  |959   | 63  |6.569343|
    
As the results of the analysis show, it seems that the marriage status of individuals is effective in regular repayment of loan installments. Those who are not married have the highest default rate (9.7%) and the lowest default rate against widows (6.5%). In other words, out of every 100 single people, about ten will fail, while out of every 100 widows, only six will fail. Those who are divorced or widowed have the lowest risk of failure.



  **3-There is a relation between income level and default rate.
  the default-rate based on the level_income is shown below:**
  
    
| level_income    | count| sum | ratio|        
|---------------- |------|-----|------|	
|medium-low       |5775  | 494 |8.554113|
| medium-high     |6014  | 513 |8.530096|
| low             |3745  | 298 |7.957276|
| high            |5918  | 436 |7.367354|    
  

According to statistics, people with high incomes (more than 3,0000) have the lowest default rates. Low-income people (less than 1,5000) have similar rates and have a low risk(it's logical, People with low incomes are more cautious about getting and repaying loans.) But middle-income people(between 1,5000 and 3,0000) have the highest default rates. Among the middle class, the lower middle class (between 15,000 and 22,000) have a higher risk. Therefore, the level of income is effective in repaying the loan.


  **4-Different loan purposes affect on-time repayment of the loan.
  the default-rate based on the purpose of the loan is shown below:**
  
| purpose	                 | count| sum | ratio|        
|----------------            |------|-----|------|	
|car                         |4306  | 403 |9.359034|
| education                  |4013  | 370 |9.220035|
| wedding                    |2323  | 186 |8.006888|
| commercial estate          |1962  | 151 |7.696228| 
| housing                    |3783  | 273 |7.216495| 
| purchase residential estate|5065  | 358 |7.068115|   
    
Statistical results show that if the purpose of loan is related to the car with education, the risk of default is significantly higher than those who borrowed for the purpose of marriage or the purchase of real estate. The default rate for car and education is more than 9%, while the rate for marriage and real estate is less than 8%.


### Final conclusions and Recommendations

In general, It is recommended to the bank that to be more careful in the following cases: 

  1. Loan applicant with 1 or 2 children. 
  2. Unmarried people or civil partnership. 
  3. When the purpose of receiving a loan is to buy a car or education. 
  4. When loan applicant's income is between 15000 and 30000 Euros.
  5. When loan applicants with three children and more apply for a home loan. 
  6. When people receive a loan to buy a house and their marital status is civil partnership.

For these cases, perform a more accurate credit assessment and obtain more reliable collateral is recommended.
