# Analyzing borrowers’ risk of defaulting



# Contents <a id='back'></a>

* [Introduction](#intro)
    * [Goal](#goal)
    * [Stages](#stages)
* [Data overview](#data_review)
    * [Data exploration](#data_exploration)
* [Data preprocessing](#data_transformation)
    * [Education](#education)
    * [days_employed](#days_employed)
    * [dob_years](#dob_years)
    * [family_status](#family_stuts) 
    * [gender](#gender)
    * [income_type](#income_type)
    * [Duplicates](#duplicates)
* [Working with missing values](#missing_values)
    * [Restoring missing values in `total_income`](#restoring_total_income)
    * [Restoring values in `days_employed`](#restoring_days_employed)
    * [3.3 Hypothesis 3: genre preferences in Springfield and Shelbyville](#genre)
* [Findings](#end)

## Introduction <a id='intro'></a>
the project is to prepare a report for a bank’s loan division. We need to find out if a customer’s marital status and number of children has an impact on whether they will default on a loan. The bank already has some data on customers’ credit worthiness.

The report will be considered when building the **credit score** of a potential customer. The **credit score** is used to evaluate the ability of a potential borrower to repay their loan.

### Goal<a id='goal'></a>
Test 4 hypotheses:
1. Is there a connection between having kids and repaying a loan on time?
2. Is there a connection between marital status and repaying a loan on time?
3. Is there a connection between income level and repaying a loan on time?
4. How do different loan purposes affect on-time loan repayment?

### Stages <a id='stages'></a>
Data on  customers’ credit worthiness is stored in the file `/datasets/credit_scoring_eng.csv`. There is no information about the quality of the data, so we will need to explore it before testing the hypotheses. 

First, we'll evaluate the quality of the data and see whether its issues are significant. Then, during data preprocessing, you will try to account for the most critical problems.

## Data review <a id='data_review'></a>



 Open the Credit Scoring and exploring using imported pandas.

In [83]:
import pandas as pd

In [84]:
df=pd.read_csv('/datasets/credit_scoring_eng.csv')

### Data exploration <a id='data_exploration'></a>

let's explore the data by its description, shape and 10 first rows.

In [85]:
df.describe()


Unnamed: 0,children,days_employed,dob_years,education_id,family_status_id,debt,total_income
count,21525.0,19351.0,21525.0,21525.0,21525.0,21525.0,19351.0
mean,0.538908,63046.497661,43.29338,0.817236,0.972544,0.080883,26787.568355
std,1.381587,140827.311974,12.574584,0.548138,1.420324,0.272661,16475.450632
min,-1.0,-18388.949901,0.0,0.0,0.0,0.0,3306.762
25%,0.0,-2747.423625,33.0,1.0,0.0,0.0,16488.5045
50%,0.0,-1203.369529,42.0,1.0,0.0,0.0,23202.87
75%,1.0,-291.095954,53.0,1.0,1.0,0.0,32549.611
max,20.0,401755.400475,75.0,4.0,4.0,1.0,362496.645


In [86]:
df.head(10)

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
0,1,-8437.673028,42,bachelor's degree,0,married,0,F,employee,0,40620.102,purchase of the house
1,1,-4024.803754,36,secondary education,1,married,0,F,employee,0,17932.802,car purchase
2,0,-5623.42261,33,Secondary Education,1,married,0,M,employee,0,23341.752,purchase of the house
3,3,-4124.747207,32,secondary education,1,married,0,M,employee,0,42820.568,supplementary education
4,0,340266.072047,53,secondary education,1,civil partnership,1,F,retiree,0,25378.572,to have a wedding
5,0,-926.185831,27,bachelor's degree,0,civil partnership,1,M,business,0,40922.17,purchase of the house
6,0,-2879.202052,43,bachelor's degree,0,married,0,F,business,0,38484.156,housing transactions
7,0,-152.779569,50,SECONDARY EDUCATION,1,married,0,M,employee,0,21731.829,education
8,2,-6929.865299,35,BACHELOR'S DEGREE,0,civil partnership,1,F,employee,0,15337.093,having a wedding
9,0,-2188.756445,41,secondary education,1,married,0,M,employee,0,23108.15,purchase of the house for my family


The table contains ten columns asnd 21525 rows.

Let's obtain general information with one command.


In [87]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21525 entries, 0 to 21524
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   children          21525 non-null  int64  
 1   days_employed     19351 non-null  float64
 2   dob_years         21525 non-null  int64  
 3   education         21525 non-null  object 
 4   education_id      21525 non-null  int64  
 5   family_status     21525 non-null  object 
 6   family_status_id  21525 non-null  int64  
 7   gender            21525 non-null  object 
 8   income_type       21525 non-null  object 
 9   debt              21525 non-null  int64  
 10  total_income      19351 non-null  float64
 11  purpose           21525 non-null  object 
dtypes: float64(2), int64(5), object(5)
memory usage: 2.0+ MB


**Description of the data**
- `children` - the number of children in the family
- `days_employed` - work experience in days
- `dob_years` - client's age in years
- `education` - client's education
- `education_id` - education identifier
- `family_status` - marital status
- `family_status_id` - marital status identifier
- `gender` - gender of the client
- `income_type` - type of employment
- `debt` - was there any debt on loan repayment
- `total_income` - monthly income
- `purpose` - the purpose of obtaining a loan


They store the different data type: int64, float64 and object.
No issues were found in the name of the columns.


The number of column values is different. This means the data contains missing values especially in the 'days_emplyed' and 'total_income' columns.


In [88]:
df_filtered=df[df['days_employed'].isna()]
df_filtered

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
12,0,,65,secondary education,1,civil partnership,1,M,retiree,0,,to have a wedding
26,0,,41,secondary education,1,married,0,M,civil servant,0,,education
29,0,,63,secondary education,1,unmarried,4,F,retiree,0,,building a real estate
41,0,,50,secondary education,1,married,0,F,civil servant,0,,second-hand car purchase
55,0,,54,secondary education,1,civil partnership,1,F,retiree,1,,to have a wedding
...,...,...,...,...,...,...,...,...,...,...,...,...
21489,2,,47,Secondary Education,1,married,0,M,business,0,,purchase of a car
21495,1,,50,secondary education,1,civil partnership,1,F,employee,0,,wedding ceremony
21497,0,,48,BACHELOR'S DEGREE,0,married,0,F,business,0,,building a property
21502,1,,42,secondary education,1,married,0,F,employee,0,,building a real estate


The missing values in 'days_employed' and 'total_income' columns look symetrical. But it can not be sure because the table was filtered only by 'days_emplyed' column.

Let's count all missing values in condition that they appear in both columns.

In [89]:
df_tot_mis_values=df[(df['days_employed'].isna())&(df['total_income'].isna())]
len(df_tot_mis_values)

2174

**Intermediate conclusion**\
The numbers of rows in the filtered table shows the same number of missing values as expected.\
They are symetrical - they appear symultanasly in `days_employed` and `total_income` column. 


Let's check what is the percentage of missing data.

The missing data are related to work - days and income. So the income type should be taken under consideration for the filling of the missing data.\
Let's check if in the missing column, the indentifying characteristic - `income_type` is missing as well. \
Then, we will check the distribution of those income type in all missing data.



In [90]:
print('The percentage of all missing values is:', f'{len(df_tot_mis_values)/len(df):.2%}')
print()
print(df_tot_mis_values['income_type'].value_counts(normalize=True))
print()
print(df['income_type'].value_counts(normalize=True))

The percentage of all missing values is: 10.10%

employee         0.508280
business         0.233671
retiree          0.189972
civil servant    0.067617
entrepreneur     0.000460
Name: income_type, dtype: float64

employee                       0.516562
business                       0.236237
retiree                        0.179141
civil servant                  0.067782
unemployed                     0.000093
entrepreneur                   0.000093
student                        0.000046
paternity / maternity leave    0.000046
Name: income_type, dtype: float64


The disrtibution of `income_type` is similiar in the data with missing values and the whole data. But the table of missing values does not have those value - unemployed, paternity/ maternity leave and student.
That's means the the income type is not the reason for the missing values.

**Possible reasons for missing values in data**

We will check the distribution in the `education` to search for the pattern.



In [91]:
print(df_tot_mis_values['education'] .str.lower().value_counts(normalize=True))
print()
print(df['education'].str.lower().value_counts(normalize=True))
print()
print(df_tot_mis_values['dob_years'].value_counts(normalize=True))
print()
print(df['dob_years'].value_counts(normalize=True))

secondary education    0.708372
bachelor's degree      0.250230
some college           0.031739
primary education      0.009660
Name: education, dtype: float64

secondary education    0.707689
bachelor's degree      0.244367
some college           0.034564
primary education      0.013101
graduate degree        0.000279
Name: education, dtype: float64

34    0.031739
40    0.030359
31    0.029899
42    0.029899
35    0.029439
36    0.028979
47    0.027139
41    0.027139
30    0.026679
28    0.026219
57    0.025759
58    0.025759
54    0.025299
38    0.024839
56    0.024839
37    0.024379
52    0.024379
39    0.023459
33    0.023459
50    0.023459
51    0.022999
45    0.022999
49    0.022999
29    0.022999
43    0.022999
46    0.022079
55    0.022079
48    0.021159
53    0.020239
44    0.020239
60    0.017939
61    0.017479
62    0.017479
64    0.017019
32    0.017019
27    0.016559
23    0.016559
26    0.016099
59    0.015639
63    0.013339
25    0.010580
24    0.009660
66    0.009200
6

**Intermediate conclusion**

The distribution has not changed in the `family_status`.\
But in the distribution in the `education` has changed, especialy in the primary education value which changed from 0.9% in the table of missing values data to 13% and that in the missing values data, we do not values for graduate degree.
    We can see clearly that the destribution of `bod_years` has changes as well. In the missing values table, 34 has 31% meanwhile int he whole data, 34 became 28%. it's a 3% change. We will take it in considartion too when we will replace the missing values.
    

**Conclusions**

We cannot indentify the true pattern for missing values.
It represents only 10% of all data. Therefor, we will fill them according to the income type and eduaction.


Next, we will process all the columns before approcing the filling of the missing values.



## Data transformation <a id='data_transformation'></a>

We will process all columns separtely by counting their unique values, manupilating and rechecking them.

### `education`<a id='education'></a>



In [92]:
df.education.unique()

array(["bachelor's degree", 'secondary education', 'Secondary Education',
       'SECONDARY EDUCATION', "BACHELOR'S DEGREE", 'some college',
       'primary education', "Bachelor's Degree", 'SOME COLLEGE',
       'Some College', 'PRIMARY EDUCATION', 'Primary Education',
       'Graduate Degree', 'GRADUATE DEGREE', 'graduate degree'],
      dtype=object)

In the `education` column, we have values that have uppercast letters. We will transform all values into lowercase using `.st.lower()`

In [93]:
df['education']=df['education'].str.lower()

In [94]:
df['education'].unique()

array(["bachelor's degree", 'secondary education', 'some college',
       'primary education', 'graduate degree'], dtype=object)

### `children` <a id='children'></a>


In [95]:
df['children'].value_counts(normalize=True)

 0     0.657329
 1     0.223833
 2     0.095470
 3     0.015331
 20    0.003531
-1     0.002184
 4     0.001905
 5     0.000418
Name: children, dtype: float64

In the `children` column, we hace two strange value - (-1) and 20. \
-1 could not be a variable and 20 could be a type mistake between 0 or 2. \
Because both values appears in low frequency than most data, we will replace them with the median value.


In [96]:
children_median=df['children'].median()
print(children_median)
df['children']=df['children'].replace(-1,children_median)
df['children']=df['children'].replace(20,children_median)

0.0


In [97]:
print(df['children'].unique())
print(df['children'].value_counts(normalize=True))

[1. 0. 3. 2. 4. 5.]
0.0    0.663043
1.0    0.223833
2.0    0.095470
3.0    0.015331
4.0    0.001905
5.0    0.000418
Name: children, dtype: float64


### `days_employed` <a id='days_employed'></a>


In [98]:
print(df.days_employed.unique())
print()
df.days_employed.value_counts()

[-8437.67302776 -4024.80375385 -5623.42261023 ... -2113.3468877
 -3112.4817052  -1984.50758853]



-327.685916     1
-1580.622577    1
-4122.460569    1
-2828.237691    1
-2636.090517    1
               ..
-7120.517564    1
-2146.884040    1
-881.454684     1
-794.666350     1
-3382.113891    1
Name: days_employed, Length: 19351, dtype: int64

we have two sort of problematic data:
* negative values - could be a technical issues in the sbmission of the values. we will replacing all negative values in positive values ussing `abs()`.
* missing values - which we replace the median later on.


In [99]:
df.days_employed=abs(df['days_employed'])

In [100]:
df.days_employed.unique()

array([8437.67302776, 4024.80375385, 5623.42261023, ..., 2113.3468877 ,
       3112.4817052 , 1984.50758853])

### `dob_years` <a id='dob_years'></a>


In [101]:
df.dob_years.value_counts(normalize=True)

35    0.028664
40    0.028293
41    0.028200
34    0.028014
38    0.027782
42    0.027735
33    0.026992
39    0.026620
31    0.026016
36    0.025784
44    0.025412
29    0.025319
30    0.025087
48    0.024994
37    0.024948
50    0.023879
43    0.023833
32    0.023693
49    0.023600
28    0.023368
45    0.023089
27    0.022904
56    0.022625
52    0.022485
47    0.022300
54    0.022253
46    0.022067
58    0.021417
57    0.021370
53    0.021324
51    0.020813
59    0.020627
55    0.020581
26    0.018955
60    0.017515
25    0.016585
61    0.016492
62    0.016353
63    0.012497
64    0.012311
24    0.012265
23    0.011800
65    0.009013
66    0.008502
22    0.008502
67    0.007758
21    0.005157
0     0.004692
68    0.004599
69    0.003949
70    0.003020
71    0.002695
20    0.002369
72    0.001533
19    0.000650
73    0.000372
74    0.000279
75    0.000046
Name: dob_years, dtype: float64


There is less than 1% of value equal to 0. It is the lowest age years in the data and the next one is 19. The distribution looks a little bit abnormal after the age of 65. Therefore, we will delete them.

In [102]:
df.drop(df.loc[df['dob_years']==0].index,inplace=True)

In [103]:
df['dob_years'].unique()

array([42, 36, 33, 32, 53, 27, 43, 50, 35, 41, 40, 65, 54, 56, 26, 48, 24,
       21, 57, 67, 28, 63, 62, 47, 34, 68, 25, 31, 30, 20, 49, 37, 45, 61,
       64, 44, 52, 46, 23, 38, 39, 51, 59, 29, 60, 55, 58, 71, 22, 73, 66,
       69, 19, 72, 70, 74, 75])

### `family_status` <a id='family_status'></a>

In [104]:
df['family_status'].unique()

array(['married', 'civil partnership', 'widow / widower', 'divorced',
       'unmarried'], dtype=object)

We did not detect any issues in this column.


In [105]:
df['family_status'].value_counts()

married              12331
civil partnership     4156
unmarried             2797
divorced              1185
widow / widower        955
Name: family_status, dtype: int64

### `gender` <a id='gender'></a>


In [106]:
df.gender.value_counts()

F      14164
M       7259
XNA        1
Name: gender, dtype: int64

In this column, we have an undentified value `'XNA'`. It counts just as onr row. Therefor, we will earase this row using `drop()` with ` with reseting the index of the data.


In [107]:
df.drop(df[df['gender']=='XNA'].index,inplace=True)

In [108]:
df.gender.unique()

array(['F', 'M'], dtype=object)

### `income_type` <a id='income_type'></a>

In [109]:
df.income_type.value_counts()

employee                       11064
business                        5064
retiree                         3836
civil servant                   1453
unemployed                         2
entrepreneur                       2
student                            1
paternity / maternity leave        1
Name: income_type, dtype: int64

In [110]:
df[df['income_type']=='paternity / maternity leave'].head()

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
20845,2.0,3296.759962,39,secondary education,1,married,0,F,paternity / maternity leave,1,8612.661,car


For the `paternity / maternity leave`, we have only 1 value. We checked and this gender's row. Therefor, we will replace this value with `maternity leave'.

In [111]:
df['income_type']=df['income_type'].replace('paternity / maternity leave','maternity leave')

In [112]:
print(df['income_type'].unique())
df['income_type'].value_counts()

['employee' 'retiree' 'business' 'civil servant' 'unemployed'
 'entrepreneur' 'student' 'maternity leave']


employee           11064
business            5064
retiree             3836
civil servant       1453
unemployed             2
entrepreneur           2
student                1
maternity leave        1
Name: income_type, dtype: int64

### Duplicates  <a id='duplicates'></a>
Now let's see the duplicates in our data.

In [113]:
print(df.duplicated().sum())
df[df.duplicated()]

71


Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
2849,0.0,,41,secondary education,1,married,0,F,employee,0,,purchase of the house for my family
3290,0.0,,58,secondary education,1,civil partnership,1,F,retiree,0,,to have a wedding
4182,1.0,,34,bachelor's degree,0,civil partnership,1,F,employee,0,,wedding ceremony
4851,0.0,,60,secondary education,1,civil partnership,1,F,retiree,0,,wedding ceremony
5557,0.0,,58,secondary education,1,civil partnership,1,F,retiree,0,,to have a wedding
...,...,...,...,...,...,...,...,...,...,...,...,...
20702,0.0,,64,secondary education,1,married,0,F,retiree,0,,supplementary education
21032,0.0,,60,secondary education,1,married,0,F,retiree,0,,to become educated
21132,0.0,,47,secondary education,1,married,0,F,employee,0,,housing renovation
21281,1.0,,30,bachelor's degree,0,married,0,F,employee,0,,buy commercial real estate


It looks like all duplicates have missing values. We can check the duplicates in:
* a filtered table containing only missing values.
* in a filtered table that does not contain missing values
when we will egnore the missing values in the `days_employed` and `total_income`.

In [114]:
no_miss_values=df.dropna(subset=('days_employed','total_income')).duplicated().sum()
no_miss_values

0

In [115]:
df_tot_mis_values=df[(df['days_employed'].isna())&(df['total_income'].isna())]
df_tot_mis_values.dropna(subset=('days_employed','total_income')).duplicated().sum()

0

We can cnclude that the duplicates are due to the missing values. We can drop them becausse we will fill those missing values from existing ones and they will remain duplicated.

In [116]:
df=df.drop_duplicates().reset_index(drop=True)

In [117]:
df.duplicated().sum()

0


We detected some issues with the data:

* Duplicates the `education` column.
* High and negtive value in the `children` column. less than 1% of the data was changed.
* Negative values in `days_employed` were changed to positive values and the 10% missing values were not filled yet.
* In the `dob_years` column, we had 0 as age which respresented less than 1% of the data. We changed it with the median.
* In the `family_status` column, 
* In the `genre` colum, there was a XNA value with 0% which we removed.
* In the `income_type` column.we changed the value of `paternity / maternity leave` to ` maternity leave` according to it's gender.
* We dropted 71 duplicated that contained only missing values.


# Working with missing values<a id='missing_values'></a>

To processing the missing values in 'days_employed' and 'total_income' columns, we will check the mean and median in multipel categories to understand under which category grouping in by to fill the missing values with most accuracy.



## Restoring missing values in `total_income`<a id='restoring_total_income'></a>
We will try to fill the missing values in thr `total_income` column by categories.

First, will create a new column of group age using the function `group_age`.


In [118]:
def group_age(dob_years):
    
    if dob_years < 30:
        return '19-29'
    elif dob_years < 40:
        return '30-39'
    elif dob_years < 50:
        return '40-49'
    elif dob_years< 60:
        return '50-59'
    elif dob_years< 70:
        return '60-69'
    else:
        return '70+'

In [119]:
df['age_group'] = df['dob_years'].apply(group_age)

In [120]:
df['age_group'].value_counts(normalize=True)

30-39    0.265174
40-49    0.250749
50-59    0.218106
19-29    0.148885
60-69    0.109170
70+      0.007915
Name: age_group, dtype: float64

we will check the distribution of all categories with the mean and median of the total income in the data that will not contain missing values.
Then we will choose the appropiate catergory and the values to be filled - mean or median.

Let's check the mean and median of total income grouped by income type.

In [121]:
total_income_type=df.groupby('income_type').agg({'total_income':['mean','median']})
total_income_type



Unnamed: 0_level_0,total_income,total_income
Unnamed: 0_level_1,mean,median
income_type,Unnamed: 1_level_2,Unnamed: 2_level_2
business,32397.307219,27563.0285
civil servant,27361.316126,24083.5065
employee,25824.679592,22815.1035
entrepreneur,79866.103,79866.103
maternity leave,8612.661,8612.661
retiree,21939.310393,18969.149
student,15712.26,15712.26
unemployed,21014.3605,21014.3605


Let's check the mean and the median of total income group by education.

In [125]:
total_by_education=df.groupby('education').agg({'total_income':['mean','median']})
total_by_education


Unnamed: 0_level_0,total_income,total_income
Unnamed: 0_level_1,mean,median
education,Unnamed: 1_level_2,Unnamed: 2_level_2
bachelor's degree,33172.428387,28054.531
graduate degree,27960.024667,25161.5835
primary education,21144.882211,18741.976
secondary education,24600.353617,21839.4075
some college,29035.057865,25608.7945


Let's check the mean and the median of total income group by age.

In [126]:
total_by_age=df.groupby('age_group').agg({'total_income':['mean','median']})
total_by_age

Unnamed: 0_level_0,total_income,total_income
Unnamed: 0_level_1,mean,median
age_group,Unnamed: 1_level_2,Unnamed: 2_level_2
19-29,25531.501098,22735.911
30-39,28312.479963,24667.528
40-49,28551.375635,24764.229
50-59,25811.700327,22203.0745
60-69,23242.812818,19817.44
70+,20125.658331,18751.324


Let's check the mean and the median of total income group by family status.

In [127]:
total_by_family=df.groupby('family_status').agg({'total_income':['mean','median']})
total_by_family

Unnamed: 0_level_0,total_income,total_income
Unnamed: 0_level_1,mean,median
family_status,Unnamed: 1_level_2,Unnamed: 2_level_2
civil partnership,26702.249322,23195.636
divorced,27202.683563,23584.9695
married,27045.38353,23377.708
unmarried,26943.601742,23139.404
widow / widower,23006.808776,20523.267


Let's check the mean and the median of total income group by family status.

In [129]:
total_by_children=df.groupby('children').agg({'total_income':['mean','median']})
total_by_children

Unnamed: 0_level_0,total_income,total_income
Unnamed: 0_level_1,mean,median
children,Unnamed: 1_level_2,Unnamed: 2_level_2
0.0,26425.873208,23033.33
1.0,27405.559686,23661.403
2.0,27489.198728,23136.1155
3.0,29366.910652,25191.619
4.0,27289.829647,24981.634
5.0,27268.84725,29816.2255


In all categeries that we checked, we see different values for the mean and the median values of total income. The fact we did not find a pattern for why the missing values exist and and that the distribution for each category differs from one and other, we should use the median value to represnet a more accurate value in addtion to grouping them by income type, education and age group.

In [130]:
df['total_income'] = df['total_income'].fillna(df.groupby(['income_type','education','age_group'])['total_income'].transform('median'))



let's check the missing values in the data.


In [131]:
df['total_income'].isna().sum()

3

We still have 3 missing values. Let's look at them.

In [132]:
df[df['total_income'].isna()]

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose,age_group
1296,1.0,,70,primary education,3,civil partnership,1,F,employee,0,,transactions with commercial real estate,70+
5907,0.0,,58,bachelor's degree,0,married,0,M,entrepreneur,0,,buy residential real estate,50-59
8095,0.0,,64,primary education,3,civil partnership,1,F,civil servant,0,,to have a wedding,60-69


It looks like we cannot fill those values when grouping by age group because we do not have values for employee with primary education who are 70+, entrepeneur who have bachelor's degree and are between 50-59 years old and civil servant who have primary eduction and are between 60-69 years old. 
So we will fill the rest of missing value by the median oftotal income catagorised only by age and eduction. 

In [134]:
df.groupby(['age_group','education']).agg({'total_income':'median'})


Unnamed: 0_level_0,Unnamed: 1_level_0,total_income
age_group,education,Unnamed: 2_level_1
19-29,bachelor's degree,25623.604
19-29,primary education,25036.25625
19-29,secondary education,20518.483
19-29,some college,22167.237
30-39,bachelor's degree,28401.157
30-39,graduate degree,18187.3015
30-39,primary education,19546.341
30-39,secondary education,22177.715
30-39,some college,28266.089
40-49,bachelor's degree,29484.586


In [135]:
df['total_income'] = df['total_income'].fillna(df.groupby(['education','dob_years'])['total_income'].transform('mean'))


Let's check for missing values.

In [136]:
df.isna().sum()

children               0
days_employed       2093
dob_years              0
education              0
education_id           0
family_status          0
family_status_id       0
gender                 0
income_type            0
debt                   0
total_income           0
purpose                0
age_group              0
dtype: int64

##  Restoring values in `days_employed`<a id='restoring_days_employed'></a>

We will not fill the missing values in days_emloyed because we will not use this data to answer the 4 questions in our goal. we will will drop the column from our data.


In [137]:
df.drop('days_employed', inplace=True, axis=1)

We still have the same number of missing values even if we fill the missing values in the total income column. we will drop the duplictes with reseting the index.

In [138]:
df

Unnamed: 0,children,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose,age_group
0,1.0,42,bachelor's degree,0,married,0,F,employee,0,40620.102,purchase of the house,40-49
1,1.0,36,secondary education,1,married,0,F,employee,0,17932.802,car purchase,30-39
2,0.0,33,secondary education,1,married,0,M,employee,0,23341.752,purchase of the house,30-39
3,3.0,32,secondary education,1,married,0,M,employee,0,42820.568,supplementary education,30-39
4,0.0,53,secondary education,1,civil partnership,1,F,retiree,0,25378.572,to have a wedding,50-59
...,...,...,...,...,...,...,...,...,...,...,...,...
21347,1.0,43,secondary education,1,civil partnership,1,F,business,0,35966.698,housing transactions,40-49
21348,0.0,67,secondary education,1,married,0,F,retiree,0,24959.969,purchase of a car,60-69
21349,1.0,38,secondary education,1,civil partnership,1,M,employee,1,14347.610,property,30-39
21350,3.0,38,secondary education,1,married,0,M,employee,1,39054.888,buying my own car,30-39


In [139]:
df.isna().sum()

children            0
dob_years           0
education           0
education_id        0
family_status       0
family_status_id    0
gender              0
income_type         0
debt                0
total_income        0
purpose             0
age_group           0
dtype: int64

Now we have no duplicates or missing values in our data. Let's proceed with categorising our data to answer our goals.

## Categorization of data

Our goals are to answer those questions:
* The connection between having kids and repaying a loan on time. We will use the columns of children and dept. 
* The connection between marital status and repaying a loan on time. We will categorise the family status to single or with partner.
* The connection between income level and repaying a loan on time. We will group the total income by different range.
* To check the affect of different loan purposes on time loan repayment, we will create a function that simplifies the main pupose of the loan.

Let's take in consideration that the values of debt -
* the value 0 represents that the individuals did not succeed in repaying the loan
* the value 1 represents that the individuals succeed in repaying the loan.

let's check all unique values for our categories.

In [140]:
print(df.debt.unique())
print(df.children.unique())
print(df.family_status.unique())
print(df.purpose.unique())

[0 1]
[1. 0. 3. 2. 4. 5.]
['married' 'civil partnership' 'widow / widower' 'divorced' 'unmarried']
['purchase of the house' 'car purchase' 'supplementary education'
 'to have a wedding' 'housing transactions' 'education' 'having a wedding'
 'purchase of the house for my family' 'buy real estate'
 'buy commercial real estate' 'buy residential real estate'
 'construction of own property' 'property' 'building a property'
 'buying a second-hand car' 'buying my own car'
 'transactions with commercial real estate' 'building a real estate'
 'housing' 'transactions with my real estate' 'cars' 'to become educated'
 'second-hand car purchase' 'getting an education' 'car'
 'wedding ceremony' 'to get a supplementary education'
 'purchase of my own house' 'real estate transactions'
 'getting higher education' 'to own a car' 'purchase of a car'
 'profile education' 'university education'
 'buying property for renting out' 'to buy a car' 'housing renovation'
 'going to university']



As we mention above, we will create two functions:
* assign_martial_status that will identify if the indivduals have a partner or is single.
* assign_pupose that will simplify the column purpose.



In [146]:
def assign_partners (family_status):    
    if family_status =='married'or family_status =='civil partnership':
        return 'with partner'
    else:
        return 'single'

In [147]:
df['partner_status']=df['family_status'].apply(assign_partners)

In [148]:
df['partner_status'].value_counts()

with partner    16419
single           4933
Name: partner_status, dtype: int64

In [149]:
def assign_purpose (purpose):
    if 'hous' in purpose or 'estate' in purpose or 'propert' in purpose:
        return 'house'
    elif 'wedd' in purpose:
        return 'wedding'
    elif 'car' in purpose:
        return 'car'
    elif 'educat' or 'uni' in purpose:
        return 'education'
    else:
        return 'other'

In [150]:
df['purpose_category'] = df['purpose'].apply(assign_purpose)

In [151]:
df['purpose_category'].value_counts()

house        10763
car           4284
education     3995
wedding       2310
Name: purpose_category, dtype: int64

Now we will create a function to group the total income by income level by those parameters:
* low income - if total income per month is lower or equal than 17000
* low-midel income - if total income per month is between 17000 and 22000
* upper-midel income - if total income per month is between 22000 and 31000
* upper income - it total income per month is higher than 30000.


Those classification were chosen by the values of min, 25% value, 50% value, 75% value and max value. please see below.

In [152]:
df['total_income'].describe()

count     21352.000000
mean      26482.517903
std       15747.112790
min        3306.762000
25%       17199.970250
50%       22933.643500
75%       31655.503250
max      362496.645000
Name: total_income, dtype: float64

In [153]:
def income_level(total_income):
    
    if total_income<= 17000:
        return 'Low income'
    elif 17000< total_income <= 22000:
        return 'low-midel income'
    elif 22000< total_income <= 31000:
        return 'upper-midel income'
    else:
        return 'Upper income'

In [154]:
df['income_level']=df['total_income'].apply(income_level)

In [155]:
df['income_level'].value_counts()

upper-midel income    6268
Upper income          5632
Low income            5199
low-midel income      4253
Name: income_level, dtype: int64

## Checking the Hypotheses


**Is there a correlation between having children and paying back on time?**

In [156]:
kids_debt=pd.DataFrame(df.groupby('debt')['children'].value_counts())
kids_debt

Unnamed: 0_level_0,Unnamed: 1_level_0,children
debt,children,Unnamed: 2_level_1
0,0.0,13076
0,1.0,4351
0,2.0,1845
0,3.0,301
0,4.0,37
0,5.0,9
1,0.0,1067
1,1.0,441
1,2.0,194
1,3.0,27


In [157]:
df.groupby('children')['debt'].mean().map('{:.2%}'.format)

children
0.0    7.54%
1.0    9.20%
2.0    9.51%
3.0    8.23%
4.0    9.76%
5.0    0.00%
Name: debt, dtype: object

**Conclusion**

Between 8% to 10% of individuals who have between 1 to 4 children cannot repay theirs loans.
when is is easier to repay loans when you have less children.


**Is there a correlation between family status and paying back on time?**

In [158]:
status_debt=pd.DataFrame(df.groupby(['debt'])['partner_status'].value_counts())
status_debt

Unnamed: 0_level_0,Unnamed: 1_level_0,partner_status
debt,partner_status,Unnamed: 2_level_1
0,with partner,15106
0,single,4513
1,with partner,1313
1,single,420


In [159]:
df.groupby(['partner_status',])['debt'].mean().map('{:.2%}'.format)

partner_status
single          8.51%
with partner    8.00%
Name: debt, dtype: object

**Conclusion**

8.5% of singel individuals cannot repay their loans.When we have the most high value count for maried or in civil parternship, that means is easier to repay loan when you have a partner.



**Is there a correlation between income level and paying back on time?**

In [160]:
income_lvl_debt=pd.DataFrame(df.groupby(['debt'])['income_level'].value_counts())
income_lvl_debt

Unnamed: 0_level_0,Unnamed: 1_level_0,income_level
debt,income_level,Unnamed: 2_level_1
0,upper-midel income,5721
0,Upper income,5230
0,Low income,4790
0,low-midel income,3878
1,upper-midel income,547
1,Low income,409
1,Upper income,402
1,low-midel income,375


In [161]:
df.groupby(['income_level',])['debt'].mean().map('{:.2%}'.format)

income_level
Low income            7.87%
Upper income          7.14%
low-midel income      8.82%
upper-midel income    8.73%
Name: debt, dtype: object

**Conclusion**

The distribution of loan repayers looks equal to each other as well as for the non repayers. We can conclude that the incme level does not effect on the ability to repay the loan.


**How does credit purpose affect the default rate?**

In [162]:
df.groupby('debt')['purpose_category'].value_counts()

debt  purpose_category
0     house               9984
      car                 3884
      education           3625
      wedding             2126
1     house                779
      car                  400
      education            370
      wedding              184
Name: purpose_category, dtype: int64

In [163]:
df.groupby('purpose_category')['debt'].mean().map('{:.2%}'.format)

purpose_category
car          9.34%
education    9.26%
house        7.24%
wedding      7.97%
Name: debt, dtype: object

**Conclusion**

repaying for a house has the most hight value count and the wedding has the lowest value count. 
not repaying for car or education has each category 9% of people who could not repay their loans.



# General Conclusion 

To answer our 4 goals, we had to:
* replace missing values in the total income by the median categorising its vaslues by income type, education and age. Then created according to this information income level.
* dropp the days employed column who had missing values becaus it was not relevant to our analysis.
* drop the dupicated data who existed due to the missing values.
* process the data in education to avoid duplicated values.
* replace negative and highl values of children number to the median value
* drop the age value 0 and group the other values by 10 years age range.
* group family status to single or with partner to help answer our goal
* group the main purpose by easier categories.


We can conclude that in our data:
* 40% of all people could not repay their loan.
* having no children helps repaying the loan.
* it is easier to repay the loan with a partner.
* their is no corralation between income level and the ability to repay the loan.
* it is easier to repay a loan for a long term pupose than a short term.


We recommand restudying the whole data (without the missing values) in pariculary on the third question - to examine better the deppendence of income level to the ability to repay the loan by changing the range of income level and taking acknowledged the daysd imployed.
