# Borrower reliability study — banking data analysis

## Data overview

In [1]:
# importing libraries
import pandas as pd

In [2]:
# reading csv file and assigning the result to variable
data = pd.read_csv('https://code.s3.yandex.net/datasets/data.csv')

In [3]:
# displaying the head of the DataFrame
data.head()

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
0,1,-8437.673028,42,высшее,0,женат / замужем,0,F,сотрудник,0,253875.639453,покупка жилья
1,1,-4024.803754,36,среднее,1,женат / замужем,0,F,сотрудник,0,112080.014102,приобретение автомобиля
2,0,-5623.42261,33,Среднее,1,женат / замужем,0,M,сотрудник,0,145885.952297,покупка жилья
3,3,-4124.747207,32,среднее,1,женат / замужем,0,M,сотрудник,0,267628.550329,дополнительное образование
4,0,340266.072047,53,среднее,1,гражданский брак,1,F,пенсионер,0,158616.07787,сыграть свадьбу


- The `days_employed` column shows a negative number. Perhaps the data upload went wrong.

In [4]:
# displaying information about the DataFrame
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21525 entries, 0 to 21524
Data columns (total 12 columns):
children            21525 non-null int64
days_employed       19351 non-null float64
dob_years           21525 non-null int64
education           21525 non-null object
education_id        21525 non-null int64
family_status       21525 non-null object
family_status_id    21525 non-null int64
gender              21525 non-null object
income_type         21525 non-null object
debt                21525 non-null int64
total_income        19351 non-null float64
purpose             21525 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 2.0+ MB


**Conclusion**

- There are gaps in the `days_employed` and `total_income` columns. There is a good chance that people were not able to verify their experience and income. A preliminary version of the processing - fill in the gaps with the median based on the education level and occupation;
- Change the data in the column `education` to lowercase;
- Change the data type in the `total_income`, `days_employed` columns to integer.

## Data preprocessing

### Addressing gaps

In [5]:
# counting the gaps
data.isnull().sum()

children               0
days_employed       2174
dob_years              0
education              0
education_id           0
family_status          0
family_status_id       0
gender                 0
income_type            0
debt                   0
total_income        2174
purpose                0
dtype: int64

In [6]:
# checking for null values
data[data['total_income'].isnull()].head()

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
12,0,,65,среднее,1,гражданский брак,1,M,пенсионер,0,,сыграть свадьбу
26,0,,41,среднее,1,женат / замужем,0,M,госслужащий,0,,образование
29,0,,63,среднее,1,Не женат / не замужем,4,F,пенсионер,0,,строительство жилой недвижимости
41,0,,50,среднее,1,женат / замужем,0,F,госслужащий,0,,сделка с подержанным автомобилем
55,0,,54,среднее,1,гражданский брак,1,F,пенсионер,1,,сыграть свадьбу


In [7]:
# calculating the percentage of gaps in days_employed 
nulls_percentage_data_employed = data['days_employed'].isnull().sum() / len(data)
display('{:.2%}'.format(nulls_percentage_data_employed))

'10.10%'

In [8]:
# calculating the percentage of gaps in total_income
nulls_percentage_total_income = data['total_income'].isnull().sum() / len(data)
display('{:.2%}'.format(nulls_percentage_total_income))

'10.10%'

In [9]:
# filling in the gaps in total_income with the median value groupped by level of education and occupation
total_income_median = data.dropna()
total_income_median = total_income_median.groupby(['education', 'income_type'])['total_income'].median()
data['median_total_income'] = data.apply(lambda row: total_income_median[row['education']][row['income_type']], axis=1)
data['total_income'] = data['total_income'].fillna(data['median_total_income'])

In [10]:
# there are a lot of negative values in the days_employed column
# maybe the data was uploaded incorrectly
# replace negative numbers with positive ones
data['days_employed'] = abs(data['days_employed'])
data.head()

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose,median_total_income
0,1,8437.673028,42,высшее,0,женат / замужем,0,F,сотрудник,0,253875.639453,покупка жилья,166164.078024
1,1,4024.803754,36,среднее,1,женат / замужем,0,F,сотрудник,0,112080.014102,приобретение автомобиля,136818.115423
2,0,5623.42261,33,Среднее,1,женат / замужем,0,M,сотрудник,0,145885.952297,покупка жилья,133399.107243
3,3,4124.747207,32,среднее,1,женат / замужем,0,M,сотрудник,0,267628.550329,дополнительное образование,136818.115423
4,0,340266.072047,53,среднее,1,гражданский брак,1,F,пенсионер,0,158616.07787,сыграть свадьбу,114483.373934


In [11]:
# filling in the gaps in days_employed with the median value groupped by level of education and occupation 
days_employed_median = data.dropna()
days_employed_median = days_employed_median.groupby(['education', 'income_type'])['days_employed'].median()
data['median_days_employed'] = data.apply(lambda row: days_employed_median[row['education']][row['income_type']], axis=1)
data['days_employed'] = data['days_employed'].fillna(data['median_days_employed'])

# deleting columns
del data['median_total_income']
del data['median_days_employed']

In [12]:
# counting gaps
data.isnull().sum()

children            0
days_employed       0
dob_years           0
education           0
education_id        0
family_status       0
family_status_id    0
gender              0
income_type         0
debt                0
total_income        0
purpose             0
dtype: int64

In [13]:
# displaying information about the DataFrame
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21525 entries, 0 to 21524
Data columns (total 12 columns):
children            21525 non-null int64
days_employed       21525 non-null float64
dob_years           21525 non-null int64
education           21525 non-null object
education_id        21525 non-null int64
family_status       21525 non-null object
family_status_id    21525 non-null int64
gender              21525 non-null object
income_type         21525 non-null object
debt                21525 non-null int64
total_income        21525 non-null float64
purpose             21525 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 2.0+ MB


- The `days_employed` and `total_income` columns have gaps on the same rows. There maybe reasons for this, such as lack of income and experience in the workplace;
- Based on education and occupation, medians were used to fill in the gaps.

### Addressing anomalies and corrections

In [14]:
# checking for anomalies
data['children'].value_counts()

 0     14149
 1      4818
 2      2055
 3       330
 20       76
-1        47
 4        41
 5         9
Name: children, dtype: int64

There can be no negative number of children, and 20 stands out strongly against the background of the rest of the sample. We remove the minus sign and the value 20 is most likely a typo. 

In [15]:
# changing 20 to 2
data.loc[data['children'] == 20, 'children'] = 2

# changing -1 to 1
data.loc[data['children'] == -1, 'children'] = 1

# hecking for anomalies
data['children'].value_counts()

0    14149
1     4865
2     2131
3      330
4       41
5        9
Name: children, dtype: int64

In [16]:
# checking for anomalies
data['dob_years'].value_counts()

35    617
40    609
41    607
34    603
38    598
42    597
33    581
39    573
31    560
36    555
44    547
29    545
30    540
48    538
37    537
50    514
43    513
32    510
49    508
28    503
45    497
27    493
56    487
52    484
47    480
54    479
46    475
58    461
57    460
53    459
51    448
59    444
55    443
26    408
60    377
25    357
61    355
62    352
63    269
64    265
24    264
23    254
65    194
66    183
22    183
67    167
21    111
0     101
68     99
69     85
70     65
71     58
20     51
72     33
19     14
73      8
74      6
75      1
Name: dob_years, dtype: int64

We don't do anything regarding age 0, since there are no questions about it. 

In [17]:
# checking for anomalies
data['gender'].value_counts()

F      14236
M       7288
XNA        1
Name: gender, dtype: int64

The `gender` column contains the gender of XNA. As the XNA gender occurs once, and the majority of the data is female, we will replace the value with F.

In [18]:
# changing 'XNA' to 'F' in the gender column
data.loc[data['gender'] == 'XNA', 'gender'] = 'F'
# let's check that everything is fine
data['gender'].value_counts()

F    14237
M     7288
Name: gender, dtype: int64

While there are anomalies in the data, their number is not significant and is unlikely to affect the results of the study. However, key points have been corrected.

### Addressing data types

In [19]:
# changing the total_income type to int
data['total_income'] = data['total_income'].astype(int)

In [20]:
# changing the days_employed type to int
data['days_employed'] = data['days_employed'].astype('int')

In [21]:
# displaying information about the DataFrame
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21525 entries, 0 to 21524
Data columns (total 12 columns):
children            21525 non-null int64
days_employed       21525 non-null int64
dob_years           21525 non-null int64
education           21525 non-null object
education_id        21525 non-null int64
family_status       21525 non-null object
family_status_id    21525 non-null int64
gender              21525 non-null object
income_type         21525 non-null object
debt                21525 non-null int64
total_income        21525 non-null int64
purpose             21525 non-null object
dtypes: int64(7), object(5)
memory usage: 2.0+ MB


The types `total_income` and `days_employed` were replaced with `int`, since we aren't interested in precision into hundredths.

### Addressing duplicates

The case letters in the education columns is different. It is necessary to change the data in all columns to the same case. Otherwise, implicit duplicates may appear.

In [22]:
# lowercasing data 
data['education'] = data['education'].str.lower()
data['family_status'] = data['family_status'].str.lower()
data['income_type'] = data['income_type'].str.lower()
data['purpose'] = data['purpose'].str.lower()

# displaying information about the DataFrame
data.head(3)

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
0,1,8437,42,высшее,0,женат / замужем,0,F,сотрудник,0,253875,покупка жилья
1,1,4024,36,среднее,1,женат / замужем,0,F,сотрудник,0,112080,приобретение автомобиля
2,0,5623,33,среднее,1,женат / замужем,0,M,сотрудник,0,145885,покупка жилья


In [23]:
# checking for duplicates
data.duplicated().sum()

54

In [24]:
# removing duplicates
data.drop_duplicates(inplace=True)

In [25]:
# checking for duplicates
data.duplicated().sum()

0

- There were not so many duplicates and they were quickly removed;
- To remove explicit duplicates, the drop_duplicates method was used with the removal of old indexes and the creation of new ones;
- To remove implicit duplicates, all columns were changed to the same case;
- It is possible that duplicates appear because one person approved several applications or an error occured when uploading data.

### Creating additional dictionary of dataframes, decomposing the original dataframe.

In [26]:
# creating a dataframe with each unique value from education corresponding to education_id
df_education = data[['education', 'education_id']]

# removing duplicates
df_education = df_education.drop_duplicates().reset_index(drop = True)

In [27]:
# creating a dataframe with each unique value from family_status corresponding to family_status_id
df_family_status = data[['family_status', 'family_status_id']]

# removing duplicates
df_family_status = df_family_status.drop_duplicates().reset_index(drop=True)

In [28]:
# removing the education column from the data table, leaving only education_id
del data['education']

In [29]:
# removing the family_status column from data table, leaving only family_status_id
del data['family_status']

Two dictionaries were created using the `education` and `family_status` columns and the corresponding ID's.

### Categorization of income

In [30]:
# creating a function that assigns categories based on salary ranges to borrowers
def income_category(income):
    if income <= 30000:
        return 'E'
    if income <= 50000:
        return 'D'
    if income <= 200000:
        return 'C'
    if income <= 1000000:
        return 'B'
    return 'A'

In [31]:
# creating a total_income_category column with salary categories
data['total_income_category'] = data['total_income'].apply(income_category)

# displaying the head of the DataFrame
data.head(3)

Unnamed: 0,children,days_employed,dob_years,education_id,family_status_id,gender,income_type,debt,total_income,purpose,total_income_category
0,1,8437,42,0,0,F,сотрудник,0,253875,покупка жилья,B
1,1,4024,36,1,0,F,сотрудник,0,112080,приобретение автомобиля,C
2,0,5623,33,1,0,M,сотрудник,0,145885,покупка жилья,C


Based on salary ranges, a total_income_category column was created for each borrower.

### Categorization of loan purposes

In [32]:
# checking unique genre names
data['purpose'].unique()

array(['покупка жилья', 'приобретение автомобиля',
       'дополнительное образование', 'сыграть свадьбу',
       'операции с жильем', 'образование', 'на проведение свадьбы',
       'покупка жилья для семьи', 'покупка недвижимости',
       'покупка коммерческой недвижимости', 'покупка жилой недвижимости',
       'строительство собственной недвижимости', 'недвижимость',
       'строительство недвижимости', 'на покупку подержанного автомобиля',
       'на покупку своего автомобиля',
       'операции с коммерческой недвижимостью',
       'строительство жилой недвижимости', 'жилье',
       'операции со своей недвижимостью', 'автомобили',
       'заняться образованием', 'сделка с подержанным автомобилем',
       'получение образования', 'автомобиль', 'свадьба',
       'получение дополнительного образования', 'покупка своего жилья',
       'операции с недвижимостью', 'получение высшего образования',
       'свой автомобиль', 'сделка с автомобилем',
       'профильное образование', 'высшее об

In [33]:
# creating a function that assigns category based on data from the purpose column
def purpose_category(purpose):
    if 'автомоб' in purpose:
        return 'car loans'
    if 'жил' in purpose:
        return 'real estate loans'
    if 'недвиж' in purpose:
        return 'real estate loans'
    if 'свадьб' in purpose:
        return 'wedding loans'
    if 'образ' in purpose:
        return 'education loans'

In [34]:
# creating a purpose_category column with categories
data['purpose_category'] = data['purpose'].apply(purpose_category)

# displaying the head of the DataFrame
data.head(3)

Unnamed: 0,children,days_employed,dob_years,education_id,family_status_id,gender,income_type,debt,total_income,purpose,total_income_category,purpose_category
0,1,8437,42,0,0,F,сотрудник,0,253875,покупка жилья,B,real estate loans
1,1,4024,36,1,0,F,сотрудник,0,112080,приобретение автомобиля,C,car loans
2,0,5623,33,1,0,M,сотрудник,0,145885,покупка жилья,C,real estate loans


**Conclusion**

A purpose_category column was created, where each borrower was assigned a purpose for obtaining a loan. There are 4 goals in total:
car loans, real estate loans, weddings loans, and education loans.

## Data analysis

### Does the number of children affect the loan repayment on time?

In [35]:
# checking for values in the children column
data['children'].value_counts()

0    14107
1     4856
2     2128
3      330
4       41
5        9
Name: children, dtype: int64

In [36]:
# creating a function that  assigns a category based on data from the children column
# combining families with 3 or more children into one category
def make_children_category(children):
    if children == 1:
        return '1 child'
    if children == 2:
        return '2 children'
    if children >= 3:
        return '3 or more children'
    return 'No children'

# creating a children_category column
data['children_category'] = data['children'].apply(make_children_category)

# checking for values
data['children_category'].value_counts()

No children           14107
1 child                4856
2 children             2128
3 or more children      380
Name: children_category, dtype: int64

In [37]:
# creating a function to calculate the percentage ratio between the sum of debt and the number debt-ridden families
def make_proportion(pdSerises):
    return str(round((pdSerises.sum() / pdSerises.count()) * 100, 2)) + '%'

In [38]:
# creating a pivot table with sum of debt, the number debt-ridden families, and the ratio
data_pivot = data.pivot_table(index = ['children_category'], values = ['debt'], aggfunc = ['sum', 'count', make_proportion])

In [39]:
# sorting the pivot table
data_pivot = data_pivot.sort_values(by=('make_proportion', 'debt'))
data_pivot

Unnamed: 0_level_0,sum,count,make_proportion
Unnamed: 0_level_1,debt,debt,debt
children_category,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
No children,1063,14107,7.54%
3 or more children,31,380,8.16%
1 child,445,4856,9.16%
2 children,202,2128,9.49%


- The likelihood of indebtedness is lower when there are no children than when there are children.

- The likelihood of being in debt is higher for families with one or two children. However, the difference is not that great in general. 

- With many children, the debt is approximately at the same as when there are no children.

### Does marital status affect loan repayment on time?

In [40]:
# returning the family_status column to data
family_status = data.merge(df_family_status, on = 'family_status_id', how = 'right')

In [41]:
# creating a pivot table with sum of debt, the number debt-ridden families, and the ratio
data_pivot = family_status.pivot_table(index=['family_status'], values=['debt'], aggfunc=['sum', 'count', make_proportion])

In [42]:
# sorting the pivot table
data_pivot = data_pivot.sort_values(by=('make_proportion', 'debt'))
data_pivot

Unnamed: 0_level_0,sum,count,make_proportion
Unnamed: 0_level_1,debt,debt,debt
family_status,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
вдовец / вдова,63,959,6.57%
в разводе,85,1195,7.11%
женат / замужем,931,12344,7.54%
гражданский брак,388,4163,9.32%
не женат / не замужем,274,2810,9.75%


- Divorced and widowed people are at the lowest risk of delinquency.

- People who are officially maried are practically not far behind them.

- Unmarried and cohabiting couples are at the highest risk of delinquency. 

- Probably people who are in official relationships take out loans with greater care. 

### Does income level affect the loan repayment on time?

In [43]:
# creating a function that assigns categories based on salary ranges
def income_category_rubles(income):
    if income == 'E':
        return '0–30,000'
    if income == 'D':
        return '30,001–50,000'
    if income == 'C':
        return '50,001–200,000'
    if income == 'B':
        return '200,001–1,000,000'
    return '1,000,001 and more'

In [44]:
# adding income_category_range to data
data['income_category_range'] = data['total_income_category'].apply(income_category_rubles)
data.head()

Unnamed: 0,children,days_employed,dob_years,education_id,family_status_id,gender,income_type,debt,total_income,purpose,total_income_category,purpose_category,children_category,income_category_range
0,1,8437,42,0,0,F,сотрудник,0,253875,покупка жилья,B,real estate loans,1 child,"200,001–1,000,000"
1,1,4024,36,1,0,F,сотрудник,0,112080,приобретение автомобиля,C,car loans,1 child,"50,001–200,000"
2,0,5623,33,1,0,M,сотрудник,0,145885,покупка жилья,C,real estate loans,No children,"50,001–200,000"
3,3,4124,32,1,0,M,сотрудник,0,267628,дополнительное образование,B,education loans,3 or more children,"200,001–1,000,000"
4,0,340266,53,1,1,F,пенсионер,0,158616,сыграть свадьбу,C,wedding loans,No children,"50,001–200,000"


In [45]:
# creating a pivot table with sum of debt, the number debt-ridden families, and the ratio
data_pivot = data.pivot_table(index='income_category_range', values = ['debt'], aggfunc=['sum', 'count', make_proportion])

In [46]:
# sorting the pivot table
data_pivot = data_pivot.sort_values(by=('make_proportion', 'debt'))
data_pivot

Unnamed: 0_level_0,sum,count,make_proportion
Unnamed: 0_level_1,debt,debt,debt
income_category_range,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
"30,001–50,000",21,350,6.0%
"200,001–1,000,000",364,5224,6.97%
"1,000,001 and more",2,25,8.0%
"50,001–200,000",1352,15850,8.53%
"0–30,000",2,22,9.09%


Middle class families are the least indebted. Families with higher incomes are more likely to be indebted. Families with the lower incomes are at the greatest risk.

### Does the purpose of a loan affect the loan repayment on time?

In [47]:
# creating a pivot table with sum of debt, the number debt-ridden families, and the ratio
data_pivot = data.pivot_table(index='purpose_category', values = ['debt'], aggfunc=['sum', 'count', make_proportion])

In [48]:
# sorting the pivot table
data_pivot = data_pivot.sort_values(by=('make_proportion', 'debt'))
data_pivot

Unnamed: 0_level_0,sum,count,make_proportion
Unnamed: 0_level_1,debt,debt,debt
purpose_category,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
real estate loans,782,10814,7.23%
wedding loans,186,2335,7.97%
education loans,370,4014,9.22%
car loans,403,4308,9.35%


The risk of giving real estate loans is lower than that of wedding loans. The risk of giving car and education loans are higher.

## Final conclusion:

The likelihood of indebtedness is lower when there are no children than when there are children. The likelyhood of being in debt is higher for families with one or two children. However, the difference is not that great in general.  With many children, the debt is approximately at the same as when there are no children.


Divorced and widowed people are at the lowest risk of delinquency. People who are officially married are practically not far behind them. Unmarried and cohabiting couples are at the highest risk of delinquency. Probably people who are in official relationships take out loans with greater care. 

Middle class families are the least indebted. Families with higher incomes are more likely to be indebted. Families with the lower incomes are at the greatest risk.

The risk of giving real estate loans is lower than that of wedding loans. The risk of giving car and education loans are higher.

**Details:**

**Does the number of children affect the loan repayment on time?**

 - No children - 7.54%
 - 3 or more children - 8.16%
 - 1 child - 9.16%
 - 2 children - 9.49%

- The absolute difference between minimum and maximum value - 1.95%
- The relative difference between minimum and maximum value - 20.54%

The relative difference is calculated by the formula:
If x>y, then r=(x-y)/x)*100. If x<y, r= (y-x)/x)*100.

**Does marital status affect loan repayment on time?**

- widow / widower - 6.57%
- divorced - 7.11%
- married - 7.54%
- cohabiting couples - 9.32%
- not married - 9.75%


- The absolute difference between minimum and maximum value - 3.17%
- The relative difference between the minimum and maximum value - 32.62%

**Does income level affect the loan repayment on time?**

- 30,001–50,000 - 6.0%
- 200,001–1,000,000 - 6.97%
- 1,000,001 and above - 8.0%
- 50,001–200,000 - 8.53%
- 0–30,000 - 9.09%


- The absolute difference between minimum and maximum value - 3.09%
- The relative difference between minimum and maximum value - 33.99%

**Does the purpose of a loan affect the loan repayment on time?**

- real estate loans - 7.23%
- wedding loans - 7.97%
- education loans - 9.22%
- car loans - 9.35%


- The absolute difference between minimum and maximum value - 2.12%
- The relative difference between minimum and maximum value - 22.67%