# Research of reliability of bank borrowers

The customer is the credit department of a bank. It is necessary to understand whether the marital status and the number of children of the client affect the fact of repayment of the loan on time. The initial data of the bank is based on  statistics on the solvency of customers.

The results of the study will be taken into account when building a **credit scoring** model — a special system that evaluates the ability of a potential borrower to repay a loan to a bank.

## Data preview 

In [22]:
import pandas as pd
data = pd.read_csv('/datasets/data.csv')
display(data.head(10)) # getting the table
data.info() #  getting the general information
data['children'].value_counts()
data['dob_years'].value_counts()
data['education'].value_counts()
data['family_status'].value_counts()
data['gender'].value_counts()
data['income_type'].value_counts()
data['total_income'].min()
data['total_income'].max()
data['purpose'].value_counts()


Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
0,1,-8437.673028,42,высшее,0,женат / замужем,0,F,сотрудник,0,253875.639453,покупка жилья
1,1,-4024.803754,36,среднее,1,женат / замужем,0,F,сотрудник,0,112080.014102,приобретение автомобиля
2,0,-5623.42261,33,Среднее,1,женат / замужем,0,M,сотрудник,0,145885.952297,покупка жилья
3,3,-4124.747207,32,среднее,1,женат / замужем,0,M,сотрудник,0,267628.550329,дополнительное образование
4,0,340266.072047,53,среднее,1,гражданский брак,1,F,пенсионер,0,158616.07787,сыграть свадьбу
5,0,-926.185831,27,высшее,0,гражданский брак,1,M,компаньон,0,255763.565419,покупка жилья
6,0,-2879.202052,43,высшее,0,женат / замужем,0,F,компаньон,0,240525.97192,операции с жильем
7,0,-152.779569,50,СРЕДНЕЕ,1,женат / замужем,0,M,сотрудник,0,135823.934197,образование
8,2,-6929.865299,35,ВЫСШЕЕ,0,гражданский брак,1,F,сотрудник,0,95856.832424,на проведение свадьбы
9,0,-2188.756445,41,среднее,1,женат / замужем,0,M,сотрудник,0,144425.938277,покупка жилья для семьи


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21525 entries, 0 to 21524
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   children          21525 non-null  int64  
 1   days_employed     19351 non-null  float64
 2   dob_years         21525 non-null  int64  
 3   education         21525 non-null  object 
 4   education_id      21525 non-null  int64  
 5   family_status     21525 non-null  object 
 6   family_status_id  21525 non-null  int64  
 7   gender            21525 non-null  object 
 8   income_type       21525 non-null  object 
 9   debt              21525 non-null  int64  
 10  total_income      19351 non-null  float64
 11  purpose           21525 non-null  object 
dtypes: float64(2), int64(5), object(5)
memory usage: 2.0+ MB


свадьба                                   797
на проведение свадьбы                     777
сыграть свадьбу                           774
операции с недвижимостью                  676
покупка коммерческой недвижимости         664
покупка жилья для сдачи                   653
операции с жильем                         653
операции с коммерческой недвижимостью     651
жилье                                     647
покупка жилья                             647
покупка жилья для семьи                   641
строительство собственной недвижимости    635
недвижимость                              634
операции со своей недвижимостью           630
строительство жилой недвижимости          626
покупка недвижимости                      624
покупка своего жилья                      620
строительство недвижимости                620
ремонт жилью                              612
покупка жилой недвижимости                607
на покупку своего автомобиля              505
заняться высшим образованием      

**Conclusion**

The table has 12 columns and 21525 rows. The data type in all columns is object, int64, float64.

The table contains data about the borrower in the following form: 
* **children** — number of children in the family, 
* **days_employed** — total work experience in days, 
* **dob_years** — age of the client in years,
* **education** — level of education of the client, 
* **education_id** — identifier of the level of education, 
* **family_status** — marital status,
* **family_status_id** — identifier of marital status, 
* **gender** — gender of the client, 
* **income_type** — type of employment, 
* **debt** — whether there was a debt on repayment of loans, 
* **total_income** — monthly income, 
* **purpose** — the purpose of getting a loan. 


There are no mistakes in the column names, however, the number of values in the columns varies. It means that  there are missing values in the data. In the education column there are fonts of different case, in the days_employed column there are values with the '-' sign and abnormally large values. One XNA value has been detected in the gender column, the reason for that is that an error may have occurred during unloading or filling in the table. For correct analysis this data must be processed.

Each row contains data about the borrower. Some of the columns contain general information (age, gender, education, number of children, type of employment, marital status). Other columns are directly related to his banking history (if there is a loan or not, monthly income, the purpose of getting a loan).

We can say that there is enough data for analysis, however, it is necessary to eliminate gaps in the data.




## Data preprocessing

### Processing of the gaps

In [23]:
# using isna() method to fing gaps and len() to count them
len(data[data['days_employed'].isna()]) # 2174
len(data[data['total_income'].isna()]) # 2174

# working on the gaps in days_employed and total_income
data.loc[data['days_employed'].isna(), 'days_employed'] = data['days_employed'].sort_values(ascending=True).median()
data.loc[data['total_income'].isna(), 'total_income'] = data['total_income'].sort_values(ascending=True).median()

# replacing 0 with the avarage 
data.loc[data['dob_years'] == 0, 'dob_years'] = data['dob_years'].sum() / data['dob_years'].count()
data.info() 


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21525 entries, 0 to 21524
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   children          21525 non-null  int64  
 1   days_employed     21525 non-null  float64
 2   dob_years         21525 non-null  float64
 3   education         21525 non-null  object 
 4   education_id      21525 non-null  int64  
 5   family_status     21525 non-null  object 
 6   family_status_id  21525 non-null  int64  
 7   gender            21525 non-null  object 
 8   income_type       21525 non-null  object 
 9   debt              21525 non-null  int64  
 10  total_income      21525 non-null  float64
 11  purpose           21525 non-null  object 
dtypes: float64(3), int64(4), object(5)
memory usage: 2.0+ MB


**Conclusion**

The first step is to detect gaps. To do this we used the isna() and len() method to count their number. We see 2174 gaps in each: 'days_employed' and 'total_income columns'. Since the data on the number of days and total income is important for the study and they make up 10% of the total data, they cannot be deleted along with the entire row. I must say that these are quantitative variables. Gaps in such variables are filled with characteristic values. To estimate the typical sample values we can use mean or median value. The zeros in 'dob_years' graph were replaced with the average value. After checking it seems that all the gaps are filled in. It is possible that such errors appeared when uploading data, or it could be the human factor.

### Replacing the data type

In [24]:
# converting real numbers to integers
data['days_employed'] = data['days_employed'].apply(abs)
data['days_employed'] = data['days_employed'].astype('int')
data.loc[data['days_employed'] > 30000, 'days_employed'] = data['days_employed'] / 24
data['days_employed'] = data['days_employed'].astype('int')

data['total_income'] = data['total_income'].astype('int')

data['dob_years'] = data['dob_years'].astype('int')


**Conclusion**

There are several ways to change one data type to another, including to_numeric(), but it is suitable for converting string values to real values. In this case, we convert real numbers to integers using the as type('int') method. 

### Processing duplicates

In [25]:
data.duplicated().sum() # found 54 duplicates
 # deleted them

# checking hidden duplicates in 'children' column
data['children'].unique()
#array([ 1,  0,  3,  2, -1,  4, 20,  5])
data['children'].value_counts() 
data['children'] = data['children'].replace(-1, 1)
data['children'] = data['children'].replace(20, 2)
data['children'].value_counts()


0    14149
1     4865
2     2131
3      330
4       41
5        9
Name: children, dtype: int64

In [26]:
data['gender'].value_counts()
# the share of anomalies in 'gender'  is 0.05% and it does not participate in the study, so we can leave it в исследовании, поэтому можно ее не трогать


F      14236
M       7288
XNA        1
Name: gender, dtype: int64

In [27]:
# checking for hidden duplicates in ‘education’
data['education'].unique()
data['education'] = data['education'].str.lower()
data['education'].value_counts()



среднее                15233
высшее                  5260
неоконченное высшее      744
начальное                282
ученая степень             6
Name: education, dtype: int64

In [28]:
# checking for hidden duplicates in family_status
data['family_status'].unique()

array(['женат / замужем', 'гражданский брак', 'вдовец / вдова',
       'в разводе', 'Не женат / не замужем'], dtype=object)

In [29]:
# checking for hidden duplicates in income_type
data['income_type'].unique()

array(['сотрудник', 'пенсионер', 'компаньон', 'госслужащий',
       'безработный', 'предприниматель', 'студент', 'в декрете'],
      dtype=object)

In [9]:
# проверим наличие скрытых дуьликатов в столбце purpose
data['purpose'].unique()

array(['покупка жилья', 'приобретение автомобиля',
       'дополнительное образование', 'сыграть свадьбу',
       'операции с жильем', 'образование', 'на проведение свадьбы',
       'покупка жилья для семьи', 'покупка недвижимости',
       'покупка коммерческой недвижимости', 'покупка жилой недвижимости',
       'строительство собственной недвижимости', 'недвижимость',
       'строительство недвижимости', 'на покупку подержанного автомобиля',
       'на покупку своего автомобиля',
       'операции с коммерческой недвижимостью',
       'строительство жилой недвижимости', 'жилье',
       'операции со своей недвижимостью', 'автомобили',
       'заняться образованием', 'сделка с подержанным автомобилем',
       'получение образования', 'автомобиль', 'свадьба',
       'получение дополнительного образования', 'покупка своего жилья',
       'операции с недвижимостью', 'получение высшего образования',
       'свой автомобиль', 'сделка с автомобилем',
       'профильное образование', 'высшее об

In [30]:
# checking for hidden duplicates in debt
data['debt'].unique()

array([0, 1])

In [11]:
data = data.drop_duplicates().reset_index(drop=True)
data.duplicated().sum()

0

**Conclusion**

We used the duplicated() method to fing duplicates and counted them using the sum() method. Thus, rows with a complete data match were identified. To remove duplicates, we used the data.drop_duplicates().reset_index(drop=True) method to avoid creating a column with old index values.

The unique() method was applied to the columns education, children, debt, purpose, income_type, family_status, education, gender in order to find hidden duplicates. We found repetitions due to different case in the education column, converted everything to lowercase using the str.lower() method. When checking the "Child Elements" column, suspicious values -1 and 20 were found. Such values may occur when loading data. They can denote 1 child and 2 children respectively. The share of these values is insignificant, so it is most logical to replace these values with 1 and 2 using the replace() method.

After checking the specified columns for duplicates, all duplicates were deleted using the drop_duplicates() method.



### Lemmatization

In [31]:
# calling for Pymystem3 library
from pymystem3 import Mystem
from collections import Counter
m = Mystem() 
text = data['purpose'].unique()
separator = ' '
text = separator.join(text)
lemmas = (m.lemmatize(text))
display(Counter(lemmas)) 

# creating a function
def lemmas_purpose(row):
    purpose = row['purpose']
    purpose_lemmas = m.lemmatize(purpose)
    if ('недвижимость' in purpose_lemmas or 'жилье' in purpose_lemmas):
        return 'недвижимость'
    elif ('свадьба' in purpose_lemmas or 'сыграть' in purpose_lemmas):
        return 'свадьба'
    elif 'автомобиль' in purpose_lemmas:
        return 'автомобиль'
    elif 'образование' in purpose_lemmas:
        return 'образование'
    else:
        return 'цель не определена'

data['general_purposes'] = data.apply(lemmas_purpose, axis = 1)
display(data.head())
data['general_purposes'].value_counts()
    
 

Counter({'покупка': 10,
         ' ': 96,
         'жилье': 7,
         'приобретение': 1,
         'автомобиль': 9,
         'дополнительный': 2,
         'образование': 9,
         'сыграть': 1,
         'свадьба': 3,
         'операция': 4,
         'с': 5,
         'на': 4,
         'проведение': 1,
         'для': 2,
         'семья': 1,
         'недвижимость': 10,
         'коммерческий': 2,
         'жилой': 2,
         'строительство': 3,
         'собственный': 1,
         'подержать': 2,
         'свой': 4,
         'со': 1,
         'заниматься': 2,
         'сделка': 2,
         'получение': 3,
         'высокий': 3,
         'профильный': 1,
         'сдача': 1,
         'ремонт': 1,
         '\n': 1})

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose,general_purposes
0,1,8437,42,высшее,0,женат / замужем,0,F,сотрудник,0,253875,покупка жилья,недвижимость
1,1,4024,36,среднее,1,женат / замужем,0,F,сотрудник,0,112080,приобретение автомобиля,автомобиль
2,0,5623,33,среднее,1,женат / замужем,0,M,сотрудник,0,145885,покупка жилья,недвижимость
3,3,4124,32,среднее,1,женат / замужем,0,M,сотрудник,0,267628,дополнительное образование,образование
4,0,14177,53,среднее,1,гражданский брак,1,F,пенсионер,0,158616,сыграть свадьбу,свадьба


недвижимость    10840
автомобиль       4315
образование      4022
свадьба          2348
Name: general_purposes, dtype: int64

**Conclusion**

We found lemmas in the values in 'purpose'  and calculated them. This revealed the main goals of getting a loan. For lemmatization it was necessary to unload the Pymystem3 library then we found all the unique values in the purpose column and performed lemmization. With the help of Counter the frequency of occurrence of words was calculated. Then we identified the most common categories by entering a function in which the applied the method that lematized each cell of the purpose column and checked which category was found. Thus,  four largest groups of borrowers were found.

### Categorization of the data

In [32]:
# getting categories
data[['education']].drop_duplicates()


Unnamed: 0,education
0,высшее
1,среднее
13,неоконченное высшее
31,начальное
2963,ученая степень


In [33]:
data[['family_status']].drop_duplicates()


Unnamed: 0,family_status
0,женат / замужем
4,гражданский брак
18,вдовец / вдова
19,в разводе
24,Не женат / не замужем


In [34]:
data[['family_status_id', 'family_status']].drop_duplicates().set_index('family_status_id')


Unnamed: 0_level_0,family_status
family_status_id,Unnamed: 1_level_1
0,женат / замужем
1,гражданский брак
2,вдовец / вдова
3,в разводе
4,Не женат / не замужем


In [35]:
data[['income_type']].drop_duplicates()


Unnamed: 0,income_type
0,сотрудник
4,пенсионер
5,компаньон
26,госслужащий
3133,безработный
5936,предприниматель
9410,студент
20845,в декрете


In [36]:
data[['purpose']].drop_duplicates()


Unnamed: 0,purpose
0,покупка жилья
1,приобретение автомобиля
3,дополнительное образование
4,сыграть свадьбу
6,операции с жильем
7,образование
8,на проведение свадьбы
9,покупка жилья для семьи
10,покупка недвижимости
11,покупка коммерческой недвижимости


**Conclusion**

At this stage we have studied dictionaries with categorical variables. Categorization serves as a tool for detecting what categories there are in the dataframe in general and in what format they are recorded.

## The questions of the research

-	Is there correlation between having children and repayment of the loan on time?

In [42]:
# creating a table for children and debt
children_pivot = data.pivot_table(index = ['children'], values = 'debt', aggfunc = ['sum', 'count', 'mean'])
children_pivot['mean'] = children_pivot['mean'] * 100 # counting the percentage
children_pivot.columns = ['has_debt', 'total', '%'] # creating columns
children_pivot = children_pivot.sort_values(by = 'children', ascending = False) # sorting
display(children_pivot)

Unnamed: 0_level_0,has_debt,total,%
children,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
5,0,9,0.0
4,4,41,9.756098
3,27,330,8.181818
2,202,2131,9.479118
1,445,4865,9.146968
0,1063,14149,7.512898


**Conclusion**

According to the results of the summary table, we see that the largest percentage of those with a loan (9.7%) have 4 children, followed by families with 2 children (9.5%) and families with 1 child (9.1%). The proportion of families with 3 children is  1% lower than with one, two and four children. However, the share of those with a loan among families without children is 7.5%. Thus, the average debt rate for families with children is 9.1%, while only 7.5% of families without children have a loan. The difference is 1.5% and does not seem significant. It is also worth bearing in mind that the absolute number of loans in families without children is higher.

- Is there a correlation between marital status and repayment of the loan on time?

In [41]:
# creating a table for family_status and debt
family_status_pivot = data.pivot_table(index = ['family_status'], values = 'debt', aggfunc = ['sum', 'count', 'mean'])
family_status_pivot['mean'] = family_status_pivot['mean'] * 100 
family_status_pivot.columns = ['has_debt', 'total', '%'] 
family_status_pivot = family_status_pivot.sort_values(by = 'family_status', ascending = False) 
display(family_status_pivot)

Unnamed: 0_level_0,has_debt,total,%
family_status,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
женат / замужем,931,12380,7.520194
гражданский брак,388,4177,9.288963
вдовец / вдова,63,960,6.5625
в разводе,85,1195,7.112971
Не женат / не замужем,274,2813,9.740491


**Conclusion**

In this case, we see that unmarried people (9.7%) and people in a civil marriage (9.2%) have the most loans. Married and divorced people have 7.5% and 7.1% respectively. Widowers have the least (6.6%) loans. The difference between each group represented in marital status does not exceed 2.2%. We can conclude that marital status practically does not affect the repayment of the loan.

- Is there a correlation between the level of income and repayment of the loan on time?

In [40]:
# creating a table for income_type and debt

income_type_pivot = data.pivot_table(index = ['income_type'], values = 'debt', aggfunc = ['sum', 'count', 'mean'])
income_type_pivot['mean'] = income_type_pivot['mean'] * 100 
income_type_pivot.columns = ['has_debt', 'total', '%'] 
income_type_pivot = income_type_pivot.sort_values(by = 'income_type', ascending = False) 
display(income_type_pivot)

Unnamed: 0_level_0,has_debt,total,%
income_type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
студент,0,1,0.0
сотрудник,1061,11119,9.542225
предприниматель,0,2,0.0
пенсионер,216,3856,5.60166
компаньон,376,5085,7.394297
госслужащий,86,1459,5.894448
в декрете,1,1,100.0
безработный,1,2,50.0


**Conclusion**

In this table we observe the fact that employees of companies have the biggest amount of loans (9.5%). In the second place 
 we have so-called partners - 7.4%, civil servants and pensioners have the lowest percentage of loans - 5.8% and 5.6%. In absolute terms, people on maternity leave and the unemployed do not represent statistical significance. In this group there were also no significant fluctuations in debt repayment, which leads to the conclusion that there is no relationship between income and debt repayment.


-	How do different loan goals affect its repayment on time?

In [43]:
# creating a table for general_purposes and debt
pivot_purposes = data.pivot_table(index = ['general_purposes'], values = 'debt', aggfunc = ['sum', 'count', 'mean'])
pivot_purposes['mean'] = pivot_purposes['mean'] * 100
pivot_purposes.columns = ['has_debt', 'total', '%']
pivot_purposes = pivot_purposes.sort_values(by = '%', ascending = False)
display(pivot_purposes)

Unnamed: 0_level_0,has_debt,total,%
general_purposes,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
автомобиль,403,4315,9.339513
образование,370,4022,9.199403
свадьба,186,2348,7.921635
недвижимость,782,10840,7.214022


**Conclusion**

The purpose of the loan, obviously, also does not significantly affect its repayment. In this table we see fluctuations from 7.2% to 9.3% by category.

## General conclusion

In this project the statistics on the solvency of the bank's clients was analyzed in order to determine the **relationship between the marital status and the number of children of the client on the fact of repayment of the loan on time**.

Before starting the analysis, the data was downloaded from the folder and the dataframe was analyzed using the info() and value_counts() methods. This helped to get the general information on the data: the size of the dataframe and the types of data in it and to see potential problems. Thus, it was found that there were gaps, duplicates and abnormal values in the data. It should also be said that data in the dataframe was in  the form object, int64, float64.

Then a step-by-step **data preprocessing** took place.

Firstly,  I processed the gaps using the isna() method in the columns days_employed, total_income, dob_years.
Secondly, data types were converted into int format with the astype method. For the days_employed column, the real numbers were  replaced with integers, the "-" sign was removed and the data was converted to hours.
Next, I searched for obvious and hidden duplicates  using the unique() method for the columns education, children, debt, purpose, income_type, family_status, education, gender. I found duplicates due to different case in the education column, brought everything to the lower case using the str.lower() method. When checking the children column, suspicious values of -1 and 20 were found. Values  like this could occur when uploading  the data. I assumed that they can denote 1 child and 2 respectively. The proportion of these values was insignificant, so I replaced these values with 1 and 2 using the replace() method. The rest of the data did not influence the analysis or its share was insignificant.

At the next stage of preprocessing, the purpose column was lemmatized in order to indicate the general categories of loan goals. To do this, I created a function and saved the result in a separate general_purposes column.

Next, the data was categorized. To do this, dictionaries with categorical variables were allocated, which is necessary to tackle the task of the project to find the correlation  between the marital status and the number of children of the client on the fact of repayment of the loan on time.

In paragraph three of the analysis I created 4 tables:
1) Summary table for **children and debt**. It was analyzed in order to answer the question whether there was a correlation between the number of children and the repayment of the loan on time. According to the results of the summary table, we see that the largest percentage of those with a loan (9.7%) have 4 children, followed by families with 2 children (9.5%) and families with 1 child (9.1%). The share of families with 3 children is  1% lower than with one, two and four children. However, the share of those with a loan among families without children is 7.5%. Thus, the average debt rate for families with children is 9.1%, while only 7.5% of families without children have a loan. The difference is 1.5% and does not seem significant.

2) Summary table for **family_status and debt**. It was analyzed in order to answer the question whether there was a correlation between marital status and repayment of the loan on time. It is obvious from the table that unmarried people (9.7%) and people in a civil marriage (9.2%) have the largesrt amount of loans. Married and divorced people comprise 7.5% and 7.1% respectively. Widowers have the least (6.6%) loans. The difference between each group represented in marital status does not exceed 2.2%. Thus, we can conclude that marital status practically does not affect the repayment of the loan.

3) Summary table for **income_type and debt**. It was analyzed in order to answer the question whether there was a correlation between income and repayment of the loan on time. In this table we observe the fact that employees of companies have the biggest amount of loans (9.5%). In second place we see partners - 7.4%, civil servants and pensioners have the lowest percentae of loans - 5.8% and 5.6%. In absolute terms, people on maternity leave and the unemployed do not represent statistical significance. For this group there were also no significant fluctuations in debt repayment, which leads to the conclusion that there is no correlation between income and debt repayment.

4) Summary table for **general_purposes and debt**. It was analyzed in order to answer the question whether there was  a correlation between income and repayment of the loan on time. The purpose of the loan, obviously, also does not significantly affect its repayment. In this table we see fluctuations from 7.2% to 9.3% by category.

According to the results of the study, it can be concluded that children and marital status does not affect the repayment of the loan on time. The analysis of each category showed that each group is credited in the range from 6.6% to 9.7%. It is also obvious from the data that the income and purpose of the loan also do not affect the repayment of the loan on time.

Based on the data analysis, it can be concluded that the customer should first of all supplement the data for further analysis and continue searching for potential categories that could have a more significant impact on the repayment of the loan on time. Nevertheless, we should keep in mind that the data is being replenished, and, perhaps, if the bank's credit policy changes, these data will show new types of correlation that will be useful in the future.
