## Research of Borrowers' Reliability

This is a simple project mostly focused on data preparation and manipulation. Here I want to get some simple answers from the dataset that I have.

The customer is the bank's credit department. It is necessary to understand whether the marital status and the number of children of the client affect the fact of payment of the loan on time. Let's assume, that the data provided by the bank is the statistics on the solvency of customers.

The results of the study will be taken into account when building a credit scoring model — a special system that evaluates the ability of a potential borrower to repay a loan to a bank.

The table contains data on the bank's clients: marital status, debt, education, income, gender. The data is sorted by columns and has the following types: text, number, or a floating-point number (real data type).

The columns of the table are the following:

* `children` - the number of children in the family; 
* `days_employed` - total work experience in days;
* `dob_years` -  the client's age in years;
* `education` - the level of education of the client;
* `education_id` - id of the level of education;
* `family_status` - marital status;
* `family_status_id` -  id of the marital status;
* `gender` - gender of the client;
* `income_type` - type of employment;
* `debt` - whether there was a debt on the repayment of loans;
* `total_income` - total profit;
* `purpose` - purpose of the loan;

Research of Borrowers' Reliability study includes the following stages :

1. Data cleanup: search for missing data, restore data, delete duplicates.
2. Data lemmatization.
3. Data categorization.
4. Conclusions for each section separately and one general conclusion.

## Data investigation

First thing to do is to check the general look and feel of the data.

In [70]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [71]:
# libraries import
import pandas as pd
!pip install pymystem3==0.1.10
from pymystem3 import Mystem
m = Mystem()

# чтение данных из файла data.csv
data = pd.read_csv("/content/drive/MyDrive/da_portfolio/data_credit.csv")

# looking into the table
display(data.head(10))

# general information about the dataset
display(data.info())

# statistics of the table
display(data.describe())



Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
0,1,-8437.673028,42,высшее,0,женат / замужем,0,F,сотрудник,0,253875.639453,покупка жилья
1,1,-4024.803754,36,среднее,1,женат / замужем,0,F,сотрудник,0,112080.014102,приобретение автомобиля
2,0,-5623.42261,33,Среднее,1,женат / замужем,0,M,сотрудник,0,145885.952297,покупка жилья
3,3,-4124.747207,32,среднее,1,женат / замужем,0,M,сотрудник,0,267628.550329,дополнительное образование
4,0,340266.072047,53,среднее,1,гражданский брак,1,F,пенсионер,0,158616.07787,сыграть свадьбу
5,0,-926.185831,27,высшее,0,гражданский брак,1,M,компаньон,0,255763.565419,покупка жилья
6,0,-2879.202052,43,высшее,0,женат / замужем,0,F,компаньон,0,240525.97192,операции с жильем
7,0,-152.779569,50,СРЕДНЕЕ,1,женат / замужем,0,M,сотрудник,0,135823.934197,образование
8,2,-6929.865299,35,ВЫСШЕЕ,0,гражданский брак,1,F,сотрудник,0,95856.832424,на проведение свадьбы
9,0,-2188.756445,41,среднее,1,женат / замужем,0,M,сотрудник,0,144425.938277,покупка жилья для семьи


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21525 entries, 0 to 21524
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   children          21525 non-null  int64  
 1   days_employed     19351 non-null  float64
 2   dob_years         21525 non-null  int64  
 3   education         21525 non-null  object 
 4   education_id      21525 non-null  int64  
 5   family_status     21525 non-null  object 
 6   family_status_id  21525 non-null  int64  
 7   gender            21525 non-null  object 
 8   income_type       21525 non-null  object 
 9   debt              21525 non-null  int64  
 10  total_income      19351 non-null  float64
 11  purpose           21525 non-null  object 
dtypes: float64(2), int64(5), object(5)
memory usage: 2.0+ MB


None

Unnamed: 0,children,days_employed,dob_years,education_id,family_status_id,debt,total_income
count,21525.0,19351.0,21525.0,21525.0,21525.0,21525.0,19351.0
mean,0.538908,63046.497661,43.29338,0.817236,0.972544,0.080883,167422.3
std,1.381587,140827.311974,12.574584,0.548138,1.420324,0.272661,102971.6
min,-1.0,-18388.949901,0.0,0.0,0.0,0.0,20667.26
25%,0.0,-2747.423625,33.0,1.0,0.0,0.0,103053.2
50%,0.0,-1203.369529,42.0,1.0,0.0,0.0,145017.9
75%,1.0,-291.095954,53.0,1.0,1.0,0.0,203435.1
max,20.0,401755.400475,75.0,4.0,4.0,1.0,2265604.0


From the general view of the data, I can see that not all columns contain data in the form that can be used for analysis.

For example, the `days_employed` column has negative values and floating-point values, and the minimum number of children is "-1". In columns with text data, the text is not unified: both capital and lowercase letters are used, and there is a probability of typos.

From the general information on the table, it can be seen that not all columns have the same number of rows, and so on.


The columns `children`, `family_status`, `total_income` and `debt` are the most important for the testing of the different assumptions, so the main effort regarding the data cleaning for further analysis will be applied to them, the remaining columns will be processed 

## Data cleanup

Let's start with the missing information. From the general information on the data, I can see that not all columns have the same number of rows. This means that some columns miss the data. These columns are:
* `days_employed` 
* `total_income`

In [72]:
# check up of the extra spaces in the name of the columns
data.columns

Index(['children', 'days_employed', 'dob_years', 'education', 'education_id',
       'family_status', 'family_status_id', 'gender', 'income_type', 'debt',
       'total_income', 'purpose'],
      dtype='object')

And I need to know the amount of data missing.

In [73]:
# check up of the amount of the data missing
pd.DataFrame(round((data.isna().mean()*100),2)).style.background_gradient("coolwarm")

Unnamed: 0,0
children,0.0
days_employed,10.1
dob_years,0.0
education,0.0
education_id,0.0
family_status,0.0
family_status_id,0.0
gender,0.0
income_type,0.0
debt,0.0


10% is too much for the data drop, so I will fill them with the median value.

In [74]:
# fill nan of the total_income by the median from income_type
data["total_income"] = data.groupby("income_type")["total_income"].transform(lambda x: x.fillna(x.median()))

# fill nan of the days_employed  by the median from income_type
data["days_employed"] = data.groupby("income_type")["days_employed"].transform(lambda x: x.fillna(x.median()))

In [75]:
# check up of the missing data
data.isnull().sum()

children            0
days_employed       0
dob_years           0
education           0
education_id        0
family_status       0
family_status_id    0
gender              0
income_type         0
debt                0
total_income        0
purpose             0
dtype: int64

I can see that there are 2174 missing rows in the  `total_income` column.

This number correlates with the number of omissions in the `days_employed` column, which is also 2174. Apparently, some of the clients were unemployed at the time of data collection, so they found it difficult to answer the question about earnings.

Given that the total number of rows in the column is 21525, the number of omissions = 2174 makes aroung 10% of the total rows. From my point of view, 10% is a lot, therefore I will not delete these lines, as this may affect the final result.

I filled in the gaps with the median value, depending on the category. This is done in a wat where the median profit of the "pensioner" category does not affect the median profit of the "employee" category and vice versa, and so on. Each category has its own median profit and the gaps are filled appropriately.

By the same principle, the gaps in the `days_employed` column are filled.

For the analysis, I will use the data from the `total_income` column. The data type of this column is `float`. For further analysis and the possibility of using some functions, I will change the data type of to `int`.

In [76]:
# changing the data type
data["total_income"] = data["total_income"].astype("int")
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21525 entries, 0 to 21524
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   children          21525 non-null  int64  
 1   days_employed     21525 non-null  float64
 2   dob_years         21525 non-null  int64  
 3   education         21525 non-null  object 
 4   education_id      21525 non-null  int64  
 5   family_status     21525 non-null  object 
 6   family_status_id  21525 non-null  int64  
 7   gender            21525 non-null  object 
 8   income_type       21525 non-null  object 
 9   debt              21525 non-null  int64  
 10  total_income      21525 non-null  int64  
 11  purpose           21525 non-null  object 
dtypes: float64(1), int64(6), object(5)
memory usage: 2.0+ MB


The data type in the column that will be used for analysis has been changed and this can be seen in the data information (`total_income 21525 non-null int64`).

I will not use the data from the `days_employed` column, so I do not change the data type of this column.


I'm going to search for duplicates in the rows with text values. To do this, I will convert all the string values into the lowercase. Since some of the columns contain numeric data, I can't apply the `str. lower ()` method to all the data, so I only apply it to columns with a text value, to each column individually.



In [77]:
# going lowercase
data["education"] = data["education"].str.lower()
data["family_status"] = data["family_status"].str.lower()
data["income_type"] = data["income_type"].str.lower()
data["purpose"] = data["purpose"].str.lower()


In [78]:
# checking for duplicates
data.duplicated().sum()

71

In [79]:
# remove duplicates
data=data.drop_duplicates().reset_index(drop=True)

In [80]:
# checking for duplicates againg
data.duplicated().sum()

0

Duplicates are removed.

## Lemmatization

I am using the pymystem3 library to lemmatise the data. 

In [81]:
# checking what categories of the dataset
data["purpose"].value_counts()

свадьба                                   791
на проведение свадьбы                     768
сыграть свадьбу                           765
операции с недвижимостью                  675
покупка коммерческой недвижимости         661
операции с жильем                         652
покупка жилья для сдачи                   651
операции с коммерческой недвижимостью     650
жилье                                     646
покупка жилья                             646
покупка жилья для семьи                   638
строительство собственной недвижимости    635
недвижимость                              633
операции со своей недвижимостью           627
строительство жилой недвижимости          624
покупка недвижимости                      621
покупка своего жилья                      620
строительство недвижимости                619
ремонт жилью                              607
покупка жилой недвижимости                606
на покупку своего автомобиля              505
заняться высшим образованием      

In [82]:
# creating the lemma function
def lemm(purpose):
    lemma = ' '.join(m.lemmatize(purpose))
    if ("жилье" in lemma) or ("недвижимость" in lemma):
        return "realtz"
    elif "автомобиль" in lemma:
        return "car"
    elif "образование" in lemma:
        return "education"
    else:
        return "celebrations"
    return lemma
 

# applying function to the table
data['purpose_lem'] = data['purpose'].apply(lemm)

In [83]:
# checking the result
data.head()

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose,purpose_lem
0,1,-8437.673028,42,высшее,0,женат / замужем,0,F,сотрудник,0,253875,покупка жилья,недвижимость
1,1,-4024.803754,36,среднее,1,женат / замужем,0,F,сотрудник,0,112080,приобретение автомобиля,автомобиль
2,0,-5623.42261,33,среднее,1,женат / замужем,0,M,сотрудник,0,145885,покупка жилья,недвижимость
3,3,-4124.747207,32,среднее,1,женат / замужем,0,M,сотрудник,0,267628,дополнительное образование,образование
4,0,340266.072047,53,среднее,1,гражданский брак,1,F,пенсионер,0,158616,сыграть свадьбу,праздник


For the analysis, the data were brought into a unified format with lemmatization. I used the pymystem3 library  because the text data in the table is in Russian, and the pymystem3 library allows you to work (lemmitize) with such data without additional effort.

The result is stored in a separate column - `purpose_lem`.



## Data categorization


Now I catogorize the data.

In [84]:
# removing negative values
data['children'] = data['children'].abs()

# count valuse of 'children'
data['children'].value_counts()

0     14091
1      4855
2      2052
3       330
20       76
4        41
5         9
Name: children, dtype: int64

The value of `20`  children looks like an outlier. I don't say the family cannot have 20 kids, but the probability that there are 76 such families for 21525 clients is too small. Perhaps someone, when filling out the form, accidentally added `0` to `2 children`, or maybe the form itself incorrectly displayed the information.

Since I do not know where to put this data (for example, whether it will be ok to add it to the `2 children` category ), and the presence of this data in the table creates an unnecessary outlier, I decided to delete the rows with the `20 children` value.

The number of rows in the table allows you to do this ( 76 is about 0.4% of the total number of customers). Even if this percentage includes families that actually have 20 children, their presence is more of an "accident" and is not statistically significant.

In [85]:
# removing outliers
data = data.drop(data[data.children == 20].index)

# checking the result
data['children'].value_counts()

0    14091
1     4855
2     2052
3      330
4       41
5        9
Name: children, dtype: int64

The number of children in the family now is limited by five, there are no outliers, we can move on and do the actual data analysis.


Categorizing data by income is necessary, because the data table currently has too many unique values in the income column. By dividing them into categories, I make it easier to analyze without deleting or replacing the data. To categorize income data, I use quartiles.

In [86]:
# before making a profit pivot table,
# I categorize the profit to make it easier to work with the data
# to do this, I find out the quartiles from the data from the profit column
data['total_income'].quantile([0.25,0.5,0.75])

0.25    107528.75
0.50    142594.00
0.75    195795.50
Name: total_income, dtype: float64

In [87]:
# a function that assigns an income category, depending on the quartile
def income_category(income):
    if income <= 107528:
        return "low"
    elif income <= 142594:
        return "medium-low"
    elif income <= 195795:
        return "medium-high"
    else:
        return "high"

data['total_income_cat'] = data['total_income'].apply(income_category)

Income data is categorized, divided into four types from high to low. This categorization will make it easier to understand the results of the analysis, since the values in the results will also be divided into categories.

## Now lets answer some questions


**Is there a relationship between having children and paying back the loan on time?**

In [88]:
# lets make a pivot table for  'children'

data.pivot_table(index='children', values='debt', aggfunc=['count','sum','mean'])

Unnamed: 0_level_0,count,sum,mean
Unnamed: 0_level_1,debt,debt,debt
children,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
0,14091,1063,0.075438
1,4855,445,0.091658
2,2052,194,0.094542
3,330,27,0.081818
4,41,4,0.097561
5,9,0,0.0


Lets calculate the number of customers with unpaid loans to the total number of customers in the category:

* 7.5 % without children
* 9.1 % with one child
* 9.4 % with two children
* 8.2 % with three children
* 9.7% with four children
* 0% with five children

From the percentage ratio, I can see that customers with more than one child have largest number of loan arrears, peaking at the number of four childern per household.

**Is there a relationship between marital status and repayment of the loan on time?**

In [89]:
# pivtot table for the  'family_status'
data.pivot_table(index='family_status', values='debt', aggfunc=['count','sum','mean'])

Unnamed: 0_level_0,count,sum,mean
Unnamed: 0_level_1,debt,debt,debt
family_status,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
в разводе,1193,84,0.070411
вдовец / вдова,955,63,0.065969
гражданский брак,4139,385,0.093018
женат / замужем,12290,928,0.075509
не женат / не замужем,2801,273,0.097465


Lets calculate the number of customers with unpaid loans to the total number of customers in the category:

* 7% divorced
* 6,5% widowed
* 9,3% de facto relationship
* 7,5% married
* 9,7% single

Form the percentage ratio I can see that the largest number of loan arrears are single clients, and second largests are clients in de facto relationship. 

**Is there a relationship between the level of income and repayment of the loan on time?**

There are too many profit values in the data table and almost all of them are unique. To facilitate the analysis, the profit is divided into categories in the `data categorization` step.

In [90]:
# pivot table for 'total_income_cat'
data.pivot_table(index='total_income_cat', values='debt', aggfunc=['count','sum','mean'])


Unnamed: 0_level_0,count,sum,mean
Unnamed: 0_level_1,debt,debt,debt
total_income_cat,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
high,5345,381,0.071282
low,5345,427,0.079888
medium-high,5227,445,0.085135
medium-low,5461,480,0.087896


Below is a percentage of customers with loan arrears to the total number of customers in the category:

* 7% high
* 7.9% low
* 8.5% medium-high
* 8.7% medium-low

From the percentage ration I can see that income has little effect on loan debt if income is divided into four categories.

However, of the four categories, two belong to the average income. Here I can assume that that is because there are more people with average income in general.

I would recommend paying attention to the income level when approving a loan and paying particular attention to clients whose income level belongs to one of the categories of average income.



**How do different loan goals affect its repayment?**

In [91]:
# pivot table for 'purpose_lem'
data.pivot_table(index='purpose_lem', values='debt', aggfunc=['count','sum','mean'])


Unnamed: 0_level_0,count,sum,mean
Unnamed: 0_level_1,debt,debt,debt
purpose_lem,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
автомобиль,4290,401,0.093473
недвижимость,10775,780,0.07239
образование,3998,369,0.092296
праздник,2315,183,0.07905


Пересчитаем в процентном соотношении количество клиентов с задолженностью по кредитам  к общему числу клиентов в категории:

* 9.3% автомобиль
* 7.2% операции с недвижимостью
* 9.2% образование
* 7.9% различные торжества и праздники


Из процентного соотношения видно, что наибольшее количество задолженностей по кредиту имеют клиенты цель кредита у которых - покупка автомобиля и образование. 


### Шаг 4. Общий вывод

Из анализа данных по клиентам, можно сделать следующие выводы:

* **Влияет ли количество детей на выплаты по кредитам:** да, влияет. У клиентов с двумя и более детьми процент задолженности по кредиту ~ 9%, с наивысшим показателем у клиентов с четырмя детьми - 9.7%.
* **Влияет ли семейное положение на выплаты по кредитам:** да, влияет. Наибольшее количество задолженностей по кредитам у клиентов без партнеров - 9.7% и клиентов живущих в гражданском браке - 9.3%.
* **Влияет ли уровень дохода на задолженность по кредитам:** да, влияет, так как набольшее количество задолженностей по кредиту - 8.7% у клиентов со средним доходом..
* **Влияет ли цель кредита на задолженность по кредитам:** да, влияет. Наибольшее количество задолженностей относятся к целям связанным с покупокй автомобиля 9.3%.


