## Research of Borrowers' Reliability

This straightforward project primarily focuses on data preparation and manipulation. The goal is to extract insights from the provided dataset.

The customer for this analysis is a bank's credit department, interested in understanding whether a client's marital status and number of children impact their loan repayment timeliness. We assume that the bank's dataset represents customer solvency statistics.

These findings will be considered when developing a credit scoring model – a specialized system for evaluating a potential borrower's ability to repay a loan.

The dataset includes information on the bank's clients: marital status, debt, education, income, and gender. The data is organized by columns and features various types, such as text, numeric, and floating-point numbers (real data type).

The table columns are as follows:

* `children` - the number of children in the family; 
* `days_employed` - total work experience in days;
* `dob_years` -  the client's age in years;
* `education` - the level of education of the client;
* `education_id` - id of the level of education;
* `family_status` - marital status;
* `family_status_id` -  id of the marital status;
* `gender` - gender of the client;
* `income_type` - type of employment;
* `debt` - whether there was a debt on the repayment of loans;
* `total_income` - total profit;
* `purpose` - purpose of the loan;

The Borrowers' Reliability study consists of the following stages:

1. Data cleanup: searching for missing data, restoring data, and removing duplicates.
2. Data lemmatization.
3. Вata categorization.
4. Drawing conclusions for each section separately and forming an overall conclusion.

## Data investigation

First thing to do is to check the general look and feel of the data.

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [4]:
# libraries import
import pandas as pd
!pip install pymystem3==0.1.10
from pymystem3 import Mystem
m = Mystem()

# чтение данных из файла data.csv
data = pd.read_csv("/content/drive/MyDrive/da_portfolio/data_credit.csv")

# looking into the table
display(data.head(10))

# general information about the dataset
display(data.info())

# statistics of the table
display(data.describe())

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pymystem3==0.1.10
  Downloading pymystem3-0.1.10-py3-none-any.whl (10 kB)
Installing collected packages: pymystem3
  Attempting uninstall: pymystem3
    Found existing installation: pymystem3 0.2.0
    Uninstalling pymystem3-0.2.0:
      Successfully uninstalled pymystem3-0.2.0
Successfully installed pymystem3-0.1.10


Installing mystem to /root/.local/bin/mystem from http://download.cdn.yandex.net/mystem/mystem-3.0-linux3.1-64bit.tar.gz


Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
0,1,-8437.673028,42,высшее,0,женат / замужем,0,F,сотрудник,0,253875.639453,покупка жилья
1,1,-4024.803754,36,среднее,1,женат / замужем,0,F,сотрудник,0,112080.014102,приобретение автомобиля
2,0,-5623.42261,33,Среднее,1,женат / замужем,0,M,сотрудник,0,145885.952297,покупка жилья
3,3,-4124.747207,32,среднее,1,женат / замужем,0,M,сотрудник,0,267628.550329,дополнительное образование
4,0,340266.072047,53,среднее,1,гражданский брак,1,F,пенсионер,0,158616.07787,сыграть свадьбу
5,0,-926.185831,27,высшее,0,гражданский брак,1,M,компаньон,0,255763.565419,покупка жилья
6,0,-2879.202052,43,высшее,0,женат / замужем,0,F,компаньон,0,240525.97192,операции с жильем
7,0,-152.779569,50,СРЕДНЕЕ,1,женат / замужем,0,M,сотрудник,0,135823.934197,образование
8,2,-6929.865299,35,ВЫСШЕЕ,0,гражданский брак,1,F,сотрудник,0,95856.832424,на проведение свадьбы
9,0,-2188.756445,41,среднее,1,женат / замужем,0,M,сотрудник,0,144425.938277,покупка жилья для семьи


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21525 entries, 0 to 21524
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   children          21525 non-null  int64  
 1   days_employed     19351 non-null  float64
 2   dob_years         21525 non-null  int64  
 3   education         21525 non-null  object 
 4   education_id      21525 non-null  int64  
 5   family_status     21525 non-null  object 
 6   family_status_id  21525 non-null  int64  
 7   gender            21525 non-null  object 
 8   income_type       21525 non-null  object 
 9   debt              21525 non-null  int64  
 10  total_income      19351 non-null  float64
 11  purpose           21525 non-null  object 
dtypes: float64(2), int64(5), object(5)
memory usage: 2.0+ MB


None

Unnamed: 0,children,days_employed,dob_years,education_id,family_status_id,debt,total_income
count,21525.0,19351.0,21525.0,21525.0,21525.0,21525.0,19351.0
mean,0.538908,63046.497661,43.29338,0.817236,0.972544,0.080883,167422.3
std,1.381587,140827.311974,12.574584,0.548138,1.420324,0.272661,102971.6
min,-1.0,-18388.949901,0.0,0.0,0.0,0.0,20667.26
25%,0.0,-2747.423625,33.0,1.0,0.0,0.0,103053.2
50%,0.0,-1203.369529,42.0,1.0,0.0,0.0,145017.9
75%,1.0,-291.095954,53.0,1.0,1.0,0.0,203435.1
max,20.0,401755.400475,75.0,4.0,4.0,1.0,2265604.0


Upon an initial examination of the data, it's evident that not all columns contain information in a format suitable for analysis.

For instance, the `days_employed` column contains negative values and floating-point values, while the minimum number of children is "-1". In columns with textual data, the text is not standardized: both uppercase and lowercase letters are used, and typos may be present.

A review of the general information about the table reveals that not all columns have an equal number of rows, among other issues.

The columns `children`, `family_status`, `total_income`, and `debt` are crucial for testing various assumptions, so the primary focus will be on cleaning these columns for further analysis. The remaining columns will also be processed as needed.

## Data cleanup

Let's begin by addressing the missing information. From the overall data overview, it's clear that not all columns have the same number of rows, indicating that some columns have missing data. The affected columns include:
* `days_employed` 
* `total_income`

In [5]:
# check up of the extra spaces in the name of the columns
data.columns

Index(['children', 'days_employed', 'dob_years', 'education', 'education_id',
       'family_status', 'family_status_id', 'gender', 'income_type', 'debt',
       'total_income', 'purpose'],
      dtype='object')

And I need to know the amount of data missing.

In [6]:
# check up of the amount of the data missing
pd.DataFrame(round((data.isna().mean()*100),2)).style.background_gradient("coolwarm")

Unnamed: 0,0
children,0.0
days_employed,10.1
dob_years,0.0
education,0.0
education_id,0.0
family_status,0.0
family_status_id,0.0
gender,0.0
income_type,0.0
debt,0.0


10% is too much for the data drop, so I will fill them with the median value.

In [7]:
# fill nan of the total_income by the median from income_type
data["total_income"] = data.groupby("income_type")["total_income"].transform(lambda x: x.fillna(x.median()))

# fill nan of the days_employed  by the median from income_type
data["days_employed"] = data.groupby("income_type")["days_employed"].transform(lambda x: x.fillna(x.median()))

In [8]:
# check up of the missing data
data.isnull().sum()

children            0
days_employed       0
dob_years           0
education           0
education_id        0
family_status       0
family_status_id    0
gender              0
income_type         0
debt                0
total_income        0
purpose             0
dtype: int64

There are 2,174 missing rows in the `total_income` column.

Interestingly, this number corresponds to the number of missing entries in the `days_employed` column, which is also 2,174. It's possible that some clients were unemployed at the time of data collection, making it challenging for them to provide information on their earnings.

Considering that the total number of rows in the column is 21,525, the 2,174 missing entries account for approximately 10% of the total rows. In my opinion, 10% is a significant portion, so I will not remove these rows, as doing so may impact the final results.

To address this issue, I filled in the gaps with median values based on the respective categories. This ensures that the median income of the "pensioner" category, for example, does not influence the median income of the "employee" category, and vice versa. Each category has its own median income, and the missing values are filled accordingly.

Following the same principle, the missing values in the `days_employed` column have been filled in.

For the analysis, I will use the data from the `total_income` column. The data type of this column is `float`. For further analysis and the possibility of using some functions, I will change the data type of to `int`.

In [9]:
# changing the data type
data["total_income"] = data["total_income"].astype("int")
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21525 entries, 0 to 21524
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   children          21525 non-null  int64  
 1   days_employed     21525 non-null  float64
 2   dob_years         21525 non-null  int64  
 3   education         21525 non-null  object 
 4   education_id      21525 non-null  int64  
 5   family_status     21525 non-null  object 
 6   family_status_id  21525 non-null  int64  
 7   gender            21525 non-null  object 
 8   income_type       21525 non-null  object 
 9   debt              21525 non-null  int64  
 10  total_income      21525 non-null  int64  
 11  purpose           21525 non-null  object 
dtypes: float64(1), int64(6), object(5)
memory usage: 2.0+ MB


The data type in the column for analysis has been updated, as reflected in the dataset information (`total_income 21525 non-null int64`).

Since the `days_employed` column will not be utilized in this analysis, I have opted not to modify the data type for this particular column.


My plan is to identify any duplicate rows that have text values. To achieve this, I'll first convert all string values to lowercase. However, since certain columns have numeric data, I cannot use the `str.lower()` method on the entire dataset. Therefore, I'll apply it only to the text columns and do so individually for each column.



In [10]:
# going lowercase
data["education"] = data["education"].str.lower()
data["family_status"] = data["family_status"].str.lower()
data["income_type"] = data["income_type"].str.lower()
data["purpose"] = data["purpose"].str.lower()


In [11]:
# checking for duplicates
data.duplicated().sum()

71

In [12]:
# remove duplicates
data=data.drop_duplicates().reset_index(drop=True)

In [13]:
# checking for duplicates againg
data.duplicated().sum()

0

Duplicates are removed.

## Lemmatization

I am using the pymystem3 library to lemmatise the data. 

In [14]:
# checking what categories of the dataset
data["purpose"].value_counts()

свадьба                                   791
на проведение свадьбы                     768
сыграть свадьбу                           765
операции с недвижимостью                  675
покупка коммерческой недвижимости         661
операции с жильем                         652
покупка жилья для сдачи                   651
операции с коммерческой недвижимостью     650
покупка жилья                             646
жилье                                     646
покупка жилья для семьи                   638
строительство собственной недвижимости    635
недвижимость                              633
операции со своей недвижимостью           627
строительство жилой недвижимости          624
покупка недвижимости                      621
покупка своего жилья                      620
строительство недвижимости                619
ремонт жилью                              607
покупка жилой недвижимости                606
на покупку своего автомобиля              505
заняться высшим образованием      

In [15]:
# creating the lemma function
def lemm(purpose):
    lemma = ' '.join(m.lemmatize(purpose))
    if ("жилье" in lemma) or ("недвижимость" in lemma):
        return "real estate"
    elif "автомобиль" in lemma:
        return "car"
    elif "образование" in lemma:
        return "education"
    else:
        return "celebrations"
    return lemma
 

# applying function to the table
data['purpose_lem'] = data['purpose'].apply(lemm)

In [16]:
# checking the result
data.head()

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose,purpose_lem
0,1,-8437.673028,42,высшее,0,женат / замужем,0,F,сотрудник,0,253875,покупка жилья,real estate
1,1,-4024.803754,36,среднее,1,женат / замужем,0,F,сотрудник,0,112080,приобретение автомобиля,car
2,0,-5623.42261,33,среднее,1,женат / замужем,0,M,сотрудник,0,145885,покупка жилья,real estate
3,3,-4124.747207,32,среднее,1,женат / замужем,0,M,сотрудник,0,267628,дополнительное образование,education
4,0,340266.072047,53,среднее,1,гражданский брак,1,F,пенсионер,0,158616,сыграть свадьбу,celebrations


To facilitate the analysis, I standardized the data by performing lemmatization. Since the text data in the table is in Russian, I used the pymystem3 library, which simplifies the process of working with such data. As a result, I created a new column, `purpose_lem`, to store the lemmatized output.



## Data categorization


Now I need to catogorize the data.

In [17]:
# removing negative values
data["children"] = data["children"].abs()

# count valuse of 'children'
data["children"].value_counts()

0     14091
1      4855
2      2052
3       330
20       76
4        41
5         9
Name: children, dtype: int64

It appears that the value of `20` for the number of children is an outlier. While it's not impossible for a family to have 20 kids, the fact that there are 76 such cases among 21525 clients seems highly improbable. It's possible that someone made a mistake while filling out the form, or perhaps there was an error in displaying the information.

Since it's unclear where to assign this data, such as whether to categorize it under `2` children, and its presence in the table creates an outlier, I've decided to remove the rows with the value `20` children. Given the number of rows in the table, this is feasible (with 76 rows representing about 0.4% of the total customers). Even if some families do actually have 20 children, their inclusion is more incidental than statistically significant.

In [18]:
# removing outliers
data = data.drop(data[data.children == 20].index)

# checking the result
data["children"].value_counts()

0    14091
1     4855
2     2052
3      330
4       41
5        9
Name: children, dtype: int64

The number of children in the family now is limited by five, there are no outliers, we can move on and do the actual data analysis.


Categorizing data by income is necessary, because the data table currently has too many unique values in the income column. By dividing them into categories, I make it easier to analyze without deleting or replacing the data. To categorize income data, I use quartiles.

In [19]:
# before making a profit pivot table,
# I categorize the profit to make it easier to work with the data
# to do this, I find out the quartiles from the data from the profit column
data["total_income"].quantile([0.25,0.5,0.75])

0.25    107528.75
0.50    142594.00
0.75    195795.50
Name: total_income, dtype: float64

In [20]:
# a function that assigns an income category, depending on the quartile
def income_category(income):
    if income <= 107528:
        return "low"
    elif income <= 142594:
        return "medium-low"
    elif income <= 195795:
        return "medium-high"
    else:
        return "high"

data["total_income_cat"] = data["total_income"].apply(income_category)

Income data is categorized, divided into four types from high to low. This categorization will make it easier to understand the results of the analysis, since the values in the results will also be divided into categories.

## Now lets answer some questions


**Is there a relationship between having children and paying back the loan on time?**

In [21]:
# lets make a pivot table for  'children'

data.pivot_table(index="children", values="debt", aggfunc=["count","sum","mean"])

Unnamed: 0_level_0,count,sum,mean
Unnamed: 0_level_1,debt,debt,debt
children,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
0,14091,1063,0.075438
1,4855,445,0.091658
2,2052,194,0.094542
3,330,27,0.081818
4,41,4,0.097561
5,9,0,0.0


Lets calculate the number of customers with unpaid loans to the total number of customers in the category:

* 7.5 % without children
* 9.1 % with one child
* 9.4 % with two children
* 8.2 % with three children
* 9.7% with four children
* 0% with five children

From the percentage ratio, I can see that customers with more than one child have largest number of loan arrears, peaking at the number of four childern per household.

**Is there a relationship between marital status and repayment of the loan on time?**

In [22]:
# pivtot table for the  'family_status'
data.pivot_table(index="family_status", values="debt", aggfunc=["count","sum","mean"])

Unnamed: 0_level_0,count,sum,mean
Unnamed: 0_level_1,debt,debt,debt
family_status,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
в разводе,1193,84,0.070411
вдовец / вдова,955,63,0.065969
гражданский брак,4139,385,0.093018
женат / замужем,12290,928,0.075509
не женат / не замужем,2801,273,0.097465


Lets calculate the number of customers with unpaid loans to the total number of customers in the category:

* 7% divorced
* 6,5% widowed
* 9,3% de facto relationship
* 7,5% married
* 9,7% single

Form the percentage ratio I can see that the largest number of loan arrears are single clients, and second largests are clients in de facto relationship. 

**Is there a relationship between the level of income and repayment of the loan on time?**

There are too many profit values in the data table and almost all of them are unique. To facilitate the analysis, the profit is divided into categories in the `data categorization` step.

In [23]:
# pivot table for 'total_income_cat'
data.pivot_table(index="total_income_cat", values="debt", aggfunc=["count","sum","mean"])


Unnamed: 0_level_0,count,sum,mean
Unnamed: 0_level_1,debt,debt,debt
total_income_cat,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
high,5345,381,0.071282
low,5345,427,0.079888
medium-high,5227,445,0.085135
medium-low,5461,480,0.087896


Below is a percentage of customers with loan arrears to the total number of customers in the category:

* 7% high
* 7.9% low
* 8.5% medium-high
* 8.7% medium-low

From the percentage ration I can see that income has little effect on loan debt if income is divided into four categories.

However, of the four categories, two belong to the average income. Here I can assume that that is because there are more people with average income in general.

I would recommend paying attention to the income level when approving a loan and paying particular attention to clients whose income level belongs to one of the categories of average income.



**How do different loan goals affect its repayment?**

In [24]:
# pivot table for 'purpose_lem'
data.pivot_table(index="purpose_lem", values="debt", aggfunc=["count","sum","mean"])


Unnamed: 0_level_0,count,sum,mean
Unnamed: 0_level_1,debt,debt,debt
purpose_lem,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
car,4290,401,0.093473
celebrations,2315,183,0.07905
education,3998,369,0.092296
real estate,10775,780,0.07239


The percentage of the number of customers with loan arrears to the total number of customers in the category:

* 9.3% car
* 7.2% real estate
* 9.2% education
* 7.9% various celebrations and holidays


From the percentage ratio I can see that the largest number of loan arrears are customers whose purpose is car or education.


## Conclustion

From the analysis of customer data, the following conclusions can be drawn:

* **Does the number of children affects loan repayments:** yes, it does. 
The percentage of the customers with loan debt and two or more children is ~ 9%, whereas customers with four children are ~ 9.7%.
* **Does marital status affects loan repayments:** yes, it does. The largest percentage of loan arrears is among the single customers ~ 9.7% and clients living in a de facto relationship ~ 9.3%.
* **Does the level of income affects the debtÖ** yes, it does since the largest percentage of loan arrears is ~ 8.7% for middle-income customers.
* **Does the purpose of the loan affects the debt:** yes, it does. The largest percentage of debts relating to the purchase of a car goal is ~9.3%.

