# Research on the Reliability of Borrowers


## Open the table and study the general information about the data.

__Task 1.__

Import the pandas library. Load the data from the CSV file into a dataframe and save it to the variable 'data'.

In [1]:
import pandas as pd

try:
    data = pd.read_csv('/Users/daniyardjumaliev/Jupyter/Projects/datasets/borrowers.csv')
except:
    data = pd.read_csv('https://code.s3.yandex.net/datasets/data.csv')

__Task 2.__

Display the first 20 rows of the data dataframe on the screen.

In [2]:
data.head(20)

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
0,1,-8437.673028,42,высшее,0,женат / замужем,0,F,сотрудник,0,253875.639453,покупка жилья
1,1,-4024.803754,36,среднее,1,женат / замужем,0,F,сотрудник,0,112080.014102,приобретение автомобиля
2,0,-5623.42261,33,Среднее,1,женат / замужем,0,M,сотрудник,0,145885.952297,покупка жилья
3,3,-4124.747207,32,среднее,1,женат / замужем,0,M,сотрудник,0,267628.550329,дополнительное образование
4,0,340266.072047,53,среднее,1,гражданский брак,1,F,пенсионер,0,158616.07787,сыграть свадьбу
5,0,-926.185831,27,высшее,0,гражданский брак,1,M,компаньон,0,255763.565419,покупка жилья
6,0,-2879.202052,43,высшее,0,женат / замужем,0,F,компаньон,0,240525.97192,операции с жильем
7,0,-152.779569,50,СРЕДНЕЕ,1,женат / замужем,0,M,сотрудник,0,135823.934197,образование
8,2,-6929.865299,35,ВЫСШЕЕ,0,гражданский брак,1,F,сотрудник,0,95856.832424,на проведение свадьбы
9,0,-2188.756445,41,среднее,1,женат / замужем,0,M,сотрудник,0,144425.938277,покупка жилья для семьи


__Task 3.__

Display basic information about the dataframe using the info() method.

In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21525 entries, 0 to 21524
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   children          21525 non-null  int64  
 1   days_employed     19351 non-null  float64
 2   dob_years         21525 non-null  int64  
 3   education         21525 non-null  object 
 4   education_id      21525 non-null  int64  
 5   family_status     21525 non-null  object 
 6   family_status_id  21525 non-null  int64  
 7   gender            21525 non-null  object 
 8   income_type       21525 non-null  object 
 9   debt              21525 non-null  int64  
 10  total_income      19351 non-null  float64
 11  purpose           21525 non-null  object 
dtypes: float64(2), int64(5), object(5)
memory usage: 2.0+ MB


## Data Preprocessing

### Removing Missing Values

__Task 4.__

Display the number of missing values for each column using a combination of two methods.

In [4]:
data.isna().sum()

children               0
days_employed       2174
dob_years              0
education              0
education_id           0
family_status          0
family_status_id       0
gender                 0
income_type            0
debt                   0
total_income        2174
purpose                0
dtype: int64

__Task 5.__

There are missing values in two columns. One of them is 'days_employed', which we will handle in the next step. The other column with missing values is 'total_income', which contains data on income. Income is strongly influenced by employment type, so we need to fill in the missing values in this column with the median income for each income type from the 'income_type' column. For example, a person with the employment type 'employee' should have the missing value in the 'total_income' column filled with the median income among all records with the same type of employment.

In [5]:
for t in data['income_type'].unique():
    data.loc[(data['income_type'] == t) & (data['total_income'].isna()), 'total_income'] = \
    data.loc[(data['income_type'] == t), 'total_income'].median()

### Handling Anomalies.

__Task 6.__

The data may contain artifacts (anomalies) - values that do not reflect reality and appeared due to some error. Such an artifact would be a negative number of days of work experience in the days_employed column. For real data, this is normal. Let's process the values in this column by replacing all negative values with positive ones using the abs() method.

In [6]:
data['days_employed'] = data['days_employed'].abs()

__Task 7.__

For each type of employment, let's display the median value of work experience days_employed in days.

In [7]:
data.groupby('income_type')['days_employed'].agg('median')

income_type
безработный        366413.652744
в декрете            3296.759962
госслужащий          2689.368353
компаньон            1547.382223
пенсионер          365213.306266
предприниматель       520.848083
сотрудник            1574.202821
студент               578.751554
Name: days_employed, dtype: float64

For two types (unemployed and retirees), anomalously large values are obtained. It is difficult to correct such values, so we will leave them as they are. Moreover, we will not need this column for our research.

__Task 8.__

Let's display the list of unique values in the children column.

In [8]:
data['children'].unique()

array([ 1,  0,  3,  2, -1,  4, 20,  5])

__Task 9.__

There are two anomalous values in the children column. Let's remove the rows that contain such anomalous values from the data dataframe.

In [9]:
data = data[(data['children'] != -1) & (data['children'] != 20)]

__Task 10.__

Let's once again display the list of unique values in the children column to ensure that the artifacts have been removed.

In [10]:
data['children'].unique()

array([1, 0, 3, 2, 4, 5])

### Removing Missing Values (Continued)

__Task 11.__

Let's fill in the missing values in the days_employed column with the median values for each income type income_type.

In [11]:
for t in data['income_type'].unique():
    data.loc[(data['income_type'] == t) & (data['days_employed'].isna()), 'days_employed'] = \
    data.loc[(data['income_type'] == t), 'days_employed'].median()

__Task 12.__

Let's make sure that all missing values are filled in. To double-check, we'll once again display the count of missing values for each column using two methods.

In [12]:
data.isna().sum()

children            0
days_employed       0
dob_years           0
education           0
education_id        0
family_status       0
family_status_id    0
gender              0
income_type         0
debt                0
total_income        0
purpose             0
dtype: int64

### Changing Data Types

__Task 13.__

Let's change the data type in the total_income column from floating-point to integer using the astype() method.

In [13]:
data['total_income'] = data['total_income'].astype(int)

### Duplicate Handling

__Task 14.__

Let's handle implicit duplicates in the education column. In this column, there are identical values written differently, using both uppercase and lowercase letters. We will convert them to lowercase. We should also check the other columns.

In [14]:
data['education'] = data['education'].str.lower()

__Task 15.__

Let's display the number of duplicate rows in the data. If there are any duplicate rows, we will remove them.

In [15]:
data.duplicated().sum()

71

In [16]:
data = data.drop_duplicates()

### Categorization of Data

__Task 16.__

Based on the specified ranges, let's create a total_income_category column in the data dataframe with the following categories:

- 0–30000 — 'E';
- 30001–50000 — 'D';
- 50001–200000 — 'C';
- 200001–1000000 — 'B';
- 1000001 and above — 'A'.

For example, we will assign category 'E' to a borrower with an income of 25000, and category 'B' to a client with an income of 235000. We'll achieve this by using a custom function named categorize_income() and the apply() method.

In [17]:
def categorize_income(income):
    try:
        if 0 <= income <= 30000:
            return 'E'
        elif 30001 <= income <= 50000:
            return 'D'
        elif 50001 <= income <= 200000:
            return 'C'
        elif 200001 <= income <= 1000000:
            return 'B'
        elif income >= 1000001:
            return 'A'
    except:
        pass

In [18]:
data['total_income_category'] = data['total_income'].apply(categorize_income)

__Task 17.__

Let's display the list of unique purposes for taking a loan from the purpose column.

In [19]:
data['purpose'].unique()

array(['покупка жилья', 'приобретение автомобиля',
       'дополнительное образование', 'сыграть свадьбу',
       'операции с жильем', 'образование', 'на проведение свадьбы',
       'покупка жилья для семьи', 'покупка недвижимости',
       'покупка коммерческой недвижимости', 'покупка жилой недвижимости',
       'строительство собственной недвижимости', 'недвижимость',
       'строительство недвижимости', 'на покупку подержанного автомобиля',
       'на покупку своего автомобиля',
       'операции с коммерческой недвижимостью',
       'строительство жилой недвижимости', 'жилье',
       'операции со своей недвижимостью', 'автомобили',
       'заняться образованием', 'сделка с подержанным автомобилем',
       'получение образования', 'автомобиль', 'свадьба',
       'получение дополнительного образования', 'покупка своего жилья',
       'операции с недвижимостью', 'получение высшего образования',
       'свой автомобиль', 'сделка с автомобилем',
       'профильное образование', 'высшее об

__Task 18.__

Let's create a function that, based on the data in the purpose column, will generate a new column purpose_category containing the following categories:

- 'Car-related Operations',
- 'Real Estate Operations',
- 'Wedding Expenses',
- 'Education Expenses'.

For example, if the purpose column contains the substring 'to buy a car', then the purpose_category column should contain the string 'Car-related Operations'.

We'll achieve this by using a custom function named categorize_purpose() and the apply() method. We'll examine the data in the purpose column and determine which substrings will help us correctly categorize it.

In [20]:
def categorize_purpose(row):
    try:
        if 'автом' in row:
            return 'операции с автомобилем'
        elif 'жил' in row or 'недвиж' in row:
            return 'операции с недвижимостью'
        elif 'свад' in row:
            return 'проведение свадьбы'
        elif 'образов' in row:
            return 'получение образования'
    except:
        return 'нет категории'

In [21]:
data['purpose_category'] = data['purpose'].apply(categorize_purpose)

### Step 3. Explore the Data and Answer the Questions

#### 3.1 Is there a correlation between the number of children and the timely repayment of the loan?"

In [22]:
data_children = data.loc[:,['children','debt']]
data_children_all = data_children.groupby('children')['debt'].count()

data_children_debt = data[data['debt'] == 1]
data_children_debt = data_children_debt.loc[:,['children','debt']]
data_children_debt = data_children_debt.groupby('children')['debt'].count()

data_children_no_debt = data[data['debt'] == 0]
data_children_no_debt = data_children_no_debt.loc[:,['children','debt']]
data_children_no_debt = data_children_no_debt.groupby('children')['debt'].count()

result = data_children_debt / data_children_all

The groups are categorized by the number of children in the family.

__Total number of clients:__

Number of Children|	Number of Clients      
-------:|----:                  
0       |14091                    
1       |4808                      
2       |2052                      
3       |330                       
4       |41                        
5       |9 

__Total number of borrowers:__

Number of Children|Clients      
-------:|----:                  
0       |1063                    
1       |444                      
2       |194                      
3       |27                       
4       |4                         
5       |0                         

__Total number of clients who repaid on time:__

|Number of Children|Clients|
|-------:|----:|
|0       |13028|
|1       |4364|
|2       |1858|
|3       |303|
|4       |37| 
|5       |9| 

__Percentage of borrowers from the total number of clients:__

|Number of Children|% Borrowers|
|-------:|----:|
|0       |7.54%|
|1       |9.23%|
|2       |9.45%|
|3       |8.18%|
|4       |9.75%|
|5       |0.00%|

**Conclusion:** _Borrowers with no children have the lowest percentage of defaulters. Therefore, clients without children are more likely to repay loans on time compared to clients with children._

#### 3.2 Is there a correlation between marital status and the timely repayment of the loan?

In [23]:
data_family = data.loc[:,['family_status','debt']]
data_family_all = data_family.groupby('family_status')['debt'].count()

data_family_debt = data[data['debt'] == 1]
data_family_debt = data_family_debt.loc[:,['family_status','debt']]
data_family_debt = data_family_debt.groupby('family_status')['debt'].count()

data_family_no_debt = data[data['debt'] == 0]
data_family_no_debt = data_family_no_debt.loc[:,['family_status','debt']]
data_family_no_debt = data_family_no_debt.groupby('family_status')['debt'].count()

result = data_family_debt / data_family_all
data_family_all

family_status
Не женат / не замужем     2796
в разводе                 1189
вдовец / вдова             951
гражданский брак          4134
женат / замужем          12261
Name: debt, dtype: int64

The groups are categorized by marital status.

__Total number of clients:__

|Marital Status      | Number of Clients|
|--------------------|-----|
|Not Married   |2796|
|Divorced             |1189|
|Widowed          |951|
|Civil Partnership       |4134|
|Married        |12261|

__Total number of borrowers:__

|Marital Status     | Borrowers|
|--------------------|-----|
|Not Married   |273|
|Divorced             |84|
|Widowed          |63|
|Civil Partnership       |385|
|Married       |927|

__Total number of clients who repaid on time:__

|Marital Status      | On Time Repayment|
|--------------------|-----|
|Not Married   |2523|
|Divorced             |1105|
|Widowed          |888|
|Civil Partnership       |3749|
|Married        |11334|

__Percentage of borrowers from the total number of clients:__

|Marital Status      | % Borrowers|
|--------------------|-----|
|Not Married   |9.76%|
|Divorced             |7.06%|
|Widowed          |6.62%|
|Civil Partnership       |9.31%|
|Married        |7.56%|

**Conclusion:** _People who are either previously married or currently married are more likely to repay loans on time._

#### 3.3 Is there a correlation between income level and the timely repayment of the loan?

In [24]:
data_income = data.loc[:,['total_income_category','debt']]
data_income_all = data_income.groupby('total_income_category')['debt'].count()

data_income_debt = data[data['debt'] == 1]
data_income_debt = data_income_debt.loc[:,['total_income_category','debt']]
data_income_debt = data_income_debt.groupby('total_income_category')['debt'].count()

data_income_no_debt = data[data['debt'] == 0]
data_income_no_debt = data_income_no_debt.loc[:,['total_income_category','debt']]
data_income_no_debt = data_income_no_debt.groupby('total_income_category')['debt'].count()

result = data_income_debt / data_income_all
data_income_all

total_income_category
A       25
B     5014
C    15921
D      349
E       22
Name: debt, dtype: int64

The client groups are categorized based on income level.

__Total number of clients:__

|Income Level          |Number of Clients|
|----------------------|------|
|Income above 1,000,000 rubles        |25 |
|Income from 200,001 to 1,000,000 rubles|5014 |
|Income from 50,001 to 200,000 rubles   |15921 |
|Income from 30,001 to 50,000 rubles    |349 |
|Income up to 30,000 rubles              |22 |

__Total number of borrowers:__

|Income Level          |Borrowers|
|----------------------|------|
|Income above 1,000,000 rubles        |2 |
|Income from 200,001 to 1,000,000 rubles|354 |
|Income from 50,001 to 200,000 rubles   |1353 |
|Income from 30,001 to 50,000 rubles    |21 |
|Income up to 30,000 rubles              |2 |

__Total number of clients who repaid on time:__

|Income Level          |On Time Repayment|
|----------------------|------|
|Income above 1,000,000 rubles        |23 |
|Income from 200,001 to 1,000,000 rubles|4660 |
|Income from 50,001 to 200,000 rubles   |14568 |
|Income from 30,001 to 50,000 rubles    |328 |
|Income up to 30,000 rubles              |20 |

__Percentage of borrowers from the total number of clients:__

|Income Level          |% Borrowers|
|----------------------|------|
|Income above 1,000,000 rubles        |8.00% |
|Income from 200,001 to 1,000,000 rubles|7.06% |
|Income from 50,001 to 200,000 rubles   |8.49% |
|Income from 30,001 to 50,000 rubles    |6.01% |
|Income up to 30,000 rubles              |9.09% |

**Conclusion:** _For the income groups above 1,000,000 rubles and below 50,000 rubles, there is too little data available. Making conclusions about income based on all the groups would be incorrect. We are left with only two groups, and from them, we can note that people with incomes ranging from 200,000 to 1,000,000 are less likely to have payment delays. However, it is not appropriate to consider these findings as statistically significant._

#### 3.4 How do different loan purposes affect timely loan repayment?

In [25]:
data_purpose = data.loc[:,['purpose_category','debt']]
data_purpose_all = data_purpose.groupby('purpose_category')['debt'].count()

data_purpose_debt = data[data['debt'] == 1]
data_purpose_debt = data_purpose_debt.loc[:,['purpose_category','debt']]
data_purpose_debt = data_purpose_debt.groupby('purpose_category')['debt'].count()

data_purpose_no_debt = data[data['debt'] == 0]
data_purpose_no_debt = data_purpose_no_debt.loc[:,['purpose_category','debt']]
data_purpose_no_debt = data_purpose_no_debt.groupby('purpose_category')['debt'].count()

result = data_purpose_debt / data_purpose_all

data_purpose_no_debt

purpose_category
операции с автомобилем      3879
операции с недвижимостью    9971
получение образования       3619
проведение свадьбы          2130
Name: debt, dtype: int64

Groups are categorized based on the purpose of the loan:

__Total number of clients:__

|Loan Purpose            |Number of Clients|
|-----------------------|---------|
|Car Operations     |3879|
|Real Estate Operations   |9971|
|Education       |3619|
|Wedding         |2130|

__Total number of debtors:__

|Loan Purpose             |Debtors|
|-----------------------|---------|
|Car Operations     |400|
|Real Estate Operations   |780|
|Education       |369|
|Wedding         |183|

__Total number of on-time repayments:__

|Loan Purpose             |On Time|
|-----------------------|---------|
|Car Operations     |3879|
|Real Estate Operations   |9971|
|Education       |3619|
|Wedding         |2130|

__Percentage of debtors out of the total number of clients:__


|Loan Purpose             |% Debtors|
|-----------------------|---------|
|Car Operations    |9.34%|
|Real Estate Operations   |7.25%|
|Education       |9.25%|
|Wedding         |7.91%|

**Conclusion:** _Loans related to real estate are more often repaid on time. Clients taking out loans for cars and education are less reliable._

#### 3.5 Provide possible reasons for missing data in the source data.

*Answer:* I believe that in most cases, the reason for missing data is human error. It can be either due to carelessness or situations where there is no suitable data to categorize, and the operator leaves it blank. If the clients themselves fill out the forms, then errors and omissions may occur more frequently.

Additionally, data quality can be significantly affected by errors in data processing, storage, and transmission from the server. It's also possible that there are coding errors in data collection processes.

#### 3.6 Explain why filling in missing values with the median is the best solution for quantitative variables.

*Answer:* Because median values are less influenced by extreme outliers in the data compared to the mean.

### Step 4: General Conclusion.

1. Is there a correlation between the number of children and loan repayment on time?
Yes, there is a correlation. In this client sample, _14,091_ people do not have children, and _7,240_ have one child or more. After analysis, it was found that the group without children has a _7.54%_ on-time repayment rate. The groups with children vary from _8.18%_ to _9.75%_ on average. This confirms the hypothesis that clients without children are more likely to repay loans on time.

2. Is there a correlation between marital status and loan repayment on time?
Yes, there is a correlation. We have data on unmarried clients - _2,796_ people, as well as clients living in civil partnerships - _4,134_ people. People in marriage - _12,261_, widows and widowers - _951_, and divorced individuals - _1,189_. The on-time repayment percentages for clients who are married or in a civil partnership are _7.56%_ or lower. Meanwhile, clients who are not married or are in a civil partnership have a rate of _9.31%_ or higher. The hypothesis that marital status affects on-time loan repayment has been confirmed.

3. Is there a correlation between income level and loan repayment on time?
In this category, it is not possible to conduct a precise analysis because there is too little data for 3 out of the 5 income groups, specifically:
<!DOCTYPE html>
<html>
<head>
<style>
    table {
        text-align: left;
    }
</style>
</head>
<body>
<table>
    <tr>
        <th>Income Level</th>
        <th>Number of Clients</th>
    </tr>
    <tr>
        <td>Income above 1,000,000 rubles</td>
        <td>25</td>
    </tr>
    <tr>
        <td>Income from 200,001 to 1,000,000 rubles</td>
        <td>5,014</td>
    </tr>
    <tr>
        <td>Income from 50,001 to 200,000 rubles</td>
        <td>15,921</td>
    </tr>
    <tr>
        <td>Income from 30,001 to 50,000 rubles</td>
        <td>349</td>
    </tr>
    <tr>
        <td>Income up to 30,000 rubles</td>
        <td>22</td>
    </tr>
</table>
</body>
</html>
    
Given such a significant imbalance, I consider counting and analyzing this data to be impractical.

4. How do different loan purposes affect timely loan repayment?
Yes, there is a correlation. Based on the analysis, it was determined that the most reliable borrowers are those seeking mortgages.

|Loan Purpose            |Number of Clients|
|-----------------------|---------|
|Car Operations     |3879|
|Real Estate Operations   |9971|
|Education       |3619|
|Wedding         |2130|

Despite the majority of loans being for real estate, they are more reliable in terms of repayment.

|Loan Purpose             |% Debtors|
|-----------------------|---------|
|Car Operations     |9.34%|
|Real Estate Operations   |7.25%|
|Education       |9.25%|
|Wedding         |7.91%|

**General Conclusion:** : When granting a loan, it is recommended to consider the client's marital status, the number of children in the family, and the purpose of the loan. An ideal borrower is __a married client without children who is seeking a mortgage__.