# Analyzing borrowers’ risk of defaulting

The goal of this project is to prepare a report for a bank’s loan division. We need to find out if a customer’s marital status and number of children has an impact on whether they will default on a loan. The bank already has some data on customers’ credit worthiness.

This report will be considered when building a **credit scoring** of a potential customer. A ** credit scoring ** is used to evaluate the ability of a potential borrower to repay their loan.

## Description of the data

/datasets/credit_scoring_eng.csv

children: the number of children in the family

days_employed: how long the customer has been working

dob_years: the customer’s age

education: the customer’s education level

education_id: identifier for the customer’s education

family_status: the customer’s marital status

family_status_id: identifier for the customer’s marital status

gender: the customer’s gender

income_type: the customer’s income type

debt: whether the customer has ever defaulted on a loan

total_income: monthly income

purpose: reason for taking out a loan

In [3]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np 
import matplotlib.pyplot as plt
import seaborn as sns


In [4]:
data = pd.read_csv('/datasets/credit_scoring_eng.csv')


data.columns = ['children', 'days_employed','age', 'education','education_id','family_status','family_status_id','gender','income_type','debt','total_income','purpose']
data.info()
data.head()
data.tail()
data.describe()
data.shape

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21525 entries, 0 to 21524
Data columns (total 12 columns):
children            21525 non-null int64
days_employed       19351 non-null float64
age                 21525 non-null int64
education           21525 non-null object
education_id        21525 non-null int64
family_status       21525 non-null object
family_status_id    21525 non-null int64
gender              21525 non-null object
income_type         21525 non-null object
debt                21525 non-null int64
total_income        19351 non-null float64
purpose             21525 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 2.0+ MB


(21525, 12)

In [5]:
data['days_employed'] = data['days_employed'].abs()

In [3]:

print(data['children'].value_counts().sum())
print(data['days_employed'].value_counts().sum())
print(data['age'].value_counts().sum())
print(data['education'].str.lower().value_counts().sum())
print(data['education_id'].value_counts().sum())
print(data['family_status'].value_counts().sum())
print(data['family_status_id'].value_counts().sum())
print(data['gender'].value_counts().sum())
print(data['income_type'].value_counts().sum())
print(data['debt'].value_counts().sum())
print(data['total_income'].value_counts().sum())
print(data['purpose'].value_counts().sum())



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21525 entries, 0 to 21524
Data columns (total 12 columns):
children            21525 non-null int64
days_employed       19351 non-null float64
age                 21525 non-null int64
education           21525 non-null object
education_id        21525 non-null int64
family_status       21525 non-null object
family_status_id    21525 non-null int64
gender              21525 non-null object
income_type         21525 non-null object
debt                21525 non-null int64
total_income        19351 non-null float64
purpose             21525 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 2.0+ MB
21525
19351
21525
21525
21525
21525
21525
21525
21525
21525
19351
21525


### Step 2. Data preprocessing

### Processing missing values

In [None]:

data['days_employed'] = data['days_employed'].fillna(value=data['days_employed'].median())
print(data['days_employed'].isnull().any())

#max age = 75 years, so days employed can't be bigger than 27375
data.loc[data['days_employed'] >= 27375, 'days_employed'] = data['days_employed'].median()
data.drop(data[ data['children'] == 20 ].index , inplace=True)
data.loc[data['children'] == -1, 'children'] = 1


data['purpose'] = data['purpose'].fillna(value='undefined')
data['age'] = data['age'].fillna(value=data['age'].median())
data['education'] = data['education'].fillna(value='undefined')
data['family_status'] = data['family_status'].fillna(value='undefined')
data['family_status_id'] = data['family_status_id'].fillna(value=data['family_status_id'].median())
data['gender'] = data['gender'].fillna(value='undefined')
data['income_type'] = data['income_type'].fillna(value='undefined')
data['debt'] = data['debt'].fillna(value='undefined')
data['total_income'] = data['total_income'].fillna(value=data['total_income'].median())


print(data['children'].value_counts().sum())
print(data['days_employed'].value_counts().sum())
print(data['age'].value_counts().sum())
print(data['education'].str.lower().value_counts().sum())
print(data['education_id'].value_counts().sum())
print(data['family_status'].value_counts().sum())
print(data['family_status_id'].value_counts().sum())
print(data['gender'].value_counts().sum())
print(data['income_type'].value_counts().sum())
print(data['debt'].value_counts().sum())
print(data['total_income'].value_counts().sum())
print(data['purpose'].value_counts().sum())



I make an assumption (in real life I would obviously contact the source of information) that the -1 child is actually 1, and it's a typo. The 20 is probably a typo as well. It's probably 2, not 0,but we can't be sure, so I chose to delete the rows. It can't be accurate, 76 people with 20 is too much of a coincidence
#there's 21525 NaN values in all rows. We need to replace NaN first, then look for other missing values manually


### Conclusion I have replaced the missing values with undefined for str and median for numbers since it is better suited for situation with a huge range of numbers than mean.

### Data type replacement

In [None]:
import nltk
data = data.drop_duplicates().reset_index(drop=True)
from nltk.stem import WordNetLemmatizer 
lemmatizer = WordNetLemmatizer()


data['education'] = data['education'].str.lower() #to lower case 


purposes = data['purpose'].unique()

lemmas = pd.Series()

new_purposes = []
for purpose in purposes:
    new_purposes.append(nltk.word_tokenize(purpose))
    
purpose_house = []
purpose_education = []
purpose_estate = []
purpose_car = []
purpose_property = []
purpose_wedding = []

for purpose in new_purposes: 
    if 'house' in purpose:
        purpose_house.append(purpose)
        
    if 'education' in purpose:
        purpose_education.append(purpose)
        
    if 'estate' in purpose:
        purpose_estate.append(purpose)
        
    if 'car' in purpose:
        purpose_car.append(purpose)
        
    if 'property' in purpose:
        purpose_property.append(purpose)
        
    if 'wedding' in purpose:
        purpose_wedding.append(purpose)
        

data['purpose'].value_counts()

def purpose_group(purpose):
    

    if 'house' in purpose:
                return 'house'
        
    if 'education'  in purpose:
                return 'education'
        
    if 'educated' in purpose:
                return 'education'
        
    if 'estate' in purpose:
                return 'estate'
        
    if 'car' in purpose:
                return 'car'
        
    if 'property' in purpose:
                return 'property'
        
    if 'wedding' in purpose:
                return 'wedding'

    return 'other'
#data['purpose_grouped'] = data['purpose'].apply(purpose_group)
#print(data['purpose_grouped'].value_counts().sum())
#print(data['purpose'].value_counts())
data['purpose'] = data['purpose'].apply(purpose_group)
print(data['purpose'].unique())
print(data['children'].value_counts().sum())
print(data['days_employed'].value_counts().sum())
print(data['age'].value_counts().sum())
print(data['education'].str.lower().value_counts().sum())
print(data['education_id'].value_counts().sum())
print(data['family_status'].value_counts().sum())
print(data['family_status_id'].value_counts().sum())
print(data['gender'].value_counts().sum())
print(data['income_type'].value_counts().sum())
print(data['debt'].value_counts().sum())
print(data['total_income'].value_counts().sum())
print(data['purpose'].value_counts().sum())

- Is there a relation between marital status and repaying a loan on time?

In [None]:
column_1 = data['family_status_id']
column_2 = data['debt']
correlation = column_1.corr(column_2)
print(correlation)

by_marital_status = data.groupby("family_status_id")["debt"].sum()
marital_status_all = data["family_status_id"].value_counts()


percentage_marital_status = by_marital_status / (marital_status_all / 100)
print('percentage of people with debt for each marital status:', percentage_marital_status)




Since 1 means there's debt and 0 means there's no debt,we can find the percentage of people of all 5 family status categories with debt, meaning, the percentage of by_marital_status values take from data["family_status_id"].value_counts() values.
Clearly, this is the correlation. This means, people from 1-st and 4-th category have most debt, but, as corr function and percentagesth showed, the difference is relatively small.

Conclusion: the percentage of people with debt is very low and the difference between different groups is very insignificant which is consistent with the correlation.

- Is there a relation between income level and repaying a loan on time?

In [None]:
column_3 = data['total_income']
column_4 = data['debt']
correlation1 = column_3.corr(column_4)
print(correlation1)


def total_income_group(income):
    if income <= 50000:
        return 'low'
    if income <= 100000:
        return 'medium'
    if income <= 150000:
        return 'high'
    return 'very_high'

data['total_income'] = data['total_income'].apply(total_income_group)
print(data['total_income'].value_counts())
data['total_income'] = data['total_income'].to_string()
print(data['total_income'].value_counts())
by_total_income = data.groupby('debt')['total_income']
total_income_all = data['total_income'].value_counts()

percentage_total_income = by_total_income / (total_income_all / 100)
print('percentage of people with debt for total_income:', percentage_total_income)


Negative correlation means that variables move in different directions,inverse correlation, it means, the lower the income, the higher the debt and vice versa. At the same time, correlation is not vert significant. 

### Conclusion

- How do different loan purposes affect on-time repayment of the loan?

In [None]:
by_purpose = data.groupby("purpose")["debt"].sum()
purpose_all = data["purpose"].value_counts()
percentage_purpose = by_purpose / (purpose_all / 100)
print('percentage of people with debt for purpose:', percentage_purpose)

    


### Conclusion

### Step 4. General conclusion

The overall conclusion, according to the analysis of the correlation between the column 'debt' and other relevant columns, is that
there's very little correlation between debt and other variables; there is a small percentage of people with debt in all cases
and the difference between categories is marginal.