# Analyzing borrowers risk of defaulting

Your project is to prepare a report for a bank's loan division. You'll need to find out if a customers' marital status and number of children has an impact on whether they will default on a loan. The bank already has some data on customers' credit worthiness.

Your report will be considered when building a **credit scoring** of a potential customer. A ** credit scoring ** is used to evaluate the ability of a potential borrower to repay their loan.

## Import Packages

In [65]:
import pandas as pd
import numpy as np
import researchpy as rp
import nltk
from nltk.stem.snowball import SnowballStemmer
!pip install pandas-profiling[notebook]
from pandas_profiling import ProfileReport
import warnings
warnings.filterwarnings('ignore')



### Details

| Package          | Description                                                          |
|:-----------------|:---------------------------------------------------------------------|
| pandas           | For data preprocessing and basic descriptive statistics.             |
| researchpy       | For looking at summary statistics.                                   |
| nltk             | For natural language processing.                                     |
| pandas_profiling | For creating a profile report on the dataset.                        |
| missingno        | For looking at missing data information.                             |
| warnings         | For ignoring warnings after finishing the project for readability.   |

## Importing and Cleaning Data

### Import Data

In [51]:
#read the data
try: 
    credit_scoring = pd.read_csv('/Users/bnoah/data/credit_scoring_eng.csv')
except: 
    credit_scoring = pd.read_csv('/datasets/credit_scoring_eng.csv')

### Profile Report

In [6]:
credit_score_report = ProfileReport(credit_scoring, title="Credit Score Profiling Report")
credit_score_report.to_widgets()

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render widgets:   0%|          | 0/1 [00:00<?, ?it/s]

VBox(children=(Tab(children=(Tab(children=(GridBox(children=(VBox(children=(GridspecLayout(children=(HTML(valu…

### Data Cleaning

In [52]:
#remove all instances where the number of children is -1
credit_scoring = credit_scoring.loc[credit_scoring['children'] >= 0]
#remove all instances where gender is XNA
credit_scoring = credit_scoring.loc[credit_scoring['gender'] != 'XNA']
#set all days employed values equal to their absolute value
credit_scoring['days_employed'] = credit_scoring['days_employed'].abs()
#rename dob_years column to age
credit_scoring.rename(columns={'dob_years':'age'}, inplace=True)
#remove all instances where age is equal to zero
credit_scoring = credit_scoring.loc[credit_scoring['age'] != 0]

### Conclusion

In this section, I focused on cleaning the data. Firstly, I noticed that there were 47 occurrences of negative children. Since it was such a rare occurrence, I decided to drop those observations instead of investigating it further. I also removed the one instance where gender was equal to "XNA" since I do not know what that stands for, and it was also rare. As far as days_employed, it does not make sense for there to be negative values and 82% of the values are negative. My best guess for what occurred is that some values were accidentally put as negative instead of positive. For this reason, I replaced the days_employed variable with its absolute value, but in the real world I would talk to whoever created the data set to try to figure out the issues. Lastly, I changed the variable name of dob_years to age as I think that is easier to understand, and I removed all 101 instances of age being 0. 

## Data preprocessing

### Processing missing values

In [53]:
#creating a variable for observations with missing data
credit_scoring['missing'] = np.where(credit_scoring['days_employed'].isna(),1,0)
#check if those observations are correlated with any of the other variables
display(credit_scoring.corr()['missing'])
#drop the missing variable from the dataframe
credit_scoring.drop('missing',axis=1,inplace=True)

children            0.003316
days_employed            NaN
age                 0.009092
education_id       -0.010283
family_status_id    0.000543
debt               -0.002898
total_income             NaN
missing             1.000000
Name: missing, dtype: float64

### Conclusion

The first thing I noticed is that total_income and days_employed were all missing for the same observations. This means if one of the variables is missing at random, the other variable should be too. Overall, we had around a 10% missing rate for the total_income and days_unemployed variables. Looking at the correlations above, it seems as though the missing data is not highly correlated with any of the other variables, which means the data is most likely missing completely at random. 

### Data type replacement

In [54]:
#look at the min, max, median and mean for the total_income and days_employed variables
display(credit_scoring.agg({'total_income': ["min", "max", "median", "mean"],
                    'days_employed': ["min", "max", "median", "mean"]}))
#filled in the missing values with the mean for total_income and median for days_employed
credit_scoring['total_income'] = credit_scoring['total_income'].fillna(credit_scoring['total_income'].mean())
credit_scoring['days_employed'] = credit_scoring['days_employed'].fillna(credit_scoring['days_employed'].median())
#change real numbers to integers
credit_scoring['days_employed'] = credit_scoring['days_employed'].astype(int)
credit_scoring['total_income'] = credit_scoring['total_income'].astype(int)

Unnamed: 0,total_income,days_employed
min,3306.762,24.141633
max,362496.645,401755.400475
median,23200.877,2195.688842
mean,26799.115951,66923.717886


### Conclusion

I started out by looking at the descriptive statistics for total_income and days_employed. For total_income the mean and median are relatively close, so I used the mean to replace the missing values. With the days_employed variable, it seems as though it is heavily influenced by outliers due to the fact the mean is significantly higher than the median. For this reason, I replaced the missing values with the median. Lastly, I converted both variables to integers.

### Processing duplicates

In [55]:
#convert all string variables to lower case
for x in credit_scoring.select_dtypes(include='object').columns:
    credit_scoring[x] = credit_scoring[x].str.lower()
#count the number of duplicates
print('Number of duplicates: ', credit_scoring.duplicated().sum())
#drop duplicates
credit_scoring.drop_duplicates(inplace = True)

Number of duplicates:  71


### Conclusion

First, I put all the string variables in lower case to make sure we catch all duplicates. Afterwards, I found 71 duplicate values and deleted those values.

### Categorizing Data

In [56]:
#create a function that groups people by their age group
def age_group(age):
    if age <= 18:
        return 'child'
    elif age <= 44:
        return 'young adult'
    elif age <= 64:
        return 'adult'
    return 'elderly'
#add an age group column to the dataframe
credit_scoring['age_group'] = credit_scoring['age'].apply(age_group)

#create a function that groups people by their income level
def income_group(income):
    if income <= 40100:
        return 'low income'
    elif income < 120400:
        return 'middle income'
    return 'upper income'
#addincome group column to the dataframe
credit_scoring['income_group'] = credit_scoring['total_income'].apply(income_group)

#create a second function that groups people as either below or above mean income
def income_below_avg(income):
    if income < credit_scoring['total_income'].mean():
        return 'below average'
    return 'above average'
#look at above and below average income values
credit_scoring['income_avg'] = credit_scoring['total_income'].apply(income_below_avg)

#create a function that is 1 if you have any children and 0 if you have no children
def any_children(children):
    if children > 0:
        return 1
    return 0
#add a variable to the dataframe that is binary and equal to 1 if the person has a child
credit_scoring['any_children'] = credit_scoring['children'].apply(any_children)

#create a variable that has the stem word english dictionary 
english_stemmer = SnowballStemmer('english')
#looks at the stem of each word in a sentence/phrase and classifies the purpose of their loan
def loan_purpose_general(purpose):
    for word in purpose.split(' '):
        stemmed_word = english_stemmer.stem(word)
        if stemmed_word == 'hous':
            return 'house'
        elif stemmed_word == 'estat':
            return 'house'
        elif stemmed_word == 'properti':
            return 'house'
        elif stemmed_word == 'car':
            return 'car'
        elif stemmed_word == 'wed':
            return 'wedding'
        elif stemmed_word == 'educ':
            return 'education'
        elif stemmed_word == 'univers':
            return 'education'
#add the loan general purpose column to the datafram and check the values
credit_scoring['loan_purpose_general'] = credit_scoring['purpose'].apply(loan_purpose_general)

#create a function that is 1 if the person is married and equal to 0 if they are not.
def married(family_status):
    if family_status == 'married':
        return 1
    return 0
credit_scoring['married'] = credit_scoring['family_status'].apply(married)

#### Updated Profile Report

In [61]:
credit_score_newvar_report = ProfileReport(credit_scoring[['age_group','income_group','income_avg','any_children','loan_purpose_general','married']], 
                                           title="Credit Score New Variable Profiling Report")
credit_score_newvar_report.to_widgets()

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render widgets:   0%|          | 0/1 [00:00<?, ?it/s]

VBox(children=(Tab(children=(Tab(children=(GridBox(children=(VBox(children=(GridspecLayout(children=(HTML(valu…

### Conclusion

The variables I want to put into categories are age, total_income, children, purpose, and marriage. 
- For `age group`, I used the age groups in the United States census. 
- For `income_group`, I used a [2020 US news article](https://money.usnews.com/money/personal-finance/family-finance/articles/where-do-i-fall-in-the-american-economic-class-system), to classify low, middle and upper income groups. 
- For `income_avg`, I separated those above and below the average income in this dataset. 
- For `any_children`, I separated people with and without children. 
- For `loan_purpose_general`, I grouped the purposes variable, using stemming, based on the general reason for the loan. 
- For `married`, Lastly, I separated married and non-married people. 

## Answer these questions

### Is there a relation between having kids and repaying a loan on time?

In [77]:
display(rp.summary_cont(credit_scoring.groupby('any_children')['debt']))
rp.summary_cont(credit_scoring.groupby('children')['debt'])





Unnamed: 0_level_0,N,Mean,SD,SE,95% Conf.,Interval
any_children,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,14021,0.0755,0.2641,0.0022,0.0711,0.0798
1,7284,0.0925,0.2898,0.0034,0.0859,0.0992






Unnamed: 0_level_0,N,Mean,SD,SE,95% Conf.,Interval
children,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,14021,0.0755,0.2641,0.0022,0.0711,0.0798
1,4792,0.092,0.2891,0.0042,0.0838,0.1002
2,2039,0.0951,0.2935,0.0065,0.0824,0.1079
3,328,0.0823,0.2753,0.0152,0.0524,0.1122
4,41,0.0976,0.3004,0.0469,0.0027,0.1924
5,9,0.0,0.0,0.0,,
20,75,0.1067,0.3108,0.0359,0.0352,0.1782


#### Conclusion

When comparing those with and without children, the difference in likelihood of default was 9.2% for those with children compared to 7.5% for those without. Looking at the 95% confidence intervals above, we can say that those with children are more likely to default of their loan. Yet, the default rate is not affected by the number of children. 

### Is there a relation between marital status and repaying a loan on time?

In [78]:
#look at summary statistics comparing those that are married and unmarried
display(rp.summary_cont(credit_scoring.groupby('married')['debt']))
#look at summary statistics comparing different non married family statuses
rp.summary_cont(credit_scoring[credit_scoring['married'] != 1].groupby('family_status')['debt'])





Unnamed: 0_level_0,N,Mean,SD,SE,95% Conf.,Interval
married,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,9044,0.0891,0.2849,0.003,0.0832,0.095
1,12261,0.0755,0.2642,0.0024,0.0708,0.0802






Unnamed: 0_level_0,N,Mean,SD,SE,95% Conf.,Interval
family_status,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
civil partnership,4124,0.0936,0.2913,0.0045,0.0847,0.1025
divorced,1181,0.072,0.2586,0.0075,0.0572,0.0867
unmarried,2789,0.0979,0.2972,0.0056,0.0868,0.1089
widow / widower,950,0.0653,0.2471,0.008,0.0495,0.081


#### Conclusion

Based on the summary tables, those that are married are less likely to default on a loan. Next, I wanted to see if there were any differences between the different types of non marriages. While it may not be as significant, among those that are not married, widowers and divorcees default less than those that are unmarried or in civil partnerships. 

### Is there a relation between income level and repaying a loan on time?

In [79]:
#summary stats for each income group 
display(rp.summary_cont(credit_scoring.groupby('income_group')['debt']))
#summar stats for those above and below the income average 
rp.summary_cont(credit_scoring.groupby('income_avg')['debt'])






Unnamed: 0_level_0,N,Mean,SD,SE,95% Conf.,Interval
income_group,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
low income,18523,0.0831,0.276,0.002,0.0791,0.0871
middle income,2734,0.0691,0.2537,0.0049,0.0596,0.0786
upper income,48,0.0833,0.2793,0.0403,0.0022,0.1644






Unnamed: 0_level_0,N,Mean,SD,SE,95% Conf.,Interval
income_avg,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
above average,9492,0.0772,0.267,0.0027,0.0719,0.0826
below average,11813,0.0846,0.2782,0.0026,0.0795,0.0896


#### Conclusion

Looking at the first classification of income groups, those in the low income group were more likely to default on their loan than those in the middle income group. Since, the upper income group only had 48 observations, it is not a surprise that we cannot infer much about their default rates. Next, looking at income average, there is not a clear difference in default rates among those above and below average income.

### How do different loan purposes affect on-time repayment of the loan?

In [80]:
#summary stats for different purposes
rp.summary_cont(credit_scoring.groupby('loan_purpose_general')['debt'])






Unnamed: 0_level_0,N,Mean,SD,SE,95% Conf.,Interval
loan_purpose_general,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
car,4273,0.0934,0.291,0.0045,0.0846,0.1021
education,3985,0.0928,0.2903,0.0046,0.0838,0.1019
house,10739,0.0725,0.2594,0.0025,0.0676,0.0774
wedding,2308,0.0797,0.2709,0.0056,0.0687,0.0908


#### Conclusion

Those who take out a loan for a car or education are more likely to default on their loan than those who take out a loan for a house. Also, weddings have lower loan default rates, but not at the significance level as houses. 

## General conclusion

In general, this project helped us better understand some key factors in default rates on loans. 

**Cleaning Summary:** 
1. I noticed 82% of the `days_employed` values were negative which made no logical sense. In my project, I assumed the error was an instance of accidental negative signs and took the absolute value of the days_employed variable. Yet, this is an issue that, in the real world, would need to be discussed and addressed with the person who gathered the data to figure out the exact error. 
2. I looked at the missing values for the `total_income` and `days_employed` variables. After looking at the correlation between the missing data and other variables, I was able to conclude that the data seems to be missing completely at random and imputed missing values. 
3. I looked at duplicated values and deleted them. There is an argument to be made for not deleting duplicate values in this situation, due to the lack of a unique identifier, but since there were only 71 duplicates, deleting them will not change our general conclusions. 

**Analysis Summary:**
For our questions of interest, I was able to find a few characteristics that seemed to have an impact on the default rate. 
1. People who have a child have higher default rates on a loan. This makes sense because taking care of children can be very costly and decrease one's income they can put towards their loan. 
2. Those that are married are less likely to default on their loan. One of the reasons this may be the case is because it increases household income and reduces the effect of one partner losing their job on the ability of the household to pay back the loan. 
3. Those in the lower-income bracket were more likely to default on a loan than those in the middle-income bracket. Beyond the obvious reason of having a lower income, I think this also arises due to the fact that low-income jobs are more volatile in terms of monthly income and job safety. This idea is discussed in an [article](https://www.cbpp.org/research/poverty-and-inequality/most-workers-in-low-wage-labor-market-work-substantial-hours-in) released in 2018 by the Center on Budget and Policy Priorities. 
4. Loans for cars and education have higher default rates than loans for houses. One of the reasons I believe this occurs is due to the ease of buying a car and the lack of quality job certainty with an undergraduate degree.  

Overall, while I think there are characteristics here that can help predict loan default rates, there are many social issues with using them for a credit score. I think punishing people for having children or being single would cause significant backlash toward the company. My recommendation would be to focus on more concrete data points such as credit payment history, current loans, etc. 


