<img alt="Colaboratory logo" width="15%" src="https://raw.githubusercontent.com/carlosfab/escola-data-science/master/img/novo_logo_bg_claro.png">

#### **Data Science na Prática 3.0**
*by [sigmoidal.ai](https://sigmoidal.ai)*

---

# Credit Risk Assessment

### Credit Risk

**Credit Risk** can be defined as the probability or chance that someone or their counterparty will fail to honor their previous agreement, resulting in financial loss to banking institutions when the client *defaults* on the aforementioned agreement<sup><a href="https://www.risk-officer.com/Credit_Risk.htm">1</a>,</sup><sup><a href="https://www.investopedia.com/terms/c/creditrisk.asp">2</a></sup>. Usually, this occurs because of the inability of clients to repay their loans to said institutions.

<p align=center>
<img src="img/credit_risk.jpg" width="40%"><br>
<i><sup>Image credits: storyset @ <a href="https://www.freepik.com/author/stories">freepik</a>.</sup></i>
</p>

Although it may be impossible to predict exactly which clients will incur money loss for the company, ***Credit Risk Management*** refers exactly to the evaluation of this probability. That is, trying to identify which clients will default on their agreements. This is especially important for **Credit Risk Management** strategies, as it will allow the companies to mitigate money loss, for example, by increasing the interest rates for clients who represent higher risks or by even denying loans<sup><a href="https://www.risk-officer.com/Credit_Risk.htm">1</a>,</sup><sup><a href="https://www.investopedia.com/terms/c/creditrisk.asp">2</a></sup>.

One of the strategies used by lenders to evaluate risk are the *5 Cs of Credit*. Although companies have different ways of measuring these, they offer some insights about the risk of financial loss. The 5 Cs are: **Character**, as in the client's credit history; **Capital**, as in the amount of money they have; **Capacity**, or the debt-to-income ratio; **Collateral**, assets that can back or act as security for the loan; and **Conditions**, as in that is the purpose, amount and rates of the loan<sup><a href="https://www.investopedia.com/terms/f/five-c-credit.asp">3</a></sup>.

However, these are only a few characteristics that can be observed. The companies usually have a lot more information about their clients. Using Machine Learning methods, we can leverage this information, and with it try and predict if they will default or not.

## Goal

The goal of the present analysis is to predict if a bank client will default on their financial agreements with the lending institution or not. This evaluation has to happen before the client takes a loan or a credit card. This prediction needs to minimize false positives — to not negatively impact the clients — but also to prevent money loss on the company's side.

## Initial hypotheses 

There are a few initial hypotheses that we can think of.

* Clients with previously recorded bankruptcies will be more likely to default.

* Clients with previously recorded defaults will be more likely to default again.

* Clients with lower credit scores will also be more likely to default.

On the course of this analysis, we will see how these characteristics will impact on the outcome observed.

## About the dataset

In this notebook, we will use a dataset from a Data Science project that was part of a competition held by [Nubank](https://blog.nubank.com.br/nubank-o-que-e/), a digital financial platform from Brazil. The dataset presents several pieces of information about Nubank's clients and whether they defaulted on their financial obligations or not.

In the dataset, we find several columns:

* `ids` = These are the clients' IDs. This column is anonymised. 
* `target_default` = Whether the client defaulted or not. This will be our target variable.
* `score_1` = Credit Score. This column is anonymised.
* `score_2` = Another type of Credit Score. This is also anonymised.
* `score_3` = A third type of Credit Score. Contains the actual numbers.
* `score_4` = A fourth type of Credit Score. Contains the actual numbers.
* `score_5` = A fifth type of Credit Score. Contains the actual numbers.
* `score_6` = A sixth type of Credit Score. Contains the actual numbers.
* `risk_rate` = Unclear. Could be "Interest Rate Risk", which is defined as "the danger that a bank may incur loss or lose money in granting loans (...)".<sup><a href="https://www.sciencedirect.com/topics/economics-econometrics-and-finance/interest-rate-risk">4</a></sup>
* `last_amount_borrowed` = Last amount borrowed by the client.
* `last_borrowed_in_months` = How many months since the last loan.
* `credit_limit` = Limit of credit.
* `reason` = Unclear. This could be the reason for the loan. This is also anonymised.
* `income` = The client's income, probably annual income. 
* `facebook_profile` = If the client has a facebook profile.
* `state` = Geographical state. This is also anonymised.
* `zip` = Zip code. This is also anonymised.
* `channel` = Unclear. This is also anonymised.
* `job_name` = The client's job title. This is also anonymised.
* `real_state` = Unclear. This is also anonymised.
* `ok_since` = Unclear. This is probably a time variable.
* `n_bankruptcies` = Number of previous bankruptcies.
* `n_defaulted_loans` = Number of previous defaulted loans.
* `n_accounts` = Number of accounts.
* `n_issues` = Number of issues.
* `application_time_applied` = The time the application was made.
* `application_time_in_funnel` = How long the application was in "funnel".
* `email` = The client's e-mail provider.
* `external_data_provider_credit_checks_last_2_year` = External data. Credit checks in the last 2 years.
* `external_data_provider_credit_checks_last_month` = External data. Credit checks in the last month.
* `external_data_provider_credit_checks_last_year` = External data. Credit checks in the last year.
* `external_data_provider_email_seen_before` = External data. Unclear. Probably how many times the e-mail was seen previously.
* `external_data_provider_first_name` = External data. First name, unsure if from provider or client. We'll check this information based on the number of unique values.
* `external_data_provider_fraud_score` = External data. Fraud score.
* `lat_lon` = Latitude and Longitude of the client.
* `marketing_channel` = Marketing channel through which the client decided on a loan.
* `profile_phone_number` = Client's phone number. Looks anonymised.
* `reported_income` = Client's reported income.
* `shipping_state` = Country state for shipping to the client.
* `shipping_zip_code` = Zip code for shipping to the client. Looks anonymised.
* `profile_tags` = Dictionary of tags for each client. Unclear to what it means.
* `user_agent` = Information about which platform was used by the client (navigator, operating system, et cetera).
* `target_fraud` = Fraud information for another analysis in the same dataset.

## Importing data

Let's start our analysis by importing our dependencies, setting some parameters and reading our dataset. We will also print the first few entries of the data.

In [3]:
# Dependencies
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Options
pd.set_option('display.max_columns', None)

# The Moon also rises
np.random.seed(6327)

# Defining plot parameters
# plt.style.use('dark_background')
plt.rcParams['font.family'] = 'sans-serif'
plt.rcParams['font.sans-serif'] = 'Arial'
plt.rcParams['font.stretch'] = 'normal'
plt.rcParams['font.style'] = 'normal'
plt.rcParams['font.variant'] = 'normal'

# Reading dataframe
df = pd.read_csv("data/acquisition_train.csv")

In [4]:
# Checking size and first entries
print(df.shape)
df.head(6)

(45000, 43)


Unnamed: 0,ids,target_default,score_1,score_2,score_3,score_4,score_5,score_6,risk_rate,last_amount_borrowed,last_borrowed_in_months,credit_limit,reason,income,facebook_profile,state,zip,channel,job_name,real_state,ok_since,n_bankruptcies,n_defaulted_loans,n_accounts,n_issues,application_time_applied,application_time_in_funnel,email,external_data_provider_credit_checks_last_2_year,external_data_provider_credit_checks_last_month,external_data_provider_credit_checks_last_year,external_data_provider_email_seen_before,external_data_provider_first_name,external_data_provider_fraud_score,lat_lon,marketing_channel,profile_phone_number,reported_income,shipping_state,shipping_zip_code,profile_tags,user_agent,target_fraud
0,343b7e7b-2cf8-e508-b8fd-0a0285af30aa,False,1Rk8w4Ucd5yR3KcqZzLdow==,IOVu8au3ISbo6+zmfnYwMg==,350.0,101.800832,0.259555,108.427273,0.4,25033.92,36.0,0.0,mLVIVxoGY7TUDJ1FyFoSIZi1SFcaBmO01AydRchaEiGYtU...,65014.12,True,sjJbkqJS7cXalHLBFA+EOQ==,Ernn+uVXCMq/6ARrBCcd+A==,NCqL3QBx0pscDnx3ixKwXg==,mLVIVxoGY7TUDJ1FyFoSIZi1SFcaBmO01AydRchaEiGYtU...,N5/CE7lSkAfB04hVFFwllw==,,0.0,0.0,18.0,18.0,07:52:34,444,outlook.com,,2,0.0,51.0,leidelaura,645,"(-29.151545708122246, -51.1386461804385)",Invite-email,514-9840782,57849.0,BR-MT,17528,"{'tags': ['n19', 'n8']}",Mozilla/5.0 (Linux; Android 6.0.1; SGP771 Buil...,
1,bc2c7502-bbad-0f8c-39c3-94e881967124,False,DGCQep2AE5QRkNCshIAlFQ==,SaamrHMo23l/3TwXOWgVzw==,370.0,97.062615,0.942655,92.002546,0.24,,,39726.0,mLVIVxoGY7TUDJ1FyFoSIZi1SFcaBmO01AydRchaEiGYtU...,100018.91,False,xsd3ZdsI3356I3xMxZeiqQ==,rlWIXTBO+VOa34+SpGyhlQ==,NCqL3QBx0pscDnx3ixKwXg==,mLVIVxoGY7TUDJ1FyFoSIZi1SFcaBmO01AydRchaEiGYtU...,N5/CE7lSkAfB04hVFFwllw==,50.0,0.0,0.0,14.0,14.0,02:34:29,346,gmail.com,0.0,1,0.0,17.0,diocezio,243,"(-19.687710705798963, -47.94151536525154)",Radio-commercial,251-3659293,4902.0,BR-RS,40933,"{'tags': ['n6', 'n7', 'nim']}",Mozilla/5.0 (Linux; Android 5.0.2; SAMSUNG SM-...,
2,669630dd-2e6a-0396-84bf-455e5009c922,True,DGCQep2AE5QRkNCshIAlFQ==,Fv28Bz0YRTVAT5kl1bAV6g==,360.0,100.027073,0.351918,112.892453,0.29,7207.92,36.0,,mLVIVxoGY7TUDJ1FyFoSIZi1SFcaBmO01AydRchaEiGYtU...,65023.65,,Ygq6MsM98oC8yceExr69Ig==,PjTIDfJsK0DKL9fO7vuW2g==,NCqL3QBx0pscDnx3ixKwXg==,mLVIVxoGY7TUDJ1FyFoSIZi1SFcaBmO01AydRchaEiGYtU...,n+xK9CfX0bCn77lClTWviw==,,0.0,0.0,10.0,,00:60:02,6,gmail.com,,2,,9.0,veralucia,65,"(-28.748023890412284, -51.867279334353995)",Waiting-list,230-6097993,163679.0,BR-RR,50985,"{'tags': ['n0', 'n17', 'nim', 'da']}",Mozilla/5.0 (Linux; Android 6.0.1; SGP771 Buil...,
3,d235609e-b6cb-0ccc-a329-d4f12e7ebdc1,False,1Rk8w4Ucd5yR3KcqZzLdow==,dCm9hFKfdRm7ej3jW+gyxw==,510.0,101.599485,0.987673,94.902491,0.32,,,54591.0,mLVIVxoGY7TUDJ1FyFoSIZi1SFcaBmO01AydRchaEiGYtU...,68830.01,False,KET/Pmr6rHp1RJ/P9ymztw==,Cc/kWDLQH3dpHv5HU+pLVA==,NCqL3QBx0pscDnx3ixKwXg==,mLVIVxoGY7TUDJ1FyFoSIZi1SFcaBmO01AydRchaEiE56f...,n+xK9CfX0bCn77lClTWviw==,,1.0,0.0,19.0,19.0,11:20:49,406,spgov.com,,3,,38.0,venice,815,"(-17.520650158450454, -39.75801139933186)",Waiting-list,261-3543751,1086.0,BR-RN,37825,{'tags': ['n4']},Mozilla/5.0 (Linux; Android 6.0; HTC One X10 B...,
4,9e0eb880-e8f4-3faa-67d8-f5cdd2b3932b,False,8k8UDR4Yx0qasAjkGrUZLw==,+CxEO4w7jv3QPI/BQbyqAA==,500.0,98.474289,0.532539,118.126207,0.18,,,,mLVIVxoGY7TUDJ1FyFoSIZi1SFcaBmO01AydRchaEiGYtU...,60011.29,True,xsd3ZdsI3356I3xMxZeiqQ==,i036nmJ7rfxo+3EvCD7Jnw==,NCqL3QBx0pscDnx3ixKwXg==,mLVIVxoGY7TUDJ1FyFoSIZi1SFcaBmO01AydRchaEiGYtU...,n+xK9CfX0bCn77lClTWviw==,,0.0,0.0,11.0,,13:39:03,240,gmail.com,0.0,2,1.0,46.0,darzisa,320,"(-16.574259446978008, -39.90990074785962)",Invite-email,102-3660162,198618.0,BR-MT,52827,"{'tags': ['pro+aty', 'n19', 'da', 'b19']}",Mozilla/5.0 (Linux; Android 7.0; Pixel C Build...,
5,538c1908-bd80-b834-c3f0-238b4f536d3f,False,8k8UDR4Yx0qasAjkGrUZLw==,+CxEO4w7jv3QPI/BQbyqAA==,300.0,101.83704,0.915389,90.711273,0.44,,,61055.0,mLVIVxoGY7TUDJ1FyFoSIZi1SFcaBmO01AydRchaEiGYtU...,75024.28,False,JrdZzPZEa/YIIGwh8OdTKA==,kfWDI5wYFHdi9PtKFI9EPg==,NCqL3QBx0pscDnx3ixKwXg==,mLVIVxoGY7TUDJ1FyFoSIZi1SFcaBmO01AydRchaEiGYtU...,N5/CE7lSkAfB04hVFFwllw==,,0.0,0.0,9.0,9.0,05:27:02,169,gmail.com,,2,1.0,21.0,teomar,811,"(-6.762413011455668, -35.13224579733013)",Website,787-1678197,160198.0,BR-SP,55266,"{'tags': ['c1', 'n3', 'n9']}",Mozilla/5.0 (Linux; Android 6.0.1; Nexus 6P Bu...,


# References
1. https://www.risk-officer.com/Credit_Risk.htm
2. https://www.investopedia.com/terms/c/creditrisk.asp
3. https://www.investopedia.com/terms/f/five-c-credit.asp
4. https://www.sciencedirect.com/topics/economics-econometrics-and-finance/interest-rate-risk