# Supervised learning models for predicting Credit Risk Analysis

### Notebook by [José David Rocha](https://github.com/davidrocha9), [Telmo Botelho](https://github.com/Telmo465)
#### Supported by [Luis Paulo Reis](https://web.fe.up.pt/~lpreis/)
#### [Faculdade de Engenharia da Universidade do Porto](https://sigarra.up.pt/feup/en/web_page.inicial)

#### It is recommended to [view this notebook in nbviewer]() for the best overall experience
#### You can also execute the code on this notebook using [Jupyter Notebook](https://jupyter.org/) or [Binder](https://mybinder.org/) (no local installation required)

## Table of contents

1. [Introduction](#Introduction)

2. [License](#License)

3. [Required libraries](#Required-libraries)

4. [The problem domain](#The-problem-domain)

5. [Step 1: Answering the question](#Step-1:-Answering-the-question)

## Introduction

### TO DO

[[ go back to the top ]](#Table-of-contents)

## Required libraries

[[ go back to the top ]](#Table-of-contents)

If you don't have Python on your computer, you can use the [Anaconda Python distribution](http://continuum.io/downloads) to install most of the Python packages you need. Anaconda provides a simple double-click installer for your convenience.

This notebook uses several Python packages that come standard with the Anaconda Python distribution. The primary libraries that we'll be using are:

* **NumPy**: Provides a fast numerical array structure and helper functions.
* **pandas**: Provides a DataFrame structure to store data in memory and work with it easily and efficiently.
* **scikit-learn**: The essential Machine Learning package in Python.
* **matplotlib**: Basic plotting library in Python; most other Python plotting libraries are built on top of it.
* **Seaborn**: Advanced statistical plotting library.

## Step 1: Answering the question

[[ go back to the top ]](#Table-of-contents)

The first step to any data analysis project is to define the question or problem we're looking to solve, and to define a measure (or set of measures) for our success at solving that task. The data analysis checklist has us answer a handful of questions to accomplish that, so the first step that should be taken is working through those questions.

#### Did you specify the type of data analytic question (e.g. exploration, association causality) before touching the data?

>We're trying to design a predictive model capable of accurately predicting credit risk analysis, i.e., quantifying the creditworthiness of potential borrowers and their ability to honor their debt obligations. This is known as default risk: when a borrowers defaults, it means they didn't pay their debt in time. Taking that into account, the prediction is based on a set of features, such as the amount of money requested by the borrower, the interest rate associated with the loan, the loan grade history of the borrower, the annual income of the borrower, the purpose of borrowing, the monthly amount payments for opted loan and the duration of the loan until it's paid off.

#### Did you define the metric for success before beginning?

> Let's do that now. Since we are performing a classification, we can use accuracy (the fraction of correctly classified classes), to define a metric for the quality of our algorithm. After some research, we came to the conclusion that, in the long term, a model with an accuracy of less than 70% of accuracy will underperform and will not generate profit. We will try to either match or beat this accuracy for values between 2015 and 2017.

#### Did you record the experimental design?

> No. We will be using a public dataset, which can be viewed on [Kaggle](https://www.kaggle.com/rameshmehta/credit-risk-analysis), which contains information about more than 800 000 loans, dating from 2015 to 2017.

#### Did you consider whether the question could be answered with the available data?

>The provided data set has records of more than 800 000 loans, dating from 2015 to 2017. Taking that into account, and after processing and cleansing of such data, we are confident we will be able to make good predictions concerning the credit risk of loans. On the whole, we believe we have more than enough data that allows us to answer the question.

<hr />

In [2]:
import pandas as pd

read_data = pd.read_csv('data.csv')
read_data.head()

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,...,il_util,open_rv_12m,open_rv_24m,max_bal_bc,all_util,total_rev_hi_lim,inq_fi,total_cu_tl,inq_last_12m,default_ind
0,1077501,1296599,5000,5000,4975.0,36 months,10.65,162.87,B,B2,...,,,,,,,,,,0
1,1077430,1314167,2500,2500,2500.0,60 months,15.27,59.83,C,C4,...,,,,,,,,,,1
2,1077175,1313524,2400,2400,2400.0,36 months,15.96,84.33,C,C5,...,,,,,,,,,,0
3,1076863,1277178,10000,10000,10000.0,36 months,13.49,339.31,C,C1,...,,,,,,,,,,0
4,1075358,1311748,3000,3000,3000.0,60 months,12.69,67.79,B,B5,...,,,,,,,,,,0
