# Introduction

In this prject, we will walk through the full data science life cycle, from data cleaning and feature selection to machine learning. We will focus on credit modelling, a well known data science problem that focuses on modeling a borrower's credit risk. We'll be working with financial lending data from Lending Club. Lending Club is a marketplace for personal loans that matches borrowers who are seeking a loan with investors looking to lend money and make a return. 

Each borrower fills out a comprehensive application, providing their past financial history, the reason for the loan, and more. Lending Club evaluates each borrower's credit score using past historical data and assign an interest rate to the borrower. The interest rate is the percent in addition to the requested loan amount the borrower has to pay back. 

Approved loans are listed on the Lending Club website, where qualified investors can browse recently approved loans, the borrower's credit score, the purpose for the loan, and other information from the application. Once they're ready to back a loan, they select the amount of money they want to fund. Once a loan's requested amount is fully funded, the borrower receives the money they requested minus the origination fee that Lending Club charges.

The borrower then makes monthly payments back to Lending Club either over 36 months or over 60 months. Lending Club redistributes these payments to the investors. This means that investors don't have to wait until the full amount is paid off to start to see money back. If a loan is fully paid off on time, the investors make a return which corresponds to the interest rate the borrower had to pay in addition the requested amount. Many loans aren't completely paid off on time, however, and some borrowers default on the loan.

The data dictionary can be found in the project folder.

_Goal: Build a machine learning model that can accurately predict if a borrower will pay off their loan on time or not_

## Preliminary Cleaning

We'll focus on approved loans data from 2007 to 2011, since a good number of the loans have already finished. In the datasets for later years, many of the loans are current and still being paid off.

To ensure that code runs fast on our platform, we first reduce the size of LoanStats3a.csv by:

 - removing the first line: contains the extraneous text Notes offered by Prospectus (https://www.lendingclub.com/info/prospectus.action) instead of the column titles, which prevents the dataset from being parsed by the pandas library properly
 - removing the desc column: contains a long text explanation for each loan
 - removing the url column: contains a link to each loan on Lending Club which can only be accessed with an investor account
 - removing all columns containing more than 50% missing values: allows us to move faster since we can spend less time trying to fill these values

In [1]:
import pandas as pd
#remove first line
loans_2007 = pd.read_csv('LoanStats3a.csv', skiprows=1)
loans_2007.head()
#drop desc and url cols
loans_2007 = loans_2007.drop(['desc', 'url'], axis=1)
#remove all columns containing more than 50% missing vals
half_count = len(loans_2007) / 2
loans_2007 = loans_2007.dropna(thresh=half_count, axis=1)
#putting dataframe into csv file
loans_2007.to_csv('loans_2007.csv', index=False)

  interactivity=interactivity, compiler=compiler, result=result)


In [2]:
#display the first row and the number of columns
print(loans_2007.iloc[0], loans_2007.shape[1])

loan_amnt                            5000
funded_amnt                          5000
funded_amnt_inv                      4975
term                            36 months
int_rate                           10.65%
installment                        162.87
grade                                   B
sub_grade                              B2
emp_title                             NaN
emp_length                      10+ years
home_ownership                       RENT
annual_inc                          24000
verification_status              Verified
issue_d                          Dec-2011
loan_status                    Fully Paid
pymnt_plan                              n
purpose                       credit_card
title                            Computer
zip_code                            860xx
addr_state                             AZ
dti                                 27.65
delinq_2yrs                             0
earliest_cr_line                 Jan-1985
inq_last_6mths                    

## Data Cleaning
The Dataframe contains many columns and can be cumbersome to try to explore all at once. We'll break up the columns into 3 groups of 18 columns.

**Group 1:**
After analyzing each column, we can conclude that the following features need to be removed:

 - funded_amnt: leaks data from the future (after the loan is already started to be funded)
 - funded_amnt_inv: also leaks data from the future (after the loan is already started to be funded)
 - grade: contains redundant information as the interest rate column (int_rate)
 - sub_grade: also contains redundant information as the interest rate column (int_rate)
 - emp_title: requires other data and a lot of processing to potentially be useful
 - issue_d: leaks data from the future (after the loan is already completed funded)
 
Recall that Lending Club assigns a grade and a sub-grade based on the borrower's interest rate. While the grade and sub_grade values are categorical, the int_rate column contains continuous values, which are better suited for machine learning.

In [3]:
loans_2007 = loans_2007.drop(["funded_amnt", "funded_amnt_inv", "grade", 
                              "sub_grade", "emp_title", "issue_d"], axis=1)

**Group 2:**
Within this group of columns, we need to drop the following columns:

 - zip_code: redundant with the addr_state column since only the first 3 digits of the 5 digit zip code are visible (which only can be used to identify the state the borrower lives in)
 - out_prncp: leaks data from the future, (after the loan already started to be paid off)
 - out_prncp_inv: also leaks data from the future, (after the loan already started to be paid off)
 - total_pymnt: also leaks data from the future, (after the loan already started to be paid off)
 - total_pymnt_inv: also leaks data from the future, (after the loan already started to be paid off)
 - total_rec_prncp: also leaks data from the future, (after the loan already started to be paid off)
 
The out_prncp and out_prncp_inv both describe the outstanding principal amount for a loan, which is the remaining amount the borrower still owes. These 2 columns as well as the total_pymnt column describe properties of the loan after it's fully funded and started to be paid off. This information isn't available to an investor before the loan is fully funded and we don't want to include it in our model.

In [4]:
loans_2007 = loans_2007.drop(["zip_code", "out_prncp", "out_prncp_inv", 
                              "total_pymnt", "total_pymnt_inv", "total_rec_prncp"], axis=1)

**Group 3:**
In the last group of columns, we need to drop the following columns:

 - total_rec_int: leaks data from the future, (after the loan already started to be paid off)
 - total_rec_late_fee: also leaks data from the future, (after the loan already started to be paid off)
 - recoveries: also leaks data from the future, (after the loan already started to be paid off)
 - collection_recovery_fee: also leaks data from the future, (after the loan already started to be paid off)
 - last_pymnt_d: also leaks data from the future, (after the loan already started to be paid off)
 - last_pymnt_amnt: also leaks data from the future, (after the loan already started to be paid off)
 - hardship_flag: not in data dictionary.
 - disbursement_method: redundant infomation.
 - debt_settlement_flag: not in data dictionary.

In [5]:
loans_2007 = loans_2007.drop(["total_rec_int", "total_rec_late_fee", "recoveries", 
                              "collection_recovery_fee", "last_pymnt_d", "last_pymnt_amnt",
                             "hardship_flag", "disbursement_method", "debt_settlement_flag"], axis=1)

In [6]:
#display first row and number of cols
print(loans_2007.iloc[0],loans_2007.shape[1])
pd.options.display.max_columns = None


loan_amnt                            5000
term                            36 months
int_rate                           10.65%
installment                        162.87
emp_length                      10+ years
home_ownership                       RENT
annual_inc                          24000
verification_status              Verified
loan_status                    Fully Paid
pymnt_plan                              n
purpose                       credit_card
title                            Computer
addr_state                             AZ
dti                                 27.65
delinq_2yrs                             0
earliest_cr_line                 Jan-1985
inq_last_6mths                          1
open_acc                                3
pub_rec                                 0
revol_bal                           13648
revol_util                          83.7%
total_acc                               9
initial_list_status                     f
last_credit_pull_d               O

### Target Column
We should use the loan_status column, since it's the only column that directly describes if a loan was paid off on time, had delayed payments, or was defaulted on the borrower. Currently, this column contains text values and we need to convert it to a numerical one for training a model. Let's explore it.

In [7]:
loans_2007['loan_status'].value_counts()

Fully Paid                                             34116
Charged Off                                             5670
Does not meet the credit policy. Status:Fully Paid      1988
Does not meet the credit policy. Status:Charged Off      761
Name: loan_status, dtype: int64

From the LendingClub website and some google searching, here are the explanations:

 - Fully Paid:	loan has been fully paid off.
 - Charged Off: loan for which there is no longer a reasonable expectation of further payments.
 - Does not meet the credit policy: while the loan was paid off, the loan application today would no longer meet the credit policy and wouldn't be approved on to the marketplace.
 - Does not meet the credit policy: while the loan was charged off, the loan application today would no longer meet the credit policy and wouldn't be approved on to the marketplace.
 
From the investor's perspective, we're interested in trying to predict which loans will be paid off on time and which ones won't be. Only the Fully Paid and Charged Off values describe the final outcome of the loan. Since we're interested in being able to predict which of these 2 values a loan will fall under, we can treat the problem as a binary classification. We'll remove all the loans that don't contain either Fully Paid and Charged Off as the loan's status and then transform the Fully Paid values to 1 for the positive case and the Charged Off values to 0 for the negative case.

_Note: There is a class imbalance between the positive and negative cases. While there are 34,116 loans that have been fully paid off, there are only 5,670 that were charged off. This class imbalance is a common problem in binary classification and during training, the model ends up having a strong bias towards predicting the class with more observations in the training set and will rarely predict the class with less observations. The stronger the imbalance, the more biased the model becomes._

In [8]:
#remove all rows from loans_2007 that contain values other than Fully Paid or Charged Off for the loan_status column.
loans_2007 = loans_2007[(loans_2007['loan_status'] == "Fully Paid") | (loans_2007['loan_status'] == "Charged Off")]

#create mapping dict for replacement
status_replace = {
    "loan_status" : {
        "Fully Paid": 1,
        "Charged Off": 0,
    }
}

#replace values according to mapping dict
loans_2007 = loans_2007.replace(status_replace)

In [9]:
#remove any columns from loans_2007 that contain only one unique value
orig_columns = loans_2007.columns
drop_columns = []
for col in orig_columns:
    col_series = loans_2007[col].dropna().unique()
    if len(col_series) == 1:
        drop_columns.append(col)
loans_2007 = loans_2007.drop(drop_columns, axis=1)
#display drop_columns so we know which ones were removed
print(drop_columns)

['pymnt_plan', 'initial_list_status', 'collections_12_mths_ex_med', 'policy_code', 'application_type', 'acc_now_delinq', 'chargeoff_within_12_mths', 'delinq_amnt', 'tax_liens']


### Progress Update:
So far, we removed many columns that aren't useful for modeling. We also selected our target column and decided to focus our modeling efforts on binary classification. Next, we'll explore the individual features in greater depth and work towards training our first machine learning model.

## Feature Preperation

We will now prepare the data for machine learning by focusing on handling missing values, converting categorical columns to numeric columns, and removing any other extraneous columns we encounter throughout this process.

We start by computing the number of missing values and come up with a strategy for handling them. Then, we'll focus on the categorical columns.

In [10]:
#return the number of null values in each column
null_counts = loans_2007.isnull().sum()
print(null_counts)

loan_amnt                  0
term                       0
int_rate                   0
installment                0
emp_length              1078
home_ownership             0
annual_inc                 0
verification_status        0
loan_status                0
purpose                    0
title                     11
addr_state                 0
dti                        0
delinq_2yrs                0
earliest_cr_line           0
inq_last_6mths             0
open_acc                   0
pub_rec                    0
revol_bal                  0
revol_util                50
total_acc                  0
last_credit_pull_d         2
pub_rec_bankruptcies     697
dtype: int64


While most of the columns have 0 missing values, 3 columns have 50 or less rows with missing values, and 1 column, pub_rec_bankruptcies, contains 697 rows with missing values. Let's remove columns entirely where more than 1% of the rows for that column contain a null value. In addition, we'll remove the remaining rows containing null values, except emp_length.

In [11]:
#remove the pub_rec_bankruptcies column from loans
loans = loans_2007.drop("pub_rec_bankruptcies", axis=1)
#remove all rows from loans containing any missing values
loans = loans_2007.dropna(axis=0)
#return the counts for each column data type
print(loans.dtypes.value_counts())

float64    11
object     11
int64       1
dtype: int64


While the numerical columns can be used natively with scikit-learn, the object columns that contain text need to be converted to numerical data types. Let's return a new Dataframe containing just the object columns so we can explore them in more depth

In [12]:
object_columns_df = loans.select_dtypes(include=['object'])
print(object_columns_df)

             term int_rate emp_length home_ownership verification_status  \
0       36 months   10.65%  10+ years           RENT            Verified   
1       60 months   15.27%   < 1 year           RENT     Source Verified   
2       36 months   15.96%  10+ years           RENT        Not Verified   
3       36 months   13.49%  10+ years           RENT     Source Verified   
4       60 months   12.69%     1 year           RENT     Source Verified   
5       36 months    7.90%    3 years           RENT     Source Verified   
6       60 months   15.96%    8 years           RENT        Not Verified   
7       36 months   18.64%    9 years           RENT     Source Verified   
8       60 months   21.28%    4 years            OWN     Source Verified   
9       60 months   12.69%   < 1 year           RENT            Verified   
10      60 months   14.65%    5 years            OWN        Not Verified   
11      36 months   12.69%  10+ years            OWN     Source Verified   
12      36 m

Some of the columns seem like they represent categorical values, but we should confirm by checking the number of unique values in those columns:

 - home_ownership: home ownership status, can only be 1 of 4 categorical values according to the data dictionary
 - verification_status: indicates if income was verified by Lending Club
 - emp_length: number of years the borrower was employed upon time of application
 - term: number of payments on the loan, either 36 or 60
 - addr_state: borrower's state of residence
 - purpose: a category provided by the borrower for the loan request
 - title: loan title provided the borrower
 
There are also some columns that represent numeric values, that need to be converted:

 - int_rate: interest rate of the loan in %
 - revol_util: revolving line utilization rate or the amount of credit the borrower is using relative to all available credit
 
Based on the first row's values for purpose and title, it seems like these columns could reflect the same information. Let's explore the unique value counts separately to confirm if this is true.

Lastly, some of the columns contain date values that would require a good amount of feature engineering for them to be potentially useful:

 - earliest_cr_line: The month the borrower's earliest reported credit line was opened
 - last_credit_pull_d: The most recent month Lending Club pulled credit for this loan
 
Since these date features require some feature engineering for modeling purposes, let's remove these date columns from the Dataframe.

In [13]:
#display unique value counts of the columnns that seem like they contain categorical values
cols = ['home_ownership', 'verification_status', 'emp_length', 'term', 'addr_state']
for col in cols:
    print(loans[col].value_counts())

RENT        18091
MORTGAGE    16991
OWN          2775
OTHER          96
Name: home_ownership, dtype: int64
Not Verified       15773
Verified           12376
Source Verified     9804
Name: verification_status, dtype: int64
10+ years    8778
< 1 year     4410
2 years      4305
3 years      4033
4 years      3390
5 years      3246
1 year       3148
6 years      2195
7 years      1749
8 years      1458
9 years      1241
Name: emp_length, dtype: int64
 36 months    27538
 60 months    10415
Name: term, dtype: int64
CA    6825
NY    3624
FL    2727
TX    2628
NJ    1792
IL    1475
PA    1472
VA    1345
GA    1333
MA    1275
OH    1173
MD    1009
AZ     820
WA     784
CO     746
NC     730
CT     724
MI     679
MO     653
MN     584
NV     479
SC     456
OR     430
WI     428
AL     426
LA     417
KY     321
OK     292
KS     258
UT     247
AR     232
DC     209
RI     195
NM     178
WV     168
HI     168
NH     159
DE     108
MT      79
WY      78
AK      78
SD      61
VT      52
MS      19


In [14]:
#look at the unique value counts for the purpose and title columns to understand which column to keep
print(loans["purpose"].value_counts())
print(loans["title"].value_counts())

debt_consolidation    17971
credit_card            4906
other                  3720
home_improvement       2836
major_purchase         2090
small_business         1730
car                    1480
wedding                 915
medical                 659
moving                  548
house                   364
vacation                345
educational             294
renewable_energy         95
Name: purpose, dtype: int64
Debt Consolidation                                   2127
Debt Consolidation Loan                              1691
Personal Loan                                         620
Consolidation                                         499
debt consolidation                                    483
Home Improvement                                      345
Credit Card Consolidation                             342
Debt consolidation                                    318
Small Business Loan                                   317
Credit Card Loan                                      307


The home_ownership, verification_status, emp_length, and term columns each contain a few discrete categorical values. We should encode these columns as dummy variables and keep them.

It seems like the purpose and title columns do contain overlapping information but we'll keep the purpose column since it contains a few discrete values. In addition, the title column has data quality issues since many of the values are repeated with slight modifications (e.g. Debt Consolidation and Debt Consolidation Loan and debt consolidation).

We can use the following mapping to clean the emp_length column:

 - "10+ years": 10
 - "9 years": 9
 - "8 years": 8
 - "7 years": 7
 - "6 years": 6
 - "5 years": 5
 - "4 years": 4
 - "3 years": 3
 - "2 years": 2
 - "1 year": 1
 - "< 1 year": 0
 - "n/a": 0
We assume that people who may have been working more than 10 years have only really worked for 10 years. We also assume that people who've worked less than a year or if the information is not available that they've worked for 0. This is an imperfect general heuristic.

Lastly, the addr_state column contains many discrete values and we'd need to add 49 dummy variable columns to use it for classification. This would make our Dataframe much larger and could slow down how quickly the code runs. Let's remove this column from consideration.

In [15]:
mapping_dict = {
    "emp_length": {
        "10+ years": 10,
        "9 years": 9,
        "8 years": 8,
        "7 years": 7,
        "6 years": 6,
        "5 years": 5,
        "4 years": 4,
        "3 years": 3,
        "2 years": 2,
        "1 year": 1,
        "< 1 year": 0,
        "n/a": 0
    }
}
loans = loans.drop(["last_credit_pull_d", "earliest_cr_line", "addr_state", "title"], axis=1)
#removing '%' signs and converting to float
loans["int_rate"] = loans["int_rate"].str.rstrip("%").astype("float")
loans["revol_util"] = loans["revol_util"].str.rstrip("%").astype("float")
#making categorical values numerical, for analysis
loans = loans.replace(mapping_dict)

In [16]:
#encode applicable columns as dummy variables so we can use them in our model
dummy_loans = pd.get_dummies(loans[["term", "verification_status", "purpose", "term", 'home_ownership']])
#concatenate result to loans dataframe
loans = pd.concat([loans, dummy_loans], axis=1)
#drop original columns
loans = loans.drop(["verification_status", "term", "purpose", "term", "home_ownership"], axis=1)

### Progress Update:

We performed the last amount of data preparation necessary to start training machine learning models. We converted all of the columns to numerical values because those are the only type of value scikit-learn can work with. Now, we'll experiment with training models and evaluate accuracy using cross-validation.

## Making Predictions

An error metric will help us figure out when our model is performing well, and when it's performing poorly. Our objective in this is to make money; we want to fund enough loans that are paid off on time to offset our losses from loans that aren't paid off. An error metric will help us determine if our algorithm will make us money or lose us money.

In this case, we're primarily concerned with false positives and false negatives. Both of these are different types of misclassifications. With a false positive, we predict that a loan will be paid off on time, but it actually isn't. This costs us money, since we fund loans that lose us money. With a false negative, we predict that a loan won't be paid off on time, but it actually would be paid off on time. This loses us potential money, since we didn't fund a loan that actually would have been paid off.

Since we're viewing this problem from the standpoint of a conservative investor, we need to treat false positives differently than false negatives. A conservative investor would want to minimize risk, and avoid false positives as much as possible. They'd be more okay with missing out on opportunities (false negatives) than they would be with funding a risky loan (false positives).

It is important to always be aware of imbalanced classes in machine learning models, and to adjust your error metric accordingly. In this case, we don't want to use accuracy, and should instead use metrics that tell us the number of false positives and false negatives.

This means that we should optimize for:

 - high recall (true positive rate)
 - low fall-out (false positive rate)

In [17]:
print(loans.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 37953 entries, 0 to 39749
Data columns (total 40 columns):
loan_amnt                              37953 non-null float64
int_rate                               37953 non-null float64
installment                            37953 non-null float64
emp_length                             37953 non-null int64
annual_inc                             37953 non-null float64
loan_status                            37953 non-null int64
dti                                    37953 non-null float64
delinq_2yrs                            37953 non-null float64
inq_last_6mths                         37953 non-null float64
open_acc                               37953 non-null float64
pub_rec                                37953 non-null float64
revol_bal                              37953 non-null float64
revol_util                             37953 non-null float64
total_acc                              37953 non-null float64
pub_rec_bankruptcies       

As we can see above, our cleaned dataset contains 40 columns, all of which are either the uint8, int64 or the float64 data type. There aren't any null values in any of the columns. This means that we can now apply any machine learning algorithm to our dataset.

A good first algorithm to apply to binary classification problems is logistic regression, for the following reasons:

 - it's quick to train and we can iterate more quickly
 - it's less prone to overfitting than more complex models like decision trees
 - it's easy to interpret

In [18]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_predict

lr = LogisticRegression()
train_cols = loans.columns[loans.columns != 'loan_status']
features = loans[train_cols]
target = loans["loan_status"]
#fit a logistic regression to features and target
lr.fit(features, target)
#make predictions on features
predictions = cross_val_predict(lr, features, target, cv=3)
predictions = pd.Series(predictions)

# False positives.
fp_filter = (predictions == 1) & (loans["loan_status"] == 0)
fp = len(predictions[fp_filter])

# True positives.
tp_filter = (predictions == 1) & (loans["loan_status"] == 1)
tp = len(predictions[tp_filter])

# False negatives.
fn_filter = (predictions == 0) & (loans["loan_status"] == 1)
fn = len(predictions[fn_filter])

# True negatives
tn_filter = (predictions == 0) & (loans["loan_status"] == 0)
tn = len(predictions[tn_filter])
# Rates
tpr = tp  / (tp + fn)
fpr = fp  / (fp + tn)
print(tpr)
print(fpr)

0.9994319795512638
0.9990259107734268


Unfortunately, even through we're not using accuracy as an error metric, the classifier is, and it isn't accounting for the imbalance in the classes. To get the classifier to correct for imbalanced classes, we will tell the classifier to penalize misclassifications of the less prevalent class more than the other class. This is easy to implement using scikit-learn.

In [19]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_predict

#tell classifier to penalize misclassifications
lr = LogisticRegression(class_weight="balanced")
predictions = cross_val_predict(lr, features, target, cv=3)
predictions = pd.Series(predictions)

# False positives.
fp_filter = (predictions == 1) & (loans["loan_status"] == 0)
fp = len(predictions[fp_filter])

# True positives.
tp_filter = (predictions == 1) & (loans["loan_status"] == 1)
tp = len(predictions[tp_filter])

# False negatives.
fn_filter = (predictions == 0) & (loans["loan_status"] == 1)
fn = len(predictions[fn_filter])

# True negatives
tn_filter = (predictions == 0) & (loans["loan_status"] == 0)
tn = len(predictions[tn_filter])

# Rates
tpr = tp / (tp + fn)
fpr = fp / (fp + tn)

print(tpr)
print(fpr)

0.6286408532929407
0.6146503019676602


In [20]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_predict

#setting harsher penalties
penalty = {
    0: 10,
    1: 1
}

lr = LogisticRegression(class_weight=penalty)
predictions = cross_val_predict(lr, features, target, cv=3)
predictions = pd.Series(predictions)

# False positives.
fp_filter = (predictions == 1) & (loans["loan_status"] == 0)
fp = len(predictions[fp_filter])

# True positives.
tp_filter = (predictions == 1) & (loans["loan_status"] == 1)
tp = len(predictions[tp_filter])

# False negatives.
fn_filter = (predictions == 0) & (loans["loan_status"] == 1)
fn = len(predictions[fn_filter])

# True negatives
tn_filter = (predictions == 0) & (loans["loan_status"] == 0)
tn = len(predictions[tn_filter])

# Rates
tpr = tp / (tp + fn)
fpr = fp / (fp + tn)

print(tpr)
print(fpr)

0.2376218877212913
0.2433274887979739


## Random Forrest

Let's try a different algorithm.

In [21]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.cross_validation import cross_val_predict
rf = RandomForestClassifier(class_weight="balanced", random_state=1)
predictions = cross_val_predict(rf, features, target, cv=3)
predictions = pd.Series(predictions)

# False positives.
fp_filter = (predictions == 1) & (loans["loan_status"] == 0)
fp = len(predictions[fp_filter])

# True positives.`
tp_filter = (predictions == 1) & (loans["loan_status"] == 1)
tp = len(predictions[tp_filter])

# False negatives.
fn_filter = (predictions == 0) & (loans["loan_status"] == 1)
fn = len(predictions[fn_filter])

# True negatives
tn_filter = (predictions == 0) & (loans["loan_status"] == 0)
tn = len(predictions[tn_filter])

# Rates
tpr = tp / (tp + fn)
fpr = fp / (fp + tn)

print(tpr)
print(fpr)



0.9689482154690903
0.9649327878433664


## Conclusions

Unfortunately, using a random forest classifier didn't improve our false positive rate. The model is likely weighting too heavily on the 1 class, and still mostly predicting 1s. We could fix this by applying a harsher penalty for misclassifications of 0s.

Ultimately, our best model had a false positive rate of 24%, and a true positive rate of 23%. For a conservative investor, this means that they make money as long as the interest rate is high enough to offset the losses from 24% of borrowers defaulting, and that the pool of 23% of borrowers is large enough to make enough interest money to offset the losses.

If we had randomly picked loans to fund, borrowers would have defaulted on 14.5% of them, and our model is better than that, although we're excluding more loans than a random strategy would. Given this, there's still quite a bit of room to improve:

We can tweak the penalties further.
 - We can try models other than a random forest and logistic regression.
 - We can use some of the columns we discarded to generate better features.
 - We can ensemble multiple models to get more accurate predictions.
 - We can tune the parameters of the algorithm to achieve higher performance.