## Making Predictions
Now that the feature columns have been cleaned and prepared for modelling, I can start using the data to make predictions. 

Although I need to remember the class imbalance in my target column, `loan_status`. There are six times as many loans that were paid off on time (`1`) than those that weren't (`0`). I will keep this in mind as I build my machine learning models.

### Error Metric
Before I start, I should pick an error metric. Since I am viewing the problem as a conservative investor, I would want to minimize risk and avoid false positives as much as possible. Due to class imbalance, I should measure error using the rate of true and false negatives and positives instead of using accuracy. This means I will optimize for high recall (true positive rate) and low fall-out (false positive rate).

### Logistic Regression
Logistic regression is a good first algorithm to apply to binary classification problems because it is quick to train, easy to interpret, and is less prone to overfitting than more complex models like decision trees. In order to get a realistic depiction of the accuracy of the model, I will perform k-fold cross validation to further avoid overfitting. This is how I will get my initial predictions.

In [2]:
#bring in data
import pandas as pd
import numpy as np

loans = pd.read_csv('cleaned_loans_2007.csv')

#import algorithms for logistic regression
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_predict

#instantiating the model
lr = LogisticRegression()

#getting feature columns from df
cols = loans.columns
train_cols = cols.drop('loan_status')

#setting feature and target
features = loans[train_cols]
target = loans['loan_status']

#performing 3-fold cross validation
predictions = cross_val_predict(lr, features, target, cv=3)

#convert predictions to pandas series
predictions = pd.Series(predictions)

#calculating false positive and true positive rate
fp_filter = (predictions == 1) & (loans["loan_status"] == 0)
fp = len(predictions[fp_filter])

# True positives.
tp_filter = (predictions == 1) & (loans["loan_status"] == 1)
tp = len(predictions[tp_filter])

# False negatives.
fn_filter = (predictions == 0) & (loans["loan_status"] == 1)
fn = len(predictions[fn_filter])

# True negatives
tn_filter = (predictions == 0) & (loans["loan_status"] == 0)
tn = len(predictions[tn_filter])

# Rates
tpr = tp / (tp + fn)
fpr = fp / (fp + tn)

print(tpr)
print(fpr)

Unnamed: 0,loan_amnt,int_rate,installment,emp_length,annual_inc,loan_status,dti,delinq_2yrs,inq_last_6mths,open_acc,...,purpose_major_purchase,purpose_medical,purpose_moving,purpose_other,purpose_renewable_energy,purpose_small_business,purpose_vacation,purpose_wedding,term_ 36 months,term_ 60 months
0,5000.0,10.65,162.87,10,24000.0,1,27.65,0.0,1.0,3.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
1,2500.0,15.27,59.83,0,30000.0,0,1.0,0.0,5.0,3.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
2,2400.0,15.96,84.33,10,12252.0,1,8.72,0.0,2.0,2.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
3,10000.0,13.49,339.31,10,49200.0,1,20.0,0.0,1.0,10.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
4,5000.0,7.9,156.46,3,36000.0,1,11.2,0.0,3.0,9.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0


In [4]:
#instantiating the model
lr = LogisticRegression(class_weight='balanced')

#performing 3-fold cross validation
predictions = cross_val_predict(lr, features, target, cv=3)

#convert predictions to pandas series
predictions = pd.Series(predictions)

#calculating false positive and true positive rate
fp_filter = (predictions == 1) & (loans["loan_status"] == 0)
fp = len(predictions[fp_filter])

# True positives.
tp_filter = (predictions == 1) & (loans["loan_status"] == 1)
tp = len(predictions[tp_filter])

# False negatives.
fn_filter = (predictions == 0) & (loans["loan_status"] == 1)
fn = len(predictions[fn_filter])

# True negatives
tn_filter = (predictions == 0) & (loans["loan_status"] == 0)
tn = len(predictions[tn_filter])

# Rates
tpr = tp / (tp + fn)
fpr = fp / (fp + tn)

print(tpr)
print(fpr)



0.6593237240504034
0.3796972395369546
