
# Random Forest Project 

For this project we will be exploring publicly available data from [LendingClub.com](www.lendingclub.com). Lending Club connects people who need money (borrowers) with people who have money (investors). Hopefully, as an investor you would want to invest in people who showed a profile of having a high probability of paying you back. We will try to create a model that will help predict this.

We will use lending data from 2007-2010 and try to classify and predict whether or not the borrower paid back their loan in full. 

Here are what the columns represent:
* credit.policy: 1 if the customer meets the credit underwriting criteria of LendingClub.com, and 0 otherwise.
* purpose: The purpose of the loan (takes values "credit_card", "debt_consolidation", "educational", "major_purchase", "small_business", and "all_other").
* int.rate: The interest rate of the loan, as a proportion (a rate of 11% would be stored as 0.11). Borrowers judged by LendingClub.com to be more risky are assigned higher interest rates.
* installment: The monthly installments owed by the borrower if the loan is funded.
* log.annual.inc: The natural log of the self-reported annual income of the borrower.
* dti: The debt-to-income ratio of the borrower (amount of debt divided by annual income).
* fico: The FICO credit score of the borrower.
* days.with.cr.line: The number of days the borrower has had a credit line.
* revol.bal: The borrower's revolving balance (amount unpaid at the end of the credit card billing cycle).
* revol.util: The borrower's revolving line utilization rate (the amount of the credit line used relative to total credit available).
* inq.last.6mths: The borrower's number of inquiries by creditors in the last 6 months.
* delinq.2yrs: The number of times the borrower had been 30+ days past due on a payment in the past 2 years.
* pub.rec: The borrower's number of derogatory public records (bankruptcy filings, tax liens, or judgments).

# Import Libraries

**Import the usual libraries for pandas and plotting. You can import sklearn later on.**

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

## Loading the Data

In [None]:
loans=pd.read_csv('loan_data.csv ')

** Check out the info(), head(), and describe() methods on loans.**

In [None]:
loans.info()

In [None]:
loans.head()

In [None]:
loans.describe()

# Exploratory Data Analysis

Let's do some data visualization! We'll use seaborn and pandas built-in plotting capabilities

** Creating a histogram of two FICO distributions on top of each other, one for each credit.policy outcome.**

In [None]:
loans[loans['credit.policy']==1]['fico'].hist(alpha=0.5,color='blue',bins=30,label='Credit.Policy=1')

loans[loans['credit.policy']==0]['fico'].hist(alpha=0.5,color='red',bins=30,label='Credit.Policy=0')
plt.legend()
plt.xlabel('FICO')


** Creating a histogram of two FICO distributions on top of each other, one for each "not fully.paid" column.**

In [None]:
loans[loans['not.fully.paid']==1]['fico'].hist(alpha=0.5,color='blue',bins=30,label='not fully.paid=1')

loans[loans['not.fully.paid']==0]['fico'].hist(alpha=0.5,color='red',bins=30,label='not fully.paid=0')
plt.legend()
plt.xlabel('FICO')

** Creating a countplot using seaborn showing the counts of loans by purpose, with the color hue defined by not.fully.paid. **

In [None]:
plt.figure(figsize=(11,7))
sns.countplot(x='purpose', hue='not.fully.paid', data=loans,palette='Set1')

** Let's see the trend between FICO score and interest rate**

In [None]:
sns.jointplot(x='fico', y='int.rate', data=loans, kind='scatter')

** Creating the following lmplots to see if the trend differed between not.fully.paid and credit.policy**

In [None]:
sns.lmplot(x='fico',y='int.rate',data=loans,hue='credit.policy', col='not.fully.paid',palette='Set1')

In [None]:
loans['purpose'].unique()

In [None]:
cat_feats=['purpose']

In [None]:
#splitting column "purpose" into 6 distinctive columns
final_data=pd.get_dummies(loans,columns=cat_feats, drop_first=True )

In [None]:
final_data.info()

## Train Test Split

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X = final_data.drop('not.fully.paid',axis=1)
y = final_data['not.fully.paid']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=101)

## Training a Decision Tree Model

In [None]:
from sklearn.tree import DecisionTreeClassifier
loans.head()

**Creating an instance of DecisionTreeClassifier() called dtree and fitting it to the training data.**

In [None]:
dtree=DecisionTreeClassifier()

In [None]:
dtree.fit(X_train,y_train)

## Predictions and Evaluation of Decision Tree
**Creating predictions from the test set and creating a classification report and a confusion matrix.**

In [None]:
predictions=dtree.predict(X_test)

In [None]:
from sklearn.metrics import classification_report,confusion_matrix

In [None]:
print(classification_report(y_test,predictions))

In [None]:
print(confusion_matrix(y_test,predictions) )

## Training the Random Forest model

**Creating an instance of the RandomForestClassifier class and fitting it to our training data from the previous step.**

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
rfc=RandomForestClassifier(n_estimators=600)

In [None]:
rfc.fit(X_train,y_train)

## Predictions and Evaluation

Predicting y_test values 

In [None]:
predictions=rfc.predict(X_test)

In [None]:
from sklearn.metrics import classification_report,confusion_matrix

In [None]:
print(classification_report(y_test,predictions))

In [None]:
print(confusion_matrix(y_test,predictions))