# Week_3_1: Identifying safe loans with decision trees

The [LendingClub](https://www.lendingclub.com/) is a peer-to-peer leading company that directly connects borrowers and potential lenders/investors. In this notebook, you will build a classification model to predict whether or not a loan provided by LendingClub is likely to [default](https://en.wikipedia.org/wiki/Default_(finance).

In this notebook you will use data from the LendingClub to predict whether a loan will be paid off in full or the loan will be [charged off](https://en.wikipedia.org/wiki/Charge-off) and possibly go into default. In this assignment you will:

* Use SFrames to do some feature engineering.
* Train a decision-tree on the LendingClub dataset.
* Visualize the tree.
* Predict whether a loan will default along with prediction probabilities (on a validation set).
* Train a complex tree model and compare it to simple tree model.

In [1]:
import sframe
loans = sframe.SFrame('lending-club-data.gl/')

[INFO] SFrame v1.8.3 started. Logging /tmp/sframe_server_1457995043.log


# Explore the data column

In [2]:
loans['safe_loans'] = loans['bad_loans'].apply(lambda x : +1 if x==0 else -1)

In [3]:
loans = loans.remove_column('bad_loans')

In [4]:
import numpy as np
good = loans['safe_loans'] == 1
print 'safe loans:', good.sum()/float(len(loans)), '%'

safe loans: 0.811185331996 %


# Features for classification algorithm

In [5]:
features = ['grade',                     # grade of the loan
            'sub_grade',                 # sub-grade of the loan
            'short_emp',                 # one year or less of employment
            'emp_length_num',            # number of years of employment
            'home_ownership',            # home_ownership status: own, mortgage or rent
            'dti',                       # debt to income ratio
            'purpose',                   # the purpose of the loan
            'term',                      # the term of the loan
            'last_delinq_none',          # has borrower had a delinquincy
            'last_major_derog_none',     # has borrower had 90 day or worse rating
            'revol_util',                # percent of available credit being used
            'total_rec_late_fee',        # total late fees received to day
           ]

target = 'safe_loans'                    # prediction target (y) (+1 means safe, -1 is risky)

# Extract the feature columns and target column
loans = loans[features + [target]]

# Sample data to balance classes

In [6]:
safe_loans_raw = loans[loans[target] == +1]
risky_loans_raw = loans[loans[target] == -1]
print "Number of safe loans  : %s" % len(safe_loans_raw)
print "Number of risky loans : %s" % len(risky_loans_raw)

Number of safe loans  : 99457
Number of risky loans : 23150


In [7]:
percentage = len(risky_loans_raw)/float(len(safe_loans_raw))

risky_loans = risky_loans_raw
safe_loans = safe_loans_raw.sample(percentage, seed=1)

loans_data = risky_loans.append(safe_loans)

# One-hot encoding

In [8]:
categorical_variables = []
for feat_name, feat_type in zip(loans_data.column_names(), loans_data.column_types()):
    if feat_type == str:
        categorical_variables.append(feat_name)

for feature in categorical_variables:
    loans_data_one_hot_encoded = loans_data[feature].apply(lambda x: {x: 1})
    loans_data_unpacked = loans_data_one_hot_encoded.unpack(column_name_prefix=feature)

    # Change None's to 0's
    for column in loans_data_unpacked.column_names():
        loans_data_unpacked[column] = loans_data_unpacked[column].fillna(0)

    loans_data.remove_column(feature)
    loans_data.add_columns(loans_data_unpacked)

# Split data

In [9]:
train_data, validation_data = loans_data.random_split(0.8, seed=1)

# Build a decision tree classifier

In [10]:
from sklearn.tree import DecisionTreeClassifier

In [11]:
train_labels = np.array(train_data['safe_loans'])
train_feature_matrix = train_data.remove_column('safe_loans')
train_feature_matrix = train_feature_matrix.to_numpy()

In [12]:
decision_tree_model = DecisionTreeClassifier(max_depth=6)
decision_tree_model.fit(train_feature_matrix, train_labels)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=6,
            max_features=None, max_leaf_nodes=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')

In [13]:
small_model = DecisionTreeClassifier(max_depth=2)
small_model.fit(train_feature_matrix, train_labels)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=2,
            max_features=None, max_leaf_nodes=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')

# visualizing the a learned model

In [14]:
from sklearn import tree
import pydot
import graphviz

In [15]:
tree.export_graphviz(small_model, out_file='tree.dot')

In [16]:
graph = pydot.graph_from_dot_file('tree.dot')
#graph.write_png('tree.png')

# Making predictions

In [17]:
validation_safe_loans = validation_data[validation_data[target] == 1]
validation_risky_loans = validation_data[validation_data[target] == -1]

sample_validation_data_risky = validation_risky_loans[0:2]
sample_validation_data_safe = validation_safe_loans[0:2]

sample_validation_data = sample_validation_data_safe.append(sample_validation_data_risky)
sample_validation_data

sample_validation_labels = np.array(sample_validation_data['safe_loans'])
sample_validation_feature_matrix = sample_validation_data.remove_column('safe_loans')
sample_validation_feature_matrix = sample_validation_feature_matrix.to_numpy()


In [18]:
decision_tree_model.predict(sample_validation_feature_matrix)

array([ 1, -1, -1,  1])

In [19]:
sample_validation_labels

array([ 1,  1, -1, -1])

In [20]:
print '50% accuracy'

50% accuracy


# Explore probability predictions

In [21]:
decision_tree_model.predict_proba(sample_validation_feature_matrix)

array([[ 0.34156543,  0.65843457],
       [ 0.53630646,  0.46369354],
       [ 0.64750958,  0.35249042],
       [ 0.20789474,  0.79210526]])

In [22]:
decision_tree_model.classes_

array([-1,  1])

In [23]:
print '4th loan'

4th loan


# Tricky predictions!

In [24]:
small_model.predict_proba(sample_validation_feature_matrix)

array([[ 0.41896585,  0.58103415],
       [ 0.59255339,  0.40744661],
       [ 0.59255339,  0.40744661],
       [ 0.23120112,  0.76879888]])

In [25]:
small_model.classes_

array([-1,  1])

In [26]:
small_model.predict(sample_validation_feature_matrix)

array([ 1, -1, -1,  1])

# Visualize the prediction on a tree

# Evaluating the accuracy

In [27]:
decision_tree_model.score(train_feature_matrix, train_labels)

0.64052761659144641

In [28]:
small_model.score(train_feature_matrix, train_labels)

0.61350204169353106

In [29]:
validation_labels = np.array(validation_data['safe_loans'])
validation_feature_matrix = validation_data.remove_column('safe_loans')
validation_feature_matrix = validation_feature_matrix.to_numpy()

In [30]:
decision_tree_model.score(validation_feature_matrix, validation_labels)

0.63614821197759586

# Big model

In [31]:
big_model = DecisionTreeClassifier(max_depth=10)
big_model.fit(train_feature_matrix, train_labels)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=10,
            max_features=None, max_leaf_nodes=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')

In [32]:
big_model.score(train_feature_matrix, train_labels)

0.6637384483129164

In [33]:
big_model.score(validation_feature_matrix, validation_labels)

0.62688496337785438

# Quantifying the cost of mistakes

In [34]:
predicted_labels = decision_tree_model.predict(validation_feature_matrix)

In [35]:
len(validation_labels) == len(predicted_labels)

True

In [36]:
false_positives = 0
false_negatives = 0
corrects = 0
for i in range(len(validation_labels)):
    if validation_labels[i] != predicted_labels[i]:
        if validation_labels[i] == -1:
            false_positives += 1
        else:
            false_negatives += 1
    else:
        corrects += 1
print '# of false_positives:', false_positives
print '# of false_negatives:', false_negatives
print '# of corrects:', corrects
        

# of false_positives: 1661
# of false_negatives: 1717
# of corrects: 5906


In [37]:
false_negatives*10000 + false_positives*20000

50390000