In [1]:
%pylab inline
import pandas as pd

Populating the interactive namespace from numpy and matplotlib


In [2]:
df = pd.read_pickle('../cleaned_df.pkl')

## Preparation

## Categorical variables

This data has a lot of categorical variables. I prefer `dmatrices` instead of `pd.get_dummies` as it is a lot cleaner.

In [5]:
from patsy import dmatrices

In [6]:
y, X = dmatrices('delinq ~  + loan_amnt + int_rate + installment + emp_length +'
                 'C(home_ownership) + C(grade) + C(month_issued) + C(year_issued)'
                 '+ C(purpose) + C(addr_state) + inq_last_6mths + pub_rec + revol_bal +open_acc+'
                 'collections_12_mths_ex_med + delinq_2yrs + earliest_cr_line  + fico_range_low'
                 '+ ratio_mth_inc_all_payments + annual_inc',
                 df, return_type='dataframe')

Earliest cr line to the year. Proabbly going to be quite good.

In [7]:
for i in X.columns[:10]:
    print(i,)

Intercept
C(home_ownership)[T.OTHER]
C(home_ownership)[T.OWN]
C(home_ownership)[T.RENT]
C(grade)[T.B]
C(grade)[T.C]
C(grade)[T.D]
C(grade)[T.E]
C(grade)[T.F]
C(grade)[T.G]


In [8]:
y = np.ravel(y)

Create a small testset for testing purposes

In [9]:
y_s, X_s = y[-10000:], X[-10000:]

## A first model: Logistic Regression

In [10]:
from sklearn.linear_model import LogisticRegressionCV

In [11]:
model_log = LogisticRegressionCV(cv=5, penalty='l2', verbose=1, max_iter=1000)

In [12]:
fit = model_log.fit(X_s, y_s)

[Parallel(n_jobs=1)]: Done   1 jobs       | elapsed:    0.1s
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    0.6s finished


In [13]:
predictions = model_log.predict(X_s)

In [14]:
predictions.mean()

0.0

In [15]:
model_log.score(X_s, y_s)

0.91849999999999998

In [16]:
y_s.mean()

0.081500000000000003

## Train_test_split

In [10]:
from sklearn.cross_validation import train_test_split

In [18]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15,
                                                   random_state=42)

## A Second Model: 

In [7]:
from sklearn.ensemble import RandomForestClassifier

In [8]:
model_rf = RandomForestClassifier(n_estimators=200, oob_score=True, verbose=1, random_state=2143,
                                 min_samples_split=50, n_jobs=-2)

In [21]:
rf_fit = model_rf.fit(X_train, y_train)

[Parallel(n_jobs=1)]: Done   1 jobs       | elapsed:    1.5s
[Parallel(n_jobs=1)]: Done  50 jobs       | elapsed:  1.3min
[Parallel(n_jobs=1)]: Done 200 jobs       | elapsed:  5.2min
[Parallel(n_jobs=1)]: Done 200 out of 200 | elapsed:  5.2min finished


In [22]:
prediction = rf_fit.predict(X_test)

[Parallel(n_jobs=1)]: Done   1 jobs       | elapsed:    0.0s
[Parallel(n_jobs=1)]: Done  50 jobs       | elapsed:    0.8s
[Parallel(n_jobs=1)]: Done 200 jobs       | elapsed:    2.9s
[Parallel(n_jobs=1)]: Done 200 out of 200 | elapsed:    2.9s finished


In [23]:
prob_pred = rf_fit.predict_proba(X_test)

[Parallel(n_jobs=1)]: Done   1 jobs       | elapsed:    0.0s
[Parallel(n_jobs=1)]: Done  50 jobs       | elapsed:    0.7s
[Parallel(n_jobs=1)]: Done 200 jobs       | elapsed:    3.1s
[Parallel(n_jobs=1)]: Done 200 out of 200 | elapsed:    3.1s finished


# Evaluating our Performance

# Finding the Most Important Features

A benefit of random forests is that it is possible to find the most important features in our model. 

## Steps

1. Extract the most important features and sort them by value of importance
1. Standarize the values
2. multiply them such that 100 is the highest value.

In [28]:
features = sorted(list(zip(X.columns, model_rf.feature_importances_)), key=lambda x: x[1],
      reverse=True)

In [29]:
features[:10]

[('int_rate', 0.087367583254206252),
 ('ratio_mth_inc_all_payments', 0.078217366603344093),
 ('revol_bal', 0.071783620370906631),
 ('annual_inc', 0.065117515624413966),
 ('installment', 0.064354902147432155),
 ('loan_amnt', 0.051682988339306381),
 ('earliest_cr_line', 0.044770817935560217),
 ('open_acc', 0.040189360438633512),
 ('fico_range_low', 0.039819423299314226),
 ('emp_length', 0.023505721478127985)]

In [34]:
features = sorted(list(zip(X.columns, rf_fit.feature_importances_)), key=lambda x: x[1],
      reverse=True)

In [35]:
values = [feature[1] for feature in features]

In [36]:
standardized_values =(values - np.mean(values)) / np.std(values)

In [37]:
base_hundred = (100/  standardized_values[0]) * standardized_values

In [38]:
hundred = list(zip([feature[0] for feature in features], base_hundred))

In [39]:
importand_featuers = hundred[:10]

In [40]:
importand_featuers

[('int_rate', 100.00000000000001),
 ('ratio_mth_inc_all_payments', 88.231565394867886),
 ('revol_bal', 79.956884620470973),
 ('annual_inc', 71.383358902621652),
 ('installment', 70.402533292114683),
 ('loan_amnt', 54.104712741740315),
 ('earliest_cr_line', 45.214712812793067),
 ('open_acc', 39.322329604173177),
 ('fico_range_low', 38.846539664087068),
 ('emp_length', 17.864879843806673)]

In [42]:
rf_fit.score(X_test, y_test)

[Parallel(n_jobs=1)]: Done   1 jobs       | elapsed:    0.0s
[Parallel(n_jobs=1)]: Done  50 jobs       | elapsed:    0.7s
[Parallel(n_jobs=1)]: Done 200 jobs       | elapsed:    2.9s
[Parallel(n_jobs=1)]: Done 200 out of 200 | elapsed:    2.9s finished


0.89873045078196867

In [55]:
from sklearn import svm

In [56]:
model = svm.SVC(random_state=4134)

In [None]:
svm_fit = model.fit(X_test, y_test)

In [80]:
df.loan_status.value_counts()

Current               191303
Fully Paid            144478
Charged Off            33166
Late (31-120 days)      5748
In Grace Period         2872
Late (16-30 days)       1087
Default                  134
dtype: int64

In [84]:
df.columns

Index(['total_pymnt', 'zip_code', 'member_id', 'id', 'loan_amnt', 'int_rate',
       'installment', 'emp_length', 'home_ownership', 'grade', 'sub_grade',
       'emp_title', 'issue_d', 'loan_status', 'annual_inc',
       'verification_status', 'purpose', 'addr_state', 'inq_last_6mths', 'dti',
       'revol_util', 'mths_since_last_delinq', 'pub_rec', 'revol_bal',
       'open_acc', 'collections_12_mths_ex_med', 'delinq_2yrs',
       'earliest_cr_line', 'fico_range_low', 'last_credit_pull_d',
       'ratio_inc_debt', 'ratio_inc_installment', 'ratio_mth_inc_all_payments',
       'year_issued', 'month_issued', 'delinq'],
      dtype='object')

In [90]:
default_df = df.assign(percentage_paid= df.total_pymnt/df.loan_amnt
                      )[df.loan_status=='Charged Off']

In [92]:
default_df.percentage_paid.mean()

0.45161542363848606

# Dumping the model to pickle

In [45]:
df.int_rate.head()

0    0.1065
1    0.1527
2    0.1596
3    0.1349
4    0.1269
Name: int_rate, dtype: float64

In [14]:
from sklearn.externals import joblib
# Prevent accidental execution
## joblib.dump(rf_fit, 'model/rf_model.pkl');

In [14]:
rf = joblib.load('../model/rf_model.pkl')

In [77]:
model = joblib.load('../model/rf_model.pkl')

# Fitting only Riskier Loans

Since we aren't interested in investing high-grade loans, we can and should retrain our model so that our model learns only features indicative of default only in our risky-loan universe. 

It isn't clear, *a priori* that this model will yield better results. While excluding A, B loans almost surely reduces our asymptotic bias, this may be overwhelemed by the decrease in the amount of data our model has to work with.

In [3]:
df_risky = df[df.grade.isin(['C', 'D', 'E'])]


In [4]:
import sys; 
sys.path.append("../scripts")
from model import create_matrix


In [5]:
y, X = create_matrix(df)

In [11]:
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8)

In [12]:
fitted_risky_rf = model_rf.fit(X_train, y_train)

[Parallel(n_jobs=-2)]: Done   1 out of 200 | elapsed:    2.6s remaining:  8.5min
[Parallel(n_jobs=-2)]: Done 200 out of 200 | elapsed:  1.2min finished


In [17]:
# Prevent accidental running
## joblib.dump(fitted_risky_rf, '../rf_risky_model/rf_model.pkl');

In [19]:
probs = fitted_risky_rf.predict_proba(X_test)

[Parallel(n_jobs=7)]: Done   1 out of 200 | elapsed:    0.0s remaining:    7.9s
[Parallel(n_jobs=7)]: Done 200 out of 200 | elapsed:    0.9s finished


In [22]:
default_prob = [i[1] for i in probs]

In [29]:
X_test['default_prob'] = default_prob

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


In [30]:
X_test['target'] = y_test

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


# Basic Evaluation

First we do a basic evaluation by sorting the notes by default probability. This is flawed in a number of ways. In particular, it only considers the default rate rather than ROI, which is what we are interested in. 

In [32]:
sorted_X = X_test.sort_values('default_prob')


In [33]:
top_20_percent = sorted_X[:len(sorted_X) // 5]

In [34]:
total_default_rate = X_test.target.mean()

In [35]:
total_default_rate

0.10309662462396159

In [36]:
top_20_percent.target.value_counts()

0    13999
1      494
Name: target, dtype: int64

In [152]:
top_20_percent.target.value_counts()

0    7249
1      44
Name: target, dtype: int64

In [37]:
top_20_mean = top_20_percent.target.mean()

In [38]:
top_20_mean

0.034085420547850687

As defaults eat up half the returns so halving the default rate corresponds to an increase from 9% ROI to 13.5% according to the test set.

# Moving Forward

Additional evaluations can be found in the `Evaluation` notebook. In that section we consider a more rigorous 