# Assignment 2: Lending Club
Anna Pauxberger

Problem Posed: Predict the highest amount of a loan, a customer could get. 

Lending Club Data from: https://www.lendingclub.com/info/download-data.action

1. Data-Preprocessing
2. Split Training and Validation Set
3. Train Random Forest Classification
4. Advise on Loan Amount

# Data-Preprocessing

**Data Imbalance** From the shape of the original data sets we can tell that reject and accept data are imbalanced. We could now either decrease the reject set to match the length of the accept data, artificially create new accept data points, or account for that imbalance when we split the training and test data and train the model. I chose to go with the latter. 

**Variable Choice** We are limited by the variables available in the reject dataset. From those I chose to 
- omit: 
    - application date: because it doesn't match the accept dataset, where date listed on the platform is the only one available. 
    - risk score: Because there is no equivalent score in accept set. Not having a risk score in predicting whether or not a loan will be accepted is a major limitation of this model.
    - state: because a geographical variable is represented in the zip code
    - policy code: because it predicts the acceptance/ rejection (is correlated with it)
- keep: 
    - loan amount
    - debt to income ratio
    - zip code: I only account for the zip code, and use it ordinally. Since states are ordered by region, this could make a positive effect. 
    - employment length: udner one year was set to 0, above 10 years was set to 20 to partially account for the years up to 40 or 50 that a person could have already been employed. This also is a limitation of the model. 
    - purpose (loan title): By classifying the user input into the 10 largest categories, I create a categorical variable for purpose. One weakness is that the 'other' category is still rather large, which could decrease the significance of this variable.

In [3]:
import pandas as pd
import numpy as np

In [4]:
reject_original = pd.read_csv('RejectStatsA.csv', low_memory=False, skiprows=[0])
accept_original = pd.read_csv('LoanStats3a.csv', low_memory=False, skiprows=[0])

In [5]:
reject_original.shape, accept_original.shape # class imbalance

((755491, 9), (42538, 145))

In [6]:
reject = reject_original[['Amount Requested', 'Debt-To-Income Ratio', 'Zip Code', 'Employment Length', 'Loan Title']]
accept = accept_original[['loan_amnt', 'dti', 'zip_code', 'emp_length', 'purpose']]

In [7]:
# explore and drop all nas, because it is only >1% of the data
reject = reject.dropna()
accept = accept.dropna()

In [8]:
# Edits 1: employment length, zip code, dti
pd.options.mode.chained_assignment = None

reject.columns = ['amount', 'dti', 'zip', 'emp', 'purpose']
accept.columns = ['amount', 'dti', 'zip', 'emp', 'purpose']

accept['emp'].replace('< 1 year', '0', inplace=True)
accept['emp'].replace('10+ years', '20', inplace=True) # between 10 and 40, assuming that's a good representation
accept['emp'] = accept['emp'].str.extract('(\d+)', expand=False)
accept['zip'] = accept['zip'].str.extract('(\d+)', expand=False)
accept['zip'] = accept['zip'].str.slice(0,1)

reject['dti'].replace('%', '', inplace=True)
reject['zip'] = reject['zip'].str.extract('(\d+)', expand=False)
reject['zip'] = reject['zip'].str.slice(0,1)
reject['emp'].replace('< 1 year', '0', inplace=True) # could be between 0 and 1, assuming that's the mean or a good repr
reject['emp'].replace('10+ years', '20', inplace=True) # between 10 and 40, assuming that's a good representation
reject['emp'] = reject['emp'].str.extract('(\d+)', expand=False)
reject['dti'] = reject['dti'].str.split('%', expand=True)[0]


reject['accept'] = 0
accept['accept'] = 1

In [9]:
# Edits 2: concat, edit type
frames = [accept, reject]
temp = pd.concat(frames)

# convert to floats
temp.dti = temp.dti.astype(float)
temp.emp = temp.emp.astype(float)
temp.accept = temp.accept.astype(float)
temp.zip = temp.zip.astype(float)
temp.dtypes

amount     float64
dti        float64
zip        float64
emp        float64
purpose     object
accept     float64
dtype: object

In [10]:
# Edits 3: add purpose
temp.purpose.value_counts()[:10]

debt_consolidation    266223
other                  96647
credit_card            58059
home_improvement       44406
car                    44181
small_business         39575
major_purchase         37989
medical                19190
moving                 18180
                       15235
Name: purpose, dtype: int64

In [23]:
def purpose_transformation(purpose):
    purpose = str(purpose).lower()
    if 'debt' in purpose: return 'debt-related'
    if 'credit card' in purpose: return 'credit_card'
    if 'business' in purpose: return 'business'
    if 'home' in purpose or 'house' in purpose: return 'home'
    if 'car' in purpose: return 'car'
    if 'medical' in purpose: return 'medical'
    if 'wedding' in purpose: return 'wedding'
    return 'other'

purpose_temp = temp['purpose'].apply(purpose_transformation)

purpose_dummies = pd.get_dummies(purpose_temp, dtype=float)
lending = pd.concat([temp, purpose_dummies], axis=1)
lending = lending.drop(['purpose'], axis=1)
lending.head()

Unnamed: 0,amount,dti,zip,emp,accept,business,car,credit_card,debt-related,home,medical,other,wedding
0,5000.0,27.65,8.0,20.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2500.0,1.0,3.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
2,2400.0,8.72,6.0,20.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,10000.0,20.0,9.0,20.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
4,3000.0,17.94,9.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0


# Split training and test set

I split the training set to contain 70%, and the test set to be 30%. I stratify with regards to the lending_target to balance according to the initial imbalance problem mentioned above. If I had manipulated more variables, I would have also included a validation set which I would have used as test set while training the models, in order to not inform my model by my last resource test set. 

In [27]:
from sklearn.model_selection import train_test_split

In [51]:
lending_data = lending.loc[:, lending.columns != 'accept']
lending_target = lending.loc[:, lending.columns == 'accept']

X_train, X_test, y_train, y_test = train_test_split(
    lending_data, lending_target, test_size=0.7, random_state=0, stratify=lending_target)

In [29]:
X_train.shape, y_train.shape

((236624, 12), (236624, 1))

In [30]:
X_test.shape, y_test.shape

((552124, 12), (552124, 1))

# Train Random Forest Classification

**Model** Out of logistic regression, logistic regression cross validation and the random forest, the latter performed best according to f1_score. The F1 score is the harmonic mean of precision and recall and thus a holistic performance measure of the model. (Coming with limitations. If I care about precision or recall in particular, for example for clinical trial data, it is crucial to look at the outcomes specifically.) 

** Precision** We can see that we have a 99% precision for predicting 0. Meaning that 99% of all rejects we predict, were actually rejected. For accepted, this is only 25%. Meaning that out of all loans we accept, only 25% would actually get accepted. For practical purposes, this is bad since a lot of people will walk into the offices that we will have to reject. For marketing purposes, this might be good because it makes customers remember our loan giving bank in positive light (as we promise them a loan). 


**Recall** Recall is 86% for rejected loans, meaning out of all loans that were actually rejected, we detected 86%. This time we have a pretty similar value for acceptance, with 87%. In general, I think this performance is not too bad, as this is just a recommendation service. However, we could do better. 

**Cross Validation** I apply cross validation using k-fold CV, which splits the training set into 10 sets, iterates through each using 9 parts as training set and 1 part as test (validation) set. My scores have a mean of 86% accuracy, with a rather small standard deviation (0.00). While this sounds good, it's important to not only look at accuracy, since we could technically have good accuracy scores by predicting one class only, if 86% of our data were that class. 

**Dummy vs Ordinal** I chose to use ordinal type for all variables. Some models (especially logistic regression) perform better when data is put into dummy variables (one hot encoded), since they can then better linearly predict each variable. Random Forest, however, is pretty good at dealing with nominal data. However, since all of my categorical data have rather few categories, I also could have made them dummy variables since it wouldn't have increased the dimensions as much. In a next iteration I would test that.

In [49]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from sklearn.metrics import f1_score, classification_report, confusion_matrix

In [32]:
clf = RandomForestClassifier(n_estimators=100, max_depth=4,random_state=0,class_weight='balanced')
clf.fit(X_train, np.ravel(y_train))

RandomForestClassifier(bootstrap=True, class_weight='balanced',
            criterion='gini', max_depth=4, max_features='auto',
            max_leaf_nodes=None, min_impurity_decrease=0.0,
            min_impurity_split=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=100, n_jobs=1, oob_score=False, random_state=0,
            verbose=0, warm_start=False)

In [33]:
y_pred = clf.predict(X_train)
y_pred

array([0., 0., 0., ..., 1., 0., 0.])

In [34]:
clf.feature_importances_

array([6.90384274e-02, 3.15204259e-01, 2.69029594e-03, 5.45583833e-01,
       2.37808529e-04, 1.11250688e-03, 6.14519745e-03, 1.54616372e-02,
       9.24729096e-04, 4.04156661e-04, 4.30815266e-02, 1.15622676e-04])

In [35]:
clf.predict_proba(X_train)

array([[0.94142403, 0.05857597],
       [0.92042842, 0.07957158],
       [0.80007106, 0.19992894],
       ...,
       [0.34388511, 0.65611489],
       [0.72104365, 0.27895635],
       [0.79448835, 0.20551165]])

In [38]:
f1_score(y_train, y_pred)

0.3676618014374799

In [39]:
print(classification_report(y_train, y_pred))

             precision    recall  f1-score   support

        0.0       0.99      0.84      0.91    224197
        1.0       0.23      0.87      0.37     12427

avg / total       0.95      0.84      0.88    236624



In [40]:
cv = cross_val_score(clf, X_train, np.ravel(y_train), cv=10)
np.mean(cv), np.std(cv)

(0.8427040722693929, 0.0029150214858213483)

# Advise on Loan Amount

Here, I create an algorithm that takes a user's input (debt, income, zip code, and employment length in years) and calcualtes the highest loan amount this user can get. I loop through different values, which is rather inefficient. A logistic regression could be more effective here, as I would be able to compute a decisino boundary. 

In [334]:
# user input
debt = 1000
income = 2000
zip_code = 234
emp_length = 10     # in years

In [400]:
loan_max = lending.amount.max() + (1/5)*lending.amount.max()
loan_amounts = np.arange(loan_max, 0, -500)

def loan_optimizer(debt, income, zip_code, emp_length):
    loans_secured = [] # 75%, 85%, 95%, 100%
    dti = debt/income
    for loan in loan_amounts:
        user_data = [loan, dti, zip_code, emp_length]
        user_data = np.array(user_data).reshape(1, -1)
        reject_prob = clf.predict_proba(user_data)[0][0]
        accept_prob = clf.predict_proba(user_data)[0][1]
        
        if accept_prob == 1: 
            print('A loan of of ', loan, 
                                   ' $ will get accepted with probability', accept_prob)
            return
        
        if accept_prob > 0.40 and accept_prob < 0.91: 
            print('A loan of of ', loan, 
                                   ' $ will get accepted with probability', accept_prob)
            return
        else: 
            return('Sorry - No loan can be issued for you.')


In [401]:
loan_optimizer(1000, 20000, 234, 10)

A loan of of  1680000.0  $ will get accepted with probability 0.41060282947987403


# Appendix 1: Data Exploration

In [403]:
# Methods used to explore the original data
reject_original.head()
len(reject_original) 
list(reject_original)
reject_original.dtypes
reject_original.isna().sum()

accept_original.head()
len(accept_original)
list(accept_original) 
accept_original.dtypes
accept_original.isna().sum()

reject_original['Risk_Score'].value_counts()
reject_original.Risk_Score.nunique()
accept_original.grade.value_counts()
accept_original.zip_code.nunique()

print()




# Appendix 2: Additional Models with lower F1 Score (thus not used)

# Logistic Regression

### Just logistic regression

In [42]:
from sklearn.linear_model import LogisticRegression

In [43]:
clf = LogisticRegression(random_state=0, solver='lbfgs', class_weight='balanced').fit(X_train, y_train)

  y = column_or_1d(y, warn=True)


In [52]:
clf.predict(X_test)

array([0., 0., 1., ..., 1., 0., 0.])

In [53]:
clf.predict_proba(X_test) 

  np.exp(prob, prob)


array([[1.        , 0.        ],
       [0.88115979, 0.11884021],
       [0.41153164, 0.58846836],
       ...,
       [0.47706258, 0.52293742],
       [0.82452578, 0.17547422],
       [0.65305918, 0.34694082]])

In [54]:
y_pred = clf.predict(X_train)
f1_score(y_train, y_pred)

0.2408141726347531

In [56]:
print(classification_report(y_train, y_pred))

             precision    recall  f1-score   support

        0.0       0.98      0.80      0.88    224197
        1.0       0.15      0.64      0.24     12427

avg / total       0.93      0.79      0.84    236624



In [57]:
confusion_matrix(y_train, y_pred)

array([[178285,  45912],
       [  4441,   7986]])

### CV logistic regression

In [58]:
from sklearn.linear_model import LogisticRegressionCV

In [59]:
clf = LogisticRegressionCV(cv=5, random_state=0, solver='lbfgs', class_weight='balanced').fit(X_train, y_train)

  y = column_or_1d(y, warn=True)


In [61]:
clf.predict(X_test)

array([0., 0., 1., ..., 0., 0., 0.])

In [62]:
clf.predict_proba(X_test).shape

  np.exp(prob, prob)


(552124, 2)

In [63]:
clf.predict_proba(X_test)

  np.exp(prob, prob)


array([[1.        , 0.        ],
       [0.87003889, 0.12996111],
       [0.45162017, 0.54837983],
       ...,
       [0.51050384, 0.48949616],
       [0.82529935, 0.17470065],
       [0.63311277, 0.36688723]])

In [64]:
clf.score(X_test, y_test)

0.8072715549405568

In [65]:
y_pred = clf.predict(X_train)
f1_score(y_train, y_pred)

0.261456966516013

In [66]:
clf.coef_

array([[-2.32986918e-05, -3.38731654e-02,  2.27884667e-02,
         1.81527498e-01, -1.16737505e-03,  2.78168308e-02,
        -8.84685361e-03,  1.35777279e-01, -2.17549516e-03,
        -6.93789243e-03, -1.30904318e-01,  6.27183901e-03]])