# Analytics Zoo Neural Network : Prosper Loan Dataset

We are going to look at the prosper loan dataset.  This dataset shows a history of loans made by Prosper.

The idea with this dataset is to try to predict whether a loan will be "good" or "bad." There are many ways a loan can be bad, from late payments to defaults, but we will turn this into a binary choice between "good" loans and "bad" loans.

We will use a neural network in the Keras-style API in Analytics Zoo to do the prediction of "good" loans or "bad.

This uses Analytics Zoo.   If you want to see a "vanilla" Tensorflow version of this lab, click [here](./prosper-tf.ipynb).

In [2]:
from zoo.common.nncontext import init_nncontext
sc = init_nncontext("Prosper Classification")  # Initialize the 

## Load the Data

We are going to load the data of the prosper loan dataset.

Notice we are first loading this into a Pandas dataframe. This is fine for a small dataset, but we will need more than this for a large "at scale" notebook.

Notice the very large numbers of columns. We are going to take only a selection of those columns.

In [3]:
import pandas as pd

## small file, start with this
#datafile = "https://s3.amazonaws.com/elephantscale-public/data/prosper-loan/prosper-loan-data-sample.csv"
## this is a large file
datafile = "https://s3.amazonaws.com/elephantscale-public/data/prosper-loan/prosper-loan-data.csv.gz"

data = pd.read_csv(datafile)
data.tail()

Unnamed: 0,Term,LoanStatus,BorrowerRate,ProsperRating (numeric),ProsperScore,ListingCategory,BorrowerState,EmploymentStatus,EmploymentStatusDuration,IsBorrowerHomeowner,...,ProsperPaymentsOneMonthPlusLate,ProsperPrincipalBorrowed,ProsperPrincipalOutstanding,LoanOriginalAmount,MonthlyLoanPayment,Recommendations,InvestmentFromFriendsCount,InvestmentFromFriendsAmount,Investors,YearsWithCredit
49719,36,1,0.0679,4.0,6.0,Personal,WA,Full-time,69.0,True,...,0.0,1000.0,847.61,4292,132.11,2,0,0.0,194,42
49720,36,1,0.1899,4.0,6.0,Business,CO,Full-time,22.0,False,...,0.0,14250.0,0.02,2000,73.3,0,0,0.0,25,10
49721,36,1,0.2639,2.0,3.0,Reno,FL,Employed,25.0,False,...,0.0,0.0,0.0,2500,101.25,0,0,0.0,26,6
49722,36,0,0.111,6.0,8.0,Other,PA,Employed,21.0,True,...,0.0,33501.0,4815.42,2000,65.57,0,0,0.0,22,22
49723,60,1,0.2605,4.0,5.0,Reno,GA,Full-time,94.0,True,...,0.0,5000.0,3264.37,15000,449.55,0,0,0.0,274,21


## Look at some summary data

Let's look at some summary data here.

We'd like to see the cardinality of the categorical columns.

In [6]:
print(data['LoanStatus'].value_counts())


1    33530
0    16194
Name: LoanStatus, dtype: int64


It looks like "good" loans outnumber bad about 2-1.  That's a class imbalance but not a dramatic one. We can probably get away with using it as is.

Let's now look at the Employment Status:

In [7]:

print(data['EmploymentStatus'].value_counts())



Full-time        25016
Employed         18393
Self-employed     3045
Part-time         1060
Other              924
Retired            703
Not employed       583
Name: EmploymentStatus, dtype: int64


It appears that there are a variety of different statuses although Full-Time and Employed (which seem to be the same thing), are the majority.

## Converting Categorical columns 

Convert categorical columns to numeric.   
Here let's convert **EmploymentStatus** column

In [8]:
# use pd.factorize on EmploymentStatus, ListingCategory

data['EmploymentStatusFactor'] = pd.factorize(data['EmploymentStatus'])[0]
data['ListingCategoryFactor'] = pd.factorize(data['ListingCategory'])[0]

## Build feature vectors 

We're going to use this from the `feature_columns` list.  This will get *only* the feature columns.

In [9]:
feature_columns = ['EmploymentStatusFactor', 'CreditScore', 'StatedMonthlyIncome', 'ListingCategoryFactor']
features = data[feature_columns]
label = data['LoanStatus']
features.tail()

Unnamed: 0,EmploymentStatusFactor,CreditScore,StatedMonthlyIncome,ListingCategoryFactor
49719,1,760.0,10333.333333,10
49720,1,740.0,2333.333333,8
49721,2,660.0,4333.333333,11
49722,2,700.0,8041.666667,5
49723,1,680.0,3875.0,11


## Split Data into training and test.

We will split our the data up into training and test. Here we do 70% training, 30% test.

In [10]:
## Split the data into 70% training and 30% test sets 

from sklearn.model_selection import train_test_split
train_x, test_x, train_y, test_y = train_test_split(features, label)
print("training set = " , len(train_x))  # Show number of rows intraining set
pd.DataFrame(train_x).tail()  # Show last few ros in training set.

training set =  37293


Unnamed: 0,EmploymentStatusFactor,CreditScore,StatedMonthlyIncome,ListingCategoryFactor
13718,4,700.0,0.0,1
43723,2,680.0,3500.0,1
1212,2,720.0,8583.333333,5
27957,2,660.0,5000.0,1
42093,1,620.0,4675.0,1


In [11]:
print("test set = " , len(test_x))  # Show number of rows in test sete
pd.DataFrame(test_x).tail()   # Show last few rows of test set

test set =  12431


Unnamed: 0,EmploymentStatusFactor,CreditScore,StatedMonthlyIncome,ListingCategoryFactor
42161,1,640.0,4150.0,1
15666,1,740.0,5416.666667,10
13209,3,660.0,1833.333333,1
19552,1,560.0,8456.166667,0
16812,1,640.0,4583.333333,1


In [16]:
print(train_x.keys().tolist())  # Print out the names of the columns

['EmploymentStatusFactor', 'CreditScore', 'StatedMonthlyIncome', 'ListingCategoryFactor']


## Neural Network

Here we are going to define our network. We will do this with the Analytics Zoo Keras style API.

Here's what we have:

 * Input Layer
 * Hidden Layer (4 Neurons), with Dropout
 * Hidden Layer (4 Neurons), with Dropout
 * Output Layer (1 Neuron)
 
This is *binary* clasification.
 
We will train with Adam optimizer, which is a good all-around optimizer that doesn't need us to play much with hyperparameters.

In [None]:
from zoo.pipeline.api.keras.layers import Dense, Dropout
from zoo.pipeline.api.keras.models import Sequential

def build_model(train_x):
  model = Sequential()

  model.add(Dense(4, input_dim=len(train_x.columns), activation='tanh'))
  model.add(Dropout(0.2))
  model.add(Dense(4, activation='tanh'))
  model.add(Dropout(0.2))
  model.add(Dense(output_dim=1, activation='sigmoid'))
  
  model.compile(loss='binary_crossentropy', optimizer='adam')
  return model


model = build_model(train_x)
model.summary()

Now it's time to train. We will do this with the `model.fit()` from analytics zoo. Notice the Keras-like `.fit` semantics.

In [None]:
%%time
# Train the model
print("Training begins.")
model.fit(
    train_x.values,
    train_y.values,
    batch_size=500,
    nb_epoch=2)
print("Training completed.")

In [None]:
predictions = model.predict(test_x.values)

## Evaluate the model.

Let us check to see how the model did, using accuracy as a measure.

In [42]:
from sklearn.metrics import accuracy_score
import numpy as np
accuracy_score(test_y, np.rint(np.concatenate(predictions.collect())))

0.6794304561177701

The results indicate that we get 67% accuracy, meaning we are right 2/3 of the time.

##  Improve Accuracy

### Add more features
Look at the schema of the full dataset.  Are there any columns you want to add? Make sure you up the number of neurons in the hidden layer as you add more features.

## Conclusion

1. Models can be used for both Classification and regression
2. Classification ususally uses a loss function of cross-entropy rather than MSE.
3. A Multilayer Peceptron network can be built with Keras style `Dense` layers.

