# Analytics Zoo Neural Network : Prosper Loan Dataset

We are going to look at the prosper loan dataset.  This dataset shows a history of loans made by Prosper.

The idea with this dataset is to try to predict whether a loan will be "good" or "bad." There are many ways a loan can be bad, from late payments to defaults, but we will turn this into a binary choice between "good" loans and "bad" loans.

We will use a neural network in the Keras-style API in Analytics Zoo to do the prediction of "good" loans or "bad.

This uses Analytics Zoo.   If you want to see a "vanilla" Tensorflow version of this lab, click [here](./prosper-tf.ipynb).

In [5]:
from zoo.common.nncontext import init_nncontext
sc = init_nncontext("MPG Regression")  # Initialize the 

## Step 1: Load the Data

We are going to load the data of the prosper loan dataset.

Notice we are first loading this into a Pandas dataframe. This is fine for a small dataset, but we will need more than this for a large "at scale" notebook.

Notice the very large numbers of columns. We are going to take only a selection of those columns.

In [6]:
import pandas as pd

## small file, start with this
#datafile = "https://s3.amazonaws.com/elephantscale-public/data/prosper-loan/prosper-loan-data-sample.csv"
## this is a large file
datafile = "https://s3.amazonaws.com/elephantscale-public/data/prosper-loan/prosper-loan-data.csv.gz"

data = pd.read_csv(datafile)
data.tail()

Unnamed: 0,Term,LoanStatus,BorrowerRate,ProsperRating (numeric),ProsperScore,ListingCategory,BorrowerState,EmploymentStatus,EmploymentStatusDuration,IsBorrowerHomeowner,...,ProsperPaymentsOneMonthPlusLate,ProsperPrincipalBorrowed,ProsperPrincipalOutstanding,LoanOriginalAmount,MonthlyLoanPayment,Recommendations,InvestmentFromFriendsCount,InvestmentFromFriendsAmount,Investors,YearsWithCredit
49719,36,1,0.0679,4.0,6.0,Personal,WA,Full-time,69.0,True,...,0.0,1000.0,847.61,4292,132.11,2,0,0.0,194,42
49720,36,1,0.1899,4.0,6.0,Business,CO,Full-time,22.0,False,...,0.0,14250.0,0.02,2000,73.3,0,0,0.0,25,10
49721,36,1,0.2639,2.0,3.0,Reno,FL,Employed,25.0,False,...,0.0,0.0,0.0,2500,101.25,0,0,0.0,26,6
49722,36,0,0.111,6.0,8.0,Other,PA,Employed,21.0,True,...,0.0,33501.0,4815.42,2000,65.57,0,0,0.0,22,22
49723,60,1,0.2605,4.0,5.0,Reno,GA,Full-time,94.0,True,...,0.0,5000.0,3264.37,15000,449.55,0,0,0.0,274,21


## Step 2 : Clean Data

We should get rid of the NA values.

In [4]:
## TODO :  Drop any NA, null values.  
## Hint : Using `.na.drop()`
prosper_clean = data.dropna()

print("Original record count {:,}, cleaned records count {:,},  dropped {:,}"\
      .format(len(data), len(prosper_clean), 
              (len(data) - len(prosper_clean))))
prosper_clean.tail()

Original record count 49,724, cleaned records count 49,724,  dropped 0


Unnamed: 0,Term,LoanStatus,BorrowerRate,ProsperRating (numeric),ProsperScore,ListingCategory,BorrowerState,EmploymentStatus,EmploymentStatusDuration,IsBorrowerHomeowner,...,ProsperPaymentsOneMonthPlusLate,ProsperPrincipalBorrowed,ProsperPrincipalOutstanding,LoanOriginalAmount,MonthlyLoanPayment,Recommendations,InvestmentFromFriendsCount,InvestmentFromFriendsAmount,Investors,YearsWithCredit
49719,36,1,0.0679,4.0,6.0,Personal,WA,Full-time,69.0,True,...,0.0,1000.0,847.61,4292,132.11,2,0,0.0,194,42
49720,36,1,0.1899,4.0,6.0,Business,CO,Full-time,22.0,False,...,0.0,14250.0,0.02,2000,73.3,0,0,0.0,25,10
49721,36,1,0.2639,2.0,3.0,Reno,FL,Employed,25.0,False,...,0.0,0.0,0.0,2500,101.25,0,0,0.0,26,6
49722,36,0,0.111,6.0,8.0,Other,PA,Employed,21.0,True,...,0.0,33501.0,4815.42,2000,65.57,0,0,0.0,22,22
49723,60,1,0.2605,4.0,5.0,Reno,GA,Full-time,94.0,True,...,0.0,5000.0,3264.37,15000,449.55,0,0,0.0,274,21


## Look at some summary data

Let's look at some summary data here.

We'd like to see the cardinality of the categorical columns.

In [None]:
print(prosper_clean['LoanStatus'].value_counts())


It looks like "good" loans outnumber bad about 2-1.  That's a class imbalance but not a dramatic one. We can probably get away with using it as is.

Let's now look at the Employment Status:

In [None]:

print(prosper_clean['EmploymentStatus'].value_counts())



It appears that there are a variety of different statuses although Full-Time and Employed (which seem to be the same thing), are the majority.

In [None]:
print(prosper_clean['ListingCategory'].value_counts())



Hmmmm... the lions share of these appear to be "debt" -- not very meaningful.  Many of the more specific categories like "boat" and "RV" don't account for many loans.

## Step 3: Converting Categorical columns 

Convert categorical columns to numeric.   
Here let's convert **EmploymentStatus** column

In [None]:
# use pd.factorize on EmploymentStatus, ListingCategory

prosper_clean['EmploymentStatusFactor'] = pd.factorize(prosper_clean['EmploymentStatus'])[0]
prosper_clean['ListingCategoryFactor'] = pd.factorize(prosper_clean['ListingCategory'])[0]

## Step 4: Build feature vectors 

We're going to use this from the `feature_columns` list.  This will get *only* the feature columns.

In [None]:
feature_columns = ['EmploymentStatusFactor', 'CreditScore', 'StatedMonthlyIncome', 'ListingCategoryFactor']
features = prosper_clean[feature_columns]
label = prosper_clean['LoanStatus']

## Step 5: Split Data into training and test.

We will split our the data up into training and test.  (You know the drill by now).

**=> TODO: Split dataset into 70% training, 30% validation**


In [None]:
## TODO :  Split the data into 70% training and 30% test sets 
## Hint : 0.7   , 0.3
from sklearn.model_selection import train_test_split
train_x, test_x, train_y, test_y = train_test_split(features, label)
print("training set = " , len(train_x))
print("testing set = " , len(test_x))

In [None]:
len(train_x.keys())

## Step 6: Neural Network

Here we are going to define our network. We will do this with the Analytics Zoo Keras style API.

Here's what we have:

 * Input Layer
 * Hidden Layer (4 Neurons), with Dropout
 * Hidden Layer (4 Neurons), with Dropout
 * Output Layer (1 Neuron)
 
 This is *binary* clasification.

In [None]:
from zoo.pipeline.api.keras.layers import Dense, Dropout
from zoo.pipeline.api.keras.models import Sequential

def build_model(train_x):
  model = Sequential()

  model.add(Dense(4, input_dim=len(train_x.columns)))
  model.add(Dropout(0.2))
  model.add(Dense(4))
  model.add(Dropout(0.2))
  model.add(Dense(output_dim=1, activation='sigmoid'))
  
  model.compile(loss='binary_crossentropy', optimizer='adam')
  return model


model = build_model(train_x)
model.summary()

Now it's time to train. We will do this with the `model.fit()` from analytics zoo. Notice the Keras-like `.fit` semantics.

In [20]:
%%time
# Train the model
print("Training begins.")
model.fit(
    train_x.values,
    train_y.values,
    batch_size=500,
    nb_epoch=2)
print("Training completed.")

Training begins.
Training completed.
CPU times: user 370 ms, sys: 10 ms, total: 380 ms
Wall time: 11.6 s


In [21]:
predictions = model.predict(test_x.values)

In [22]:
model.predict(test_x.iloc[0:2].values).collect()

[array([1.], dtype=float32), array([1.], dtype=float32)]

## Step 7: Evaluate the model.

Let us check to see how the model did, using accuracy as a measure.

In [23]:
from sklearn.metrics import accuracy_score
accuracy_score(test_y.values, predictions.collect())


0.671707827206178

The results indicate that we get 67% accuracy, meaning we are right 2/3 of the time.

## Step 8: Improve Accuracy

### Add more features
Look at the schema of the full dataset.  Are there any columns you want to add. Make sure you up the number of neurons in the hidden layer as you add more features.