# Neural Network : Prosper Loan Dataset

We are going to look at the prosper loan dataset.  This dataset shows a history of loans made by Prosper.


In [None]:
%matplotlib inline
import time,datetime
import pandas as pd
import matplotlib.pyplot as plt




try:
  # %tensorflow_version only exists in Colab.
  %tensorflow_version 2.x
except Exception:
  pass
import tensorflow as tf


## Step 1: Load the Data

Notice we are first loading this into a Pandas dataframe. This is fine for a small dataset, but we will need more than this for a large "at scale" notebook.

In [None]:
## small file, start with this
#datafile = "https://s3.amazonaws.com/elephantscale-public/data/prosper-loan/prosper-loan-data-sample.csv"
## this is a large file
datafile = "https://s3.amazonaws.com/elephantscale-public/data/prosper-loan/prosper-loan-data.csv.gz"


data = pd.read_csv(datafile)
data

In [None]:
## TODO : select a few columns 
## start with: 'LoanStatus',  'EmploymentStatus', 'CreditScore', 'StatedMonthlyIncome', 'ListingCategory'
#select_columns = ['LoanStatus', 'EmploymentStatus', 'CreditScore', '???', '???']


## we can add more later

select_columns = ['LoanStatus',  'EmploymentStatus', 'CreditScore', 'StatedMonthlyIncome', 'ListingCategory']

## Note : vector columns can only have Numbers, don't include Categorical columns here
## And definitely not 'LoanStatus'  (if you are curiuos include and see what happens!)
vector_columns = [ 'EmpIndex', 'CreditScore', 'StatedMonthlyIncome', 'CategoryIndex']

## Feature Columns

feature_columns = ['EmploymentStatusFactor', 'CreditScore', 'StatedMonthlyIncome', 'ListingCategoryFactor']




## Step 2 : Clean Data

In [None]:
## TODO :  Drop any NA, null values.  
## Hint : Using `.na.drop()`
prosper_clean = data.dropna()

print("Original record count {:,}, cleaned records count {:,},  dropped {:,}"\
      .format(len(data), len(prosper_clean), 
              (len(data) - len(prosper_clean))))
prosper_clean

## Look at some summary data

In [None]:
print(prosper_clean['LoanStatus'].value_counts())
print(prosper_clean['EmploymentStatus'].value_counts())
print(prosper_clean['ListingCategory'].value_counts())


**=> What does that say about the cardinality of these categorical columns? ***



## Step 3: Converting Categorical columns 

Convert categorical columns to numeric.   
Here let's convert **EmploymentStatus** column

In [None]:
# use pd.factorize on EmploymentStatus, ListingCategory

prosper_clean['EmploymentStatusFactor'] = pd.factorize(prosper_clean['EmploymentStatus'])[0]
prosper_clean['ListingCategoryFactor'] = pd.factorize(prosper_clean['ListingCategory'])[0]

## Step 4: Build feature vectors 

In [None]:
features = prosper_clean[feature_columns]
label = prosper_clean['LoanStatus']

## Step 5: Split Data into training and test.

We will split our the data up into training and test.  (You know the drill by now).

**=> TODO: Split dataset into 70% training, 30% validation**


In [None]:
## TODO :  Split the data into 70% training and 30% test sets 
## Hint : 0.7   , 0.3
from sklearn.model_selection import train_test_split
train_x, test_x, train_y, test_y = train_test_split(features, label)
print("training set = " , len(train_x))
print("testing set = " , len(test_x))

In [None]:
len(train_x.keys())

In [None]:
y = tf.keras.utils.to_categorical(train_y)
print(y)



## Step 6: Neural Network

Note this using Tensorflow's Keras interface, which is going to be the standard high-level interface for Tensorflow starting with 2.0


In [None]:
def build_model(train_x):
  model = tf.keras.Sequential([
    tf.keras.layers.Dense(4, activation=tf.nn.relu, input_dim=4),
    tf.keras.layers.Dropout(0.2),
    tf.keras.layers.Dense(4, activation=tf.nn.relu),


    tf.keras.layers.Dense(1, activation=tf.nn.sigmoid)
  ])

  model.compile(loss='binary_crossentropy',
                optimizer='adam',
                metrics=['accuracy'])
  return model


model = build_model(train_x)
model.summary()

In [None]:

EPOCHS = 10



log_dir="logs/fit/" + datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir=log_dir, histogram_freq=1)


model.fit(
  train_x, train_y,
  epochs=EPOCHS, validation_split = 0.2, verbose=2, callbacks=[tensorboard_callback])

In [None]:
predictions = model.predict(test_x)

## Step 7: Evaluate the model.

Let us check to see how the model did, using accuracy as a measure.

In [None]:
from sklearn.metrics import accuracy_score
accuracy_score(test_y, predictions > 0.5)


## Step 8: Tensorboard

In [None]:
%tensorboard --logdir logs/fit  # For Colab

# jupyter: run the following at the command line: tensorboard --logdir logs/fit

## Step 9: Improve Accuracy

### Add more features
Look at the schema of the full dataset.  Are there any columns you want to add. Make sure you up the number of neurons in the hidden layer as you add more features.