<h1 id="tocheading">Finding Patterns in Data using IBM Power and PowerAI</h1>
<div id="toc"></div>

In this lab we will explore an open source data set, and discover how we can use the tools that are part of **PowerAI** to explore and discover patterns in the data.  For this lab, we will make use of the Lending Club data set, **scikit learn, Tensorflow and Keras**.  Here is a brief description about Lending Club.

```
About the author's
Dustin VanStee - Data Scientist
Bob Chesebrough - Data Scientist
IBM Cognitive Systems Solution Center
contact : vanstee@us.ibm.com
```


<img src="https://github.com/dustinvanstee/random-public-files/raw/master/techu-banner.png" width="800" height="500" align="middle"/>

[Lending Club (LC)](https://www.lendingclub.com/) is the world’s largest online marketplace connecting borrowers and investors. It is transforming the banking system to make credit more affordable and investing more rewarding. Lending Club operates at a lower cost than traditional bank lending programs and pass the savings on to borrowers in the form of lower rates and to investors in the form of solid risk-adjusted returns.

**The DATA**  
The original data set is downloaded from [LC](https://www.lendingclub.com/info/download-data.action) covering complete loan data for all loans issued through the 2007-2018, including the current loan status (Current, Late, Fully Paid, etc.) and latest payment information. Additional features include credit history, number of finance inquiries, address including zip codes, and state, and collections among others. It is quite rich and is an excellent example of credit risk data.  Interestingly, Goldman Sachs’ new peer-to-peer lending platform called Marcus was built almost entirely using the Lending Club data.

Here is a link to some extra information regarding the fields of the data set.
[Data Dictionary](https://github.com/dustinvanstee/mldl-101/blob/master/lab5-powerai-lc/LCDataDictionary.csv)

**Important**

In this notebook, we will play with the lending club data, conduct a set of exploratory analysis and try to apply various machine learning techniques to predict borrower’s default. We took a small sample of loans made in 2016 (130K) to help speed up the processing time for the lab


Note : to remove a lot of the busy verbose code, we are making using of a utility python file called lc_utils.py.  For implemenation details you can refer here [python code](https://github.com/dustinvanstee/mldl-101/blob/master/lab5-powerai-lc/lc_utils.py)

### Quick word on the data science method
<img src="https://github.com/dustinvanstee/random-public-files/raw/master/dsx-methodology.png" width="900" height="700" align="middle"/>

Here we will use these simple high level steps to work through a typical data science problem.  This workflow is meant to be a high level guide, but in practice this is a highly iterative approach ...

### Goals

* Perform some initial analysis of the data for **Business Understanding**
* **Prepare the Data** for our visualization and modeling
* **Visualize** the data
* Model using **Dimension Reduction** and **Classification** techniques
* **Evaluate** the approach

## Business/Data Understanding and Preparation
<img src="https://github.com/dustinvanstee/random-public-files/raw/master/techu-bu-dp.png" width="800" height="500" align="middle"/>

In [None]:
# Environment bootstrapping
# !pip install jupyter-pip
# !pip3 install brunel
# !git fetch origin master
# !git reset --hard origin/master

### Import Libraries

In [None]:
# Code functions that are needed to run this lab
import pandas as pd
import warnings
warnings.filterwarnings('ignore')
import time
from datetime import datetime
import math

import pandas as pd
#pd.set_option('display.height', 1000)
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

import glob

# custom library for some helper functions 
from sklearn.base import BaseEstimator, TransformerMixin
import matplotlib
import matplotlib.pyplot as plt
import numpy as np

import myenv as myenv
from lc_utils import *


### Load the Data
Here we load data that was previously downloaded from lendingclub.com.  For speed of this lab, we are restricting the number of loans ~ 130K

In [None]:
loan_df = load_sample_data()
loan_df_orig = loan_df
loan_df.head()

### Descriptive Statistics (1D)
Lets look at some 1D and 2D descriptive statistics for this dataset

In this dataset, we have all types of data.  Numerical, Categorical, Ranked data.  This small module will take you through what is typical done to quickly understand the data



In [None]:
# This function provide the number of rows/cols
# Information on the types of data
# and a report of descriptive statistics

quick_overview_1d(loan_df)

Here we can get a quick assessment of the statistics for each column.  
**Quick Question** can you answer what was the average income for the 133K loan applicants ?

### Descriptive Statistics (2D)
Since we have over 100 numerical variables, creating a 2D correlation plot may be time consuming and difficult to interpret.  Lets look at correlations on a smaller scale for now....


In [None]:
# Grab only a subset of columns
cols = ["loan_amnt","annual_inc","dti","fico_range_high","open_acc",'funded_amnt', 'total_acc']
quick_overview_2d(loan_df, cols)

**Quick Question** : Can you find a negatively correlated variable to annual_inc in the chart above?  Can you think of a reason for this result ?

### Create Loan Default column.  This is the column we will predict later
The **loan_status** column contains the information of whether or not the loan is in default. 

This column has more than just a 'default or paid' status.  Since our goal is to build a simple default classifier , we need to make a new column based off the **loan_status** column.

Here we will look at all the categorical values in **loan_status**, and create a new column called **default** based off that one.


In [None]:
# function to create loan status .... 
loan_df = create_loan_default(loan_df)
loan_df.head(3) # scroll to the right, and see the new 'default' column

### Handle Null Values aka NaNs...

One part of the data science process thats especially time consuming is working with unclean data.  This lending club data set is a great example of that.  If you look at the dataframe shown above, you will see a number of columns with the indicator **NaN** .  This means 'not a number' and needs to be dealt with prior to any machine learning steps.  You have many options here.  Some options are listed below...

* Fill with a value -> impute mean/median/min/max/other
* drop rows with NaNs
* drop columns with large number of NaNs 
* use data in other columns to derive

All these methods are possible, but its up to the data scientist / domain expert to figure out the best approach.  There is definitely some grey area involved in whats the best approach.

First, lets understand which columns have NaNs...

In [None]:
# For every column, count the number of NaNs .... 
# code hint : uses df.isna().sum()

columns_with_nans(loan_df)


As you can see, we have some work to do to clean up the NaN values.  Beyond NaN values, we also have to transform columns if they aren't formatted correctly, or maybe we want to transform a column based on custom requirements.  

```
Example : column=employee_length , values=[1,2,3,4,5,6,7,8,9,10+] formatted as a string
          transform into 
          column=employee_length, [0_3yrs,4_6yrs,gt_6yrs] (categorical:strings)
```
          
Luckily, we took care to process and clean this data below using a few functions.  In practice, **this is where data scientists spend a large portion of their time** as this requires detailed domain knowledge to clean the data.  We have made a fair number of assumptions about how to process the data which we won't go into due to time contraints for the lab.

In [None]:
# The following cleaning of the data makes use of the steps shown below.....

#loan_df1 = drop_sparse_numeric_columns(loan_df)
#loan_df2 = drop_columns(loan_df1)
#loan_df3 = impute_columns(loan_df2)
#loan_df4 = handle_employee_length(loan_df3)
#loan_df5 = handle_revol_util(loan_df4)
#loan_df6 = drop_rows(loan_df5)

loan_df = clean_lendingclub_data(loan_df)


In [None]:
# Final Sanity check ....
# If we did our job right, there should not be any NaN's left.  
# Use this convenience function to check

# code hint df.isna().sum()

columns_with_nans(loan_df)

### Data Preparation - Handle Time Objects
Sometimes for columns that contain date information, you may want to break them down into individual columns like month, day, day of week etc.  For our use case, we will create a new column called `time_history` that will indicate how long an applicant has been a borrower.  This is an example of **feature engineering**.  Essentially, using business logic to create a new column (feature) that may have predictive value.

In [None]:
loan_df = create_time_features(loan_df)
loan_df.head(3)

### Convert Categorical Data to One hot encode ###

If you look above at the data frame, we are almost ready to start building models.  However, there is one important step to complete.  Notice we have some columns that are still built out of string data 
```
example column=home_ownership values=[RENT, MORTGAGE, OWN]
```
Machine learning algorithms only process numerical data, so we need to transform these **categorical columns** into **indicator columns**

From the example above, the transform would yield 3 new columns

```
example column=RENT values=[0,1]
        column=MORTGAGE values=[0,1]
        column=OWN values=[0,1]
```

Conveniently pandas has a nice function called **get_dummies** that we will use for this purpose

In [None]:
# Transform categorical data into binary indicator columns
# code hint, uses pd.get_dummies

loan_df = one_hot_encode_keep_cols(loan_df)
loan_df.head() # once complete, see how many new columns you have!

### Final Result after data prep ....

Ok, so you made it here, lets take a look at the final results of your data preparation work.  It may be helpful to  **qualitatively compare** your original data frame to this one and see how different they look..  Execute the cells below to get a sense of what the tranformations accomplished.

In [None]:
loan_df_orig.head(3)

In [None]:
loan_df.head(3)

### Data Visualization
As you saw, when you 'describe' a data frame, you get a table statistics showing you the mean,min,max and other statistics about each column.  This is good, but sometimes its also good to look at the histograms of the data as well.  Lets Visualize some of the distributions from our dataset


<img src="https://github.com/dustinvanstee/random-public-files/raw/master/data-visualization.png" width="800" height="500" align="middle"/>

In [None]:
# Here we plot distribution charts for all the numerical columns in our dataframe
plot_histograms(loan_df)

# Brunel Example
## The Growth of Lending Club
### Here we use the builtin Brunel Visualization graphics package
Lending club has been expanding over the years in terms of total loan volume and average loan size.

In [None]:
# Build a statistics data frame based on issue date
# aggregate on loan amount
vis_df = loan_df.copy()
vis_df['loan_status'] = loan_df_orig['loan_status']
loan_stats = pd.concat([vis_df.groupby('issue_d').mean()['loan_amnt'].to_frame().rename(columns = {'loan_amnt':'loan_average'}), vis_df.groupby('issue_d')['loan_status'].count().to_frame().rename(columns = {'loan_status':'loan_count'})], axis=1)

In [None]:
%brunel data('loan_stats') line x(ISSUE_D) y(loan_average, loan_count) color(#series) tooltip(#all) :: width=900, height=350 

**TODO** Add more Brunel if i get it in nimbix

### Modelling Phase

<img src="https://github.com/dustinvanstee/random-public-files/raw/master/modeling.png" width="800" height="500" align="middle"/>

### Train / Test set creation

One of the key points in any machine learning workflow is the **partitioning** of the data set into **train** and **test** sets.  The key idea here is that a model is built using the training data, and evaluated using the test data.  

There are more nuances to how you partition data into train/test sets, but for purposes of this lab we will omit these finer points.

In [None]:
%load_ext autoreload
%autoreload 2
from lc_utils import *

In [None]:
# Instantiate lendingclub_ml object that will hold our test, and contain methods used for testing.
# Implementation done like this to ease the burden on users for keeping track of train/test sets for different
# models we are going to build.

my_analysis = lendingclub_ml(loan_df)

In [None]:
# Create a train / test split of your data set.  Paramter is test set size percentage 
# Returns data in the form of dataframes

my_analysis.create_train_test(test_size=0.4)

### Dimension Reduction
For this modeling exercise we will perform a couple of tasks, **dimension reduction** and **classification** as shown in the following diagram.

<img src="https://github.com/dustinvanstee/random-public-files/raw/master/techu-modeling-workflow.png" width="800" height="500" align="middle"/>

**Dimension Reduction** is useful in scenarios when you have a large number of columns and you would like to reduce that down to a compressed representation .  In this lab we will try 2 methods of dimension reduction.  It will be your choice to decide which method you want to use for the classification part of the lab ! (you could even decice to bypass this if you want ...)


## Dimension Reduction - PCA

<img src="https://github.com/dustinvanstee/random-public-files/raw/master/techu-pca.png"  width="200" height="125" align="middle"/>

Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. The number of principal components is less than or equal to the number of original variables.

A simple way to think about PCA is that it helps compress the data in a lossy representation of the original dataset.

This will also be used to help us visualize the data as you will see below

In [None]:
# Dimension Reduction using PCA
my_analysis.build_pca_model(n_components=20)

In the chart above, you can see that we get ok results from PCA.  Using the first 20 principal components, we can account for ~50% of the variance described in the dataset.  **Feel free to change the number of principal components above to see if adding more helps with explained variance.**

## Dimension Reduction - AutoEncoder

<img src="https://github.com/dustinvanstee/random-public-files/raw/master/techu-ae.png"  width="600" height="375" align="middle"/>

An autoencoder is another method that can be used for dimension reduction.  An autoencoder is a **neural network** that tries to reproduce itself given the constraint that it will lose information in the bottleneck layer.  Based on this, its again a lossy representation.  One key difference between autoencoders and PCA is that an autoencoder can find non linear relationships between variables that PCA could not detect.

In [None]:
# This will build and run your auto-encoder 
# feel free to adjust 
# ae_layers -> needs to be an odd number of layers, and symmetric 
# Regularization -> controls overfitting
# epochs -> number of times to loop thru training set

my_analysis.build_ae_model(ae_layers=[100,25,6,25,100], regularization=0.001, epochs=1)

**Class Contest**  lets see who can find the best settings for the neural network to minimize loss!  Just yell out your results and instructor will add to the leaderboard!

# Now update our test dataframe with new columns that are predicted by our PCA and Autoencoder models.  

Here we will now take the models that we built and pass our test data set through the models.   By doing this, we will have reduced the number features in our data set by a significant amount (~177 colums => ~5-20 columns!)  .  

In this step we will add new columns to our test/train data frames for both our PCA model and our autoencoder model.  Don't worry about the details of this step, its just required for some followon visualization, and training steps ahead. 



In [None]:
my_analysis.update_train_test_df()

### Cool Visualizations using our dimension reduction columns

Next we will plot a few scatterplot grids based on our pricipal component and autoencoder representations of the data

We will color each data point using this key
```
Green -> Fully paid or current loan
Red   -> Loan in default
```

In [None]:
# This will take a minute or so ...
my_analysis.visualize_dimred_results(mode='pca')

In [None]:
# This will take a minute or so ...
my_analysis.visualize_dimred_results(mode='ae')

If you can discern a pattern between the red / green dots, its likely we can use a classifier to automatically seperate them! We'll see that in a few more sections

### Heatmap commentary
Using a heatmap can be another good visualization tool.  You can use this to get a sense of how the data correlates to each other.  In the code below, play with the **sortColumn** input .  In the example below we are sorting by principal component 0, which has the most information encoded in that column.  See if you can find out what PC0 might be composed of.  Try it for PC1, or AE0, AE1...

Pro tip, to get the most out of a heatmap, all the data needs to be normalized on a common 0 -> 1 scale so that the coloring of the columns works out ..


In [None]:
# This will take a minute or so ...
bob_heatmap_lc(my_analysis.test_df,sortColumn='PC0',add_corr=1)

The resolution is quite small, but try to find columns that go from solid red on bottom to black on top. That would be an indication of high correclation to your sort column

# Final Step - Lending Club Default Prediction

<img src="https://github.com/dustinvanstee/random-public-files/raw/master/techu-modeling-traintest.png"  width="600" height="375" align="middle"/>

Here we will build a classifier to predict if loan will fail or not.  We will us a 
** Deep Learning Classifier **  .  You will have 3 options for data sources, 
* the raw data
* PCA dimension reduction features
* Autoencoder features

To evaluate our model, we will use a simple contingency table (showing true/false positve/negative).  However, this is a fairly simplistic method.  Better method that data scientists use are F1 score, and PR/ROC curves but thats beyond the scope of this lab.

Step 1 here is to set our baseline result.  In this example, we are dealing with a **skewed** dataset.  This means, on average, most people will not default, and they pay their loan off.  If you built a classifier that just predicted no default, you would be right most of the time.  Lets see the stats from our dataset below....

In [None]:
# Set our baseline
my_analysis.train_df['default'].describe()


As you can see, **only ~12.8% of the applicants default**.  Any classifier we build must be better than this, or we aren't doing a very good job ;)

In [None]:
# modes
# pca           : principal components only
# ae            : autoencoder components only
# raw           : all the data non reduced
# raw_no_grades : all the data non reduced except the grade info provided by lending club

mode = 'pca' # ae , raw, raw_no_grades

if(mode == 'pca') :
            x_cols = [x for x in my_analysis.train_df.columns if 'PC' in x]
elif(mode == 'ae') :
            x_cols = [x for x in my_analysis.train_df.columns if 'AE' in x]
elif(mode == 'raw') :
            x_cols = [x for x in my_analysis.train_df.columns if 'AE' not in x and 'PC' not in x]
elif(mode == 'raw_no_grades') :
            x_cols = [x for x in my_analysis.train_df.columns if 'AE' not in x and 'PC' not in x]
            import re
            x_cols = [x for x in x_cols if not re.match('^[ABCDEFG]',x)]

#print(x_cols)
my_analysis.build_evaluate_dl_classifier(x_cols, epochs=25,batch_size=32,regularization=0.001)


**Class Contest** See who can get the best accuracy, shout out your answers to the instructor and see if you can top the leaderboard
* accuracy = (true positive + true negative) / total

### Credits 
* Bob Chesebrough - CSSC major contributor
* [Hands on Machine Learning - Geron] (https://www.oreilly.com/library/view/hands-on-machine-learning/9781491962282/)

### More Learning
* Coursera Deeplearning.ai  (Ng)
* Coursera Machine Learning (Ng)
