<h1 id="tocheading">Finding Patterns in Data using IBM Power and PowerAI</h1>
<div id="toc"></div>

In this lab we will explore an open source data set, and discover how we can use the tools that are part of **PowerAI** to explore and discover patterns in the data.  For this lab, we will make use of the Lending Club data set, **scikit learn, Tensorflow and Keras**.  Here is a brief description about Lending Club.

```
About the author's
Dustin VanStee - Data Scientist
Bob Chesebrough - Data Scientist
IBM Systems AI Center of Competence
contact : vanstee@us.ibm.com
```


<img src="https://github.com/dustinvanstee/random-public-files/raw/master/techu-banner.png" width="800" height="500" align="middle"/>

[Lending Club (LC)](https://www.lendingclub.com/) is the world’s largest online marketplace connecting borrowers and investors. It is transforming the banking system to make credit more affordable and investing more rewarding. Lending Club operates at a lower cost than traditional bank lending programs and pass the savings on to borrowers in the form of lower rates and to investors in the form of solid risk-adjusted returns.

**The DATA**  
The original data set is downloaded from [LC](https://www.lendingclub.com/info/download-data.action) covering complete loan data for all loans issued through the 2007-2018, including the current loan status (Current, Late, Fully Paid, etc.) and latest payment information. Additional features include credit history, number of finance inquiries, address including zip codes, and state, and collections among others. It is quite rich and is an excellent example of credit risk data.  Interestingly, Goldman Sachs’ new peer-to-peer lending platform called Marcus was built almost entirely using the Lending Club data.

Here is a link to some extra information regarding the fields of the data set.
[Data Dictionary](https://github.com/dustinvanstee/mldl-101/blob/master/lab5-powerai-lc/LCDataDictionary.csv)

**Important**

In this notebook, we will play with the lending club data, conduct a set of exploratory analysis and try to apply various machine learning techniques to predict borrower’s default. We took a small sample of loans made in 2016 (130K) to help speed up the processing time for the lab


Note : to remove a lot of the busy verbose code, we are making using of a utility python file called lc_utils.py.  For implemenation details you can refer here [python code](https://github.com/dustinvanstee/mldl-101/blob/master/lab5-powerai-lc/lc_utils.py)

### Quick word on the data science method
<img src="https://github.com/dustinvanstee/random-public-files/raw/master/dsx-methodology.png" width="900" height="700" align="middle"/>

Here we will use these simple high level steps to work through a typical data science problem.  This workflow is meant to be a high level guide, but in practice this is a highly iterative approach ...

### Goals

* Perform some initial analysis of the data for **Business Understanding**
* **Prepare the Data** for our visualization and modeling
* **Visualize** the data
* Model using **Dimension Reduction** and **Classification** techniques
* **Evaluate** the approach

## Business/Data Understanding and Preparation
<img src="https://github.com/dustinvanstee/random-public-files/raw/master/techu-bu-dp.png" width="800" height="500" align="middle"/>

### Environment bootstrapping
Run the following commands to install a few python packages for later use

In [None]:
# !pip install -q jupyter-pip
# !pip install -q brunel
# import brunel
# !git fetch origin master
# !git reset --hard origin/master

### Import Libraries

In [None]:
# Code functions that are needed to run this lab
import pandas as pd
import warnings
warnings.filterwarnings('ignore')
import time
from datetime import datetime
import math

import pandas as pd
#pd.set_option('display.height', 1000)
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

import glob

# custom library for some helper functions 
from sklearn.base import BaseEstimator, TransformerMixin
import matplotlib
import matplotlib.pyplot as plt
import numpy as np

import myenv as myenv
import brunel

In [None]:
import sys
sys.path.append("../utils") # go to parent dir

%load_ext autoreload
%autoreload 2
from lc_utils_2020 import *

### Load the Data
Here we load data that was previously downloaded from lendingclub.com.  For speed of this lab, we are restricting the number of loans ~ 130K

In [None]:
loan_df = load_sample_data('acc')
loan_df_orig = loan_df.copy()
loan_df.head()

#### Reload Data

In [None]:
# Samle the data to a reasonable number for debug
fraction = 0.5 # sample all of it!
loan_df = loan_df_orig.copy()
if fraction < 1.0 :
    loan_df =loan_df.sample(frac=0.1, replace=False, random_state=1)



### Descriptive Statistics (1D)
Lets look at some 1D and 2D descriptive statistics for this dataset

In this dataset, we have all types of data.  Numerical, Categorical, Ranked data.  This small module will take you through what is typical done to quickly understand the data



In [None]:
# This function provide the number of rows/cols
# Information on the types of data
# and a report of descriptive statistics

quick_overview_1d_v2(loan_df)

Here we can get a quick assessment of the statistics for each column.  
**Quick Question** can you answer what was the average income for the 133K loan applicants ?

### Descriptive Statistics (2D)
Since we have over 100 numerical variables, creating a 2D correlation plot may be time consuming and difficult to interpret.  Lets look at correlations on a smaller scale for now....


In [None]:
# Grab only a subset of columns
cols = ["loan_amnt","annual_inc","dti","fico_range_high","open_acc",'funded_amnt', 'total_acc']
quick_overview_2d(loan_df, cols)

**Quick Question** : Can you find a negatively correlated variable to annual_inc in the chart above?  Can you think of a reason for this result ?

## Data Preparation

### Create Loan Default column.  This is the column we will predict later
The **loan_status** column contains the information of whether or not the loan is in default. 

This column has more than just a 'default or paid' status.  Since our goal is to build a simple default classifier , we need to make a new column based off the **loan_status** column.

Here we will look at all the categorical values in **loan_status**, and create a new column called **default** based off that one.


In [None]:
# function to create loan status .... 
# Todo insert some extra 'noise' here ...
loan_df = create_loan_default(loan_df)
loan_df.head(3) # scroll to the right, and see the new 'default' column

### Data Preparation - Handle Null Values aka NaNs ...

One part of the data science process thats especially time consuming is working with unclean data.  This lending club data set is a great example of that.  If you look at the dataframe shown above, you will see a number of columns with the indicator **NaN** .  This means 'not a number' and needs to be dealt with prior to any machine learning steps.  You have many options here.  Some options are listed below...

* Fill with a value -> impute mean/median/min/max/other
* drop rows with NaNs
* drop columns with large number of NaNs 
* use data in other columns to derive

All these methods are possible, but its up to the data scientist / domain expert to figure out the best approach.  There is definitely some grey area involved in whats the best approach.

First, lets understand which columns have NaNs...

In [None]:
# For every column, count the number of NaNs .... 
# code hint : uses df.isna().sum()

#columns_with_nans(loan_df)


As you can see, we have some work to do to clean up the NaN values.  Beyond NaN values, we also have to transform columns if they aren't formatted correctly, or maybe we want to transform a column based on custom requirements.  

```
Example : column=employee_length , values=[1,2,3,4,5,6,7,8,9,10+] formatted as a string
          transform into 
          column=employee_length, [0_3yrs,4_6yrs,gt_6yrs] (categorical:strings)
```
          
Luckily, we took care to process and clean this data below using a few functions.  In practice, **this is where data scientists spend a large portion of their time** as this requires detailed domain knowledge to clean the data.  We have made a fair number of assumptions about how to process the data which we won't go into due to time contraints for the lab.

In [None]:
# OLD FLOW ....
# The following cleaning of the data makes use of the steps shown below.....

#loan_df1 = drop_sparse_numeric_columns(loan_df)
#loan_df2 = drop_columns(loan_df1)
#loan_df3 = impute_columns(loan_df2)
#loan_df4 = handle_employee_length(loan_df3)
#loan_df5 = handle_revol_util(loan_df4)
#loan_df6 = drop_rows(loan_df5)

#loan_df = clean_lendingclub_data(loan_df)


In [None]:
# FASTAI FLOW

#loan_df1 = drop_sparse_numeric_columns(loan_df)
loan_df1 = drop_columns(loan_df)
#loan_df3 = impute_columns(loan_df2)
loan_df2 = handle_employee_length(loan_df1)
loan_df3 = handle_revol_util(loan_df2)
#loan_df6 = drop_rows(loan_df5)

#loan_df = clean_lendingclub_data(loan_df)
loan_df = loan_df3

In [None]:
# Final Sanity check ....
# If we did our job right, there should not be any NaN's left.  
# Use this convenience function to check

# code hint df.isna().sum()

#columns_with_nans(loan_df)

### Data Preparation - Handle Time Objects
Sometimes for columns that contain date information, you may want to break them down into individual columns like month, day, day of week etc.  For our use case, we will create a new column called `time_history` that will indicate how long an applicant has been a borrower.  This is an example of **feature engineering**.  Essentially, using business logic to create a new column (feature) that may have predictive value.

In [None]:
loan_df = create_time_features(loan_df)
loan_df.earliest_cr_line = pd.to_datetime(loan_df.earliest_cr_line, errors='coerce')
loan_df.issue_d = pd.to_datetime(loan_df.issue_d, errors='coerce')
loan_df.head(3)

### Convert Categorical Data to One hot encode ###

If you look above at the data frame, we are almost ready to start building models.  However, there is one important step to complete.  Notice we have some columns that are still built out of string data 
```
example column=home_ownership values=[RENT, MORTGAGE, OWN]
```
Machine learning algorithms only process numerical data, so we need to transform these **categorical columns** into **indicator columns**

From the example above, the transform would yield 3 new columns

```
example column=RENT values=[0,1]
        column=MORTGAGE values=[0,1]
        column=OWN values=[0,1]
```

Conveniently pandas has a nice function called **get_dummies** that we will use for this purpose

In [None]:
# Skip for fastAI
# Transform categorical data into binary indicator columns
# code hint, uses pd.get_dummies

# loan_df = one_hot_encode_keep_cols(loan_df)
loan_df.head() # once complete, see how many new columns you have!

### Final Result after data preparation ....

Ok, so you made it here, lets take a look at the final results of your data preparation work.  It may be helpful to  **qualitatively compare** your original data frame to this one and see how different they look..  Execute the cells below to get a sense of what the tranformations accomplished.

In [None]:
loan_df_orig.head(3)

In [None]:
loan_df.head(3)

### Split into Train / Test Dataframes

In [None]:
train_df, test_df = train_test_split(loan_df, test_size=0.20, random_state=52)

### Export Data For H20 or other tools

In [None]:
DATE="020320"
train_df.to_csv(path_or_buf="../curateddata/lc_h2o_train_{}.csv".format(DATE),index=False,header=True)
test_df.to_csv(path_or_buf="../curateddata/lc_h2o_test_{}.csv".format(DATE),index=False,header=True)

In [None]:
# gzip


## Data Visualization
As you saw, when you 'describe' a data frame, you get a table statistics showing you the mean,min,max and other statistics about each column.  This is good, but sometimes its also good to look at the histograms of the data as well.  Lets Visualize some of the distributions from our dataset


<img src="https://github.com/dustinvanstee/random-public-files/raw/master/data-visualization.png" width="800" height="500" align="middle"/>

In [None]:
# Here we plot distribution charts for all the numerical columns in our dataframe
plot_histograms(loan_df)

### Brunel Visualization Examples
Here we use the builtin Brunel Visualization graphics package.  This documentation was useful in the preparation of the following graphs.
* https://brunel.mybluemix.net/docs/Brunel%20Documentation.pdf

In [None]:
# Build a statistics data frame based on issue date
# aggregate on loan amount
vis_df = loan_df.copy()
vis_df['default'] = loan_df['default']

### Outcome Variable: Loan Status
On the left is the breakdown of all loan status classifications.  On the right is our simple default classification based on our data prep

In [None]:
a=vis_df.sample(5000) # downsample for speed
%brunel data('a') bar x(loan_status) y(#count:linear) color(loan_status)  percent(#count:overall) tooltip(#all) | stack polar bar y(#count) color(default) percent(#count) tooltip(#all) :: width=1200, height=350 

In [None]:
ldf=vis_df.sample(5000) # downsample for speed

figure, axes = plt.subplots(nrows=2, ncols=2)
#ax.plot(kind='pie', subplots=True, figsize=(16,8))
ax1.pie(ldf['default'],  autopct='%1.1f%%',
        shadow=True, startangle=90)
ax1.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle.



### Loan Purpose
Lets try to get a sense of why people are borrowing ...

In [None]:
purpose_count = vis_df.groupby('purpose')['loan_status'].count().to_frame().rename(columns = {'loan_status':'count'})
%brunel bubble data('purpose_count') color(COUNT:[blues, reds]) size(COUNT) label(PURPOSE) tooltip(#all)

As you can see, this could go on forever, but hopefully you get a sense of the power of data visualization

## Modelling Phase

<img src="https://github.com/dustinvanstee/random-public-files/raw/master/modeling.png" width="800" height="500" align="middle"/>

### FastAI Implementation

In [None]:
from fastai.tabular import *
#loan_df.dtypes

fai_df = train_df[:10000].copy()
fai_df = fai_df[fai_df.loan_amnt.isnull()==False]

add_datepart(fai_df, 'earliest_cr_line',prefix="ecl_",time=True) # inplace
add_datepart(fai_df, 'issue_d',prefix="iss_",time=True) # inplace

#fai_df.describe()
fai_df.dtypes
display(fai_df.head(5))
print("Fast AI num records = {}".format(len(fai_df)))

In [None]:
#list(df.select_dtypes(include=['object']).columns.values)
quick_overview_1d_v2(fai_df)

In [None]:
### Clean out NaNs but leave categorical !!

#### Setup Transformers and Splits

In [None]:
procs = [FillMissing, Categorify, Normalize]
# Target / Label Column
dep_var   = 'default'

# Categorical Variables
cat_names = list(fai_df.select_dtypes(include=['object','bool','int64']).columns.values)
cat_names.remove('id')
cat_names.remove('default')

print("Total number of categorical columns :{}".format(len(cat_names)))
cat_names = cat_names[0:10]
#cat_names = cat_names[0:4] + cat_names[6:10]

#Continuous Variables
cont_names = list(fai_df.select_dtypes(include=['float64']).columns.values)
cont_names.remove('member_id')

print("Total number of continuous columns :{}".format(len(cont_names)))
cont_names = cont_names[0:50]
#print(type)
#fastai_cols = cat_names + cont_names
# 
print("\nCategoricals ({}): {},{}".format(len(cat_names),cat_names,type(cat_names)))
print("\nContinuous ({}): {}".format(len(cont_names),cont_names))
#print("\nfastai_cols : {}".format(fastai_cols))

# Setup Split
path= ""
split = int(len(fai_df)*0.30)
valid_idx = range(len(fai_df)-split, len(fai_df))
print("\nIndex splits training : 0:{}".format(len(fai_df)-split))
print("Index splits validation : {}".format(valid_idx))
print('\n')

fai_df2 = fai_df[cat_names+cont_names+[dep_var]].copy().reset_index()
columns_with_nans(  fai_df2)

#### Create Tabular Databunch

In [None]:
print("Total number of categorical columns :{}".format(len(cat_names)))
print("Total number of continuous columns :{}".format(len(cont_names)))
print("Total number of continuous columns :{}".format(len(fai_df2.columns)))

data = TabularDataBunch.from_df(path="",df=fai_df2, 
        dep_var=dep_var, procs=procs, valid_idx=valid_idx,
        cat_names=cat_names, cont_names=cont_names)
#data.train_ds.x.inner_df.size
#print(data.train_ds.cat_names)  # `cont_names` defaults to: set(df)-set(cat_names)-{dep_var}
#print(data.train_ds.cont_names)  # `cont_names` defaults to: set(df)-set(cat_names)-{dep_var}

#      1 data = (TabularList.from_df(df, path=PATH, cat_names=cat_names, procs=procs)
#      2                            .random_split_by_pct()
#----> 3                            .label_from_df(cols=dep_var)
#      4                            .add_test(test)
#      5                            .databunch())
#data.train_ds.x.inner_df.head()

In [None]:
# data.train_ds.x.inner_df.describe()

In [None]:
#dir(data)
#'add_test','add_tfm','batch_size','create','device','dl','dl_tfms','dls',
#'empty_val','export','fix_dl','fix_ds','from_df','is_empty','label_list',
#'load_empty','loss_func','one_batch','one_item','path','remove_tfm','sanity_check',
#'save','show_batch','single_dl','single_ds',
#'test_dl','test_ds','train_dl','train_ds','valid_dl','valid_ds']

# dir(data.train_ds)  # Data Set, _dl is data_loader
# 'c','databunch','export','filter_by_func','get_state','item',
# 'load_empty','load_state','new','predict','process',
# 'set_item','tfm_y','tfmargs','tfmargs_y','tfms','tfms_y',
# 'to_csv','to_df','transform','transform_y'

# data.train_ds.c
# data.train_ds.to_df()

In [None]:
# import re
# for i in range(35000) :
#     a=str(data.train_ds.get(i))
#     if re.search("id #na",a) :
#         print(str(a))
# for i in range(15000) :
#     a=str(data.valid_ds.get(i))
#     if re.search("id #na",a) :
#         print(str(a))
# # ALL valids have #na# !!!

In [None]:
# display(df.iloc[35000:35002])
# print(data.train_ds.get(0))
# print()
# print(data.valid_ds.get(0))
# print()


#### Create Tabular Learner

In [None]:
learn = tabular_learner(data, layers=[200,100], emb_drop=0.2, emb_szs={'addr_state': 2,'zip_code':11}, metrics=accuracy,)
#learn.summary()


In [None]:
learn.fit_one_cycle(1, 1e-3)

In [None]:
learn.lr_find(start_lr=1e-8,end_lr=1.0)

In [None]:
learn.recorder.plot()

In [None]:
learn.fit_one_cycle(div_factor=100,max_lr=5e-3,cyc_len=10)

In [None]:
#learn.predict(df.iloc[3])

In [None]:
# learn methods
# 'add_time', 'apply_dropout', 'backward', 'bn_wd', 'callback_fns', 'callbacks', 
# 'clip_grad', 'create_opt', 'data', 'destroy', 'dl', 'export', 
# 'fit', 'fit_fc', 'fit_one_cycle', 'freeze', 'freeze_to', 'get_preds', 
# 'init', 'interpret', 'layer_groups', 'load', 'loss_func', 
# 'lr_find', 'lr_finder', 'lr_range', 'metrics', 'mixup', 
# 'model', 'model_dir', 'one_cycle_scheduler', 'opt', 'opt_func', 
# 'path', 'pred_batch', 'predict', 'predict_with_mc_dropout', 
# 'purge', 'recorder', 'save', 'show_results', 'silent', 'split', 
# 'summary', 'to_fp16', 'to_fp32', 'train_bn', 'true_wd', 'unfreeze', 
# 'validate', 'wd'

In [None]:
learn.show_results()


In [None]:
interp = learn.interpret()


In [None]:
# interp.confusion_matrix()
interp.plot_confusion_matrix()
#interp.plot_tab_top_losses(10)

In [None]:
interp.losses

### Train / Test set creation

One of the key points in any machine learning workflow is the **partitioning** of the data set into **train** and **test** sets.  The key idea here is that a model is built using the training data, and evaluated using the test data.  

There are more nuances to how you partition data into train/test sets, but for purposes of this lab we will omit these finer points.

In [None]:
%load_ext autoreload
%autoreload 2
from lc_utils import *

In [None]:
# Instantiate lendingclub_ml object that will hold our test, and contain methods used for testing.
# Implementation done like this to ease the burden on users for keeping track of train/test sets for different
# models we are going to build.

my_analysis = lendingclub_ml(loan_df)

In [None]:
# Create a train / test split of your data set.  Paramter is test set size percentage 
# Returns data in the form of dataframes

my_analysis.create_train_test(test_size=0.4)

### Credits 
* Bob Chesebrough - IBM CSSC Data Scientist
* Catherine Cao - IBM FSS Data Scientist
* [Hands on Machine Learning - Geron] (https://www.oreilly.com/library/view/hands-on-machine-learning/9781491962282/)

### More Learning
* Coursera Deeplearning.ai  (Ng)
* Coursera Machine Learning (Ng)
