<h1 id="tocheading">Finding Patterns in Data using IBM Power and PowerAI</h1>
<div id="toc"></div>

In this lab we will explore an open source data set, and discover how we can use the tools that are part of **PowerAI** to explore and discover patterns in the data.  For this lab, we will make use of the Lending Club data set, pandas, numpy and scikit learn libraries.  Here is a brief description about Lending Club.



In [None]:
#default_exp ai_essentials

<img src="https://raw.githubusercontent.com/dustinvanstee/aicoc-ai-immersion/master/nb_images/lendingclub_frameworks.png" width="800" height="500" align="middle"/>

[Lending Club (LC)](https://www.lendingclub.com/) is the world’s largest online marketplace connecting borrowers and investors. It is transforming the banking system to make credit more affordable and investing more rewarding. Lending Club operates at a lower cost than traditional bank lending programs and pass the savings on to borrowers in the form of lower rates and to investors in the form of solid risk-adjusted returns.

**The DATA**  
The original data set is downloaded from [LC](https://www.lendingclub.com/info/download-data.action) covering complete loan data for all loans issued through the 2007-2018, including the current loan status (Current, Late, Fully Paid, etc.) and latest payment information. Additional features include credit history, number of finance inquiries, address including zip codes, and state, and collections among others. It is quite rich and is an excellent example of credit risk data.  Interestingly, Goldman Sachs’ new peer-to-peer lending platform called Marcus was built almost entirely using the Lending Club data.

Here is a link to some extra information regarding the fields of the data set.
[Data Dictionary](https://github.com/dustinvanstee/mldl-101/blob/master/lab5-powerai-lc/LCDataDictionary.csv)

**Important**

In this notebook, we will play with the lending club data, conduct a set of exploratory analysis and try to apply various machine learning techniques to predict borrower’s default. We took a small sample of loans made in 2016 (130K) to help speed up the processing time for the lab


Note : to remove a lot of the busy verbose code, we are making using of a utility python file called lc_utils.py.  For implemenation details you can refer here [python code](https://github.com/dustinvanstee/mldl-101/blob/master/lab5-powerai-lc/lc_utils.py)

### Quick word on the data science method
<img src="https://raw.githubusercontent.com/dustinvanstee/aicoc-ai-immersion/master/nb_images/dsx-methodology.png" width="900" height="700" align="middle"/>

Here we will use these simple high level steps to work through a typical data science problem.  This workflow is meant to be a high level guide, but in practice this is a highly iterative approach ...

### Goals

* Perform some initial analysis of the data for **Business Understanding**
* **Prepare the Data** for our visualization and modeling
* **Visualize** the data
* Model using **Dimension Reduction** and **Classification** techniques
* **Evaluate** the approach

## Business/Data Understanding and Preparation
<img src="https://raw.githubusercontent.com/dustinvanstee/aicoc-ai-immersion/master/nb_images/iterative_workflow.png" width="800" height="500" align="middle"/>

### Import Libraries

In [None]:
# export

# pick up lc_utils_2020.py file
import sys
sys.path.append("../Tabular") 

In [None]:
#export 
print("Importing Data")
# Code functions that are needed to run this lab
import pandas as pd
import warnings
warnings.filterwarnings('ignore')
import time
from datetime import datetime
import math

import pandas as pd
#pd.set_option('display.height', 1000)
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

import glob

# custom library for some helper functions 
from sklearn.base import BaseEstimator, TransformerMixin
import matplotlib
import matplotlib.pyplot as plt
import numpy as np

import myenv as myenv
from lc_utils_2020 import *

In [None]:
%load_ext autoreload
%autoreload 2
from lc_utils_2020 import *

### Load the Data
Here we load data that was previously downloaded from lendingclub.com.  For speed of this lab, we are restricting the number of loans ~ 130K

**Key functions**
* pd.read_csv
* pd.concat

In [None]:
# export 
loan_df = load_sample_data('ornl')
loan_df_orig = loan_df.copy()
loan_df.head(30)

### Descriptive Statistics (1D)
Lets look at some 1D and 2D descriptive statistics for this dataset

In this dataset, we have all types of data.  Numerical, Categorical, Ranked data.  This small module will take you through what is typical done to quickly understand the data


**Key Functions**
* df[column].value_counts()
* df.describe()

In [None]:
# export 
print("Descriptive Statistics")
# This function provide the number of rows/cols
# Information on the types of data
# and a report of descriptive statistics

# quick_overview_1d(loan_df)
categorical_overview,numerical_overview = quick_overview_1d_v2(loan_df)

In [None]:
print("Categorical report")
display(categorical_overview)
print("Numerical report")
display(numerical_overview)

Here we can get a quick assessment of the statistics for each column.  
**Quick Question** can you answer what was the average income for the 133K loan applicants ?

### Descriptive Statistics (2D)
Since we have over 100 numerical variables, creating a 2D correlation plot may be time consuming and difficult to interpret.  Lets look at correlations on a smaller scale for now....


In [None]:
# Grab only a subset of columns
cols = ["loan_amnt","annual_inc","dti","fico_range_high","open_acc",'funded_amnt', 'total_acc']
quick_overview_2d(loan_df, cols)

**Quick Question** : Can you find a negatively correlated variable to annual_inc in the chart above?  Can you think of a reason for this result ?

### Create Loan Default column.  This is the column we will predict later
The **loan_status** column contains the information of whether or not the loan is in default. 

This column has more than just a 'default or paid' status.  Since our goal is to build a simple default classifier , we need to make a new column based off the **loan_status** column.

Here we will look at all the categorical values in **loan_status**, and create a new column called **default** based off that one.


In [None]:
# export
print("Creating Loan Default column")
# function to create loan status .... 
loan_df = create_loan_default(loan_df)

In [None]:
loan_df.head(3) # scroll to the right, and see the new 'default' column

### Data Preparation - Handle Null Values aka NaNs ...

One part of the data science process thats especially time consuming is working with unclean data.  This lending club data set is a great example of that.  If you look at the dataframe shown above, you will see a number of columns with the indicator **NaN** .  This means 'not a number' and needs to be dealt with prior to any machine learning steps.  You have many options here.  Some options are listed below...

* Fill with a value -> impute mean/median/min/max/other
* drop rows with NaNs
* drop columns with large number of NaNs 
* use data in other columns to derive

All these methods are possible, but its up to the data scientist / domain expert to figure out the best approach.  There is definitely some grey area involved in whats the best approach.

First, lets understand which columns have NaNs...

In [None]:
# For every column, count the number of NaNs .... 
# code hint : uses df.isna().sum()

columns_with_nans(loan_df)


As you can see, we have some work to do to clean up the NaN values.  Beyond NaN values, we also have to transform columns if they aren't formatted correctly, or maybe we want to transform a column based on custom requirements.  

```
Example : column=employee_length , values=[1,2,3,4,5,6,7,8,9,10+] formatted as a string
          transform into 
          column=employee_length, [0_3yrs,4_6yrs,gt_6yrs] (categorical:strings)
```
          
Luckily, we took care to process and clean this data below using a few functions.  In practice, **this is where data scientists spend a large portion of their time** as this requires detailed domain knowledge to clean the data.  We have made a fair number of assumptions about how to process the data which we won't go into due to time contraints for the lab.

In [None]:
# export
print("Handling Nulls and NaNs")
# The following cleaning of the data makes use of the steps shown below.....

# loan_df1 = drop_sparse_numeric_columns(loan_df, threshold=0.03)
loan_df1 = drop_sparse_columns(loan_df,pct_missing_threshold=0.6)
loan_df2 = impute_columns(loan_df1)
loan_df3 = handle_employee_length(loan_df2)
loan_df4 = handle_revol_util(loan_df3)
loan_df = loan_df4
columns_with_nans(loan_df4)


In [None]:
# Final Sanity check ....
# If we did our job right, there should not be any NaN's left.  
# Use this convenience function to check

# code hint df.isna().sum()

columns_with_nans(loan_df)

### Data Preparation - Handle Time Objects
Sometimes for columns that contain date information, you may want to break them down into individual columns like month, day, day of week etc.  For our use case, we will create a new column called `time_history` that will indicate how long an applicant has been a borrower.  This is an example of **feature engineering**.  Essentially, using business logic to create a new column (feature) that may have predictive value.

In [None]:
#loan_df[loan_df["earliest_cr_line"]!="unknown"]

In [None]:
# export
print("Data Preparation Handling Time Objects")
loan_df = create_time_features(loan_df)
loan_df.head(3)

### Convert Categorical Data to One hot encode ###

If you look above at the data frame, we are almost ready to start building models.  However, there is one important step to complete.  Notice we have some columns that are still built out of string data 
```
example column=home_ownership values=[RENT, MORTGAGE, OWN]
```
Machine learning algorithms only process numerical data, so we need to transform these **categorical columns** into **indicator columns**

From the example above, the transform would yield 3 new columns

```
example column=RENT values=[0,1]
        column=MORTGAGE values=[0,1]
        column=OWN values=[0,1]
```

Conveniently pandas has a nice function called **get_dummies** that we will use for this purpose

In [None]:
# export
print("Transforming Data into binary indicator columns")
# Transform categorical data into binary indicator columns
# code hint, uses pd.get_dummies

loan_df = one_hot_encode_keep_cols(loan_df)
loan_df.head() # once complete, see how many new columns you have!

### Final Result after data preparation ....

Ok, so you made it here, lets take a look at the final results of your data preparation work.  It may be helpful to  **qualitatively compare** your original data frame to this one and see how different they look..  Execute the cells below to get a sense of what the tranformations accomplished.

In [None]:
loan_df_orig.head(3)

In [None]:
loan_df.head(3)

### Data Visualization
As you saw, when you 'describe' a data frame, you get a table statistics showing you the mean,min,max and other statistics about each column.  This is good, but sometimes its also good to look at the histograms of the data as well.  Lets Visualize some of the distributions from our dataset


<img src="https://raw.githubusercontent.com/dustinvanstee/aicoc-ai-immersion/master/nb_images/datavis_iterative_workflow.png" width="800" height="500" align="middle"/>

In [None]:
# Here we plot distribution charts for all the numerical columns in our dataframe
plot_histograms(loan_df_orig)

## Data Visualization Examples
### The Growth of Lending Club

In [None]:
# Build a statistics data frame based on issue date
# aggregate on loan amount
vis_df = loan_df_orig.copy().sample(5000)

### Outcome Variable: Loan Status
On the left is the breakdown of all loan status classifications.  On the right is our simple default classification based on our data prep

In [None]:
# Example of a groupby and aggregate 
df=vis_df[['loan_status', 'loan_amnt', 'funded_amnt']].groupby(['loan_status']).agg(['sum', 'count'])
df.columns = ['_'.join(col).strip() for col in df.columns.values]
df=df.reset_index()
df

In [None]:
import seaborn as sns
plt.figure(figsize=(20,5))
sns.set(style="whitegrid")

#tips = sns.load_dataset("tips")
#ax = sns.barplot(x="loan_status", y="loan_amnt_count", data=df)
#ax = sns.barplot(x="loan_status", y="loan_amnt", data=vis_df)
ax = sns.countplot(x="loan_status", hue="home_ownership", data=vis_df)


### Borrowing by State 
Most of the money in terms of absolute borrowing is borrowed by people from California. For average loan amount per state, Alaska ranks on top.

In [None]:
# Create an aggregated dataframe ...
df=vis_df[['addr_state', 'loan_amnt', ]].groupby(['addr_state']).agg(['mean', 'sum'])
df.columns = ['_'.join(col).strip() for col in df.columns.values]
df=df.reset_index()
df


In [None]:
#Plotly Choropleth Example 
import plotly.graph_objects as go

fig = go.Figure(data=go.Choropleth(
    locations=df['addr_state'], # Spatial coordinates
    z = df['loan_amnt_mean'].astype(float), # Data to be color-coded
    locationmode = 'USA-states', # set of locations match entries in `locations`
    colorscale = 'Reds',
    colorbar_title = "USD",
))

fig.update_layout(
    title_text = '2016 Lending Club Avg Loans by State',
    geo_scope='usa', # limite map scope to USA
)

fig.show()    

###  Loan Purpose Wordcloud
Lets try to get a sense of why people are borrowing ...

In [None]:
text = " ".join(str(purpose).lower() for purpose in loan_df_orig['title'])

from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
wordcloud = WordCloud(background_color="white").generate(text)
# Display the generated image:
plt.figure(figsize=(20,5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()


As you can see, this could go on forever, but hopefully you get a sense of the power of data visualization

### Modelling Phase

<img src="https://raw.githubusercontent.com/dustinvanstee/aicoc-ai-immersion/master/nb_images/modeling_iterative_workflow.png" width="800" height="500" align="middle"/>

### Train / Test set creation

One of the key points in any machine learning workflow is the **partitioning** of the data set into **train** and **test** sets.  The key idea here is that a model is built using the training data, and evaluated using the test data.  

There are more nuances to how you partition data into train/test sets, but for purposes of this lab we will omit these finer points.

In [None]:
%load_ext autoreload
%autoreload 2
from lc_utils_2020 import *

In [None]:
# export
print("Train Test Set Creation")
# Instantiate lendingclub_ml object that will hold our test, and contain methods used for testing.
# Implementation done like this to ease the burden on users for keeping track of train/test sets for different
# models we are going to build.

my_analysis = lendingclub_ml(loan_df)

In [None]:
# export

# Create a train / test split of your data set.  Paramter is test set size percentage 
# Returns data in the form of dataframes

my_analysis.create_train_test(test_size=0.33)

## Dimension Reduction
For this modeling exercise we will perform a couple of tasks, **dimension reduction** and **classification** as shown in the following diagram.

<img src="https://github.com/dustinvanstee/random-public-files/raw/master/techu-modeling-workflow.png" width="800" height="500" align="middle"/>

**Dimension Reduction** is useful in scenarios when you have a large number of columns and you would like to reduce that down to a compressed representation .  In this lab we will try 2 methods of dimension reduction.  It will be your choice to decide which method you want to use for the classification part of the lab ! (you could even decice to bypass this if you want ...)


### Dimension Reduction - PCA

<img src="https://github.com/dustinvanstee/random-public-files/raw/master/techu-pca.png"  width="200" height="125" align="middle"/>

Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. The number of principal components is less than or equal to the number of original variables.

A simple way to think about PCA is that it helps compress the data in a lossy representation of the original dataset.

This will also be used to help us visualize the data as you will see below

In [None]:
# Dimension Reduction using PCA
# PCA model saved in data structure ..
my_analysis.build_pca_model(n_components=30)

In the chart above, you can see that we get ok results from PCA.  Using the first 20 principal components, we can account for ~50% of the variance described in the dataset.  **Feel free to change the number of principal components above to see if adding more helps with explained variance.**

### Now update our test dataframe with new columns that are predicted by our PCA.  

Here we will now take the models that we built and pass our test data set through the models.   By doing this, we will have reduced the number features in our data set by a significant amount (~177 colums => ~5-20 columns!)  .  

In this step we will add new columns to our test/train data frames for  our PCA model .  Don't worry about the details of this step, its just required for some followon visualization, and training steps ahead. 



In [None]:
my_analysis.add_pca_columns_to_df()

In [None]:
#[x for x in my_analysis.train_df.columns if 'PC' in str(x)]
#for c in my_analysis.train_df.columns: 
#    print(c)
#display(my_analysis.X_train_scaled.head(10))
#my_analysis.train_df.head(10)

### Cool Visualizations using our dimension reduction columns

Next we will plot a few scatterplot grids based on our pricipal component and autoencoder representations of the data

We will color each data point using this key
```
Green -> Fully paid or current loan
Red   -> Loan in default
```

In [None]:
# This will take a minute or so ...
my_analysis.visualize_dimred_results(mode='pca')

In [None]:
# Example : Filter based on Principal Components ..
a = my_analysis.test_df
f_df = a[(a['PC0'] < 1) & (a['PC1'] <0)]
display(f_df.describe())
display(loan_df.describe())

If you can discern a pattern between the red / green dots, its likely we can use a classifier to automatically seperate them! We'll see that in a few more sections

## Lending Club Default Prediction using Random Forest

<img src="https://github.com/dustinvanstee/random-public-files/raw/master/techu-modeling-traintest.png"  width="600" height="375" align="middle"/>

Here we will build a classifier to predict if loan will fail or not.  We will us a 
** Random Forest ** tree algorithm  .  You will have 2 options for data sources, 
* the raw data
* PCA dimension reduction features

To evaluate our model, we will use a simple contingency table (showing true/false positve/negative).  However, this is a fairly simplistic method.  Better method that data scientists use are F1 score, and PR/ROC curves but thats beyond the scope of this lab.

Step 1 here is to set our baseline result.  In this example, we are dealing with a **skewed** dataset.  This means, on average, most people will not default, and they pay their loan off.  If you built a classifier that just predicted no default, you would be right most of the time.  Lets see the stats from our dataset below....

In [None]:
# export 
print("Setting Baseline")
# Set our baseline
my_analysis.train_df['default'].describe()

As you can see, **only ~12.8% of the applicants default**.  Any classifier we build must be better than this, or we aren't doing a very good job ;)

In [None]:
# export

# modes
# pca           : principal components only
# raw           : all the data non reduced
# raw_no_grades : all the data non reduced except the grade info provided by lending club

mode = 'raw' # ae , raw, raw_no_grades
x_cols =[]
if(mode == 'pca') :
            x_cols = [x for x in my_analysis.train_df.columns if 'PC' in x]
elif(mode == 'raw') :
            x_cols = [x for x in my_analysis.train_df.columns if 'PC' not in x]
            x_cols.remove('default')
elif(mode == 'raw_no_grades') :
            x_cols = [x for x in my_analysis.train_df.columns if 'PC' not in x]
            import re
            x_cols = [x for x in x_cols if not re.match('^[ABCDEFG]',x)]
            x_cols.remove('default')

#print(x_cols)


### Random Forest Example

In [None]:
# export 
print("Random Forest Example")
# Build a dataframe with selected columns ...
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

X = my_analysis.train_df[x_cols]
y = my_analysis.Y_train

In [None]:
# export

## Simple Random Forest Model
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix
clf = RandomForestClassifier(max_depth=5,n_estimators=300, random_state=0)

clf.fit(X,y)



### Confusion Matrix

In [None]:
# export
print("Confusion Matrix")
X_test = my_analysis.test_df[x_cols]
Y_test_predict = np.where(clf.predict(X_test) > 0.5, 1, 0 )

cnf_matrix =confusion_matrix(my_analysis.Y_test, Y_test_predict)
class_names =  ['Default','Paid']
plot_confusion_matrix(cnf_matrix, class_names)

### Explainability : RF Feature Importance

In [None]:
# export
print("RF Feature Importance")
# Random Forest Explainability
#print(len(clf.feature_importances_))
#print(len(X.columns))
rf_feature_importance = dict(zip(X.columns,clf.feature_importances_))


sort_rf_feature_importance = sorted(rf_feature_importance.items(), key=lambda x: x[1], reverse=True)

for i in sort_rf_feature_importance:
    print(i[0], i[1])

## SnapML vs SciKit Learn Demo

In [None]:
type(y)
import os
os.environ["CUDA_VISIBLE_DEVICES"]="0"

In [None]:
# Import the LogisticRegression from snap.ml
from sklearn.metrics import f1_score, accuracy_score, roc_curve, roc_auc_score
from pai4sk import LogisticRegression
snapml_lr = LogisticRegression(use_gpu = True, device_ids = [0],
                        num_threads = 256,
                        fit_intercept = True, regularizer = 0.01)

In [None]:
# Training
#t0 = time.time()
#snapml_lr.fit(X.to_numpy(), y.to_numpy())
#print("[snap.ml] Training time (s):  {0:.2f}".format(time.time()-t0))

# Learn more about SnapML here: https://www.zurich.ibm.com/snapml/

### Credits 
* Bob Chesebrough - IBM AICoC Data Scientist
* Catherine Cao - IBM FSS Data Scientist
* Dustin VanStee - IBM AICoC Data Scientist
* Travis Siegfried - AI & Vision Solution Architect

* [Hands on Machine Learning - Geron] (https://www.oreilly.com/library/view/hands-on-machine-learning/9781491962282/)
