<h1 id="tocheading">Finding Patterns in Data using IBM Power and PowerAI</h1>
**TODO** : Insert nice banner here
(techu / powerAI / sklearn / tensorflow /keras banner here)
<div id="toc"></div>

In this lab we will explore an open source data set, and discover how we can use the tools that are part of PowerAI to explore and discover patterns in the data.  For this lab, we will make use of the Lending Club data set.  Here is a brief description about Lending Club.


<img src="https://github.com/CatherineCao2016/pics/raw/master/lcintro.png" width="800" height="500" align="middle"/>

[Lending Club (LC)](https://www.lendingclub.com/) is the world’s largest online marketplace connecting borrowers and investors. It is transforming the banking system to make credit more affordable and investing more rewarding. Lending Club operates at a lower cost than traditional bank lending programs and pass the savings on to borrowers in the form of lower rates and to investors in the form of solid risk-adjusted returns.

The original data set is downloaded from [LC](https://www.lendingclub.com/info/download-data.action) covering complete loan data for all loans issued through the 2007-2018, including the current loan status (Current, Late, Fully Paid, etc.) and latest payment information. Additional features include credit history, number of finance inquiries, address including zip codes, and state, and collections among others. It is quite rich and is really the only quality source of credit risk data we could locate; there is a large amount of value in the data and companies guard it relentlessly. Goldman Sachs’ new peer-to-peer lending platform called Marcus was built almost entirely using the Lending Club data.

**Important**

In this notebook, we will play with the LC data, conduct a set of exploratory analysis and try to apply various machine learning techniques to predict borrower’s default. We took a small sample of loans made in 2017 (150K) to help speed up the processing time for the lab


Note : to remove a lot of the busy verbose code, we are making using of a utility python file called lc_utils.py.  For implemenation details you can refer here  **TODO : link**

## Quick word on the data science method
<img src="https://github.com/dustinvanstee/random-public-files/raw/master/dsx-methodology.png" width="900" height="700" align="middle"/>

Here we will use these simple high level steps to work through a typical data science problem.  This workflow is meant to be a high level guide, but in practice this is a highly iterative approach ...


### Problem Understanding

**High level use case** - predict credit default analysis using lendingclub.com data

**TODO** - add some commentary here

**TODO** - link data dictionary, and add some commentary

### Goals

* Perform some initial analysis of the data for **Business Understanding**
* **Prepare the Data** for our visualization and modeling
* **Visualize** the data
* Model using **Dimension Reduction** and **Classification** techniques
* **Evaluate** the approach

## Data Understanding and Preparation
<img src="https://github.com/dustinvanstee/random-public-files/raw/master/data-preparation.png" width="800" height="500" align="middle"/>

### Import Libraries

In [14]:
# Code functions that are needed to run this lab
import pandas as pd
import warnings
warnings.filterwarnings('ignore')
import time
from datetime import datetime
import math

import pandas as pd
#pd.set_option('display.height', 1000)
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

import glob

# custom library for some helper functions 
from sklearn.base import BaseEstimator, TransformerMixin
import matplotlib
import matplotlib.pyplot as plt
import numpy as np


from lc_utils import *
from myenv import *

In [11]:
# TODO : remove later ....

%load_ext autoreload
%autoreload 2
#%unload_ext lc_utils
#from lc_utils import *
# %reload_ext lc_utils


The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


### Load the Data
Here we load data that was previously downloaded from lendingclub.com.  For speed of this lab, we are restricting the number of loans ~ 180K

In [13]:
loan_df = load_sample_data()
loan_df.head()

CLASS_ENVIRONMENT = nimbix
**load_sample_data** : Setting data location to /dl-labs/mldl-101/lab5-powerai-lc/


IndexError: list index out of range

### Descriptive Statistics (1D)
Lets look at some 1D and 2D descriptive statistics for this dataset

In this dataset, we have all types of data.  Numerical, Categorical, Ranked data.  This small module will take you through what is typical done to quickly understand the data



In [None]:
quick_overview(loan_df)

Here we can get a quick assessment of the statistics for each column.  
**Quick Question** can you answer what was the average income for the 188K loan applicants ?

### Descriptive Statistics (2D)
Since we have 113 numerical variables, creating a 2D correlation plot may be time consuming and difficult to interpret

**TODO** write a nice util that shows just a couple vars of interest ....

### Create Loan Default column.  This is the column we will predict later
The **loan_status** column contains the information of whether or not the loan is in default. Here we will look at all the categorical values in loan_status, and create a new column based off that one.


In [None]:
loan_df = create_loan_default(loan_df)

### Handle Null Values ... Impute later for key columns ....

* handle null values (fill zero, impute mean/median/min/max/other, drop, etc)
* handle values that need to be re-cast (ie string to int, etc etc)


In [None]:
loan_df.describe()
columns_with_nans(loan_df)


As you can see, we have some work to do to clean up the NaN values.  ... Luckily, we took care to process and clean this data below using a routine.  In practice, this is where data scientists spend a large portion of their time as this requires detailed domain knowledge to clean the data.  We have made a fair nubmer of assumptions about how to process the data which we won't go into due to time contraints for the lab.

In [None]:
#loan_df1 = drop_sparse_numeric_columns(loan_df)
#loan_df2 = drop_columns(loan_df1)
#loan_df3 = impute_columns(loan_df2)
#loan_df4 = handle_employee_length(loan_df3)
#loan_df5 = handle_revol_util(loan_df4)
#loan_df6 = drop_rows(loan_df5)

loan_df = clean_lendingclub_data(loan_df)


In [None]:
# Final Sanity check ....
columns_with_nans(loan_df)

### Data Preparation - Handle Time Objects
Sometimes for columns that contain date information, you may want to break them down into individual columns like month, day, day of week etc.  For our use case, we will create a new column called `time_history` that will indicate how long an applicant has been a borrower.  This is an example of **feature engineering**.  Essentially, using business logic to create a new column (feature) that may have predictive value.

In [None]:
loan_df = create_time_features(loan_df)

### Convert Categorical Data to One hot encode ###
**TODO** explain here 

In [None]:
loan_df = one_hot_encode_keep_cols(loan_df)

### Final Result after data prep ....

In [None]:
loan_df.head()

<img src="https://github.com/dustinvanstee/random-public-files/raw/master/data-visualization.png" width="800" height="500" align="middle"/>

### Data Visualization
As you saw, when you 'describe' a data frame, you get a table statistics showing you the mean,min,max and other statistics about each column.  This is good, but sometimes its also good to look at the histograms of the data as well.  Lets Visualize some of the distributions from our dataset


In [None]:
plot_histograms(loan_df)

In [None]:
### A word on visualization libararies.


# Brunel Example
## The Growth of Lending Club
### Here we use the builtin Brunel Visualization graphics package
Lending club has been expanding over the years in terms of total loan volume and average loan size.

In [None]:
# Build a statistics data frame based on issue date
# aggregate on loan amount
# loan_stats = pd.concat([loan_df.groupby('issue_d').mean()['loan_amnt'].to_frame().rename(columns = {'loan_amnt':'loan_average'}), loan_df.groupby('issue_d')['loan_status'].count().to_frame().rename(columns = {'loan_status':'loan_count'})], axis=1)

In [None]:
#%brunel data('loan_stats') line x(ISSUE_D) y(loan_average, loan_count) color(#series) tooltip(#all) :: width=900, height=350 

**TODO** Add more Brunel if i get it in nimbix

<img src="https://github.com/dustinvanstee/random-public-files/raw/master/modeling.png" width="800" height="500" align="middle"/>

### Train / Test set creation

In [None]:
%load_ext autoreload
%autoreload 2
from lc_utils import *

In [None]:
loan_df.head()
my_analysis = lendingclub_ml(loan_df)

In [None]:
# Create a train / test split of your data set.  Paramter is test set size percentage 
# Returns data in the form of dataframes
my_analysis.create_train_test(test_size=0.4)

In [None]:
### Correlation to defualt [REMOVE]
#my_analysis.train_df.dtypes
corr_vs_1var(my_analysis.train_df, 'default')
#my_analysis.train_df.head()
#my_analysis.train_df['default'].corr(my_analysis.train_df['TX'])

In [None]:
my_analysis.X_train_scaled.head()

For this modeling exercise we will perform a couple of tasks, dimension reduction and classification as shown in the following diagram.

<img src="https://github.com/dustinvanstee/random-public-files/raw/master/techu-modeling-workflow.png" width="800" height="500" align="middle"/>

**Dimension Reduction** is useful in scenarios when you have a large number of columns and you would like to reduce that down to a compressed representation .  In this lab we will try 2 methods of dimension reduction.  It will be your choice to decide which method you want to use for the classification part of the lab ! (you could even decice to bypass this if you want ...)


## Dimension Reduction - PCA

<img src="https://github.com/dustinvanstee/random-public-files/raw/master/techu-pca.png"  width="200" height="125" align="middle"/>

Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. The number of principal components is less than or equal to the number of original variables.

In [None]:
# dim red using PCA
my_analysis.build_pca_model(n_components=50)


## Dimension Reduction - AutoEncoder

<img src="https://github.com/dustinvanstee/random-public-files/raw/master/techu-ae.png"  width="600" height="375" align="middle"/>

**TODO : AE Writeup **

In [None]:
my_analysis.build_ae_model(ae_layers=[100,25,6,25,100], regularization=0.001, epochs=1, folds=2, k_tries=1)

# Now update our test dataframe with new columns that are predicted by our PCA and Autoencoder models.  
Here we will now take the models that we built and pass our test data set through the models.   By doing this, we will have reduced the number features in our data set by a significant amount (177 => 5!)  .  

In this step we will add new columns to our test/train data frames for both our PCA model and our autoencoder model.  This is required for some followon visualization, and training steps ahead

In [None]:
my_analysis.update_train_test_df()

In [None]:
my_analysis.visualize_dimred_results(mode='pca')

In [None]:
my_analysis.visualize_dimred_results(mode='ae')

In [None]:
bob_heatmap_lc(my_analysis.test_df,sortColumn='PC0',add_corr=1)

# Final Step - Lending Club Default Prediction

<img src="https://github.com/dustinvanstee/random-public-files/raw/master/techu-modeling-traintest.png"  width="600" height="375" align="middle"/>

Here we will build a classifier to predict if loan will fail or not.  We will us a 
** Deep Learning Classifier **  .  You will have 3 options for data sources, 
* the raw data
* PCA dimension reduction features
* Autoencoder features

To evaluate our model, we will use a simple contingency table.  However, this is a fairly simplistic method.  Better method that can data scientists use are F1 score, and PR/ROC curves.

Step 1 here is to set our baseline result.  In this example, we are dealing with a **skewed** dataset.  This means, on average, most people will not default, and they pay their loan off.  If you built a classifier that just predicted no default, you would be right most of the time.  Lets see the stats from our dataset below....

In [None]:
# Set our baseline
my_analysis.train_df['default'].describe()


As you can see, only 15.5% of the applicants default.  Any classifier we build must be better than this, or we aren't doing a very good job ;)

In [None]:
mode = 'pca' # ae , all, 

if(mode == 'pca') :
            x_cols = [x for x in my_analysis.train_df.columns if 'PC' in x]
elif(mode == 'ae') :
            x_cols = [x for x in my_analysis.train_df.columns if 'AE' in x]
elif(mode == 'all') :
            x_cols = [x for x in my_analysis.train_df.columns if 'AE' in x]

my_analysis.build_evaluate_dl_classifier(x_cols, epochs=25)


In [None]:
94.6% !
plot learning rate !!

### Credits (TODO)
* [Data Preparation](https://apsportal.ibm.com/analytics/notebooks/9ef75f73-140a-4618-9292-1de51f5f331c/view?projectid=399ab81a-5140-4d51-a5df-b6e82d51db85&context=analytics)
* [Data Visualization](https://apsportal.ibm.com/analytics/notebooks/24dd6830-8a01-4bde-b42d-1d040079af16/view?projectid=399ab81a-5140-4d51-a5df-b6e82d51db85&context=analytics)
* [Modelling/Eval/Deploy](https://apsportal.ibm.com/analytics/notebooks/2d84420b-e005-4b95-99f6-a56148305fbc/view?projectid=399ab81a-5140-4d51-a5df-b6e82d51db85&context=analytics)
* [Web App]()


