### Credit Risk
Predict whether people are a good or bad credit risk  
Last modified: May 16th 2017
### Skytree 16.0 Python SDK Demonstration
Execution of the code in this demonstration produces the equivalent project in the Skytree graphical user interface  
In the Jupyter (iPython) notebook, all cells can executed at once via "Run All", executed individually, or the code exported as a .py file

## 1. Business Problem
 - A credit card company has customers who have varying degrees of credit risk
 - They would like to predict which for new applicants who are likely to be a bad credit risk
 - This will enable better decisions to be made about whom to extend credit to and how much
 - We will build a supervised machine learning model to predict credit risk for this company

## 2. Set up Skytree
Here, we import the needed Skytree modules for this project  
Any Python module may also be imported as part of the dataflow

In [1]:
import skytree
import skytree.prediction
from skytree import DatasetConfig
from skytree.prediction import AutoModelConfig, GbtConfig, GbtrConfig

hostname = 'http://localhost:8080/v1'
username = 'trial@infosys.com'
password = 'Infosys1' # Plain text passwords can be avoided using the Python getpass module
datadir = '/user/infosys/datasets/'
projectname = 'Credit Risk Prediction'
projectdesc = 'Predict whether people are a good or bad credit risk'

skytree.authenticate(username, password, hostname)

Create our project to include our datasets, models, results, and plots

In [2]:
project = skytree.create_project(projectname, projectdesc)

## 3. Data Preparation
### 3.1: Load Data
The create_dataset() project method loads data directly from HDFS  
Alternatively, a local file path can be specified  
Data can include numerical, categorical, sparse, and text columns  
Options not shown include dataset delimiter, and dataset configuration, e.g., ID column  
The .ready() blocking call ensures that this step in the dataflow is executed before further steps are attempted  
Datasets can also be retrieved by ID within an existing project, and viewed by functions such as .summary()

In [3]:
data = project.create_dataset(
    url = 'hdfs://{0}/credit_risk_prediction.csv'.format(datadir), 
    has_header = True, 
    missing_value = '?',
    name = 'Credit Data'
).ready()

### 3.2: Add ID Column
The data as supplied does not include a column containing unique row identifications, so we add one  
For training, ID is optional, but it is required for model testing or deployment  
This is because file line ordering is not necessarily preserved on a Hadoop distributed system

In [4]:
data = data.add_unique_id_column(
    'ID',
    name = 'Credit Data With ID'
).ready()

### 3.3: Commit Dataset
By default, Skytree does not attempt to load the full data immediately, which may be large  
To indicate that the whole data is to be used, we "commit" the dataset

In [5]:
data.commit().ready()

<id=1640286499329128763, name="Credit Data With ID">

### 3.4: Transform Snippets
Skytree transform snippets allow arbitrary data preparation at scale using PySpark  
Shared snippets either supplied with the software or user-defined can also be used directly in the GUI without coding  

Here, we normalize the current_balance column since it is on a significantly different scale to the others  
We could also normalize all columns or just run an algorithm where normalization is not needed, such as decision trees

Snippets can be shown in "preview" form, which shows how the transform looks on the top 20 rows, allowing interactive development, and "execute" form, which applies the snippet to the whole dataset

In [6]:
print skytree.get_snippet_by_name('Normalize_unit')

Name: Normalize_unit
Description: Rescaling to between minimum (default: 0) and maximum (default:1 ).
Id: 6590250852614889942
Visibility: Public
Input Arguments:
	colToNormalize of type: COLUMN with allowed types: [u'LongType', u'DoubleType'] from: df
Parameters:
	min of type: float kind: OPTIONAL default value: None
	max of type: float kind: OPTIONAL default value: None
Output Arguments:
	odf of type: DATAFRAME with allowed types: None from: None
Code:

# unitization with zero minimum ((x - min)/(max - min))
if min is None:
    min = df.agg({colToNormalize: 'min'}).collect()[0][0]

if max is None:
    max = df.agg({colToNormalize: 'max'}).collect()[0][0]

range = max - min
if range==0:
    odf = df.withColumn(colToNormalize, 0)
else:
    odf = df.withColumn(colToNormalize, (df[colToNormalize] - min)/range )



In [7]:
data.configure_snippets_transform() \
    .addSnippet('Normalize_unit') \
        .setInputVar('colToNormalize', 'current_balance', dataset_id=data.id) \
        .setOutputVar('odf', assign_name='transformed', \
                      output_dataset_name='Credit Data Transformed', id_column='ID') \
    .preview()

SUCCESS:
>>> 
>>> import re
>>> import os
>>> from hdfs.ext.kerberos import KerberosClient
>>> from hdfs import InsecureClient
>>> from pyspark import SparkContext, SparkConf
>>> from pyspark.sql import SQLContext
>>> from pyspark.sql.types import *
>>> import datasets
>>> from datasets import DatasetReader, DatasetWriter, DatasetInfo
>>> import sys
>>> 
>>> sqlContext = SQLContext(sc)
>>> 
>>> ds_info = datasets.DatasetInfo('/user/skytree/skytree/data_json/4609396799457303962/data/datasets/1640286499329128763/1640286499329128763.transformed.data', '/user/skytree/skytree/data_json/4609396799457303962/data/datasets/1640286499329128763/1640286499329128763.transformed.header', True, ',', 'ID', '?')
>>> reader = datasets.DatasetReader(ds_info)
>>> df = reader.spark_dataframe(sc, sqlContext)
>>> colToNormalize = 'current_balance'
>>> min = None
>>> max = None
>>> 
>>> 
>>> # unitization with zero minimum ((x - min)/(max - min))
>>> if min is None:
...     min = df.agg({colToNormalize: 'min'

u'SUCCESS'

In [8]:
output = data.configure_snippets_transform() \
    .addSnippet('Normalize_unit') \
        .setInputVar('colToNormalize', 'current_balance', dataset_id=data.id) \
        .setOutputVar('odf', assign_name='transformed', \
                      output_dataset_name='Credit Data Transformed', id_column='ID') \
    .execute()
    
data_transformed = output[0]
data_transformed = data_transformed.ready()

### 3.5: Split Into Training and Testing Sets
Splits can also be done into more than 2 files, e.g., training, tuning, and testing sets  
Splitting a file is not required in Skytree: machine learning training can be done with holdout, cross-validation, or Monte Carlo cross-validation on a single file, with portions of the dataset held out as appropriate  
Or, the user can specify a separate dataset to use for model tuning  
A separate testing set file is required for Predict & Evaluate, shown below

In [9]:
training, testing = data_transformed.split(
    split_ratios = [7,3],
    split_seed = 123456,
    names = ['Credit Training','Credit Testing'], 
    configs = [DatasetConfig(id_column='ID'), 
               DatasetConfig(id_column='ID')])

training.ready()
testing.ready()

<id=7941020574107658642, name="Credit Testing">

## 4. Run Machine Learning
### 4.1: Gradient Boosted Tree Grid Search
The first model is a gradient boosted tree (GBT) grid search with some values of tree depth and learning rate  
We begin by setting up the model configuration  
Not all of these options are required, and many more are available, e.g., point and score weights, ensemble GBT, class imbalance handling, other GBT hyperparameters and classification metrics, precision at k, rank-based loss functions, stochastic GBT, and more

In [10]:
# Import default GBT model configuration
gbt_gs_config = GbtConfig()

# K-fold cross-validation
gbt_gs_config.num_folds = 5

# Search 3 tree depths ...
gbt_gs_config.tree_depth = [2,3,4]

# ... in combination with 3 learning rates = grid of 9 points
gbt_gs_config.learning_rate = [0.02,0.04,0.06]

# Trees is automatically tuned over 10%:10%:100% = 10:10:100 trees here -> 90 grid points
gbt_gs_config.num_trees = [100]

# Make search reproducible
gbt_gs_config.holdout_seed = 123456

# Tune for Gini index
gbt_gs_config.testing_objective = 'GINI'

#### Training
Train GBT on the training dataset

In [11]:
gbt_gs_model = skytree.prediction.learn(
    training, 
    objective_column = 'class', 
    config = gbt_gs_config,
    name = 'GBT Grid Search'
).ready()

#### Testing
Apply trained model to unseen testing data  
This is equivalent to running "Predict & Evaluate" in the GUI  
The results (ROC curve, etc.) can be seen under the "Results" tab, or accessed from the SDK via .summarize()

In [12]:
gbt_gs_results = gbt_gs_model.test(
    testing,
    name = 'Testing GBT Grid Search'
).ready()

### 4.2: Gradient Boosted Tree Smart Search
By default the smart search has various limits on its parameter ranges  
These can be manually adjusted if desired  
Here, we decrease the maximum number of trees because this is a small demonstration dataset  
We then run setup, training, and testing, this time using a holdout instead of cross-validation  
The seeds and regularization are set manually for reproducility of creating the demo  
If not set, the seeds used are recorded as part of the model parameters

In [13]:
gbt_ss_config = GbtConfig()

gbt_ss_config.holdout_ratio = 0.3
gbt_ss_config.smart_search = True
gbt_ss_config.smart_search_iterations = 50
gbt_ss_config.num_trees = {'max': 100}
gbt_ss_config.testing_objective = 'GINI'

gbt_ss_config.holdout_seed = 234567
gbt_ss_config.smart_search_seed = 234567
gbt_ss_config.table_sampling_seed = 234567
gbt_ss_config.regularization = False

In [14]:
gbt_ss_model = skytree.prediction.learn(
    training,
    'class',
    gbt_ss_config,
    name = 'GBT Smart Search'
).ready()

In [15]:
gbt_ss_results = gbt_ss_model.test(
    testing,
    name = 'Testing GBT Smart Search'
).ready()

### 4.3: AutoModel
AutoModel is able to run multiple machine learning algorithms  
It acts as a generalization of Smart Search by both navigating the algorithm hyperparameter spaces and choosing between algorithms at each iteration  
Some of the algorithms require numerical-only data. Therefore, AutoModel ignores categorical columns (which can include missing values) for these algorithms so that they can be run  
This means that AutoModel can be run directly on the same income data as the grid and smart searches above

Note the AutoModel performance is competitive with the other methods (Smart Search and grid), but is simpler to run because it is not required to choose which algorithm to run, or to set algorithm hyperparameters  
In fact, AutoModel can even be run with fewer than the auto_config parameters given here

In [16]:
auto_config = AutoModelConfig() 

auto_config.holdout_ratio = 0.3
auto_config.holdout_seed = 234567
auto_config.smart_search_iterations = 30
auto_config.smart_search_seed = 234567
auto_config.testing_objective = 'GINI'

auto_model = skytree.prediction.learn(
    training,
    'class',
    auto_config,
    name = 'Automodel'
).ready()

auto_results = auto_model.test(
    testing,
    name = 'Testing AutoModel'
).ready()

### 4.4: AutoModel with AutoFeaturization
Besides AutoModel running required prerequisite transforms, we can also run it with auto-featurization  
Currently Skytree's built in auto-featurization is quite basic, but it will rapidly expand in future releases  
Auto-featurization creates a new dataset with normalized numerical columns, horizontalized categorical columns, missing values imputed, and zero variance columns removed  
This means that columns that would be ignored when only the prerequisite transforms are run can still be utilized in AutoModel

In [17]:
auto_model_f = skytree.prediction.learn(
    training,
    'class',
    auto_config,
    name = 'Automodel with AutoFeaturize',
    autoFeaturize = True
).ready()

As above, AutoModel can then be run on an unseen testing set  
With auto-featurization switched on, the resulting interpretation is in terms of the columns created after featurization (e.g., the horizontalized ones)  
This may be improved in future releases

In [18]:
auto_results_f = auto_model_f.test(
    testing,
    name = 'Testing AutoModel with AutoFeaturize'
).ready()

### 4.5: Other Predictions
#### Regression
Other predictions can be performed on the data, e.g., regression to predict someone's credit usage instead of their risk  
Here, we run GBT regression with Smart Search  
As with classification, many further options are available such as other regression metrics

In [19]:
gbtr_ss_config = GbtrConfig()

gbtr_ss_config.holdout_ratio = 0.3
gbtr_ss_config.smart_search = True
gbtr_ss_config.smart_search_iterations = 30
gbtr_ss_config.testing_objective = 'MEAN_ABSOLUTE_ERROR'

gbtr_ss_config.holdout_seed = 234567
gbtr_ss_config.smart_search_seed = 234567
gbtr_ss_config.table_sampling_seed = 234567
gbtr_ss_config.regularization = False

gbtr_ss_model = skytree.prediction.learn(
    training,
    'credit_usage',
    gbtr_ss_config,
    name = 'GBTR Smart Search'
).ready()

gbtr_ss_results = gbtr_ss_model.test(
    testing,
    name = 'Testing GBTR Smart Search'
).ready()

#### Multiclass
Multiclass classification allows prediction of a column with more than 2 classes, e.g., overdraft  
Running this model on a testing set yields a multiclass confusion matrix viewable under the GUI "Results" tab  
For multiclass, Gini index is not defined, so the testing objective is accuracy

In [20]:
gbt_mc_config = GbtConfig()

gbt_mc_config.holdout_ratio = 0.3
gbt_mc_config.smart_search = True
gbt_mc_config.smart_search_iterations = 30
gbt_mc_config.testing_objective = 'ACCURACY'

gbt_mc_config.holdout_seed = 234567
gbt_mc_config.smart_search_seed = 234567
gbt_mc_config.table_sampling_seed = 234567
gbt_mc_config.regularization = False

gbt_mc_model = skytree.prediction.learn(
    training,
    'over_draft',
    gbt_mc_config,
    name = 'GBT Smart Search Multiclass'
).ready()

gbt_mc_results = gbt_mc_model.test(
    testing,
    name = 'Testing GBT Smart Search Multiclass'
).ready()

### 4.6: Other Models
Other machine learning algorithms that can be run in the GUI and SDK include generalized linear models, random decision forest, and support vector machine  
In the command line, we also support clustering, nearest neighbors, density estimation, dimension reduction, and recommendation

## 5. Plots
We show partial dependence, capture deviation, and predicted versus true value plots  
Each plot is also easily generated in the GUI
### 5.1: AutoModel 1D Partial Dependence Plots
Partial dependencies (PDPs) give a lot of detailed insight beyond simple variable importances or performance metrics

E.g., the variable importances, viewable in the GUI, show the most significant predictors of credit risk in the AutoModel are whether or not the person has an overdraft, followed by credit purpose and current balance.

The PDPs here show additional information, such as risk decreasing with higher credit usage and current balance

In [21]:
pdp_1d_1 = auto_model.visualize_pdp("PDP 1D: Credit Usage","credit_usage").ready()
pdp_1d_2 = auto_model.visualize_pdp("PDP 1D: Current Balance","current_balance").ready()
pdp_1d_3 = auto_model.visualize_pdp("PDP 1D: Overdraft","over_draft").ready()
pdp_1d_4 = auto_model.visualize_pdp("PDP 1D: Purpose","purpose").ready()

### 5.2: AutoModel 2D Partial Dependence Plots
Two dimensional PDPs show further insight, e.g., low credit usage being more important than age for risk, and, while overdraft is significant, the subset of people for whom, in this data, "overdraft" means not having a checking account at all, is even more significant

In [22]:
pdp_2d_1 = auto_model.visualize_pdp("PDP 2D: Age, Credit History","cc_age","credit_history").ready()
pdp_2d_2 = auto_model.visualize_pdp("PDP 2D: Age, Overdraft","cc_age","over_draft").ready()
pdp_2d_3 = auto_model.visualize_pdp("PDP 2D: Age, Purpose","cc_age","purpose").ready()
pdp_2d_4 = auto_model.visualize_pdp("PDP 2D: Credit Usage, Age","credit_usage","cc_age").ready()
pdp_2d_5 = auto_model.visualize_pdp("PDP 2D: Credit Usage, Credit History","credit_usage","credit_history").ready()
pdp_2d_6 = auto_model.visualize_pdp("PDP 2D: Credit Usage, Overdraft","credit_usage","over_draft").ready()
pdp_2d_7 = auto_model.visualize_pdp("PDP 2D: Credit Usage, Purpose","credit_usage","purpose").ready()
pdp_2d_8 = auto_model.visualize_pdp("PDP 2D: Overdraft, Credit History","over_draft","credit_history").ready()
pdp_2d_9 = auto_model.visualize_pdp("PDP 2D: Overdraft, Purpose","over_draft","purpose").ready()

### 5.3: Capture Deviation Plot
Capture deviation shows how well the class probabilities were modeled  
Ideally the predicted class probability always matches the true class probability as seen from the testing set, yielding points on the diagonal line, which corresponds to a capture deviation of zero  
Here we show the capture deviation for credit risk

In [23]:
from skytree.prediction import Plot, PlotRegistry
registry = PlotRegistry(project)
resource_id = registry.findall("TESTRESULT")
for registry in resource_id:
    print registry["name"]
    print registry["id"]
registry = PlotRegistry(project)
plot_args = registry.findById("1084506945850907")
print plot_args["argsSpec"]
plot_args = "{ \"columns\": [\"Objective Column\", \"Predicted Probabilities\",\
\"Predicted Labels\", \"Predicted Categories\"],\
\"summarizer\": [ { \"name\": \"numBuckets\", \"kind\": \"fixed\",\
\"type\": \"integer\", \"default\": 10 } ], \"plot\": [] }"
plot_name = "Capture Deviation for Credit Risk"
plot_registry_id = "1084506945850907"
plot = auto_results.visualize(plot_args, plot_name, plot_registry_id).ready()

Capture Deviation Plot (Classification)
1084506945850907
Prediction Vs Actual (Regression)
5094586945860902
{
    "columns": ["Objective Column", "Predicted Probabilities", "Predicted Labels", "Predicted Categories"],
    "summarizer": [
        {
            "name": "numBuckets",
            "kind": "optional",
            "type" : "integer",
            "default": 10
        }
    ],
    "plot": []
}


### 5.4: Predicted Versus True Values
For regression problems, we can see predicted versus true values for a testing set  
This shows predicted versus true for credit usage  
The model shows some correlation, but with significant spread

In [24]:
registry = PlotRegistry(project)
resource_id = registry.findall("TESTRESULT")
for registry in resource_id:
    print registry["name"]
    print registry["id"]
resource = PlotRegistry(project)
plot_args = resource.findById("5094586945860902")
print plot_args["argsSpec"]
plot_args = "{ \"columns\": [\"Objective Column\", \"Predicted Targets\"],\
\"summarizer\": [ { \"name\": \"returnSize\", \"kind\": \"fixed\",\
\"type\": \"integer\", \"default\": 10000 } ], \"plot\": [] }"
plot_name = "Predicted vs. Actual for Credit Usage"
plot_registry_id = "5094586945860902"
plot = gbtr_ss_results.visualize(plot_args, plot_name, plot_registry_id).ready()

Capture Deviation Plot (Classification)
1084506945850907
Prediction Vs Actual (Regression)
5094586945860902
{
    "columns": ["Objective Column", "Predicted Targets"],
    "summarizer": [
        {
            "name": "returnSize",
            "kind": "fixed",
            "type" : "integer",
            "default": 10000
        }
    ],
    "plot": []
}


## 6. Conclusions
 - We built a supervised machine learning model using Skytree to predict whether people are a good or bad credit risk
 - The model classified about 74% of the transactions correctly
 - The most significant predictors of credit risk are presence of an overdraft, followed by credit purpose and current balance
 - The probability of being a bad credit risk, and the reasons given, determine the most appropriate course of action for each transaction
 - The customers win, from receiving better screening
 - The business wins, via less money being lost to extending undue credit to customers who are a bad risk