# Using AutoML Toolkit to Simplify Loan Risk Analysis XGBoost Model Optimization
## Sample Dataset

We had originally created the [Loan Risk Analysis with XGBoost series of notebooks](https://pages.databricks.com/rs/094-YMS-629/images/loan-risk-analysis.zip?_ga=2.178442724.1979726501.1567454379-69048996.1529121564&_gac=1.213533088.1565662714.EAIaIQobChMIhLzsjOT-4wIVB7vsCh36IAQXEAAYASAAEgKAK_D_BwE) that showed how to build and optimize a linear regression model using GLM, GBT, and XGBoost.  In this notebook, we will extend and simplify the process of building a XGBoost model by using the [Databricks Labs AutoML Toolkit](https://github.com/databrickslabs/automl-toolkit).

### Business Problem

Being able to accurately assess the risk of a loan application can save a lender the cost of holding too many risky assets. Rather than a credit score or credit history which tracks how reliable borrowers are, we will generate a score of how profitable a loan will be compared to other loans in the past. The combination of credit scores, credit history, and profitability score will help increase the bottom line for financial institution.

Having a interporable model that an loan officer can use before performing a full underwriting can provide immediate estimate and response for the borrower and a informative view for the lender.

<a href="https://ibb.co/cuQYr6"><img src="https://preview.ibb.co/jNxPym/Image.png" alt="Image" border="0"></a>

### References
* [Loan Risk Analysis with XGBoost and Databricks Runtime for Machine Learning](https://databricks.com/blog/2018/08/09/loan-risk-analysis-with-xgboost-and-databricks-runtime-for-machine-learning.html)
* [Databricks Labs AutoML Toolkit](https://github.com/databrickslabs/automl-toolkit)

### Dependencies
The following dependencies will need to be configured and installed for this demo notebook to be run successfully.
* Use Databricks Runtime 5.4+
* [MLflow 0.9.1](https://pypi.org/project/mlflow/0.9.1/) (pip install) | [MLflow Client 0.9.1](https://mvnrepository.com/artifact/org.mlflow/mlflow-client/0.9.1) (Maven)
* [XGBoost 0.90](https://xgboost.readthedocs.io/en/latest/jvm/) | [XGBoost4j Spark 0.90](https://mvnrepository.com/artifact/ml.dmlc/xgboost4j-spark/0.90)
* Import AutoML 
 * Download or create the JAR from the [Databricks AutoML Toolkit](https://github.com/databrickslabs/automl-toolkit) 
 * Create Library with this JAR ([Azure](https://docs.azuredatabricks.net/user-guide/libraries.html) | [AWS](https://docs.databricks.com/user-guide/libraries.html))

In [3]:
%sh 
wget -P /dbfs/tmp/loan-risk-analysis/ https://pages.databricks.com/rs/094-YMS-629/images/loan-risk-analysis-full-cleansed.parquet

In [4]:
## Bring in the dataset 
from pyspark.sql.functions import col,expr, when 
source_data = spark.read.parquet("/tmp/loan-risk-analysis/loan-risk-analysis-full-cleansed.parquet")\
  .withColumn("label", when((col("bad_loan") == "true"), 1).otherwise(0))\
  .drop(col("bad_loan"))\
  .drop(col("net"))\
  .sample(False, 0.025, 42)\
  .repartition(192)

# Create Temp View 
source_data.createOrReplaceTempView("source_data")


#Splitting Train and Test
dataset_train = source_data.where(expr("issue_year <= 2015")).cache()
dataset_valid = source_data.where(expr("issue_year > 2015")).cache()
dataset_train.createOrReplaceTempView("dataset_train")
dataset_valid.createOrReplaceTempView("dataset_valid")

In [5]:
display(source_data)

term,home_ownership,purpose,addr_state,verification_status,application_type,loan_amnt,emp_length,annual_inc,dti,delinq_2yrs,revol_util,total_acc,credit_length_in_years,int_rate,issue_year,label
60 months,MORTGAGE,home_improvement,OH,Verified,INDIVIDUAL,16000.0,5.0,60000.0,11.1,0.0,0.0,19.0,7.0,17.57,2015.0,1
36 months,MORTGAGE,debt_consolidation,NY,Not Verified,INDIVIDUAL,4000.0,10.0,79000.0,18.37,0.0,89.6,40.0,16.0,20.99,2014.0,0
60 months,MORTGAGE,home_improvement,OH,Verified,INDIVIDUAL,20000.0,6.0,78000.0,13.86,1.0,77.4,31.0,17.0,22.74,2017.0,0
36 months,MORTGAGE,debt_consolidation,AZ,Verified,INDIVIDUAL,5000.0,10.0,63000.0,22.0,2.0,81.1,20.0,17.0,11.49,2016.0,0
60 months,MORTGAGE,debt_consolidation,MA,Not Verified,INDIVIDUAL,16000.0,1.0,75000.0,18.58,0.0,33.7,44.0,13.0,13.99,2015.0,1
60 months,MORTGAGE,debt_consolidation,WA,Not Verified,INDIVIDUAL,19700.0,10.0,80000.0,22.49,0.0,57.7,35.0,11.0,13.33,2015.0,0
36 months,MORTGAGE,house,MI,Verified,INDIVIDUAL,30000.0,10.0,120000.0,8.23,3.0,68.1,20.0,12.0,24.5,2014.0,0
60 months,MORTGAGE,debt_consolidation,CA,Verified,INDIVIDUAL,17500.0,10.0,111000.0,8.18,0.0,38.1,23.0,15.0,11.55,2013.0,0
36 months,MORTGAGE,debt_consolidation,MA,Not Verified,INDIVIDUAL,12000.0,1.0,56000.0,9.62,0.0,29.0,52.0,19.0,7.62,2013.0,0
36 months,RENT,credit_card,NV,Not Verified,INDIVIDUAL,4000.0,1.0,40000.0,14.85,0.0,29.0,34.0,17.0,12.29,2015.0,0


In [6]:
print("Sample Counts: source:" , source_data.count() , ", train:" , dataset_train.count() , ", test:" , dataset_valid.count())

## Goal: Build a model for identifying bad loans
Can we build a model to based on our set of features (e.g. state, application type, annual income, etc.) to determine if we have a bad loan: `false [good loan], true [bad loan]`

## Configure the AutoML Toolkit
The key configurations are noted in the cell immediately below; these include key configurations like folder names as well as key attributes such as the `labelColumn`.

In [9]:
## Generic configuration
experimentNamePrefix = "/Users/marygrace.moesta@databricks.com/AutoML"
RUNVERSION = "5"
labelColumn = "label"
runExperiment = "runRF_" + RUNVERSION
projectName = "mg_AutoML_Demo"
modelSaveFolder = "/tmp/mgm/ml/automl/"

## This is the configuration of the hardware available (default of 4, 4, and 4)
nodeCount = 8
coresPerNode = 16
totalCores = nodeCount * coresPerNode
driverCores = 30

## Save locations
mlFlowModelSaveDirectory = "dbfs:" + modelSaveFolder + "models/" + projectName + "/"
inferenceConfigSaveLocation = "dbfs:" + modelSaveFolder + "inference/" + projectName + "/"

### Configure Overrides 
An important aspect of the AutoML toolkit is the ability to modify the generic maps with your own overrides.  In general, you can start with the defaults and change these as you want more control over how this works as you become more familiar with the toolkit; more information is available in the [AutoML Toolkit API Documentation](https://github.com/databrickslabs/automl-toolkit/blob/master/APIDOCS.md)

In [11]:
cntx = dbutils.entry_point.getDbutils().notebook().getContext()
# api_token = cntx.apiToken().get()
# api_url = cntx.apiUrl().get()
notebook_path = cntx.notebookPath().get()
generic_overrides = {
  "labelCol": labelColumn,
  "scoringMetric": "areaUnderROC",
  "dataPrepCachingFlag": False,
  "autoStoppingFlag": True,            
  "tunerAutoStoppingScore": 0.91,
  "tunerParallelism": driverCores,
  "tunerKFold": 1,  ## normally should be >=5
  "tunerSeed": 42,  ## for reproducibility
  "tunerInitialGenerationArraySeed": 42,
  "tunerTrainPortion": 0.7,
  "tunerTrainSplitMethod": "stratified",
  "tunerInitialGenerationMode": "permutations",
  "tunerInitialGenerationPermutationCount": 8,
  "tunerInitialGenerationIndexMixingMode": "linear",
  "tunerFirstGenerationGenePool": 16,
  "tunerNumberOfGenerations": 3,
  "tunerNumberOfParentsToRetain": 2,
  "tunerNumberOfMutationsPerGeneration": 4,
  "tunerGeneticMixing": 0.8,
  "tunerGenerationalMutationStrategy": "fixed",
  "tunerEvolutionStrategy": "batch",
  "tunerHyperSpaceInferenceFlag": True,
  "tunerHyperSpaceInferenceCount": 400000,
  "tunerHyperSpaceModelType": "XGBoost",
  "tunerHyperSpaceModelCount": 8,
  "mlFlowLoggingFlag": True,
  "mlFlowLogArtifactsFlag": False,
#   "mlFlowTrackingURI": api_url,
  "mlFlowExperimentName": experimentNamePrefix +"/" + projectName+ "/" + runExperiment,
#   "mlFlowAPIToken": api_token,
  "mlFlowModelSaveDirectory": mlFlowModelSaveDirectory,
  "mlFlowLoggingMode": "bestOnly",
  "mlFlowBestSuffix": "_best",
  "inferenceConfigSaveLocation": inferenceConfigSaveLocation
  }

## Calculate Feature Importance
Determine the important features within our dataset

In [13]:
## Calculate Feature Importance 
from py_auto_ml.exploration.feature_importance import FeatureImportance

fi_importances_package = FeatureImportance("XGBoost", "classifier",  source_data, 20.0,"count",generic_overrides)

In [14]:
## Display the feature importance 
display(fi_importances_package.importances)

Feature,Importance
int_rate,7.0
dti,4.0
term,3.0
issue_year,3.0
annual_inc,2.0


In [15]:
## Isolate only the top fields 
display(fi_importances_package.top_fields)

feature
int_rate
dti
issue_year
term
annual_inc


## Select Model Features
Use the `topFields` to specify which features to use for our model

In [17]:
top_fields = fi_importances_package.top_fields.select("feature").rdd.flatMap(lambda x: x).collect()

selection_fields = source_data.select([c for c in source_data.columns if c in top_fields])

In [18]:
display(selection_fields)

term,addr_state,verification_status,annual_inc,dti,int_rate
36 months,OH,Verified,49000.0,17.56,6.92
36 months,CO,Verified,30000.0,19.2,17.27
60 months,PA,Verified,54610.0,21.85,23.43
36 months,PA,Not Verified,53000.0,8.76,17.76
36 months,NY,Verified,110000.0,8.52,9.99
36 months,AR,Verified,43686.0,31.79,13.33
36 months,GA,Verified,38000.0,16.2,11.99
60 months,PA,Verified,156000.0,10.98,11.14
60 months,NJ,Not Verified,63000.0,10.46,10.49
36 months,MS,Verified,63024.0,18.43,11.47


### Use AutoML AutomationRunner
Use AutoML AutomationRunner to build, train, evalulate, and tune your ML model.

In [20]:
model_family = "XGBoost"
prediction_type = "classifier"
run_type = "confusion"

## Kickoff Automation runner
from py_auto_ml.automation_runner import AutomationRunner

runner = AutomationRunner(model_family,
                         prediction_type,
                         source_data,
                         run_type,
                         generic_overrides)

### Review Best Model Parameters
Go to MLflow to see the [best model](https://demo.cloud.databricks.com/#mlflow/experiments/4250387/runs/c704b97ed2ee4710af1cd7a328706d58) with all of its metrics and parameters

### Review Confusion Matrix
As this is a binary classifier, we are trying to minimize our false negatives (identifying loans as good when they are bad) and improving our true positives (improve identification of bad loans).

In [23]:
cmdf = runner.confusion_data.select("label", "prediction", "count")

display(cmdf)

label,prediction,count
0,0.0,13026
1,0.0,3526


In [24]:
# Source code for plotting confusion matrix is based on `plot_confusion_matrix` 
# via https://runawayhorse001.github.io/LearningApacheSpark/classification.html#decision-tree-classification
import matplotlib.pyplot as plt
import numpy as np
import itertools

def plot_confusion_matrix(cm, title):
  # Clear Plot
  plt.gcf().clear()

  # Configure figure
  fig = plt.figure(1)
  
  # Configure plot
  classes = ['Bad Loan', 'Good Loan']
  plt.imshow(cm, interpolation='nearest', cmap=plt.cm.Blues)
  plt.title(title)
  plt.colorbar()
  tick_marks = np.arange(len(classes))
  plt.xticks(tick_marks, classes, rotation=45)
  plt.yticks(tick_marks, classes)

  # Normalize and establish threshold
  normalize=False
  fmt = 'd'
  thresh = cm.max() / 2.

  # Iterate through the confusion matrix cells
  for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
      plt.text(j, i, format(cm[i, j], fmt),
               horizontalalignment="center",
               color="white" if cm[i, j] > thresh else "black")

  # Final plot configurations
  plt.tight_layout()
  plt.ylabel('True label')
  plt.xlabel('Predicted label') 
  
  # Display images
  image = fig
  
  # Show plot
  #fig = plt.show()
  
  # Save plot
  fig.savefig("confusion-matrix.png")

  # Display Plot
  display(image)
  
  # Close Plot
  plt.close(fig)

In [25]:
# Convert to pandas
cm_pdf = cmdf.toPandas()

# Create 1d numpy array of confusion matrix values
cm_1d = cm_pdf.iloc[:, 2]

# Create 2d numpy array of confusion matrix values
cm = np.reshape([cm_1d], (-1, 2))

# Plot confusion matrix  
plot_confusion_matrix(cm, "Confusion Matrix")