# Predicting Bankruptcy using SageMaker AutoPilot


## Introduction

Amazon SageMaker Autopilot is an automated machine learning (commonly referred to as AutoML) solution for tabular datasets. You can use SageMaker Autopilot in different ways: on autopilot (hence the name) or with human guidance, without code through SageMaker Studio, or using the AWS SDKs. This notebook, as a first glimpse, will use the AWS SDKs to simply create and deploy a machine learning model.

Predicting corporate bankruptcy is very important for any wholesale or capital market credit business. Predicting bankruptcy is also important for credit risk management.

---
## Setup

_This notebook was created and tested on an ml.m4.xlarge notebook instance._

Let's start by specifying:
- The S3 bucket and prefix that you want to use for training and model data.  This should be within the same region as the Notebook Instance, training, and hosting.

- - The IAM role ARN used to give training and hosting access to your data. See the documentation for how to create these.  Note, if more than one role is required for notebook instances, training, and/or hosting, please replace the `boto` regular expression with a the appropriate full IAM role ARN string(s).

In [None]:
#install arff package, this package is used to read the bankruptcy data which is in ARFF format
!pip install --upgrade pip 
!pip install --upgrade arff
#install s3fs - this package is used by pandas to read file from s3
!pip install --upgrade s3fs
!pip install wget

### Import python packages
First let's import the packages we need.
You also need the `arff` package to load the bankruptcy data as it is in `arff` format. 

In [None]:
import io
import json
import sys
import time
from time import gmtime, sleep, strftime
from urllib.parse import urlparse

import boto3
import botocore
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from scipy.io import arff
from sklearn.metrics import (auc, f1_score, plot_precision_recall_curve,
                             precision_recall_curve, precision_score,
                             recall_score, roc_curve)
from sklearn.model_selection import train_test_split

import sagemaker
import wget
from sagemaker import AutoML, get_execution_role
from sagemaker.automl.automl import AutoML

### Import the dataset.

You will use data from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/index.php) the [Polish companies bankruptcy dataset](https://archive.ics.uci.edu/ml/datasets/Polish+companies+bankruptcy+data). It has 64 features and one target attribute. More details are found here: 
Zieba, M., Tomczak, S. K., & Tomczak, J. M. (2016). Ensemble Boosted Trees with Synthetic Features Generation in Application to Bankruptcy Prediction. Expert Systems with Applications

In [None]:
sagemaker_session = sagemaker.Session()

timestamp_suffix = strftime("%d-%H-%M-%S", gmtime())
role = get_execution_role()
url = "https://sagemaker-sample-files.s3.amazonaws.com/datasets/tabular/uci_polish_bankruptcy/data.zip"
bankruptcy_file = wget.download(url)

In [None]:
!unzip -o data.zip

First, Let's take a quick look at the dataset.

In [None]:
bankruptcydata1 = arff.loadarff("1year.arff")
bankruptcy_df = pd.DataFrame(bankruptcydata1[0])
bankruptcy_df["class"] = bankruptcy_df["class"].map(int)
bankruptcy_df.head()

class 0 = Not Bankrupt (i.e did not file bankruptcy) and class 1 = Bankrupt (filed bankruptcy). As you can see, other than `amount`, other columns are anonymized in the dataset. All column descriptions are available in the html page. The column names were saved in the `bankruptcyfeatures.csv` file. Your goal is to predict which companies will file for bankruptcy next month.

From the data, we want to predict bankruptcy filing for next month

### Mapping Features Name to the Data Frame ##
You create an attribute to feature name mapping in `bankruptcyfeatures.csv` from the column descriptions in the html page. You will use this file to rename the column names of `bankruptcy_df`. Please note, column names are just for display. **Autopilot does not need column names.**  

Note, target attribute **class** is mapped to **bankrupt** to make it more clear.


In [None]:
feature_names = pd.read_csv("bankruptcyfeatures.csv", header=0)
bankruptcy_df.columns = np.array(feature_names["economic_factor"])

Now check if the dataset is balanced. See if the number of bankruptcies represents roughly half of the dataset.
Also check if the dataset has any NaN values.

In [None]:
# our target is to predict bankrupt column
target_variable = "bankrupt"
print(bankruptcy_df[target_variable].value_counts())
# check for null values
print(bankruptcy_df.isnull().values.any())

As you can see, the number of **bankruptcy** records are only around 4% of **not bankrupt**
Also, you can see the dataset features have many NaN values. You will let Autopilot handle these NaN values.

Before training, you need to split the data into train and test data sets.  The test data will be used to measure the ability of the Autopilot generated model to generalize to previously unseen data. You will use an 80-20 ratio of training versus testing data.

In [None]:
train, test = train_test_split(bankruptcy_df, test_size=0.2, random_state=100)

In [None]:
print(f"Training dataset size = {train.shape}")
print(f"Test dataset size = {test.shape}")

## Configure Autopilot

Give a job name **automl-bankruptcy**, then create a session with the SageMaker client. You need to have an **s3** bucket to store train/test data and all other artifacts Autopilot will produce. In this notebook, you are using the default **s3** bucket, but you can create your own bucket if you wish. Training and test data is used from the
previous steps and uploaded to **s3** bucket under "train" and "test" respectively. Training_data[target_variable] has the target (bankruptcy 1, Not bankrupt 0). **S3Uri** field in input_data_config tells Autopilot training data location. **TargetAttributeName** indicates target variable for the training job. 

After uploading the dataset to Amazon S3, you can invoke Autopilot to find the best ML pipeline to train a model on this dataset.

The required inputs for invoking an Autopilot job are:

    Amazon S3 location for input dataset and for all output artifacts
    Name of the column of the dataset you want to predict (y in this case)
    An IAM role

Currently Autopilot supports only tabular datasets in CSV format. Either all files should have a header row, or the first file of the dataset, when sorted in alphabetical/lexical order, is expected to have a header row.


In [None]:
auto_ml_job_name = "automl-bankruptcy"
sm = boto3.client("sagemaker")
session = sagemaker.Session()

prefix = f"sagemaker/{auto_ml_job_name}"
bucket = session.default_bucket()
training_data = train
X_test = test.drop(columns=[target_variable])
y_test = test[target_variable]

test_data = X_test

train_file = "train_data.csv"
training_data.to_csv(train_file, index=False, header=True)
train_data_s3_path = session.upload_data(path=train_file, key_prefix=prefix + "/train")
print(f"Train data uploaded to: {train_data_s3_path}")

test_file = "test_data.csv"
test_data.to_csv(test_file, index=False, header=False)
test_data_s3_path = session.upload_data(path=test_file, key_prefix=prefix + "/test")
print(f"Test data uploaded to: {test_data_s3_path}")
input_data_config = [
    {
        "DataSource": {
            "S3DataSource": {
                "S3DataType": "S3Prefix",
                "S3Uri": f"s3://{bucket}/{prefix}/train",
            }
        },
        "TargetAttributeName": target_variable,
    }
]

In [None]:
timestamp_suffix = strftime("%d-%H-%M-%S", gmtime())

"Now, you need to create the Autopilot job. See the following for an example on creating an Autopilot job. You set the maximum candidate models (attribute `max_candidates`) with different parameters to 250. You also set `ProblemType='BinaryClassification'`. Please note you do not need to set `ProblemType` and `MetricName`. If you do not set these two fields, Autopilot will automatically determine the type of supervised learning problem by analyzing the data (for a binary classification problem the default metric is F1).  We set `MetricName` (parameter `job_objective`) to AUC or F1 (value of `eval_obj` when the function is called). More info: [options for the job configuration](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/create-auto-ml-job.html)."

Note: define the functions here and use them later from a calling function. 

In [None]:
def create_automl_object(eval_obj, base_job_name):

    target_attribute_name = target_variable
    role = get_execution_role()
    automl = AutoML(
        role=role,
        target_attribute_name=target_attribute_name,
        base_job_name=base_job_name,
        sagemaker_session=session,
        problem_type="BinaryClassification",
        job_objective={"MetricName": eval_obj},
        max_candidates=250,
    )
    return automl

After the AutoML object is created, call the fit() function to train the AutoML object.

In [None]:
def automl_fit(automl, base_job_name):
    automl.fit(train_data_s3_path, job_name=base_job_name, wait=False, logs=False)

After you create the Autopilot job, monitor the response of the Autopilot job that was created above. Check the job status every 30 seconds, and once the job status returns ‘Completed’, exit the loop.
Before completing the job, loop will print **InProgress**.

In [None]:
def check_status(automl):
    describe_response = automl.describe_auto_ml_job()
    print(describe_response)
    job_run_status = describe_response["AutoMLJobStatus"]

    while job_run_status not in ("Failed", "Completed", "Stopped"):
        describe_response = automl.describe_auto_ml_job()
        job_run_status = describe_response["AutoMLJobStatus"]
        print(job_run_status)
        sleep(30)
    print("completed")

Select the best candidate and check the accuracy. 

In [None]:
def get_best_candidate(automl):
    best_candidate = automl.describe_auto_ml_job()["BestCandidate"]
    best_candidate_name = best_candidate["CandidateName"]
    print(best_candidate)
    print("\n")
    print(f"CandidateName: {best_candidate_name}")
    print( f"FinalAutoMLJobObjectiveMetricName: {best_candidate['FinalAutoMLJobObjectiveMetric']['MetricName']}")
    print(f"FinalAutoMLJobObjectiveMetricValue: {best_candidate['FinalAutoMLJobObjectiveMetric']['Value']}")
    return best_candidate_name, best_candidate

Create a model from the best candidate. In addition to predicted label, you want probability of the prediction. This probability will be used later to plot AUC and Precision/Recall.

In [None]:
def create_model(automl, best_candidate_name, best_candidate, timestamp_suffix):
    model_name = f"automl-bankruptcy-model-{timestamp_suffix}"
    inference_response_keys = ["predicted_label", "probability"]
    model = automl.create_model(
        name=best_candidate_name,
        candidate=best_candidate,
        inference_response_keys=inference_response_keys,
    )
    return model

You may also select multiple candidates (example by Objective, in this case AUC).

Once the model is created, run a Transform job to get inference (i.e Prediction about the default) from the test dataset and save in S3. It is worth noting that when you deploy the model as an endpoint or create a Transformer, SageMaker handles the deployment of the feature engineering pipeline and the ML algorithm, so end users can send the data in its raw format for inference.

In [None]:
def create_transformer(model, best_candidate, eval_obj):
    s3_transform_output_path = f"s3://{bucket}/{prefix}/inference-results/"
    output_path = f"{s3_transform_output_path}{best_candidate['CandidateName']}/"
    transformer = model.transformer(
        instance_count=1,
        instance_type="ml.m5.xlarge",
        assemble_with="Line",
        output_path=output_path,
    )
    transformer.transform(
        data=test_data_s3_path, split_type="Line", content_type="text/csv", wait=False
    )
    return transformer

Finally, we read the inference/predicted data into Pandas dataframe

This function will read the file from s3 (generated from create_transformer), create a Data Frame for label(predicted 0/1) and probability(probability of the prediction 0/1)

In [None]:
def get_csv_from_s3(s3uri, file_name):
    parsed_url = urlparse(s3uri)
    bucket_name = parsed_url.netloc
    prefix = parsed_url.path[1:].strip("/")
    s3 = boto3.client("s3")
    s3_resource=boto3.resource("s3")
    obj = None 
    loop = True
    while (loop):
        try:
            obj = s3_resource.Object(bucket_name, f"{prefix}/{file_name}")
            pred_body  = obj.get()["Body"].read().decode("utf-8")    
            print ("predict file is avilable s3")    
            loop = False
        except botocore.exceptions.ClientError as e:
            print("prediction file still not avilable in s3 sleeping for 2 minutes")
            time.sleep(120)
    return pred_body

def return_pred_df(transformer):
    print("***predict output path ***")
    print(transformer.output_path, "{}.out".format(test_file))
    pred_csv = get_csv_from_s3(transformer.output_path, "{}.out".format(test_file))
    data = pd.read_csv(io.StringIO(pred_csv), header=None)
    data.columns = ["label", "proba"]
    return data

We can download Candidate Definition notebook from the following s3 location.
We can download data exploration notebook to see details of Autopilot data analysis. This report provides insights about the dataset you provided as input to the AutoML job.

In [None]:
def download_notebooks(automl, eval_obj):
    print(f"download CandidateDefinitionNotebookLocation for {eval_obj}")
    print(
        automl.describe_auto_ml_job()["AutoMLJobArtifacts"][
            "CandidateDefinitionNotebookLocation"
        ]
    )
    print(f"download DataExplorationNotebookLocation for {eval_obj}" )
    print(
        automl.describe_auto_ml_job()["AutoMLJobArtifacts"][
            "DataExplorationNotebookLocation"
        ]
    )

Wrapper function run_automl_process is called with objective AUC and F1. This wrapper function calls multiple functions to creare AutoML object, run training process, create model from best trained job and finally return predicted data**

In [None]:
def run_automl_process(eval_obj):
    timestamp_suffix = strftime("%d-%H-%M-%S", gmtime())
    base_job_name = f"{auto_ml_job_name}-{eval_obj}{timestamp_suffix}"
    print(base_job_name)
    automl = create_automl_object(eval_obj, base_job_name)
    automl_fit(automl, base_job_name)
    check_status(automl)
    best_candidate_name, best_candidate = get_best_candidate(automl)
    model = create_model(automl, best_candidate_name, best_candidate, timestamp_suffix)
    transformer = create_transformer(model, best_candidate, eval_obj)
    pred_df = return_pred_df(transformer)
    pred_df.to_csv(f'data_{eval_obj}_bankruptcy.csv', index=False)
    download_notebooks(automl, eval_obj)
    return pred_df

Now we are ready to run for the auto pilot job. We call the wrapper function run_automl_process with objective AUC and F1

In [None]:
print("*********running with eval objective AUC***********")
data_auc = run_automl_process("AUC")

In [None]:
print("*********running with eval objective F1***********")
data_f1 = run_automl_process("F1")

Once the model is created, we run a Transform job to get inference (i.e Prediction about the default) from the test data set and save to S3. 

Now, we plot ROC - the Area under the Curve (AUC) for true positive (in this data set Bankrupt) vs false positive (predicted as Bankrupt but not Bankrupt in the ground truth). The higher the prediction quality of the classification model, the more the AUC curve is skewed to the top left.

In [None]:
from sklearn import metrics

colors = ["blue", "green"]
model_names = ["Objective : AUC", "Objective : F1"]
models = [data_auc, data_f1]
for i in range(len(models)):
    fpr, tpr, _ = metrics.roc_curve(y_test, models[i]["proba"])
    fpr, tpr, _ = metrics.roc_curve(y_test, models[i]["proba"])
    auc_score = metrics.auc(fpr, tpr)
    plt.plot(
        fpr,
        tpr,
        label=str(f"Auto Pilot {auc_score:.3f} {model_names[i]}"),
        color=colors[i],
    )

plt.xlim([-0.1, 1.1])
plt.ylim([-0.1, 1.1])
plt.ylabel("True Positive Rate")
plt.xlabel("False Positive Rate")
plt.legend(loc="lower right")
plt.title("ROC Cuve")

In [None]:
colors = ["blue", "green"]
model_names = ["Objective : AUC", "Objective : F1"]
models = [data_auc, data_f1]

print("model ", "F1 ", "precision ", "recall ")
for i in range(0, len(models)):
    precision, recall, _ = precision_recall_curve(y_test, models[i]["proba"])
    print(
        model_names[i],
        f1_score(y_test, np.array(models[i]["label"])),
        precision_score(y_test, models[i]["label"]),
        recall_score(y_test, models[i]["label"]),
    )
    plt.plot(recall, precision, color=colors[i], label=model_names[i])

plt.xlabel("Recall")
plt.ylabel("Precision")
plt.title("Precision-Recall Curve")
plt.legend(loc="upper right")
plt.show()

---
## Conclusion <a name="Conclusion"></a>
We can see that with very little data science knowledge, we are able to create a moderately accurate prediction for a complex Financial event like Bankruptcy. From the AUC and Precision+Recall plots, we can also see that Autopilot handled highly imbalanced data resonably well. We think the reason for the 62% Recall (rather than achieving higher score) is as follows: the bankruptcy dataset is missing some important features of Bankruptcy filing - short term liquidity, short term funding source etc.    

---
## Cleanup <a name="Cleanup"></a>
The Autopilot job creates many underlying artifacts such as dataset splits, preprocessing scripts, or preprocessed data, etc. This code, when un-commented, deletes them. This operation deletes all the generated models and the auto-generated notebooks as well.

In [None]:
# s3 = boto3.resource('s3')
# s3_bucket = s3.Bucket(bucket)

# s3_bucket.objects.filter(Prefix=prefix).delete()

Finally, we can delete the models by calling.

In [None]:
# transformer.delete_model()