## Introduction

Amazon SageMaker Autopilot is an automated machine learning (commonly referred to as AutoML) solution for tabular datasets. You can use SageMaker Autopilot in different ways: on autopilot (hence the name) or with human guidance, without code through SageMaker Studio, or using the AWS SDKs. This notebook will use the AWS SDKs to simply create and deploy a machine learning model.

In this notebook we will work through an example of credit card fraud detection using SageMaker AutoPilot.


## Setup

This notebook was created and tested on an ml.m4.xlarge notebook instance. Also ensure that this notebook uses the older version(older than 2.0.0) of the SageMaker SDK. Below we have code to check this so you don't have to.

Let's start by specifying:

- The S3 bucket and prefix that you want to use for training and model data.  This should be within the same region as the Notebook Instance, training, and hosting.
- The IAM role arn used to give training and hosting access to your data. See the documentation for how to create these.  Note, if more than one role is required for notebook instances, training, and/or hosting, please replace the boto regex with a the appropriate full IAM role arn string(s).

In [61]:
import sys
import sagemaker
import time
! pip install --upgrade pip 
if int(sagemaker.__version__.split('.')[0]) == 2:
    !{sys.executable} -m pip install "sagemaker>=1.71.0,<2.0.0"
    print("Installing previous SageMaker Version. Please restart the kernel")
else:
    print (sagemaker.__version__)
    print("Version is good")
#install s3fs - this package is used by pandas to read file from s3
!pip install --upgrade s3fs
#install arff package, this package is used to read the bankruptcy data which is in ARFF format
!pip install --upgrade arff
!pip install kaggle

Collecting pip
  Downloading pip-20.3-py2.py3-none-any.whl (1.5 MB)
[K     |████████████████████████████████| 1.5 MB 12.9 MB/s eta 0:00:01
[?25hInstalling collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 20.2.4
    Uninstalling pip-20.2.4:
      Successfully uninstalled pip-20.2.4
Successfully installed pip-20.3
1.72.1
Version is good
Collecting botocore<1.17.45,>=1.17.44
  Using cached botocore-1.17.44-py2.py3-none-any.whl (6.5 MB)
Installing collected packages: botocore
  Attempting uninstall: botocore
    Found existing installation: botocore 1.18.18
    Uninstalling botocore-1.18.18:
      Successfully uninstalled botocore-1.18.18
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
boto3 1.15.8 requires botocore<1.19.0,>=1.18.8, but you have botocore 1.17.44 which is incompatible.
awscli 1.18.149 requires botocore=

In [62]:
import boto3
import json
import io
import pandas as pd
import sagemaker
from sklearn.model_selection import train_test_split
from time import gmtime, strftime, sleep
from sagemaker import get_execution_role
from urllib.parse import urlparse
from sagemaker.automl.automl import AutoML
import botocore
import time
import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve, auc
from sklearn.metrics import precision_recall_curve, plot_precision_recall_curve, f1_score, precision_score, recall_score
import matplotlib.pyplot as plt
import numpy as np
import kaggle

To run this notebook, you will need to donwload the Credit Card Fraud dataset from Kaggle first. 

We downloaded the fraud data set from Kaggle site(https://www.kaggle.com/mlg-ulb/creditcardfraud). Below cell will download credit card fraud dataset from Kaggle.

In [69]:
!kaggle datasets download -d mlg-ulb/creditcardfraud -o -q
!unzip -q creditcardfraud.zip 



First, let's take a quick look at the dataset.

In [70]:
fraud_df = pd.read_csv('creditcard.csv')

In [71]:
fraud_df.head(5)

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


Here, class 0 = No Fraud, class 1  = Fraud. As we can see, other than Amount, other columns are anonymized. 


Now, we select all attributes except “*Class*'” as predictor/training feature into X and “Class” as target attribute y.

We also need to split data into train data set and test data set. We will keep **70%** of the data as training and **30%** of the data as test. We are going to use Scikit Learn utility train_test_split for this.

In [72]:
target_variable = 'Class'
print (fraud_df[target_variable].value_counts())
train, test = train_test_split(fraud_df, test_size=.3, random_state=100)

0    284315
1       492
Name: Class, dtype: int64


**Please note that the binary label column *Class* is highly imbalanced, a typical occurrence in financial use cases.**  
We will verify how well Autopilot handle this highly imbalanced data set.

### Now, we will configure Sagemaker AutoPilot. 
We give a job name **automl-creditcard-fraud**, create a session with Sagemaker client. We need to have a **s3** bucket to store train/test data and all other artifacts Autopilot will produce. We are using default **s3** bucket, you can create your own bucket. Training and Test data is used from the 
previous steps and uploaded to **s3** bucket under "train" and "test" respectively. training_data['Class'] has the target (credit card fraud 0/1). **S3Uri** field in input_data_config points Autopilot to the training data location. **TargetAttributeName** indicates target variable for the training job. 

In [73]:
auto_ml_job_name = 'automl-creditcard-fraud'
sm = boto3.client('sagemaker')
session = sagemaker.Session()

prefix = 'sagemaker/' + auto_ml_job_name
bucket = session.default_bucket()
training_data = train
X_test = test.drop(columns = [target_variable])
y_test = test[target_variable]
test_data = X_test

train_file = 'train_data.csv';
training_data.to_csv(train_file, index=False, header=True)
train_data_s3_path = session.upload_data(path=train_file, key_prefix=prefix + "/train")
print('Train data uploaded to: ' + train_data_s3_path)

test_file = 'test_data.csv';
test_data.to_csv(test_file, index=False, header=False)
test_data_s3_path = session.upload_data(path=test_file, key_prefix=prefix + "/test")
print('Test data uploaded to: ' + test_data_s3_path)
input_data_config = [{
      'DataSource': {
        'S3DataSource': {
          'S3DataType': 'S3Prefix',
          'S3Uri': 's3://{}/{}/train'.format(bucket,prefix)
        }
      },
      'TargetAttributeName': target_variable
    }
  ]

Train data uploaded to: s3://sagemaker-us-east-1-245779447069/sagemaker/automl-creditcard-fraud/train/train_data.csv
Test data uploaded to: s3://sagemaker-us-east-1-245779447069/sagemaker/automl-creditcard-fraud/test/test_data.csv


Now, we need to create the Autopilot job.We set the maximum candidate models (attribute max_candidates ) with different parameters to 200. We also set ProblemType='BinaryClassification'.Please note you do not need to set ProblemType and MetricName.If you do not set these 2 field, Autopilot will automatically determine the type of supervised learning problem by analyzing the data(for binary classification problem - default metric is F1). If you do not set this field, Autopilot will automatically determine the type of supervised learning problem by analyzing the data. We set MetricName(parameter job_objective) to AUC/F1(eval_obj). You can find out all options for the job configuration here (https://docs.aws.amazon.com/cli/latest/reference/sagemaker/create-auto-ml-job.html).

Note that depending on the number of candidates you run (here we set it to 200), this may take a couple hours to run both F1 and AUC scores.

In [86]:
def create_automl_object(eval_obj,base_job_name):
    
    target_attribute_name = target_variable
    role = get_execution_role()
    automl = AutoML(role=role,
                    target_attribute_name=target_attribute_name,
                    base_job_name=base_job_name,
                    sagemaker_session=session,
                    problem_type='BinaryClassification',
                    job_objective={'MetricName': eval_obj},
                    max_candidates=200) # Including a max candidates will let you limit the number of AutoPilot jobs to run
    return automl

After the AutoML object is created, we call the fit() function to train the AutoML object.

In [75]:
def automl_fit(automl,base_job_name):
    automl.fit(train_file, job_name=base_job_name, wait=False, logs=False)

In [76]:
def check_status(automl):
    describe_response = automl.describe_auto_ml_job()
    print (describe_response)
    job_run_status = describe_response['AutoMLJobStatus']

    while job_run_status not in ('Failed', 'Completed', 'Stopped'):
        describe_response = automl.describe_auto_ml_job()
        job_run_status = describe_response['AutoMLJobStatus']
        print (job_run_status)
        sleep(30)
    print ('completed')

In [77]:
def get_best_candidate(automl):
    best_candidate = automl.describe_auto_ml_job()['BestCandidate']
    best_candidate_name = best_candidate['CandidateName']
    print(best_candidate)
    print('\n')
    print("CandidateName: " + best_candidate_name)
    print("FinalAutoMLJobObjectiveMetricName: " + best_candidate['FinalAutoMLJobObjectiveMetric']['MetricName'])
    print("FinalAutoMLJobObjectiveMetricValue: " + str(best_candidate['FinalAutoMLJobObjectiveMetric']['Value']))
    return best_candidate_name,best_candidate

We now create a model from the **best candidate**. In addition to predicted label, we want **probability** of the prediction - this probability will be used later to plot AUC and Precision/Recall.

In [78]:
def create_model(automl,best_candidate_name,best_candidate,timestamp_suffix):
    model_name = 'automl-cardfraud-model-' + timestamp_suffix
    inference_response_keys = ['predicted_label', 'probability']
    model = automl.create_model(name=best_candidate_name,
                                candidate=best_candidate,inference_response_keys=inference_response_keys)
    return model
                                

Once the model is created, we run a Transform job to get inference (i.e Prediction about the default) from the test data set and save to S3. It is worth noting that when you deploy the model as an endpoint or create a Transform job, SageMaker handles the deployment of the feature engineering pipeline and the ML algorithm, so end users can send the data in its raw format for inference.

In [79]:
def create_transformer(model,best_candidate,eval_obj):
    s3_transform_output_path = 's3://{}/{}/inference-results/'.format(bucket, prefix);
    output_path = s3_transform_output_path + best_candidate['CandidateName'] +'/'
    transformer=model.transformer(instance_count=1, 
                              instance_type='ml.m5.xlarge',
                              assemble_with='Line',
                              output_path=output_path)
    transformer.transform(data=test_data_s3_path, split_type='Line', content_type='text/csv', wait=False)
    return transformer

In [80]:
def return_pred_df(transformer):
    print ('***predict output path ***')
    print (transformer.output_path, '{}.out'.format(test_file))
    pred_csv = get_csv_from_s3(transformer.output_path,'{}.out'.format(test_file))
    data=pd.read_csv(io.StringIO(pred_csv), header=None)
    data.columns= ['label', 'proba']    
    return data
def get_csv_from_s3(s3uri, file_name):
    parsed_url = urlparse(s3uri)
    bucket_name = parsed_url.netloc
    prefix = parsed_url.path[1:].strip('/')
    s3 = boto3.resource('s3')
    obj = None 
    loop = True
    while (loop):
        try:
            obj = s3.Object(bucket_name, '{}/{}'.format(prefix, file_name))
            pred_body  = obj.get()["Body"].read().decode('utf-8')    
            print ('predict file is avilable s3')    
            loop = False
        except botocore.exceptions.ClientError as e:
            print('prediction file still not avilable in s3 sleeping for 2 minutes')
            time.sleep(120)
    return pred_body


We can download Candidate Definition notebook from the following s3 location.
We can download data exploration notebook to see details of AutoPilot data analysis. This report provides insights about the dataset you provided as input to the AutoML job.

In [81]:
def download_notebooks(automl,eval_obj):
    print ("download CandidateDefinitionNotebookLocation for " + eval_obj)
    print (automl.describe_auto_ml_job()['AutoMLJobArtifacts']['CandidateDefinitionNotebookLocation'])
    print ("download DataExplorationNotebookLocation for " + eval_obj)
    print (automl.describe_auto_ml_job()['AutoMLJobArtifacts']['DataExplorationNotebookLocation'])

In [82]:
def run_automl_process(eval_obj):
    timestamp_suffix = strftime('%d-%H-%M-%S', gmtime())
    base_job_name = 'automl-card-fraud-' + eval_obj + timestamp_suffix
    print (base_job_name)
    automl = create_automl_object(eval_obj,base_job_name)
    automl_fit(automl,base_job_name)
    check_status(automl)
    best_candidate_name,best_candidate=get_best_candidate(automl)
    model = create_model(automl,best_candidate_name,best_candidate,timestamp_suffix)
    transformer=create_transformer(model,best_candidate,eval_obj)
    pred_df = return_pred_df(transformer)
    download_notebooks(automl,eval_obj)
    return pred_df

Now we are ready to run the Autopilot pilot job. We call the wrapper function run_automl_process with objective AUC and F1

In [87]:
print ('*********running with eval objective AUC***********')
data_auc =run_automl_process('AUC')
print ('*********running with eval objective F1***********')
data_f1 = run_automl_process('F1')

*********running with eval objective AUC***********
automl-card-fraud-AUC02-23-24-20


's3_input' class will be renamed to 'TrainingInput' in SageMaker Python SDK v2.


{'AutoMLJobName': 'automl-card-fraud-AUC02-23-24-20', 'AutoMLJobArn': 'arn:aws:sagemaker:us-east-1:245779447069:automl-job/automl-card-fraud-auc02-23-24-20', 'InputDataConfig': [{'DataSource': {'S3DataSource': {'S3DataType': 'S3Prefix', 'S3Uri': 's3://sagemaker-us-east-1-245779447069/auto-ml-input-data/train_data.csv'}}, 'TargetAttributeName': 'Class'}], 'OutputDataConfig': {'S3OutputPath': 's3://sagemaker-us-east-1-245779447069/'}, 'RoleArn': 'arn:aws:iam::245779447069:role/service-role/AmazonSageMaker-ExecutionRole-20200330T163636', 'AutoMLJobObjective': {'MetricName': 'AUC'}, 'ProblemType': 'BinaryClassification', 'AutoMLJobConfig': {'CompletionCriteria': {'MaxCandidates': 100}, 'SecurityConfig': {'EnableInterContainerTrafficEncryption': False}}, 'CreationTime': datetime.datetime(2020, 12, 2, 23, 24, 21, 570000, tzinfo=tzlocal()), 'LastModifiedTime': datetime.datetime(2020, 12, 2, 23, 24, 21, 570000, tzinfo=tzlocal()), 'AutoMLJobStatus': 'InProgress', 'AutoMLJobSecondaryStatus': 'St

Parameter image will be renamed to image_uri in SageMaker Python SDK v2.
Parameter image will be renamed to image_uri in SageMaker Python SDK v2.
Parameter image will be renamed to image_uri in SageMaker Python SDK v2.


completed
{'CandidateName': 'tuning-job-1-5dda68c52e924b0c80-080-95aab5e1', 'FinalAutoMLJobObjectiveMetric': {'MetricName': 'validation:auc', 'Value': 0.9828900098800659}, 'ObjectiveStatus': 'Succeeded', 'CandidateSteps': [{'CandidateStepType': 'AWS::SageMaker::ProcessingJob', 'CandidateStepArn': 'arn:aws:sagemaker:us-east-1:245779447069:processing-job/db-1-d8c4bcd892d34c5eb1f85f660233fe799384c1230d0a4707b8417df8a8', 'CandidateStepName': 'db-1-d8c4bcd892d34c5eb1f85f660233fe799384c1230d0a4707b8417df8a8'}, {'CandidateStepType': 'AWS::SageMaker::TrainingJob', 'CandidateStepArn': 'arn:aws:sagemaker:us-east-1:245779447069:training-job/automl-car-dpp5-1-a270026acb9d46fe8d63ec2adb5e9e39a67c8b31aff04', 'CandidateStepName': 'automl-car-dpp5-1-a270026acb9d46fe8d63ec2adb5e9e39a67c8b31aff04'}, {'CandidateStepType': 'AWS::SageMaker::TransformJob', 'CandidateStepArn': 'arn:aws:sagemaker:us-east-1:245779447069:transform-job/automl-car-dpp5-csv-1-abb41109dc804e73b3f003b1b10a682c1a1ab4f84', 'CandidateS

's3_input' class will be renamed to 'TrainingInput' in SageMaker Python SDK v2.


{'AutoMLJobName': 'automl-card-fraud-F103-00-26-10', 'AutoMLJobArn': 'arn:aws:sagemaker:us-east-1:245779447069:automl-job/automl-card-fraud-f103-00-26-10', 'InputDataConfig': [{'DataSource': {'S3DataSource': {'S3DataType': 'S3Prefix', 'S3Uri': 's3://sagemaker-us-east-1-245779447069/auto-ml-input-data/train_data.csv'}}, 'TargetAttributeName': 'Class'}], 'OutputDataConfig': {'S3OutputPath': 's3://sagemaker-us-east-1-245779447069/'}, 'RoleArn': 'arn:aws:iam::245779447069:role/service-role/AmazonSageMaker-ExecutionRole-20200330T163636', 'AutoMLJobObjective': {'MetricName': 'F1'}, 'ProblemType': 'BinaryClassification', 'AutoMLJobConfig': {'CompletionCriteria': {'MaxCandidates': 100}, 'SecurityConfig': {'EnableInterContainerTrafficEncryption': False}}, 'CreationTime': datetime.datetime(2020, 12, 3, 0, 26, 12, 315000, tzinfo=tzlocal()), 'LastModifiedTime': datetime.datetime(2020, 12, 3, 0, 26, 15, 296000, tzinfo=tzlocal()), 'AutoMLJobStatus': 'InProgress', 'AutoMLJobSecondaryStatus': 'Analyzi

Parameter image will be renamed to image_uri in SageMaker Python SDK v2.
Parameter image will be renamed to image_uri in SageMaker Python SDK v2.
Parameter image will be renamed to image_uri in SageMaker Python SDK v2.


completed
{'CandidateName': 'tuning-job-1-291fb65b80b9434eb7-016-a6e808cc', 'FinalAutoMLJobObjectiveMetric': {'MetricName': 'validation:f1', 'Value': 0.9365100264549255}, 'ObjectiveStatus': 'Succeeded', 'CandidateSteps': [{'CandidateStepType': 'AWS::SageMaker::ProcessingJob', 'CandidateStepArn': 'arn:aws:sagemaker:us-east-1:245779447069:processing-job/db-1-ea54cb9b1fab4dafba0dd99b2c7a2b511fe8b444c88b4c5899abe8785f', 'CandidateStepName': 'db-1-ea54cb9b1fab4dafba0dd99b2c7a2b511fe8b444c88b4c5899abe8785f'}, {'CandidateStepType': 'AWS::SageMaker::TrainingJob', 'CandidateStepArn': 'arn:aws:sagemaker:us-east-1:245779447069:training-job/automl-car-dpp5-1-40c039b935a44d6c97f75fd3b73cb1ccd706741ee4fb4', 'CandidateStepName': 'automl-car-dpp5-1-40c039b935a44d6c97f75fd3b73cb1ccd706741ee4fb4'}, {'CandidateStepType': 'AWS::SageMaker::TransformJob', 'CandidateStepArn': 'arn:aws:sagemaker:us-east-1:245779447069:transform-job/automl-car-dpp5-csv-1-2027b9ed4b804a0cbedf8a9e9228ba31d164a504e', 'CandidateSt

Now, we plot ROC - the Area under the Curve (AUC) for true positives (in this data set Fraud) vs false positives (predicted as Fraud but not Fraud in the ground truth). The higher the prediction quality of the classification model, the more the AUC curve is skewed to the top left.

In [1]:
from sklearn import metrics
colors = ['blue','green']
model_names = ['Objective : AUC','Objective : F1']
models = [data_auc,data_f1]
for i in range(0,len(models)):
    fpr, tpr, _ = metrics.roc_curve(y_test, models[i]['proba'])
    fpr, tpr, _  = metrics.roc_curve(y_test, models[i]['proba'])
    auc_score = metrics.auc(fpr, tpr)
    plt.plot(fpr, tpr, label=str('Auto Pilot {:.2f} '+ model_names[i]).format(auc_score),color=colors[i]) 
        
plt.xlim([-0.1,1.1])
plt.ylim([-0.1,1.1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.legend(loc='lower right')
plt.title('ROC Cuve')

NameError: name 'data_auc' is not defined

 The precision-recall curve compares the trade-off between precision and recall, with the best models having a precision-recall curve that is flat initially, dropping steeply as the recall approaches 1. The higher precision + recall, more the curve will be skewed towards upper right.

In [None]:
colors = ['blue','green']
model_names = ['Objective : AUC','Objective : F1']
models = [data_auc,data_f1]

print ('model ', 'F1 ', 'precision ', 'recall ')
for i in range(0,len(models)):
    precision, recall, _ = precision_recall_curve(y_test, models[i]['proba'])
    print (model_names[i],f1_score(y_test, np.array(models[i]['label'])),precision_score(y_test, models[i]['label']),recall_score(y_test, models[i]['label']) )
    plt.plot(recall,precision,color=colors[i],label=model_names[i])
        
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision-Recall Curve')
plt.legend(loc='upper right')
plt.show() 

---
## Conclusion <a name="Conclusion"></a>
We can see that with very little data science knowledge, we are able to create a highly accurate prediction for credit card fruad dataset. From the AUC and Precision+Recall plots, we can also see that Auto Pilot handled highly imbalanced data well.  

---
## Cleanup <a name="Cleanup"></a>

The Autopilot job creates many underlying artifacts such as dataset splits, preprocessing scripts, or preprocessed data, etc. This code, when un-commented, deletes them. This operation deletes all the generated models and the auto-generated notebooks as well. 

In [3]:
# s3 = boto3.resource('s3')
# s3_bucket = s3.Bucket(bucket)

# s3_bucket.objects.filter(Prefix=prefix).delete()

In [None]:
#transformer.delete_model()