# EmOpti Workshop - AutoML

Kernel `Python 3 (Data Science)` works well with this notebook

In [None]:
import sagemaker
import boto3
from sagemaker import get_execution_role

region = boto3.Session().region_name

session = sagemaker.Session()
s3bucket = session.default_bucket()
s3prefix = "emopti"

role = get_execution_role()
sm = boto3.Session().client(service_name="sagemaker", region_name=region)

### Upload the dataset to Amazon S3
Copy the file to Amazon Simple Storage Service (Amazon S3) in a .csv format for Amazon SageMaker training to use.

In [None]:
train_filename = 'train.csv'
test_filename = 'test.csv'

train_data_s3path = session.upload_data(bucket=s3bucket, path=f'data/{train_filename}', key_prefix=f'{s3prefix}/automl/data')
print("Train data uploaded to: " + train_data_s3path)

test_data_s3path = session.upload_data(bucket=s3bucket, path=f'data/{test_filename}', key_prefix=f'{s3prefix}/automl/data')
print("Test data uploaded to: " + test_data_s3path)

## Setting up the SageMaker Autopilot Job<a name="Settingup"></a>

After uploading the dataset to Amazon S3, you can invoke Autopilot to find the best ML pipeline to train a model on this dataset. 

The required inputs for invoking a Autopilot job are:
* Amazon S3 location for input dataset and for all output artifacts
* Name of the column of the dataset you want to predict (`calc_disp` in this case) 
* An IAM role

Currently Autopilot supports only tabular datasets in CSV format. 

Either all files should have a header row, or the first file of the dataset, when sorted in alphabetical/lexical order, is expected to have a header row.


In [None]:
auto_ml_job_config = {
    "CompletionCriteria": {
        "MaxCandidates": 5
    }
}

input_data_config = [
    {
        "DataSource": {
            "S3DataSource": {
                "S3DataType": "S3Prefix",
                "S3Uri": f's3://{s3bucket}/{s3prefix}/automl/data/train.csv'
            }
        },
        "TargetAttributeName": "calc_disp",
    }
]

output_data_config = {
    "S3OutputPath": f's3://{s3bucket}/{s3prefix}/automl/results/training'
}

You can also specify the type of problem you want to solve with your dataset (`Regression, MulticlassClassification, BinaryClassification`). In case you are not sure, SageMaker Autopilot will infer the problem type based on statistics of the target column (the column you want to predict). 

You have the option to limit the running time of a SageMaker Autopilot job by providing either the maximum number of pipeline evaluations or candidates (one pipeline evaluation is called a `Candidate` because it generates a candidate model) or providing the total time allocated for the overall Autopilot job. Under default settings, this job takes about four hours to run. This varies between runs because of the nature of the exploratory process Autopilot uses to find optimal training parameters.

## Launching the SageMaker Autopilot Job<a name="Launching"></a>

You can now launch the Autopilot job by calling the `create_auto_ml_job` API. 

In [None]:
from time import gmtime, strftime, sleep

auto_ml_job_name = f'automl-job-{strftime("%Y%m%d-%H%M", gmtime())}'
print("AutoMLJobName: " + auto_ml_job_name)

sm.create_auto_ml_job(
    AutoMLJobName=auto_ml_job_name,
    InputDataConfig=input_data_config,
    OutputDataConfig=output_data_config,
    AutoMLJobConfig=auto_ml_job_config,
    # ProblemType='BinaryClassification'|'MulticlassClassification'|'Regression',
    ProblemType='BinaryClassification',
    # AutoMLJobObjective = {'MetricName': 'Accuracy'|'MSE'|'F1'|'F1macro'|'AUC'}
    AutoMLJobObjective={'MetricName': 'F1'},    
    RoleArn=role,
)

## Tracking SageMaker AutoPilot job progress<a name="Tracking"></a>
SageMaker AutoPilot job consists of the following high-level steps : 
* Analyzing Data, where the dataset is analyzed and Autopilot comes up with a list of ML pipelines that should be tried out on the dataset. The dataset is also split into train and validation sets.
* Feature Engineering, where Autopilot performs feature transformation on individual features of the dataset as well as at an aggregate level.
* Model Tuning, where the top performing pipeline is selected along with the optimal hyperparameters for the training algorithm (the last stage of the pipeline). 

In [None]:
%%time
print("JobStatus - Secondary Status")
print("------------------------------")


describe_response = sm.describe_auto_ml_job(AutoMLJobName=auto_ml_job_name)
print(describe_response["AutoMLJobStatus"] + " - " + describe_response["AutoMLJobSecondaryStatus"])
job_run_status = describe_response["AutoMLJobStatus"]

while job_run_status not in ("Failed", "Completed", "Stopped"):
    describe_response = sm.describe_auto_ml_job(AutoMLJobName=auto_ml_job_name)
    job_run_status = describe_response["AutoMLJobStatus"]

    print(
        describe_response["AutoMLJobStatus"] + " - " + describe_response["AutoMLJobSecondaryStatus"]
    )
    sleep(30)

In [None]:
describe_response

## Results

Now use the describe_auto_ml_job API to look up the best candidate selected by the SageMaker Autopilot job. 

In [None]:
best_candidate = sm.describe_auto_ml_job(AutoMLJobName=auto_ml_job_name)["BestCandidate"]
best_candidate_name = best_candidate["CandidateName"]
print(best_candidate)
print("\n")
print("CandidateName: " + best_candidate_name)
print(
    "FinalAutoMLJobObjectiveMetricName: "
    + best_candidate["FinalAutoMLJobObjectiveMetric"]["MetricName"]
)
print(
    "FinalAutoMLJobObjectiveMetricValue: "
    + str(best_candidate["FinalAutoMLJobObjectiveMetric"]["Value"])
)

### Perform batch inference using the best candidate

Now that you have successfully completed the SageMaker Autopilot job on the dataset, create a model from any of the candidates by using [Inference Pipelines](https://docs.aws.amazon.com/sagemaker/latest/dg/inference-pipelines.html). 

In [None]:
model_name = f'automl-model-{strftime("%Y%m%d-%H%M", gmtime())}'

model = sm.create_model(
    Containers=best_candidate["InferenceContainers"], ModelName=model_name, ExecutionRoleArn=role
)

print("Model ARN corresponding to the best candidate is : {}".format(model["ModelArn"]))

You can use batch inference by using Amazon SageMaker batch transform. The same model can also be deployed to perform online inference using Amazon SageMaker hosting.

In [None]:
transform_job_name = f'automl-transform-{strftime("%Y%m%d-%H%M", gmtime())}'

transform_input = {
    "DataSource": {
        "S3DataSource": {
            "S3DataType": "S3Prefix", 
            "S3Uri": f's3://{s3bucket}/{s3prefix}/automl/data/test.csv'
        }},
    "ContentType": "text/csv",
    "CompressionType": "None",
    "SplitType": "Line",
}

transform_output = {
    "S3OutputPath": f"s3://{s3bucket}/{s3prefix}/automl/results/inference",
}

transform_resources = {
    "InstanceType": "ml.m5.4xlarge", 
    "InstanceCount": 1
}

sm.create_transform_job(
    TransformJobName=transform_job_name,
    ModelName=model_name,
    TransformInput=transform_input,
    TransformOutput=transform_output,
    TransformResources=transform_resources,
)

Watch the transform job for completion.

In [None]:
print("JobStatus")
print("----------")


describe_response = sm.describe_transform_job(TransformJobName=transform_job_name)
job_run_status = describe_response["TransformJobStatus"]
print(job_run_status)

while job_run_status not in ("Failed", "Completed", "Stopped"):
    describe_response = sm.describe_transform_job(TransformJobName=transform_job_name)
    job_run_status = describe_response["TransformJobStatus"]
    print(job_run_status)
    sleep(30)

In [None]:
describe_response

Get the Predictions (the results of our transform job):

In [None]:
import pandas as pd

s3_output_key = f"{s3prefix}/automl/results/inference/test.csv.out"
local_inference_results_path = "automl-inference_results.csv"

s3 = boto3.resource("s3")
inference_results_bucket = s3.Bucket(s3bucket)
inference_results_bucket.download_file(s3_output_key, local_inference_results_path)

df_preds = pd.read_csv(local_inference_results_path, sep=";")
pd.set_option("display.max_rows", 20)  # Keep the output on one page
df_preds

### View other candidates explored by SageMaker Autopilot
You can view all the candidates (pipeline evaluations with different hyperparameter combinations) that were explored by SageMaker Autopilot and sort them by their final performance metric.

In [None]:
candidates = sm.list_candidates_for_auto_ml_job(
    AutoMLJobName=auto_ml_job_name, SortBy="FinalObjectiveMetricValue"
)["Candidates"]
index = 1
for candidate in candidates:
    print(
        str(index)
        + "  "
        + candidate["CandidateName"]
        + "  "
        + str(candidate["FinalAutoMLJobObjectiveMetric"]["Value"])
    )
    index += 1

Get the Labels for our Test Data and show only the ADMIT rows so we can see the count of ADMITS

In [None]:
from sklearn.metrics import confusion_matrix

df_test = pd.read_csv('data/test_labels.csv')
df_test[df_test['DISCHARGE'] == 'ADMIT']


### Confusion Matrix

In [None]:
cm = confusion_matrix(df_test, df_preds[1:])
cm

In [None]:
import numpy as np
import seaborn as sns
import matplotlib

#labels = [f'True Neg\n{cm[0][0]}', f'False Pos\n{cm[0][1]}', f'False Neg\n{cm[1][0]}', f'True Pos\n{cm[1][1]}']
#labels = np.asarray(labels).reshape(2,2)
ax = sns.heatmap(cm, annot=True, fmt='', cmap='Blues')
ax.set_xticklabels(['ADMIT', 'DISCHARGE'])
ax.set_yticklabels(['ADMIT', 'DISCHARGE'])
ax.set(ylabel = "True Label", xlabel = "Predicted Label")



### Candidate Definition Notebook
    
Sagemaker AutoPilot also auto-generates a Candidate Definitions notebook. This notebook can be used to interactively step through the various steps taken by the Sagemaker Autopilot to arrive at the best candidate. This notebook can also be used to override various runtime parameters like parallelism, hardware used, algorithms explored, feature extraction scripts and more.
    
The notebook can be downloaded from the following Amazon S3 location:

In [None]:
s3notebook = sm.describe_auto_ml_job(AutoMLJobName=auto_ml_job_name)["AutoMLJobArtifacts"][
    "CandidateDefinitionNotebookLocation"
]
s3notebook

In [None]:
!aws s3 cp $s3notebook .

### Data Exploration Notebook
Sagemaker Autopilot also auto-generates a Data Exploration notebook, which can be downloaded from the following Amazon S3 location:

In [None]:
s3notebook = sm.describe_auto_ml_job(AutoMLJobName=auto_ml_job_name)["AutoMLJobArtifacts"][
    "DataExplorationNotebookLocation"
]
s3notebook

In [None]:
!aws s3 cp $s3notebook .