## Create Data Exploration and Candidate Notebooks with Sagemaker Autopilot

### Introduction
Amazon SageMaker Autopilot is an automated machine learning (commonly referred to as AutoML) solution for tabular datasets. You can use SageMaker Autopilot in different ways: on autopilot (hence the name) or with human guidance, without code through SageMaker Studio, or using the AWS SDKs.

### Problem Definition
Reference: https://www.kaggle.com/kevinarvai/clinvar-conflicting

[clinvar](https://www.ncbi.nlm.nih.gov/clinvar/) is a public resource containing annotations about human genetic variants. These variants are (usually manually) classified by clinical laboratories on a categorical spectrum ranging from benign, likely benign, uncertain significance, likely pathogenic, and pathogenic. Variants that have conflicting classifications (from laboratory to laboratory) can cause confusion when clinicians or researchers try to interpret whether the variant has an impact on the disease of a given patient.
The objective is to predict whether a ClinVar variant will have conflicting classifications. This is presented here as a binary classification problem, where each record in the dataset is a genetic variant.

### Acknowledgements
Landrum MJ, Lee JM, Benson M, Brown GR, Chao C, Chitipiralla S, Gu B, Hart J, Hoffman D, Jang W, Karapetyan K, Katz K, Liu C, Maddipatla Z, Malheiro A, McDaniel K, Ovetsky M, Riley G, Zhou G, Holmes JB, Kattman BL, Maglott DR. ClinVar: improving access to variant interpretations and supporting evidence. Nucleic Acids Res. 2018 Jan 4. PubMed PMID: 29165669.

### Setup

Let's start by specifying:

The Region Name, Sagemaker Session, The S3 bucket and prefix that you want to use for training and model data. This should be within the same region as the Notebook Instance, training, and hosting.The IAM role arn used to give training and hosting access to your data.

In [None]:
import sagemaker
import boto3
import os, jmespath
from sagemaker import get_execution_role
import pandas as pd
from time import gmtime, strftime, sleep

region = boto3.Session().region_name

session = sagemaker.Session()
bucket = session.default_bucket()

prefix = 'sagemaker/autopilot-vc'

role = get_execution_role()

sm = boto3.Session().client(service_name='sagemaker',region_name=region)

### Get datalake bucket

In [None]:
cfn = boto3.client('cloudformation')

project_name = os.environ.get('RESOURCE_PREFIX')
resources = cfn.describe_stacks(StackName='{0}-Pipeline'.format(project_name))
query = 'Stacks[].Outputs[?OutputKey==`DataLakeBucket`].OutputValue'
data_lake_bucket = path = jmespath.search(query, resources)[0][0]
print(data_lake_bucket)

### Dataset
Lets load the raw data into a dataframe. The raw data is stored in S3 in the file clinvar_conflicting.csv. This file is downloaded from the follwoing location:https://github.com/arvkevi/clinvar-kaggle/blob/master/clinvar_conflicting.csv

In [None]:
# Load the raw data into a dataframe from S3
raw_data=pd.read_csv("s3://{0}/annotation/clinvar/conflicting/clinvar_conflicting.csv".format(data_lake_bucket))

# Take 80% of the data for training
train_data = raw_data.sample(frac=0.8,random_state=200)

# Take the remaining 20% for testing
test_data = raw_data.drop(train_data.index)

#save the train and test data as a CSV file and load it to S3
train_file = 'train_data.csv';
train_data.to_csv(train_file, index=False, header=True)
train_data_s3_path = session.upload_data(path=train_file, key_prefix=prefix + "/train")
print('Train data uploaded to: ' + train_data_s3_path)

test_file = 'test_data.csv';
test_data.to_csv(test_file, index=False, header=True)
test_data_s3_path = session.upload_data(path=test_file, key_prefix=prefix + "/test")
print('Test data uploaded to: ' + test_data_s3_path)

train_data.head()


### Setting up the SageMaker Autopilot Job
After uploading the dataset to Amazon S3, you can invoke Autopilot to find the best ML pipeline to train a model on this dataset.

The required inputs for invoking a Autopilot job are:

* Amazon S3 location for input dataset and for all output artifacts
* Name of the column of the dataset you want to predict (y in this case)
* An IAM role



In [None]:
input_data_config = [{
      'DataSource': {
        'S3DataSource': {
          'S3DataType': 'S3Prefix',
          'S3Uri': 's3://{}/{}/train'.format(bucket,prefix)
        }
      },
      'TargetAttributeName': 'CLASS'
    }
  ]

output_data_config = {
    'S3OutputPath': 's3://{}/{}/output'.format(bucket,prefix)
  }

You can also specify the type of problem you want to solve with your dataset (Regression, MulticlassClassification, BinaryClassification). In case you are not sure, SageMaker Autopilot will infer the problem type based on statistics of the target column (the column you want to predict).

You have the option to limit the running time of a SageMaker Autopilot job by providing either the maximum number of pipeline evaluations or candidates (one pipeline evaluation is called a Candidate because it generates a candidate model) or providing the total time allocated for the overall Autopilot job. Under default settings, this job takes about four hours to run. This varies between runs because of the nature of the exploratory process Autopilot uses to find optimal training parameters.
For our model, we are going to just generate the Candidate Notebooks and explore it ourselves instead of running the complete default experiment. This is done by setting the flag "GenerateCandidateDefinitionsOnly=True"

### Launching the SageMaker Autopilot Job
You can now launch the Autopilot job by calling the create_auto_ml_job API.

**NOTE: The name of the Autopilot job is important because it is used to create the names for all the resources created by Sagemaker like the model name and the endpoint name.**

In [None]:
timestamp_suffix = strftime('%d-%H-%M-%S', gmtime())

auto_ml_job_name = 'automl-vc-' + timestamp_suffix
print('AutoMLJobName: ' + auto_ml_job_name)

sm.create_auto_ml_job(AutoMLJobName=auto_ml_job_name,
                      InputDataConfig=input_data_config,
                      OutputDataConfig=output_data_config,
                      GenerateCandidateDefinitionsOnly=True,
                      RoleArn=role)



print ('JobStatus - Secondary Status')
print('------------------------------')


describe_response = sm.describe_auto_ml_job(AutoMLJobName=auto_ml_job_name)
print (describe_response['AutoMLJobStatus'] + " - " + describe_response['AutoMLJobSecondaryStatus'])
job_run_status = describe_response['AutoMLJobStatus']
    


We will now wait for Sagemaker autopilot to generate the candidate notebooks.

In [None]:
while job_run_status not in ('Failed', 'Completed', 'Stopped'):
    describe_response = sm.describe_auto_ml_job(AutoMLJobName=auto_ml_job_name)
    job_run_status = describe_response['AutoMLJobStatus']
    
    print (describe_response['AutoMLJobStatus'] + " - " + describe_response['AutoMLJobSecondaryStatus'])
    sleep(30)



candidate_nb=sm.describe_auto_ml_job(AutoMLJobName=auto_ml_job_name)['AutoMLJobArtifacts']['CandidateDefinitionNotebookLocation']

data_nb=sm.describe_auto_ml_job(AutoMLJobName=auto_ml_job_name)['AutoMLJobArtifacts']['DataExplorationNotebookLocation']

print ("Data Exploration Notebook: "+data_nb)
print("------------------------------------------------------------------------------------------------")
print("Candidate Generation Notebook: "+candidate_nb)


### Downloading the autopilot candidate Notebooks
Now that Sagemaker autopilot has analyzed our data and created the candidate notebooks, lets download them and explore.

In [None]:
!aws s3 cp $data_nb .
!aws s3 cp $candidate_nb .

### Analyzing the candidate notebooks
Reference: https://docs.aws.amazon.com/sagemaker/latest/dg/autopilot-automate-model-development-notebook-output.html

During the analysis phase of the AutoML job, two notebooks are created that describe the plan that Autopilot follows to generate candidate models. A candidate model consists of a (pipeline, algorithm) pair. First, there’s a data exploration notebook, that describes what Autopilot learned about the data that you provided. Second, there’s a candidate generation notebook, which uses the information about the data to generate candidates.

You can run both notebooks in SageMaker or locally if you have installed the SageMaker Python SDK. You can share the notebooks just like any other SageMaker Studio notebook. The notebooks are created for you to conduct experiment. For example, you could edit the following items in the notebooks:

* the preprocessors used on the data

* the number of hyperparameter optimization (HPO) runs and their parallelism

* the algorithms to try

* the instance types used for the HPO jobs

* the hyperparameter ranges

Modifications to the candidate generation notebook are encouraged to be used as a learning tool. This capability allows you to learn about how the decisions made during the machine learning process impact the your results.

### Next Steps
You can now switch over to the two notebooks. Feel free to change parameters and modify them as needed for your final ML model deployment. At the end of the candidate notebook, you will have a hosted model on Sagemaker with an endpoint. We have provided a notebook "variant_predictor.ipynb" that runs predictions on the model using the test data we saved earlier. So, to summarize the next steps:
* Explore and run the SageMakerAutopilotDataExplorationNotebook.ipynb notebook.
* Explore and run the SageMakerAutopilotCandidateDefinitionNotebook.ipynb notebook.
* Explore and run the variant_predictor.ipynb notebook.