# Model training

In this notebook, we are going to train a binary classification model using AWS SageMaker ML Autopilot. The SageMaker Extension provides a script that starts this process. It uploads the training data into the selected S3 bucket, then creates and starts the Autopilot job. Please refer to the Extension <a href="https://github.com/exasol/sagemaker-extension/blob/main/doc/user_guide/user_guide.md#execution-of-training" target="_blank" rel="noopener">User Guide</a> for a detailed description of the service.

We will be running SQL queries using <a href="https://jupysql.ploomber.io/en/latest/quick-start.html" target="_blank" rel="noopener"> JupySQL</a> SQL Magic.

## Prerequisites

Prior to using this notebook the following steps need to be completed:
1. [Configure the sandbox](../sandbox_config.ipynb).
2. [Initialize the SageMaker Extension](sme_init.ipynb).
3. [Load the MAGIC Gamma Telescope data](../data/data_telescope.ipynb).

## Setup

### Access configuration

In [None]:
%run ../utils/access_store_ui.ipynb
display(get_access_store_ui('../'))

### Job name

We need a new unique job name. We will make it up from the timestamp.

In [None]:
from datetime import datetime
sb_config.save('JOB_NAME', 'CLS' + datetime.now().strftime('%Y%m%d%H%M%S'))

# Here is the job name we are going to use in this and the following notebooks.
sb_config.JOB_NAME

Let's bring up JupySQL and connect to the database via SQLAlchemy. Please refer to the documentation of <a href="https://github.com/exasol/sqlalchemy-exasol" target="_blank" rel="noopener">sqlalchemy-exasol</a> for details on how to connect to the database using Exasol SQLAlchemy driver.

In [None]:
%run ../utils/jupysql_init.ipynb

## Start training

Let's define a few variables for our experiment.

<b>Note that the path for input data should be unique for each experiment.</b>. Alternatively, all data files should be cleared after the experiment is finished. Currently, this has to be done manually. The Autopilot will be using all data files found in this directory. If it contains stale files from previous experiments then at best the training pipeline will fail. Or worse, a wrong model will be built.

In [None]:
# URI of the S3 bucket
S3_BUCKET_URI=f"s3://{sb_config.AWS_BUCKET}"

# Path in the S3 bucket where the input data will be uploaded.
S3_OUTPUT_PATH = "ida_dataset_path"

# Input table name.
INPUT_TABLE_NAME = "TELESCOPE_TRAIN"

# Name of the view extending input table (see below why it is necessary).
INPUT_VIEW_NAME = "Z_" + INPUT_TABLE_NAME

# Name of the column in the input table which is the prediction target.
TARGET_COLUMN = "CLASS"

# The maximum number of model candidates.
MAX_CANDIDATES = 2

### Prepare data

When we use our model for making batch predictions we will need to identify samples in the batch. This is because the order of labeled samples in the output may not match the order of unlabeled samples in the input. For that purpose, we will extend features by adding an artificial column that will be a placeholder for a sample ID. During model training, we will set this column to a constant value. This should make it non-influential for the prediction.

Future versions of the SageMaker Extension are expected to be doing this step for us.

First, we need to get a list of features.

In [None]:
%%sql column_names <<
SELECT COLUMN_NAME
FROM SYS.EXA_ALL_COLUMNS
WHERE COLUMN_SCHEMA = '{{sb_config.SCHEMA}}' AND COLUMN_TABLE='{{INPUT_TABLE_NAME}}'

In [None]:
column_names = ', '.join(f'[{name[0]}]' for name in column_names)

Now let's create a view extending the input table.

In [None]:
%%sql
CREATE OR REPLACE VIEW {{sb_config.SCHEMA}}."{{INPUT_VIEW_NAME}}" AS
SELECT CAST(0 AS INT) AS SAMPLE_ID, {{column_names}} FROM {{INPUT_TABLE_NAME}}

### Create Autopilot job

The script below exports the data to the AWS S3 bucket. This export operation is highly efficient, as it is performed in parallel. After that it calls Amazon SageMaker Autopilot, which automatically performs an end-to-end machine learning development, to build a model. The script doesn't wait till the training is completed. That may take a while. The next script will allow us to monitor the progress of the Autopilot training pipeline.

<img src="utils/sme_training.png"/>
<center>Model training with Autopilot</center>

In [None]:
%config SqlMagic.named_parameters=True

In [None]:
%%sql
EXECUTE SCRIPT "{{sb_config.SCHEMA}}"."SME_TRAIN_WITH_SAGEMAKER_AUTOPILOT"(
'{
    "job_name"                          : "{{sb_config.JOB_NAME}}",
    "aws_credentials_connection_name"   : "{{sb_config.SME_AWS_CONN}}",
    "aws_region"                        : "{{sb_config.AWS_REGION}}",
    "iam_sagemaker_role"                : "{{sb_config.AWS_ROLE}}",
    "s3_bucket_uri"                     : "{{S3_BUCKET_URI}}",
    "s3_output_path"                    : "{{S3_OUTPUT_PATH}}",
    "input_schema_name"                 : "{{sb_config.SCHEMA}}",
    "input_table_or_view_name"          : "{{INPUT_VIEW_NAME}}",
    "target_attribute_name"             : "{{TARGET_COLUMN}}",
    "max_candidates"                    : {{MAX_CANDIDATES}}
}')

We don't need the input view anymore since the data has been uploaded into an S3 bucket. Let's delete it.

In [None]:
%%sql
DROP VIEW {{sb_config.SCHEMA}}."{{INPUT_VIEW_NAME}}"

## Poll training status

As it was mentioned above, the model training runs asynchronously. We can monitor its progress by polling the Autopilot job status. Please call this script periodically until you see the status as Completed. 

In [None]:
%%sql
EXECUTE SCRIPT {{sb_config.get("SCHEMA")}}."SME_POLL_SAGEMAKER_AUTOPILOT_JOB_STATUS"(
    '{{sb_config.JOB_NAME}}',
    '{{sb_config.SME_AWS_CONN}}',
    '{{sb_config.AWS_REGION}}'
)

Once the job status becomes `Completed` the model is ready to be deployed and used for prediction. This will be demonstrated in the [next notebook](sme_deploy_model.ipynb).

## Troubleshoot the job

If the job fails the code below may help with troubleshooting. It prints a detailed description of the job status including the reason for failure.

In [None]:
import os
from sagemaker import AutoML

os.environ["AWS_DEFAULT_REGION"] = sb_config.AWS_REGION
os.environ["AWS_ACCESS_KEY_ID"] = sb_config.AWS_ACCESS_KEY_ID
os.environ["AWS_SECRET_ACCESS_KEY"] = sb_config.AWS_SECRET_ACCESS_KEY

automl = AutoML.attach(auto_ml_job_name=sb_config.JOB_NAME)
automl.describe_auto_ml_job()

Another hint is to check that the input data has been uploaded to the S3 bucket correctly. Generally, the data will be split into a number of batches. The following command will print a list of CSV files, one per batch. The name of a file is made of the name of the input data view and the batch number. There should be no other files in the input data directory.

The files can be inspected further by downloading them to a local machine with `aws s3 cp` command.

We assume that the required environment variables have been set when executing the previous cell.

In [None]:
aws_command = f'aws s3 ls s3://{sb_config.AWS_BUCKET}/{S3_OUTPUT_PATH} --recursive'
!{aws_command}