# Water Demand Prediction with Amazon SageMaker Autopilot


---

This notebook's CI test result for ap-southeast-2 is as follows. CI test results in other regions can be found at the end of the notebook. 

![This us-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-west-2/autopilot|autopilot_california_housing.ipynb)

---

_**Using Autopilot to Predict House Prices in California**_


Kernel `Python 3 (Data Science)` works well with this notebook. You will have the best experience running this within SageMaker Studio.

---

## Contents

1. [Introduction](#Introduction)
1. [Setup](#Setup)
1. [Prepare Training Data](#Data)
1. [Train](#Settingup)
1. [Autopilot Results](#Results)
1. [Evaluate Using Test Data](#Evaluate)
1. [Cleanup](#Cleanup)


---

## Introduction

Amazon SageMaker Autopilot is an automated machine learning (commonly referred to as AutoML) solution for tabular datasets. You can use SageMaker Autopilot in different ways: on autopilot (without any human input) or with human guidance, without code through SageMaker Studio or scripted using the AWS SDKs. This notebook will use the AWS SDKs to simply create and deploy a machine learning model without doing any feature engineering manually. We will also explore the auto-generated feature importance report.

Demand modelling is done per location. Features for demand forecasting:

* ```doy```
* ```Aquifer```
* ```PE```
* ```month```
* ```mday```
* ```is_holiday```
* ```wday```
* ```API```
* ```lagAPI```
* ```Tmax```
* ```Tmaxlag1```
* ```Sun```
* ```Sunlag1```
* ```WVRain```
* ```KerburnRain```
* ```Rain_L7DAYS```
* ```Rain_L6DAYS```
* ```Rain_L5DAYS```
* ```Rain_L4DAYS```
* ```Rain_L3DAYS```
* ```Rain_L2DAYS```
* ```Rain_L1DAYS```
* ```Season```
* ```Rainlag1```
* ```Rainlag2```
* ```Rainlag3```
* ```PElag1```
* ```PElag2```
* ```PElag3```
* ```ANcyc```
* ```storage```
* ```Storagelag1```
* ```(site name) fTemp```
* ```(site name) fPrecp```
* ```(site name) cm```
* ```(site name) stat```
* ```Restriction level```
* ```site name```(target)

What we're going to try to predict is the site water demand for a wellington region. We will let Autopilot perform feature engineering, model selection, model tuning, and give us the best candidate model ready to use for inferences.

---
## Setup

_This notebook was created and tested on a ml.m5.large notebook instance._

Let's start by specifying:

- The S3 bucket and prefix to use for training and model data.  This should be within the same region as the Notebook Instance, training, and hosting. The following code will use SageMaker's default S3 bucket (and create one if it doesn't exist).
- The IAM role ARN used to give training and hosting access to your data. See the documentation for how to create these. The following code will use the SageMaker execution role.

## declare functions needed for job scheduling later

In [1]:
import re

def convert_input_to_output_key(input_key):
    # Split the input key by '/'
    parts = input_key.split('/')
    # Remove the first part 'TransformedInputs'
    parts.pop(0)
    # Remove the last part which is the file name
    parts.pop()
    # Join the remaining parts to form the output key
    output_key = '/'.join(parts)
    return output_key

def extract_number(input_string):
    # Use regular expression to find the number before '.csv'
    match = re.search(r'_(\d+)\.csv$', input_string)
    if match:
        return int(match.group(1))
    else:
        return None

def schedule_job(best_candidate_name, input_chunk_key, output_chunk_key, instance_type, bucket_name):
    # Create a SageMaker session
    sagemaker_session = sagemaker.Session()
    timestamp_suffix = strftime("%Y%m%d-%H%M%S", gmtime())
    chunk_idx = extract_number(input_chunk_key)
    transform_job_name=f'{best_candidate_name}-c{chunk_idx}-' + timestamp_suffix
    print(f"BatchTransformJob ({instance_type}): {transform_job_name} on {input_chunk_key}")
    print(f"BatchTransformJob output: {output_chunk_key}")
    input_prefix = input_chunk_key
    output_prefix = output_chunk_key
    
    response = sm.create_transform_job(
        TransformJobName=transform_job_name, 
        ModelName=best_candidate_name,
        MaxPayloadInMB=20,
        BatchStrategy="MultiRecord",
        ModelClientConfig={
            'InvocationsTimeoutInSeconds': 3600
        },
        TransformInput={
            'DataSource': {
                'S3DataSource': {
                    'S3DataType': 'S3Prefix',
                    'S3Uri': 's3://{}/{}'.format(bucket_name, input_prefix)
                }
            },
            'ContentType': 'text/csv',
            'SplitType': 'Line'
        },
        TransformOutput={
            'S3OutputPath': 's3://{}/{}'.format(bucket_name, output_prefix),
            'AssembleWith': 'Line',
        },
        TransformResources={
            'InstanceType': instance_type, #'ml.c5.4xlarge', 'ml.m5.12xlarge',
            'InstanceCount': 1
        }
        )
    return transform_job_name

def check_job_status(transform_job_name):
    while True:
        describe_response = sm.describe_transform_job(TransformJobName=transform_job_name)
        job_run_status = describe_response["TransformJobStatus"]
        if job_run_status in ("Failed", "Completed", "Stopped"):
            print(f"{datetime.datetime.now()} {describe_response['TransformJobStatus']}")
            break
        print(f"{datetime.datetime.now()} {describe_response['TransformJobStatus']}")
        sleep(60)

def get_job_status(job):
    describe_response = sm.describe_transform_job(TransformJobName=job)
    job_run_status = describe_response["TransformJobStatus"]
    return job_run_status

# def schedule_batch_transform_jobs(best_candidate_name, input_chunk_key_list, output_chunk_key_list, instance_types, bucket_name):
#     running_jobs = {instance_type: [] for instance_type in instance_types}
#     max_parallel_jobs = 4
#     input_index = 0

#     while input_index < len(input_chunk_key_list):
#         for instance_type in instance_types:
#             while len(running_jobs[instance_type]) < max_parallel_jobs and input_index < len(input_chunk_key_list):
#                 # Schedule a new job
#                 input_chunk_key = input_chunk_key_list[input_index]
#                 output_chunk_key = output_chunk_key_list[input_index]
#                 try:
#                     transform_job_name = schedule_job(best_candidate_name, input_chunk_key, output_chunk_key, instance_type, bucket_name)
#                     running_jobs[instance_type].append(transform_job_name)
#                     input_index += 1
#                 except:
#                     print(f"schedule job exception with {instance_type}, switch to the next instance_type")
#                     break

#         # Check the status of running jobs and remove completed jobs from the list
#         all_running_jobs = []
#         for instance_type in instance_types:
#             for job in running_jobs[instance_type]:
#                 check_job_status(job)
#             job_status_x = [job for job in running_jobs[instance_type] if get_job_status(job) not in ("Failed", "Completed", "Stopped")]
#             running_jobs[instance_type] = job_status_x
#             all_running_jobs.extend(job_status_x)
        
#         if len(all_running_jobs) > 0:
#             sleep(60)  # jobs running, Wait before checking again
#         else:
#             sleep(1)  # no jobs running, don't sleep

#     # Wait for all remaining jobs to complete
#     for instance_type in instance_types:
#         while running_jobs[instance_type]:
#             for job in running_jobs[instance_type]:
#                 check_job_status(job)
#             job_status_x = [job for job in running_jobs[instance_type] if get_job_status(job) not in ("Failed", "Completed", "Stopped")]
#             running_jobs[instance_type] = job_status_x
#             if len(job_status_x) > 0:
#                 sleep(60)  # jobs running, Wait before checking again
#             else:
#                 sleep(1)  # no jobs running, don't sleep

In [2]:
from io import StringIO
import boto3

s3 = boto3.client("s3")

def list_csv_files(bucket_name, key_path):
    csv_files = []
    paginator = s3.get_paginator('list_objects_v2')
    page_iterator = paginator.paginate(Bucket=bucket_name, Prefix=key_path)
    
    for page in page_iterator:
        if 'Contents' in page:
            for content in page['Contents']:
                if content['Key'].endswith('.csv'):
                    csv_files.append(content['Key'])
    
    return csv_files

def read_csv_files_to_dataframes(bucket_name, csv_files):
    dataframes = []
    for key in csv_files:
        # Get the object from S3
        obj = s3.get_object(Bucket=bucket_name, Key=key)
        # Read the CSV file content
        data = obj['Body'].read().decode('utf-8')
        # Convert to DataFrame
        df = pd.read_csv(StringIO(data))
        dataframes.append(df)
    return dataframes


In [3]:
# declare all auto_ml_job already completed in Canvas
auto_ml_job_dict = {
    'NorthWellingtonMoa': 'Canvas1734649444174',
    'WellingtonLowLevel': 'Canvas1734648978161',
    'Petone': 'Canvas1733434154045',
    'WellingtonHighWestern': 'Canvas1733085655509',
    'WellingtonHighMoa': 'Canvas1733372214860',
    'NorthWellingtonPorirua': 'Canvas1733369877242',
    'Porirua': 'Canvas1733437572452',
    'Wainuiomata': 'Canvas1734649248674',
    'UpperHutt': 'Canvas1734649294393',
    'LowerHutt': 'Canvas1734649384856'
}

### Launching the SageMaker Autopilot Job<a name="Launching"></a>

You can now launch the Autopilot job by calling the `create_auto_ml_job` API. 

To evaluate the model on previously unseen data, we will test it against the test dataset we prepared earlier. For that, we don't necessarily need to deploy the model to an endpoint, we can simply run a batch transform job to get predictions for our unlabeled test dataset.

# Use models already trained Canvas for Inference

## data prep, find simulation folders, find site folders under the respective simulation folder

In [7]:
import boto3
from io import StringIO

# Initialize the S3 client
s3 = boto3.client('s3')

# Define the S3 bucket and prefix
bucket_name = 'niwa-water-demand-modelling'
prefix = 'InferenceData/'

# List objects in the specified S3 path
response = s3.list_objects_v2(Bucket=bucket_name, Prefix=prefix)

# Loop through the objects and look for CSV files containing "Final_"
csv_files = []
for obj in response.get('Contents', []):
    key = obj['Key']
    if key.endswith('.csv'):
        csv_files.append(key)


In [9]:
csv_files

['InferenceData/LowerHutt/Lower Hutt.csv',
 'InferenceData/NorthWellingtonMoa/North Wellington Moa.csv',
 'InferenceData/NorthWellingtonPorirua/North Wellington Porirua.csv',
 'InferenceData/Petone/Petone.csv',
 'InferenceData/Porirua/Porirua.csv',
 'InferenceData/UpperHutt/Upper Hutt.csv',
 'InferenceData/Wainuiomata/Wainuiomata.csv',
 'InferenceData/WellingtonHighMoa/Wellington High Moa.csv',
 'InferenceData/WellingtonHighWestern/Wellington High Western.csv',
 'InferenceData/WellingtonLowLevel/Wellington Low Level.csv']

In [26]:
# # Example usage
# bucket_name = 'niwa-water-demand-modelling'
# key_path = 'TransformedOutputs/Simulation/'
# target_files = list_csv_files(bucket_name, key_path)
# for key in list(auto_ml_job_dict.keys()):
#     key = f"/{key}/"
#     key_inputs = [e for e in csv_files if key in e]
#     key_files = ["/".join(e.split("/")[1:]) for e in target_files if key in e]
#     unfinished = [e for e in key_inputs if e not in key_files]
#     print(f"{key}: {len(key_files)}")
#     # find out which input file is not covered
#     print(f"{key}: {len(unfinished)} files not processed: {unfinished}")

16
16
16
16
16
0
0
0
0
0


In [11]:
# create empty target files
target_files = []

## loop through all sites and their Canvas model, create_model to register the model if needed, find all inference files for that site, clean up the csv file by only selecting necessary columns, remove the header. After batch transform job completes, it will save to the destination s3 buckets

In [14]:
def schedule_batch_transform_jobs(best_candidate_name_list, input_chunk_key_list, output_chunk_key_list, instance_types, bucket_name):
    running_jobs = {instance_type: [] for instance_type in instance_types}
    max_parallel_jobs = 4
    input_index = 0

    while input_index < len(input_chunk_key_list):
        for instance_type in instance_types:
            while len(running_jobs[instance_type]) < max_parallel_jobs and input_index < len(input_chunk_key_list):
                # Schedule a new job
                input_chunk_key = input_chunk_key_list[input_index]
                output_chunk_key = output_chunk_key_list[input_index]
                best_candidate_name = best_candidate_name_list[input_index]
                try:
                    transform_job_name = schedule_job(
                        best_candidate_name, 
                        input_chunk_key, 
                        output_chunk_key, 
                        instance_type, 
                        bucket_name
                    )
                    running_jobs[instance_type].append(transform_job_name)
                    input_index += 1
                except:
                    print(f"schedule job exception with {instance_type}, switch to the next instance_type")
                    break

        # Check the status of running jobs and remove completed jobs from the list
        all_running_jobs = []
        for instance_type in instance_types:
            for job in running_jobs[instance_type]:
                check_job_status(job)
            job_status_x = [job for job in running_jobs[instance_type] if get_job_status(job) not in ("Failed", "Completed", "Stopped")]
            running_jobs[instance_type] = job_status_x
            all_running_jobs.extend(job_status_x)
        
        if len(all_running_jobs) > 0:
            sleep(60)  # jobs running, Wait before checking again
        else:
            sleep(1)  # no jobs running, don't sleep

    # Wait for all remaining jobs to complete
    for instance_type in instance_types:
        while running_jobs[instance_type]:
            for job in running_jobs[instance_type]:
                check_job_status(job)
            job_status_x = [job for job in running_jobs[instance_type] if get_job_status(job) not in ("Failed", "Completed", "Stopped")]
            running_jobs[instance_type] = job_status_x
            if len(job_status_x) > 0:
                sleep(60)  # jobs running, Wait before checking again
            else:
                sleep(1)  # no jobs running, don't sleep

In [15]:
import sagemaker
import boto3
import datetime
from io import StringIO
import io
import pandas as pd
import numpy as np
# from sagemaker.model import Model
from sagemaker import get_execution_role
from time import gmtime, strftime, sleep
from botocore.exceptions import ClientError

role = get_execution_role()
region = boto3.Session().region_name

# Initialize the SageMaker client
sagemaker_client = boto3.client('sagemaker')

# This is the client we will use to interact with SageMaker Autopilot
sm = boto3.Session().client(service_name="sagemaker", region_name=region)
input_file_list = []
output_file_list = []
best_candidate_name_list = []

for site_name, auto_ml_job_name in list(auto_ml_job_dict.items()):
    # Describe the AutoML job using the V2 API
    # auto_ml_job_name_1 = "Canvas1734649444174"
    response = sagemaker_client.describe_auto_ml_job_v2(AutoMLJobName=auto_ml_job_name)
    
    # Extract the best candidate details
    best_candidate = response['BestCandidate']
    best_candidate_name = best_candidate['CandidateName']
    model_artifacts = best_candidate['InferenceContainers'][0]['ModelDataUrl']
    image_uri = best_candidate['InferenceContainers'][0]['Image']
    best_candidate_containers = best_candidate['InferenceContainers'] 

    # check if model exist
    try:
        response = sm.describe_model(ModelName=best_candidate_name)
        print(f"Model {best_candidate_name} exists. Loading the model.")
    # Load the model logic here
    except ClientError as e:
        if e.response['Error']['Code'] == 'ValidationException':
            print(f"Model {best_candidate_name} does not exist. Creating the model.")
            response = sm.create_model(
                ModelName=best_candidate_name,
                ExecutionRoleArn=role,
                Containers=best_candidate_containers
            )
            print(f"Model {best_candidate_name} created successfully.")
        else:
            print(f"Unexpected error: {e}")

    # Find all s3 paths for input files
    input_files = [e for e in csv_files if site_name in e]
    # Find all s3 paths for output files
    result_files = [e for e in target_files if site_name in e]

    for csv_file in input_files:
        # check if input already has output file generate
        result_found = [j for j in result_files if csv_file in j]
        if len(result_found)>0:
            print(f"result found for {csv_file}, continue to next")
            continue
                
        print(f"s3://{bucket_name}/{csv_file}")
        obj = s3.get_object(Bucket=bucket_name, Key=csv_file)
        # Read the CSV file content
        data = obj['Body'].read().decode('utf-8')
        # Convert to DataFrame
        df = pd.read_csv(StringIO(data))
        target = csv_file.split("/")[-1].split(".csv")[0]
        columns = [e for e in df.columns if e not in ["Date", target, "replicate"]]
        
        # Upload the CSV string to S3
        file_name = csv_file.split("/")[-1]
        
        # Split the data into smaller chunks
        chunk_size = 5000  # Adjust the chunk size as needed
        chunks = [df[i:i + chunk_size] for i in range(0, df.shape[0], chunk_size)]

        # Upload the chunks to S3
        model_name_list = []
        input_chunk_key_list = []
        for idx, chunk in enumerate(chunks):
            csv_buffer = StringIO()
            chunk[columns].to_csv(csv_buffer, index=False, header=False)
            chunk_csv_path = csv_file.replace(".csv", f"_{idx}.csv")
            chunk_key = f"TransformedInputs/{chunk_csv_path}"
            s3.put_object(Bucket=bucket_name, Key=chunk_key, Body=csv_buffer.getvalue())
            print(f"Uploaded chunk {idx} to s3://{bucket_name}/{chunk_key}")
            input_chunk_key_list.append(chunk_key)
            model_name_list.append(best_candidate_name)
        input_file_list.extend(input_chunk_key_list)
        # schedule jobs for all chunked csvs
        output_chunk_key_list = []
        for e in input_chunk_key_list:
            output_chunk_key_1 = convert_input_to_output_key(e)
            output_chunk_key = f"TransformedOutputs/{output_chunk_key_1}"
            output_chunk_key_list.append(output_chunk_key)
        output_file_list.extend(output_chunk_key_list)
        best_candidate_name_list.extend(model_name_list)



Model Canvas1734649444174-trial-t1-1 exists. Loading the model.
s3://niwa-water-demand-modelling/InferenceData/NorthWellingtonMoa/North Wellington Moa.csv
Uploaded chunk 0 to s3://niwa-water-demand-modelling/TransformedInputs/InferenceData/NorthWellingtonMoa/North Wellington Moa_0.csv
Model Canvas1734648978161-trial-t1-1 exists. Loading the model.
s3://niwa-water-demand-modelling/InferenceData/WellingtonLowLevel/Wellington Low Level.csv
Uploaded chunk 0 to s3://niwa-water-demand-modelling/TransformedInputs/InferenceData/WellingtonLowLevel/Wellington Low Level_0.csv
Model Canvas1733434154045-trial-t1-1 exists. Loading the model.
s3://niwa-water-demand-modelling/InferenceData/Petone/Petone.csv
Uploaded chunk 0 to s3://niwa-water-demand-modelling/TransformedInputs/InferenceData/Petone/Petone_0.csv
Model Canvas1733085655509-trial-t1-1 exists. Loading the model.
s3://niwa-water-demand-modelling/InferenceData/WellingtonHighWestern/Wellington High Western.csv
Uploaded chunk 0 to s3://niwa-wat

In [16]:
best_candidate_name_list, input_file_list, output_file_list

(['Canvas1734649444174-trial-t1-1',
  'Canvas1734648978161-trial-t1-1',
  'Canvas1733434154045-trial-t1-1',
  'Canvas1733085655509-trial-t1-1',
  'Canvas1733372214860-trial-t1-1',
  'Canvas1733369877242-trial-t1-1',
  'Canvas1733437572452-trial-t1-1',
  'Canvas1733437572452-trial-t1-1',
  'Canvas1734649248674-trial-t1-1',
  'Canvas1734649294393-trial-t1-1',
  'Canvas1734649384856-trial-t1-1'],
 ['TransformedInputs/InferenceData/NorthWellingtonMoa/North Wellington Moa_0.csv',
  'TransformedInputs/InferenceData/WellingtonLowLevel/Wellington Low Level_0.csv',
  'TransformedInputs/InferenceData/Petone/Petone_0.csv',
  'TransformedInputs/InferenceData/WellingtonHighWestern/Wellington High Western_0.csv',
  'TransformedInputs/InferenceData/WellingtonHighMoa/Wellington High Moa_0.csv',
  'TransformedInputs/InferenceData/NorthWellingtonPorirua/North Wellington Porirua_0.csv',
  'TransformedInputs/InferenceData/NorthWellingtonPorirua/North Wellington Porirua_0.csv',
  'TransformedInputs/Inferen

In [17]:
# schedule jobs by master input&output file list
instance_types = ['ml.c5.xlarge', 'ml.m5.2xlarge', 'ml.m5.xlarge']
schedule_batch_transform_jobs(
    best_candidate_name_list, 
    input_file_list, 
    output_file_list, 
    instance_types, 
    bucket_name
)


BatchTransformJob (ml.c5.xlarge): Canvas1734649444174-trial-t1-1-c0-20250203-005944 on TransformedInputs/InferenceData/NorthWellingtonMoa/North Wellington Moa_0.csv
BatchTransformJob output: TransformedOutputs/InferenceData/NorthWellingtonMoa
BatchTransformJob (ml.c5.xlarge): Canvas1734648978161-trial-t1-1-c0-20250203-005945 on TransformedInputs/InferenceData/WellingtonLowLevel/Wellington Low Level_0.csv
BatchTransformJob output: TransformedOutputs/InferenceData/WellingtonLowLevel
BatchTransformJob (ml.c5.xlarge): Canvas1733434154045-trial-t1-1-c0-20250203-005946 on TransformedInputs/InferenceData/Petone/Petone_0.csv
BatchTransformJob output: TransformedOutputs/InferenceData/Petone
BatchTransformJob (ml.c5.xlarge): Canvas1733085655509-trial-t1-1-c0-20250203-005947 on TransformedInputs/InferenceData/WellingtonHighWestern/Wellington High Western_0.csv
BatchTransformJob output: TransformedOutputs/InferenceData/WellingtonHighWestern
BatchTransformJob (ml.m5.2xlarge): Canvas1733372214860-tr

## concatenate all chunk predictions and join back with datetime in original input file, this is done by site, when data from all sites are concatenated, they will be joined together to form a single output result by experiment

In [18]:
# post processing of output files
for site_name, auto_ml_job_name in list(auto_ml_job_dict.items()):
    # Find all s3 paths for input files
    input_files = [e for e in csv_files if site_name in e]
    for csv_file in input_files:
        # List objects in the specified S3 output path
        # output_chunk_key_list[0]
        file_name = csv_file.split("/")[-1]
        input_key = f"TransformedInputs/{csv_file}"
        output_key = convert_input_to_output_key(input_key) 
        output_prefix = f"TransformedOutputs/{output_key}"
        response = s3.list_objects_v2(Bucket=bucket_name, Prefix=output_prefix)
        
        # Loop through the objects and look for CSV files ends "csv.out"
        output_files = []
        for obj in response.get('Contents', []):
            key = obj['Key']
            if key.endswith('.csv.out'):
                output_files.append(key)
        # concatenate all files and save as csv format to s3
        df_list = []
        for output_file in sorted(output_files):
            print(f"s3://{bucket_name}/{output_file}")
            obj = s3.get_object(Bucket=bucket_name, Key=output_file)
            # Read the CSV file content
            data = obj['Body'].read().decode('utf-8')
            # Convert to DataFrame
            df = pd.read_csv(StringIO(data), names=[file_name.replace(".csv", "")])
            df_list.append(df)
        df_all = pd.concat(df_list, axis=0)
        output_key = f"{output_prefix}/{file_name}"
        csv_buffer = StringIO()
        df_all.to_csv(csv_buffer, index=False)
        s3.put_object(Bucket=bucket_name, Key=output_key, Body=csv_buffer.getvalue())
        print(f"Uploaded merged prediction to s3://{bucket_name}/{output_key}")

s3://niwa-water-demand-modelling/TransformedOutputs/InferenceData/NorthWellingtonMoa/North Wellington Moa_0.csv.out
Uploaded merged prediction to s3://niwa-water-demand-modelling/TransformedOutputs/InferenceData/NorthWellingtonMoa/North Wellington Moa.csv
s3://niwa-water-demand-modelling/TransformedOutputs/InferenceData/WellingtonLowLevel/Wellington Low Level_0.csv.out
Uploaded merged prediction to s3://niwa-water-demand-modelling/TransformedOutputs/InferenceData/WellingtonLowLevel/Wellington Low Level.csv
s3://niwa-water-demand-modelling/TransformedOutputs/InferenceData/Petone/Petone_0.csv.out
Uploaded merged prediction to s3://niwa-water-demand-modelling/TransformedOutputs/InferenceData/Petone/Petone.csv
s3://niwa-water-demand-modelling/TransformedOutputs/InferenceData/WellingtonHighWestern/Wellington High Western_0.csv.out
Uploaded merged prediction to s3://niwa-water-demand-modelling/TransformedOutputs/InferenceData/WellingtonHighWestern/Wellington High Western.csv
s3://niwa-water-

In [None]:
s3 = boto3.resource("s3")
s3_bucket = s3.Bucket(bucket)

print(s3_bucket)
job_outputs_prefix = "{}/output/{}".format(prefix, auto_ml_job_name)
print(job_outputs_prefix)

# Delete S3 objects
s3_bucket.objects.filter(Prefix=job_outputs_prefix).delete()

We then delete all the experiment and model resources created by the Autopilot experiment.

In [None]:
def cleanup_experiment_resources(experiment_name):
    trials = sm.list_trials(ExperimentName=experiment_name)["TrialSummaries"]
    print("TrialNames:")
    for trial in trials:
        trial_name = trial["TrialName"]
        print(f"\n{trial_name}")

        components_in_trial = sm.list_trial_components(TrialName=trial_name)
        print("\tTrialComponentNames:")
        for component in components_in_trial["TrialComponentSummaries"]:
            component_name = component["TrialComponentName"]
            print(f"\t{component_name}")
            sm.disassociate_trial_component(TrialComponentName=component_name, TrialName=trial_name)
            try:
                # comment out to keep trial components
                sm.delete_trial_component(TrialComponentName=component_name)
            except:
                # component is associated with another trial
                continue
            # to prevent throttling
            time.sleep(5)
        sm.delete_trial(TrialName=trial_name)
    sm.delete_experiment(ExperimentName=experiment_name)
    print(f"\nExperiment {experiment_name} deleted")


def cleanup_autopilot_models(autopilot_job_name):
    print("{0}:\n".format(autopilot_job_name))
    response = sm.list_models(NameContains=autopilot_job_name)

    for model in response["Models"]:
        model_name = model["ModelName"]
        print(f"\t{model_name}")
        sm.delete_model(ModelName=model_name)
        # to prevent throttling
        time.sleep(3)

In [None]:
cleanup_experiment_resources("{0}-aws-auto-ml-job".format(auto_ml_job_name))

In [None]:
cleanup_autopilot_models(auto_ml_job_name)

Finally, the following code, when uncommented, will delete the local files used in this demo.

In [None]:
import shutil
import glob
import os


def delete_local_files():
    base_path = ""
    dir_list = glob.iglob(os.path.join(base_path, "{0}*".format(auto_ml_job_name)))

    for path in dir_list:
        if os.path.isdir(path):
            shutil.rmtree(path)

    if os.path.exists("CaliforniaHousing"):
        shutil.rmtree("CaliforniaHousing")

    if os.path.exists("cal_housing.tgz"):
        os.remove("cal_housing.tgz")

    if os.path.exists("SageMakerAutopilotCandidateDefinitionNotebook.ipynb"):
        os.remove("SageMakerAutopilotCandidateDefinitionNotebook.ipynb")

    if os.path.exists("SageMakerAutopilotDataExplorationNotebook.ipynb"):
        os.remove("SageMakerAutopilotDataExplorationNotebook.ipynb")

    if os.path.exists("test_data_no_target.csv"):
        os.remove("test_data_no_target.csv")

    if os.path.exists("test_data.csv"):
        os.remove("test_data.csv")

    if os.path.exists("train_data.csv"):
        os.remove("train_data.csv")


## UNCOMMENT TO CLEAN UP LOCAL FILES
# delete_local_files()

**Note: If you enabled automatic endpoint creation, you will need to delete the endpoint manually.**

## Notebook CI Test Results

This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.

![This us-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-east-1/autopilot|autopilot_california_housing.ipynb)

![This us-east-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-east-2/autopilot|autopilot_california_housing.ipynb)

![This us-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-west-1/autopilot|autopilot_california_housing.ipynb)

![This ca-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ca-central-1/autopilot|autopilot_california_housing.ipynb)

![This sa-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/sa-east-1/autopilot|autopilot_california_housing.ipynb)

![This eu-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-1/autopilot|autopilot_california_housing.ipynb)

![This eu-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-2/autopilot|autopilot_california_housing.ipynb)

![This eu-west-3 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-3/autopilot|autopilot_california_housing.ipynb)

![This eu-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-central-1/autopilot|autopilot_california_housing.ipynb)

![This eu-north-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-north-1/autopilot|autopilot_california_housing.ipynb)

![This ap-southeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-southeast-1/autopilot|autopilot_california_housing.ipynb)

![This ap-southeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-southeast-2/autopilot|autopilot_california_housing.ipynb)

![This ap-northeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-northeast-1/autopilot|autopilot_california_housing.ipynb)

![This ap-northeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-northeast-2/autopilot|autopilot_california_housing.ipynb)

![This ap-south-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-south-1/autopilot|autopilot_california_housing.ipynb)
