In [1]:
import sagemaker
import boto3
from sagemaker import get_execution_role
import pandas as pd

region = boto3.Session().region_name

session = sagemaker.Session()
bucket = 'lawsnic-aiml-east2'
prefix = 'kaggle/house-prices-advanced-regression-techniques'

role = get_execution_role()

sm = boto3.Session().client(service_name='sagemaker',region_name=region)

In [2]:
input_data_config = [{
      'DataSource': {
        'S3DataSource': {
          'S3DataType': 'S3Prefix',
          'S3Uri': 's3://{}/{}/input/train'.format(bucket,prefix)
        }
      },
      'TargetAttributeName': 'SalePrice'
    }
  ]

output_data_config = {
    'S3OutputPath': 's3://{}/{}/output'.format(bucket,prefix)
  }

autoMLJobConfig={
        'CompletionCriteria': {
            'MaxCandidates': 10
        }
}

autoMLJobObjective = {
      "MetricName": "R2"
}

test_data_s3_path = 's3://{}/{}/input/test.csv'.format(bucket,prefix)

Launching the SageMaker Autopilot Job
You can now launch the Autopilot job by calling the create_auto_ml_job API. https://docs.aws.amazon.com/cli/latest/reference/sagemaker/create-auto-ml-job.html

In [3]:
from time import gmtime, strftime, sleep
timestamp_suffix = strftime('%d-%H-%M-%S', gmtime())

auto_ml_job_name = 'automl-house-price-' + timestamp_suffix
print('AutoMLJobName: ' + auto_ml_job_name)

sm.create_auto_ml_job(AutoMLJobName=auto_ml_job_name,
                      InputDataConfig=input_data_config,
                      OutputDataConfig=output_data_config,
                      AutoMLJobConfig=autoMLJobConfig,
                      AutoMLJobObjective=autoMLJobObjective,
                      ProblemType="Regression",
                      RoleArn=role)

AutoMLJobName: automl-house-price-13-14-11-36


{'AutoMLJobArn': 'arn:aws:sagemaker:us-east-2:791580863750:automl-job/automl-house-price-13-14-11-36',
 'ResponseMetadata': {'RequestId': 'b0b1ec31-0010-47e1-991b-02a5625ee4b0',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': 'b0b1ec31-0010-47e1-991b-02a5625ee4b0',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '101',
   'date': 'Wed, 13 Jul 2022 14:11:37 GMT'},
  'RetryAttempts': 0}}

## Tracking SageMaker Autopilot job progress<a name="Tracking"></a>
SageMaker Autopilot job consists of the following high-level steps : 
* Analyzing Data, where the dataset is analyzed and Autopilot comes up with a list of ML pipelines that should be tried out on the dataset. The dataset is also split into train and validation sets.
* Feature Engineering, where Autopilot performs feature transformation on individual features of the dataset as well as at an aggregate level.
* Model Tuning, where the top performing pipeline is selected along with the optimal hyperparameters for the training algorithm (the last stage of the pipeline). 

In [4]:
print ('JobStatus - Secondary Status')
print('------------------------------')


describe_response = sm.describe_auto_ml_job(AutoMLJobName=auto_ml_job_name)
print (describe_response['AutoMLJobStatus'] + " - " + describe_response['AutoMLJobSecondaryStatus'])
job_run_status = describe_response['AutoMLJobStatus']
    
while job_run_status not in ('Failed', 'Completed', 'Stopped'):
    describe_response = sm.describe_auto_ml_job(AutoMLJobName=auto_ml_job_name)
    job_run_status = describe_response['AutoMLJobStatus']
    
    print (describe_response['AutoMLJobStatus'] + " - " + describe_response['AutoMLJobSecondaryStatus'])
    sleep(30)

JobStatus - Secondary Status
------------------------------
InProgress - AnalyzingData
InProgress - AnalyzingData
InProgress - AnalyzingData
InProgress - AnalyzingData
InProgress - AnalyzingData
InProgress - AnalyzingData
InProgress - AnalyzingData
InProgress - AnalyzingData
InProgress - AnalyzingData
InProgress - AnalyzingData
InProgress - AnalyzingData
InProgress - AnalyzingData
InProgress - AnalyzingData
InProgress - AnalyzingData
InProgress - AnalyzingData
InProgress - AnalyzingData
InProgress - AnalyzingData
InProgress - AnalyzingData
InProgress - AnalyzingData
InProgress - FeatureEngineering
InProgress - FeatureEngineering
InProgress - FeatureEngineering
InProgress - FeatureEngineering
InProgress - FeatureEngineering
InProgress - FeatureEngineering
InProgress - FeatureEngineering
InProgress - FeatureEngineering
InProgress - FeatureEngineering
InProgress - FeatureEngineering
InProgress - FeatureEngineering
InProgress - FeatureEngineering
InProgress - FeatureEngineering
InProgress 

## Results

Now use the describe_auto_ml_job API to look up the best candidate selected by the SageMaker Autopilot job. 

In [5]:
sm.describe_auto_ml_job(AutoMLJobName=auto_ml_job_name)

{'AutoMLJobName': 'automl-house-price-13-14-11-36',
 'AutoMLJobArn': 'arn:aws:sagemaker:us-east-2:791580863750:automl-job/automl-house-price-13-14-11-36',
 'InputDataConfig': [{'DataSource': {'S3DataSource': {'S3DataType': 'S3Prefix',
     'S3Uri': 's3://lawsnic-aiml-east2/kaggle/house-prices-advanced-regression-techniques/input/train'}},
   'TargetAttributeName': 'SalePrice',
   'ContentType': 'text/csv;header=present',
   'ChannelType': 'training'}],
 'OutputDataConfig': {'S3OutputPath': 's3://lawsnic-aiml-east2/kaggle/house-prices-advanced-regression-techniques/output'},
 'RoleArn': 'arn:aws:iam::791580863750:role/service-role/AmazonSageMaker-ExecutionRole-20220707T123330',
 'AutoMLJobObjective': {'MetricName': 'R2'},
 'ProblemType': 'Regression',
 'AutoMLJobConfig': {'CompletionCriteria': {'MaxCandidates': 10}},
 'CreationTime': datetime.datetime(2022, 7, 13, 14, 11, 36, 908000, tzinfo=tzlocal()),
 'EndTime': datetime.datetime(2022, 7, 13, 14, 49, 57, 871000, tzinfo=tzlocal()),
 'L

In [6]:

best_candidate = sm.describe_auto_ml_job(AutoMLJobName=auto_ml_job_name)['BestCandidate']
best_candidate_name = best_candidate['CandidateName']
print(best_candidate)
print('\n')
print("CandidateName: " + best_candidate_name)
print("FinalAutoMLJobObjectiveMetricName: " + best_candidate['FinalAutoMLJobObjectiveMetric']['MetricName'])
print("FinalAutoMLJobObjectiveMetricValue: " + str(best_candidate['FinalAutoMLJobObjectiveMetric']['Value']))

{'CandidateName': 'automl-house-price-13-14-11-36lG-006-a9e0941a', 'FinalAutoMLJobObjectiveMetric': {'MetricName': 'validation:r2', 'Value': 0.8751699924468994}, 'ObjectiveStatus': 'Succeeded', 'CandidateSteps': [{'CandidateStepType': 'AWS::SageMaker::ProcessingJob', 'CandidateStepArn': 'arn:aws:sagemaker:us-east-2:791580863750:processing-job/automl-house-price-13-14-11-36-db-1-428b2cdb4a964a8c8016c769f44', 'CandidateStepName': 'automl-house-price-13-14-11-36-db-1-428b2cdb4a964a8c8016c769f44'}, {'CandidateStepType': 'AWS::SageMaker::TrainingJob', 'CandidateStepArn': 'arn:aws:sagemaker:us-east-2:791580863750:training-job/automl-house-price-13-14-11-36-dpp0-1-0a6f825d04314b9fbc98d7543', 'CandidateStepName': 'automl-house-price-13-14-11-36-dpp0-1-0a6f825d04314b9fbc98d7543'}, {'CandidateStepType': 'AWS::SageMaker::TransformJob', 'CandidateStepArn': 'arn:aws:sagemaker:us-east-2:791580863750:transform-job/automl-house-price-13-14-11-36-dpp0-csv-1-9f44f9abe309462a9d5ef', 'CandidateStepName': 

### Perform batch inference using the best candidate

Now that you have successfully completed the SageMaker Autopilot job on the dataset, create a model from any of the candidates by using [Inference Pipelines](https://docs.aws.amazon.com/sagemaker/latest/dg/inference-pipelines.html). 

In [7]:
model_name = 'automl-housePrice-model-' + timestamp_suffix

model = sm.create_model(Containers=best_candidate['InferenceContainers'],
                            ModelName=model_name,
                            ExecutionRoleArn=role)

print('Model ARN corresponding to the best candidate is : {}'.format(model['ModelArn']))

Model ARN corresponding to the best candidate is : arn:aws:sagemaker:us-east-2:791580863750:model/automl-houseprice-model-13-14-11-36


In [8]:
transform_job_name = 'automl-housePric-transform-' + timestamp_suffix

transform_input = {
        'DataSource': {
            'S3DataSource': {
                'S3DataType': 'S3Prefix',
                'S3Uri': test_data_s3_path
            }
        },
        'ContentType': 'text/csv',
        'CompressionType': 'None',
        'SplitType': 'Line'
    }

transform_output = {
        'S3OutputPath': 's3://{}/{}/inference-results'.format(bucket,prefix),
    }

transform_resources = {
        'InstanceType': 'ml.m5.4xlarge',
        'InstanceCount': 1
    }

sm.create_transform_job(TransformJobName = transform_job_name,
                        ModelName = model_name,
                        TransformInput = transform_input,
                        TransformOutput = transform_output,
                        TransformResources = transform_resources
)

{'TransformJobArn': 'arn:aws:sagemaker:us-east-2:791580863750:transform-job/automl-housepric-transform-13-14-11-36',
 'ResponseMetadata': {'RequestId': '15e1fbd7-00d6-49f3-9282-461573b6ce42',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': '15e1fbd7-00d6-49f3-9282-461573b6ce42',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '115',
   'date': 'Wed, 13 Jul 2022 15:04:48 GMT'},
  'RetryAttempts': 0}}

In [9]:
print ('JobStatus')
print('----------')


describe_response = sm.describe_transform_job(TransformJobName = transform_job_name)
job_run_status = describe_response['TransformJobStatus']
print (job_run_status)

while job_run_status not in ('Failed', 'Completed', 'Stopped'):
    describe_response = sm.describe_transform_job(TransformJobName = transform_job_name)
    job_run_status = describe_response['TransformJobStatus']
    print (job_run_status)
    sleep(30)

JobStatus
----------
InProgress
InProgress
InProgress
InProgress
InProgress
InProgress
InProgress
InProgress
InProgress
InProgress
InProgress
InProgress
InProgress
InProgress
InProgress
InProgress
InProgress
InProgress
InProgress
InProgress
InProgress
Completed


In [10]:
s3_output_key = '{}/inference-results/test.csv.out'.format(prefix);
local_inference_results_path = 'inference_results.csv'

s3 = boto3.resource('s3')
inference_results_bucket = s3.Bucket(bucket)

print(s3_output_key)

kaggle/house-prices-advanced-regression-techniques/inference-results/test.csv.out


In [28]:
inference_results_bucket.download_file(s3_output_key, local_inference_results_path);

data = pd.read_csv(local_inference_results_path, sep=';')
pd.set_option('display.max_rows', 10)         # Keep the output on one page
data

Unnamed: 0,51427.2734375
0,124217.953125
1,158185.343750
2,185572.406250
3,195268.968750
4,188283.312500
...,...
1454,82366.335938
1455,83315.015625
1456,166360.281250
1457,112267.234375


In [33]:
test_data = pd.read_csv("./test.csv")
#display(test_data)

data['Id'] = test_data['Id']
data.columns.values[0] = 'SalePrice'
#data.to_csv('./submission.csv')

#display(data)

new_data = data[['Id','SalePrice']].copy()
display(new_data)
new_data.to_csv('./submission.csv', index=False)

#https://www.kaggle.com/submissions/27426344/27426344.raw score of 0.12704

Unnamed: 0,Id,SalePrice
0,1461,124217.953125
1,1462,158185.343750
2,1463,185572.406250
3,1464,195268.968750
4,1465,188283.312500
...,...,...
1454,2915,82366.335938
1455,2916,83315.015625
1456,2917,166360.281250
1457,2918,112267.234375


### View other candidates explored by SageMaker Autopilot
You can view all the candidates (pipeline evaluations with different hyperparameter combinations) that were explored by SageMaker Autopilot and sort them by their final performance metric.

In [None]:
# cell 14
candidates = sm.list_candidates_for_auto_ml_job(AutoMLJobName=auto_ml_job_name, SortBy='FinalObjectiveMetricValue')['Candidates']
index = 1
for candidate in candidates:
  print (str(index) + "  " + candidate['CandidateName'] + "  " + str(candidate['FinalAutoMLJobObjectiveMetric']['Value']))
  index += 1

### Candidate Generation Notebook
    
Sagemaker AutoPilot also auto-generates a Candidate Definitions notebook. This notebook can be used to interactively step through the various steps taken by the Sagemaker Autopilot to arrive at the best candidate. This notebook can also be used to override various runtime parameters like parallelism, hardware used, algorithms explored, feature extraction scripts and more.
    
The notebook can be downloaded from the following Amazon S3 location:

In [None]:
# cell 15
sm.describe_auto_ml_job(AutoMLJobName=auto_ml_job_name)['AutoMLJobArtifacts']['CandidateDefinitionNotebookLocation']


### Data Exploration Notebook
Sagemaker Autopilot also auto-generates a Data Exploration notebook, which can be downloaded from the following Amazon S3 location:

In [None]:
# cell 16
sm.describe_auto_ml_job(AutoMLJobName=auto_ml_job_name)['AutoMLJobArtifacts']['DataExplorationNotebookLocation']
