# Train and Deploy (Based on lab)


In [4]:
# !pip3 install -U sagemaker

In [5]:
import os
import boto3
import sagemaker
import pandas as pd

# Set up the SageMaker role and session
role = sagemaker.get_execution_role()
sess = sagemaker.Session()
region = sess.boto_region_name

# Set up the S3 bucket and prefix
bucket = sess.default_bucket()
prefix = "predictive-maintenance-feature-store"

# Load the data from S3
s3_client = boto3.client('s3')
s3_key = 'root/AAI-540_Predictive-Maintenance-for-Pharmaceutical-Manufacturing-Equipment/predictive_maintenance_dataset.csv'

response = s3_client.get_object(Bucket=bucket, Key=s3_key)
df = pd.read_csv(response['Body'])

# Print dataset info
print("Dataset loaded successfully. Shape:", df.shape)
print("\nFirst few rows of the dataset:")
print(df.head())


sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml
Dataset loaded successfully. Shape: (124494, 12)

First few rows of the dataset:
       date    device  failure    metric1  metric2  metric3  metric4  metric5  \
0  1/1/2015  S1F01085        0  215630672       55        0       52        6   
1  1/1/2015  S1F0166B        0   61370680        0        3        0        6   
2  1/1/2015  S1F01E6Y        0  173295968        0        0        0       12   
3  1/1/2015  S1F01JE0        0   79694024        0        0        0        6   
4  1/1/2015  S1F01R2B        0  135970480        0        0        0       15   

   metric6  metric7  metric8  metric9  
0   407438        0        0        7  
1   403174        0        0        0  
2   237394        0        0        0  
3   410186        0        0        0  
4   313173        0        0       

In [6]:
import boto3

# Initialize the S3 client
s3 = boto3.client('s3')

# List all S3 buckets
response = s3.list_buckets()

# Print the bucket names
print("Available buckets:")
for bucket in response['Buckets']:
    print(f'  {bucket["Name"]}')


Available buckets:
  aws-athena-query-results-807494057176-us-east-1
  sagemaker-us-east-1-807494057176


## Data preparation

In [7]:
import boto3
import pandas as pd
import numpy as np

# Initialize S3 client
s3 = boto3.client("s3")

# Define the file details
filename = "predictive_maintenance_dataset.csv"
bucket = "sagemaker-us-east-1-807494057176"  # The bucket you identified
key = "root/AAI-540_Predictive-Maintenance-for-Pharmaceutical-Manufacturing-Equipment/predictive_maintenance_dataset.csv"

# Download the file from S3
s3.download_file(bucket, key, filename)

# Load the data into a pandas DataFrame
data = pd.read_csv(filename)

# Specify the columns for the dataset
data.columns = [
    "date",
    "device",
    "failure",
    "metric1",
    "metric2",
    "metric3",
    "metric4",
    "metric5",
    "metric6",
    "metric7",
    "metric8",
    "metric9"
]

# Save the data to a new CSV file
data.to_csv("data.csv", sep=",", index=False)

# Display a random sample of 8 rows from the dataset
print(data.sample(8))

              date    device  failure    metric1  metric2  metric3  metric4  \
57103    3/19/2015  W1F0Z3G1        0   77249240        0        0        0   
80935    5/12/2015  Z1F0R8QZ        0  239701824        0       15        0   
108623   7/31/2015  W1F1DPSA        0  238820120        0        0        0   
19078    1/23/2015  W1F0Z4SP        0    2344904        0        0        0   
76394     5/1/2015  S1F0FW8K        0  131957144        0        0        0   
38590    2/20/2015  S1F0S387        0  204621608        0        2        0   
123000  10/11/2015  S1F135GA        0  213125872        0        0        0   
24900    1/31/2015  Z1F0E1CS        0  136402416        0        0        0   

        metric5  metric6  metric7  metric8  metric9  
57103        15   252142        0        0        0  
80935        95   247988        0        0      155  
108623       13       46        0        0        0  
19078         8   192925        0        0        3  
76394        92   

In [8]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 124494 entries, 0 to 124493
Data columns (total 12 columns):
 #   Column   Non-Null Count   Dtype 
---  ------   --------------   ----- 
 0   date     124494 non-null  object
 1   device   124494 non-null  object
 2   failure  124494 non-null  int64 
 3   metric1  124494 non-null  int64 
 4   metric2  124494 non-null  int64 
 5   metric3  124494 non-null  int64 
 6   metric4  124494 non-null  int64 
 7   metric5  124494 non-null  int64 
 8   metric6  124494 non-null  int64 
 9   metric7  124494 non-null  int64 
 10  metric8  124494 non-null  int64 
 11  metric9  124494 non-null  int64 
dtypes: int64(10), object(2)
memory usage: 11.4+ MB


#### Key observations:

* The dataset has 124,494 observations and 12 columns.
* The first column is the date attribute, which might not be useful for training a machine learning model but could be useful for time-based analysis. We may choose to drop it for training but retain it for later analysis or output.
* The second column is device, which represents the device ID. We will drop this before training but can add it back to the final output alongside failure probability predictions.
* The third column is failure, which is the target variable. This indicates whether a device has failed (0 = No Failure; 1 = Failure). This will be the label we use for training the model.
* There are 9 numeric features (metric1 through metric9) that represent various operational metrics from the devices. These features will be used for training and inference.

### Train, Test, Validate and Production data
Let's split the data as follows: 40% for training, 10% for validation, 10% for testing, and set 40% aside for our production dataset. We'll drop the 'device' field from the training, validation, and testing sets, as it is not a useful feature for training purposes. For our production set, however, we keep the 'device' feature. We may want to filter it out prior to running our inferences so that the input data features match those of the training set, and later use it to join with the inference results.

In [9]:
import pandas as pd
import numpy as np

# Convert 'date' to datetime format if not already done
df['date'] = pd.to_datetime(df['date'], format='%m/%d/%Y')

# Sort the dataset by date to maintain chronological order
df = df.sort_values(by='date')

# Define the cut-off points for each split
train_cutoff = '2015-04-30'  # Training up to April
val_cutoff = '2015-06-30'    # Validation in May and June
test_cutoff = '2015-08-31'   # Testing in July and August
prod_cutoff = '2015-11-02'   # Production in September to November

# Create the time-based splits
data_train = df[df['date'] <= train_cutoff].drop(['device'], axis=1)  # Training set (drop 'device')
data_val = df[(df['date'] > train_cutoff) & (df['date'] <= val_cutoff)].drop(['device'], axis=1)  # Validation set
data_test = df[(df['date'] > val_cutoff) & (df['date'] <= test_cutoff)].drop(['device'], axis=1)  # Testing set
data_prod = df[(df['date'] > test_cutoff) & (df['date'] <= prod_cutoff)]  # Production set (keep 'device')

# Output the shapes of the splits to verify
print("Training set shape:", data_train.shape)
print("Validation set shape:", data_val.shape)
print("Testing set shape:", data_test.shape)
print("Production set shape:", data_prod.shape)

Training set shape: (76377, 11)
Validation set shape: (21799, 11)
Testing set shape: (18877, 11)
Production set shape: (7441, 12)


Let's upload those data sets in S3

In [10]:
# Save each dataset to CSV files and upload them to S3
# 1. Training dataset
train_file = "train_data.csv"
data_train.to_csv(train_file, index=False, header=False)
sess.upload_data(train_file, key_prefix="{}/train".format(prefix))

# 2. Validation dataset
validation_file = "validation_data.csv"
data_val.to_csv(validation_file, index=False, header=False)
sess.upload_data(validation_file, key_prefix="{}/validation".format(prefix))

# 3. Testing dataset
test_file = "test_data.csv"
data_test.to_csv(test_file, index=False, header=False)
sess.upload_data(test_file, key_prefix="{}/test".format(prefix))

# 4. Production dataset (keeping the 'device' column)
production_file = "production_data.csv"
data_prod.to_csv(production_file, index=False, header=False)
sess.upload_data(production_file, key_prefix="{}/production".format(prefix))

# Output confirmation
print("Datasets uploaded successfully to S3 bucket '{}' with prefix '{}'.".format(bucket, prefix))

Datasets uploaded successfully to S3 bucket 'sagemaker-us-east-1-807494057176' with prefix 'predictive-maintenance-feature-store'.


In [11]:
# 5. Batch Inference dataset (dropping 'date', 'device', and 'failure')
batch_file_noID = "batch_data_noID.csv"
data_batch_noID = data_prod.drop(columns=['date', 'device', 'failure'])
data_batch_noID.to_csv(batch_file_noID, index=False, header=False)
sess.upload_data(batch_file_noID, key_prefix="{}/batch".format(prefix))

# Output confirmation
print(f"Datasets uploaded successfully to S3 bucket '{bucket}' with prefix '{prefix}'.")

Datasets uploaded successfully to S3 bucket 'sagemaker-us-east-1-807494057176' with prefix 'predictive-maintenance-feature-store'.


---

## Training job and model creation

The below cell uses the [SageMaker Python SDK](https://github.com/aws/sagemaker-python-sdk) to kick off the training job using both our training set and validation set. Not that the objective is set to 'binary:logistic' which trains a model to output a probability between 0 and 1 (here the probability of a tumor being malignant).

In [12]:
# Save CSV without headers
data_train.to_csv("train_data.csv", index=False, header=False)
data_val.to_csv("validation_data.csv", index=False, header=False)


In [13]:
print(data_train.columns)
print(data_val.columns)

Index(['date', 'failure', 'metric1', 'metric2', 'metric3', 'metric4',
       'metric5', 'metric6', 'metric7', 'metric8', 'metric9'],
      dtype='object')
Index(['date', 'failure', 'metric1', 'metric2', 'metric3', 'metric4',
       'metric5', 'metric6', 'metric7', 'metric8', 'metric9'],
      dtype='object')


In [14]:
# Drop the 'date' column from training and validation datasets
data_train = data_train.drop(columns=['date'])
data_val = data_val.drop(columns=['date'])

In [15]:
print(data_train.columns)
print(data_val.columns)

Index(['failure', 'metric1', 'metric2', 'metric3', 'metric4', 'metric5',
       'metric6', 'metric7', 'metric8', 'metric9'],
      dtype='object')
Index(['failure', 'metric1', 'metric2', 'metric3', 'metric4', 'metric5',
       'metric6', 'metric7', 'metric8', 'metric9'],
      dtype='object')


In [16]:
# Save the cleaned datasets
data_train.to_csv("train_data.csv", index=False, header=False)
data_val.to_csv("validation_data.csv", index=False, header=False)

In [17]:
# Upload to S3
sess.upload_data("train_data.csv", key_prefix="{}/train".format(prefix))
sess.upload_data("validation_data.csv", key_prefix="{}/validation".format(prefix))

's3://sagemaker-us-east-1-807494057176/predictive-maintenance-feature-store/validation/validation_data.csv'

In [18]:
from time import gmtime, strftime
import sagemaker

# Generate a unique job name
job_name = "xgb-" + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
output_location = "s3://{}/{}/output/{}".format(bucket, prefix, job_name)

# Retrieve the XGBoost container image for the current region
image = sagemaker.image_uris.retrieve(
    framework="xgboost", region=boto3.Session().region_name, version="1.7-1"
)

# Create the XGBoost estimator
sm_estimator = sagemaker.estimator.Estimator(
    image_uri=image,
    role=role,
    instance_count=1,
    instance_type="ml.m5.xlarge",
    volume_size=50,
    input_mode="File",
    output_path=output_location,
    sagemaker_session=sess,
)

# Set the hyperparameters for the XGBoost model
sm_estimator.set_hyperparameters(
    objective="binary:logistic",  # Binary classification
    max_depth=5,                  # Maximum depth of trees
    eta=0.2,                      # Learning rate
    gamma=4,                      # Minimum loss reduction required for a split
    min_child_weight=6,           # Minimum sum of instance weight in a child
    subsample=0.8,                # Subsample ratio
    verbosity=0,                  # Silent training output
    num_round=100,                # Number of boosting rounds
)

# Define the input data channels for training and validation
train_data = sagemaker.inputs.TrainingInput(
    "s3://{}/{}/train".format(bucket, prefix),
    distribution="FullyReplicated",
    content_type="text/csv",
    s3_data_type="S3Prefix",
)

validation_data = sagemaker.inputs.TrainingInput(
    "s3://{}/{}/validation".format(bucket, prefix),
    distribution="FullyReplicated",
    content_type="text/csv",
    s3_data_type="S3Prefix",
)

data_channels = {"train": train_data, "validation": validation_data}

# Start the training job and monitor logs
sm_estimator.fit(inputs=data_channels, job_name=job_name, logs=True)

INFO:sagemaker:Creating training-job with name: xgb-2024-10-19-23-41-22


2024-10-19 23:41:24 Starting - Starting the training job...
2024-10-19 23:41:39 Starting - Preparing the instances for training...
2024-10-19 23:42:26 Downloading - Downloading the training image......
2024-10-19 23:43:22 Training - Training image download completed. Training in progress..[34m[2024-10-19 23:43:27.224 ip-10-2-112-115.ec2.internal:7 INFO utils.py:28] RULE_JOB_STOP_SIGNAL_FILENAME: None[0m
[34m[2024-10-19 23:43:27.246 ip-10-2-112-115.ec2.internal:7 INFO profiler_config_parser.py:111] User has disabled profiler.[0m
[34m[2024-10-19:23:43:27:INFO] Imported framework sagemaker_xgboost_container.training[0m
[34m[2024-10-19:23:43:27:INFO] Failed to parse hyperparameter objective value binary:logistic to Json.[0m
[34mReturning the value itself[0m
[34m[2024-10-19:23:43:27:INFO] No GPUs detected (normal if no gpus installed)[0m
[34m[2024-10-19:23:43:27:INFO] Running XGBoost Sagemaker in algorithm mode[0m
[34m[2024-10-19:23:43:27:INFO] Determined 0 GPU(s) available o

In [19]:
# Check output location
output_location = "s3://{}/{}/output/{}".format(bucket, prefix, job_name)
print(output_location)

s3://sagemaker-us-east-1-807494057176/predictive-maintenance-feature-store/output/xgb-2024-10-19-23-41-22


---

## Batch Transform



#### 1. Create a transform job with the default configurations


In [20]:
from sagemaker.transformer import Transformer
import sagemaker

# Set up the transformer with the trained model
sm_transformer = sm_estimator.transformer(
    instance_count=1,
    instance_type="ml.m4.xlarge",  # Set instance type for batch processing
    output_path="s3://{}/{}/batch-inference-output".format(bucket, prefix)  # Output path
)

# Input location (batch data without 'date' and 'device' columns)
input_location = "s3://{}/{}/batch/batch_data_noID.csv".format(bucket, prefix)

# Start the Batch Transform job
sm_transformer.transform(
    data=input_location,
    content_type="text/csv",
    split_type="Line"  # Process each line of the CSV as an individual input
)

# Wait for the transform job to complete
sm_transformer.wait()

INFO:sagemaker:Creating model with name: sagemaker-xgboost-2024-10-19-23-44-09-820
INFO:sagemaker:Creating transform job with name: sagemaker-xgboost-2024-10-19-23-44-10-505


..........................................
[34m[2024-10-19:23:51:04:INFO] No GPUs detected (normal if no gpus installed)[0m
[34m[2024-10-19:23:51:04:INFO] No GPUs detected (normal if no gpus installed)[0m
[34m[2024-10-19:23:51:04:INFO] nginx config: [0m
[34mworker_processes auto;[0m
[34mdaemon off;[0m
[34mpid /tmp/nginx.pid;[0m
[34merror_log  /dev/stderr;[0m
[34mworker_rlimit_nofile 4096;[0m
[34mevents {
  worker_connections 2048;[0m
[34m}[0m
[34mhttp {
  include /etc/nginx/mime.types;
  default_type application/octet-stream;
  access_log /dev/stdout combined;
  upstream gunicorn {
    server unix:/tmp/gunicorn.sock;
  }
  server {
    listen 8080 deferred;
    client_max_body_size 0;
    keepalive_timeout 3;
    location ~ ^/(ping|invocations|execution-parameters) {
      proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
      proxy_set_header Host $http_host;
      proxy_redirect off;
      proxy_read_timeout 60s;
      proxy_pass http://gunicorn;
    

Let's inspect the output of the Batch Transform job in S3. It should show the list probabilities of failures of a device.

In [21]:
import re


def get_csv_output_from_s3(s3uri, batch_file):
    file_name = "{}.out".format(batch_file)
    match = re.match("s3://([^/]+)/(.*)", "{}/{}".format(s3uri, file_name))
    output_bucket, output_prefix = match.group(1), match.group(2)
    s3.download_file(output_bucket, output_prefix, file_name)
    return pd.read_csv(file_name, sep=",", header=None)

In [22]:
output_df = get_csv_output_from_s3(sm_transformer.output_path, batch_file_noID)
output_df.head(8)

Unnamed: 0,0
0,0.00261
1,0.000298
2,6.6e-05
3,0.000669
4,0.000269
5,0.000107
6,0.000446
7,0.000225


#### 2. Join the input and the prediction results 
Now, let's associate the prediction results with their corresponding input records. We can also use the __input_filter__ to exclude the ID column easily and there's no need to have a separate file in S3.

* Set __input_filter__ to "$[1:]": indicates that we are excluding column 0 (the 'ID') before processing the inferences and keeping everything from column 1 to the last column (all the features or predictors)  
  
  
* Set __join_source__ to "Input": indicates our desire to join the input data with the inference results  

* Leave __output_filter__ to default ('$'), indicating that the joined input and inference results be will saved as output.

In [23]:
# Required configurations for the batch transform
sm_transformer.assemble_with = "Line"
sm_transformer.accept = "text/csv"

# Input location (batch data without 'date', 'device', and 'failure' columns)
input_location = "s3://{}/{}/batch/{}".format(bucket, prefix, batch_file_noID)

# Start a transform job
sm_transformer.transform(
    input_location,
    split_type="Line",
    content_type="text/csv",
    join_source="Input",  # Join the input data with the prediction results
)

# Wait for the transform job to complete
sm_transformer.wait()

INFO:sagemaker:Creating transform job with name: sagemaker-xgboost-2024-10-19-23-55-49-400


.......................................
[34m[2024-10-20:00:02:19:INFO] No GPUs detected (normal if no gpus installed)[0m
[34m[2024-10-20:00:02:19:INFO] No GPUs detected (normal if no gpus installed)[0m
[34m[2024-10-20:00:02:19:INFO] nginx config: [0m
[34mworker_processes auto;[0m
[34mdaemon off;[0m
[34mpid /tmp/nginx.pid;[0m
[34merror_log  /dev/stderr;[0m
[34mworker_rlimit_nofile 4096;[0m
[34mevents {
  worker_connections 2048;[0m
[34m}[0m
[35m[2024-10-20:00:02:19:INFO] No GPUs detected (normal if no gpus installed)[0m
[35m[2024-10-20:00:02:19:INFO] No GPUs detected (normal if no gpus installed)[0m
[35m[2024-10-20:00:02:19:INFO] nginx config: [0m
[35mworker_processes auto;[0m
[35mdaemon off;[0m
[35mpid /tmp/nginx.pid;[0m
[35merror_log  /dev/stderr;[0m
[35mworker_rlimit_nofile 4096;[0m
[35mevents {
  worker_connections 2048;[0m
[35m}[0m
[34mhttp {
  include /etc/nginx/mime.types;
  default_type application/octet-stream;
  access_log /dev/stdout c

Let's inspect the output of the Batch Transform job in S3. It should show the list of failures and probabilities.

In [24]:
# Ensure we use the correct file name (batch_data_noID.csv) for the output
batch_file_noID = "batch_data_noID.csv"

# Retrieve and display the output from S3 using the correct batch file
output_df = get_csv_output_from_s3(sm_transformer.output_path, batch_file_noID)
output_df.head(8)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,139851104,0,0,112,20,376564,0,0,0,0.00261
1,207060288,0,0,0,19,372208,0,0,0,0.000298
2,145452104,0,0,0,7,31,0,0,0,6.6e-05
3,12870608,0,34,0,23,323213,0,0,205,0.000669
4,154655112,0,7,0,8,262558,0,0,0,0.000269
5,146152808,0,0,0,9,221471,0,0,0,0.000107
6,68643440,0,0,0,19,353212,0,0,1,0.000446
7,1450432,0,0,0,9,287598,0,0,0,0.000225


In [25]:
# Re-add the 'device' column to the output dataframe
output_with_device = pd.concat([data_prod['device'].reset_index(drop=True), output_df], axis=1)

# View the final dataframe with predictions and device column
print(output_with_device.head())

     device          0  1   2    3   4       5  6  7    8         9
0  W1F1CL1K  139851104  0   0  112  20  376564  0  0    0  0.002610
1  W1F1BTB2  207060288  0   0    0  19  372208  0  0    0  0.000298
2  W1F1C9HM  145452104  0   0    0   7      31  0  0    0  0.000066
3  W1F1CHY9   12870608  0  34    0  23  323213  0  0  205  0.000669
4  W1F1CJ3G  154655112  0   7    0   8  262558  0  0    0  0.000269


#### 3. Update the output filter to keep only ID and prediction results
Let's change __output_filter__ to "$[0,-1]", indicating that when presenting the output, we only want to keep column 0 (the 'ID') and the last column (the inference result i.e. the probability of a given tumor to be malignant)

In [26]:
# start another transform job
sm_transformer.transform(
    input_location,
    split_type="Line",
    content_type="text/csv",
    input_filter="$[1:]",
    join_source="Input",
    output_filter="$[0,-1]",
)
sm_transformer.wait()

INFO:sagemaker:Creating transform job with name: sagemaker-xgboost-2024-10-20-00-02-54-250


.............................................
[34m[2024-10-20:00:10:24:INFO] No GPUs detected (normal if no gpus installed)[0m
[34m[2024-10-20:00:10:24:INFO] No GPUs detected (normal if no gpus installed)[0m
[34m[2024-10-20:00:10:24:INFO] nginx config: [0m
[34mworker_processes auto;[0m
[34mdaemon off;[0m
[34mpid /tmp/nginx.pid;[0m
[34merror_log  /dev/stderr;[0m
[34mworker_rlimit_nofile 4096;[0m
[34mevents {
  worker_connections 2048;[0m
[34m}[0m
[34mhttp {
  include /etc/nginx/mime.types;
  default_type application/octet-stream;
  access_log /dev/stdout combined;
  upstream gunicorn {
    server unix:/tmp/gunicorn.sock;
  }
  server {
    listen 8080 deferred;
    client_max_body_size 0;
    keepalive_timeout 3;
    location ~ ^/(ping|invocations|execution-parameters) {
      proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
      proxy_set_header Host $http_host;
      proxy_redirect off;
      proxy_read_timeout 60s;
      proxy_pass http://gunicorn;
 

Now, let's inspect the output of the Batch Transform job in S3 again. It should show 2 columns: the device and their corresponding probabilities of failure.

In [27]:
output_df = get_csv_output_from_s3(sm_transformer.output_path, batch_file_noID)
output_df.head(8)

Unnamed: 0,0,1
0,139851104,0.003703
1,207060288,0.003703
2,145452104,0.003224
3,12870608,0.007542
4,154655112,0.005833
5,146152808,0.00352
6,68643440,0.003703
7,1450432,0.00352


create_model(role=role, image_uri=XGBOOST_IMAGE)In summary, we can use newly introduced 3 attributes - __input_filter__, __join_source__, __output_filter__ to 
1. Filter / select useful features from the input dataset. e.g. exclude ID columns.
2. Associate the prediction results with their corresponding input records.
3. Filter the original or joined results before saving to S3. e.g. keep ID and probability columns only.

## Upload the Sagemaker Model created during our training job to the Sagemaker Model Registry

In [30]:
import sagemaker

# Create a SageMaker session
sagemaker_session = sagemaker.Session()

# List training jobs
training_jobs = sagemaker_session.sagemaker_client.list_training_jobs()

# Display training jobs
print(training_jobs)


{'TrainingJobSummaries': [{'TrainingJobName': 'xgb-2024-10-19-23-41-22', 'TrainingJobArn': 'arn:aws:sagemaker:us-east-1:807494057176:training-job/xgb-2024-10-19-23-41-22', 'CreationTime': datetime.datetime(2024, 10, 19, 23, 41, 22, 815000, tzinfo=tzlocal()), 'TrainingEndTime': datetime.datetime(2024, 10, 19, 23, 43, 45, 742000, tzinfo=tzlocal()), 'LastModifiedTime': datetime.datetime(2024, 10, 19, 23, 43, 46, 259000, tzinfo=tzlocal()), 'TrainingJobStatus': 'Completed'}, {'TrainingJobName': 'xgb-2024-10-16-21-59-14', 'TrainingJobArn': 'arn:aws:sagemaker:us-east-1:807494057176:training-job/xgb-2024-10-16-21-59-14', 'CreationTime': datetime.datetime(2024, 10, 16, 21, 59, 14, 306000, tzinfo=tzlocal()), 'TrainingEndTime': datetime.datetime(2024, 10, 16, 22, 1, 35, 715000, tzinfo=tzlocal()), 'LastModifiedTime': datetime.datetime(2024, 10, 16, 22, 1, 36, 113000, tzinfo=tzlocal()), 'TrainingJobStatus': 'Completed'}, {'TrainingJobName': 'xgb-2024-10-16-17-53-36', 'TrainingJobArn': 'arn:aws:sage

In [31]:
import boto3

# Initialize the SageMaker client
sagemaker = boto3.client("sagemaker")

# Use the newest model name
model_name = "sagemaker-xgboost-2024-10-20-00-02-54-250"
print(f"Model Name: {model_name}")

# Retrieve the training job details
info = sagemaker.describe_training_job(TrainingJobName="xgb-2024-10-16-21-59-14")  # Updated with the newest training job name

# Get the model artifacts S3 path
model_data = info["ModelArtifacts"]["S3ModelArtifacts"]
print(f"Model Data S3 Path: {model_data}")

# Define the primary container
primary_container = {
    "Image": image,  # Make sure the 'image' variable is defined (this could be the XGBoost image URI)
    "ModelDataUrl": model_data
}

# Save the model to the SageMaker Model Registry
create_model_response = sagemaker.create_model(
    ModelName=model_name, 
    ExecutionRoleArn=role,  # Ensure 'role' contains the appropriate SageMaker execution role ARN
    PrimaryContainer=primary_container
)

# Print the ARN of the created model
print(f"Model ARN: {create_model_response['ModelArn']}")


Model Name: sagemaker-xgboost-2024-10-20-00-02-54-250
Model Data S3 Path: s3://sagemaker-us-east-1-807494057176/predictive-maintenance-feature-store/output/xgb-2024-10-16-21-59-14/xgb-2024-10-16-21-59-14/output/model.tar.gz
Model ARN: arn:aws:sagemaker:us-east-1:807494057176:model/sagemaker-xgboost-2024-10-20-00-02-54-250


In [32]:
# Inspect Training Job Details
info

{'TrainingJobName': 'xgb-2024-10-16-21-59-14',
 'TrainingJobArn': 'arn:aws:sagemaker:us-east-1:807494057176:training-job/xgb-2024-10-16-21-59-14',
 'ModelArtifacts': {'S3ModelArtifacts': 's3://sagemaker-us-east-1-807494057176/predictive-maintenance-feature-store/output/xgb-2024-10-16-21-59-14/xgb-2024-10-16-21-59-14/output/model.tar.gz'},
 'TrainingJobStatus': 'Completed',
 'SecondaryStatus': 'Completed',
 'HyperParameters': {'eta': '0.2',
  'gamma': '4',
  'max_depth': '5',
  'min_child_weight': '6',
  'num_round': '100',
  'objective': 'binary:logistic',
  'subsample': '0.8',
  'verbosity': '0'},
 'AlgorithmSpecification': {'TrainingImage': '683313688378.dkr.ecr.us-east-1.amazonaws.com/sagemaker-xgboost:1.7-1',
  'TrainingInputMode': 'File',
  'MetricDefinitions': [{'Name': 'train:mae',
    'Regex': '.*\\[[0-9]+\\].*#011train-mae:([-+]?[0-9]*\\.?[0-9]+(?:[eE][-+]?[0-9]+)?).*'},
   {'Name': 'validation:aucpr',
    'Regex': '.*\\[[0-9]+\\].*#011validation-aucpr:([-+]?[0-9]*\\.?[0-9]+(?

In [33]:
# Create Endpoint Configuration


# Create an endpoint config name. Here we create one based on the date  
# so it we can search endpoints based on creation time.
# Create an endpoint config name using the current timestamp
endpoint_config_name = 'project-endpoint-config-' + strftime("%Y-%m-%d-%H-%M-%S", gmtime())

# Specify the instance type for hosting the model
instance_type = 'ml.m5.xlarge'

# Create the endpoint configuration
endpoint_config_response = sagemaker.create_endpoint_config(
    EndpointConfigName=endpoint_config_name,  # This name will be used in the CreateEndpoint request.
    ProductionVariants=[
        {
            "VariantName": "variant1",        # Name for the production variant
            "ModelName": model_name,          # Name of the model to deploy
            "InstanceType": instance_type,    # Compute instance type
            "InitialInstanceCount": 1         # Number of instances to launch initially
        }
    ]
)

# Print the created endpoint configuration ARN
print(f"Created EndpointConfig: {endpoint_config_response['EndpointConfigArn']}")


Created EndpointConfig: arn:aws:sagemaker:us-east-1:807494057176:endpoint-config/project-endpoint-config-2024-10-20-00-12-20


In [34]:
# Deploy our model to real-time endpoint

endpoint_name = 'project-endpoint' + strftime("%Y-%m-%d-%H-%M-%S", gmtime())                            


create_endpoint_response = sagemaker.create_endpoint(
                                            EndpointName=endpoint_name, 
                                            EndpointConfigName=endpoint_config_name) 

In [35]:
import time  # Ensure time library is imported for sleep

# Describe the endpoint status and wait until it's 'InService'
def wait_for_endpoint(endpoint_name):
    while True:
        print("Checking Endpoint Status...")
        res = sagemaker.describe_endpoint(EndpointName=endpoint_name)
        state = res["EndpointStatus"]

        if state == "InService":
            print(f"Endpoint {endpoint_name} is now InService and ready to use.")
            break
        elif state == "Creating":
            print(f"Endpoint {endpoint_name} is still creating. Waiting for 60 seconds...")
            time.sleep(60)
        else:
            print(f"Error: Endpoint {endpoint_name} creation failed with status '{state}'.")
            print("Please check the SageMaker Console for more details.")
            break

# Call the function to monitor the endpoint status
wait_for_endpoint(endpoint_name)


Checking Endpoint Status...
Endpoint project-endpoint2024-10-20-00-12-24 is still creating. Waiting for 60 seconds...
Checking Endpoint Status...
Endpoint project-endpoint2024-10-20-00-12-24 is still creating. Waiting for 60 seconds...
Checking Endpoint Status...
Endpoint project-endpoint2024-10-20-00-12-24 is still creating. Waiting for 60 seconds...
Checking Endpoint Status...
Endpoint project-endpoint2024-10-20-00-12-24 is still creating. Waiting for 60 seconds...
Checking Endpoint Status...
Endpoint project-endpoint2024-10-20-00-12-24 is now InService and ready to use.


In [36]:
# Invoke Endpoint

sagemaker_runtime = boto3.client("sagemaker-runtime", region_name=region)

response = sagemaker_runtime.invoke_endpoint(
                            EndpointName=endpoint_name,
                            ContentType='text/csv',
                            Body=data_batch_noID.to_csv(header=None, index=False).strip('\n').split('\n')[0]
                            )
print(response['Body'].read().decode('utf-8'))

0.002609927672892809



In [37]:
# Examine Response Body

response

{'ResponseMetadata': {'RequestId': 'fdca1f8d-6835-45f3-bb21-2c84c2920b3f',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': 'fdca1f8d-6835-45f3-bb21-2c84c2920b3f',
   'x-amzn-invoked-production-variant': 'variant1',
   'date': 'Sun, 20 Oct 2024 00:16:27 GMT',
   'content-type': 'text/csv; charset=utf-8',
   'content-length': '21',
   'connection': 'keep-alive'},
  'RetryAttempts': 0},
 'ContentType': 'text/csv; charset=utf-8',
 'InvokedProductionVariant': 'variant1',
 'Body': <botocore.response.StreamingBody at 0x7f08fab7f730>}

In [43]:
import pandas as pd

# Load the original validation data
validation_data = pd.read_csv("validation_data.csv")

# Extract the probability column from the output DataFrame
probabilities = output_df[[1]].reset_index(drop=True)
probabilities.columns = ['probability']

# Extract the label column from column 0 of the original validation data
labels = validation_data.iloc[:, 0].reset_index(drop=True)
labels.name = 'label'

# Combine probabilities and labels into the final DataFrame with probability first
validation_with_predictions = pd.concat([probabilities, labels], axis=1)

# Save to CSV with 'probability' and 'label' columns
validation_with_predictions.to_csv("validation_with_predictions.csv", index=False)

# Display the first few rows to confirm
print(validation_with_predictions.head(8))


   probability  label
0     0.003703      0
1     0.003703      0
2     0.003224      0
3     0.007542      0
4     0.005833      0
5     0.003520      0
6     0.003703      0
7     0.003520      0


In [44]:
# Delete Endpoint

sagemaker.delete_endpoint(EndpointName=endpoint_name)

{'ResponseMetadata': {'RequestId': 'bee39720-7804-42a8-bbc6-c0c72a7c0a35',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': 'bee39720-7804-42a8-bbc6-c0c72a7c0a35',
   'content-type': 'application/x-amz-json-1.1',
   'date': 'Sun, 20 Oct 2024 01:40:53 GMT',
   'content-length': '0'},
  'RetryAttempts': 0}}

## Notebook CI Test Results

This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.

![This us-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-east-1/sagemaker_batch_transform|batch_transform_associate_predictions_with_input|Batch Transform - breast cancer prediction with high level SDK.ipynb)

![This us-east-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-east-2/sagemaker_batch_transform|batch_transform_associate_predictions_with_input|Batch Transform - breast cancer prediction with high level SDK.ipynb)

![This us-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-1/sagemaker_batch_transform|batch_transform_associate_predictions_with_input|Batch Transform - breast cancer prediction with high level SDK.ipynb)

![This ca-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ca-central-1/sagemaker_batch_transform|batch_transform_associate_predictions_with_input|Batch Transform - breast cancer prediction with high level SDK.ipynb)

![This sa-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/sa-east-1/sagemaker_batch_transform|batch_transform_associate_predictions_with_input|Batch Transform - breast cancer prediction with high level SDK.ipynb)

![This eu-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-1/sagemaker_batch_transform|batch_transform_associate_predictions_with_input|Batch Transform - breast cancer prediction with high level SDK.ipynb)

![This eu-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-2/sagemaker_batch_transform|batch_transform_associate_predictions_with_input|Batch Transform - breast cancer prediction with high level SDK.ipynb)

![This eu-west-3 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-3/sagemaker_batch_transform|batch_transform_associate_predictions_with_input|Batch Transform - breast cancer prediction with high level SDK.ipynb)

![This eu-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-central-1/sagemaker_batch_transform|batch_transform_associate_predictions_with_input|Batch Transform - breast cancer prediction with high level SDK.ipynb)

![This eu-north-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-north-1/sagemaker_batch_transform|batch_transform_associate_predictions_with_input|Batch Transform - breast cancer prediction with high level SDK.ipynb)

![This ap-southeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-southeast-1/sagemaker_batch_transform|batch_transform_associate_predictions_with_input|Batch Transform - breast cancer prediction with high level SDK.ipynb)

![This ap-southeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-southeast-2/sagemaker_batch_transform|batch_transform_associate_predictions_with_input|Batch Transform - breast cancer prediction with high level SDK.ipynb)

![This ap-northeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-northeast-1/sagemaker_batch_transform|batch_transform_associate_predictions_with_input|Batch Transform - breast cancer prediction with high level SDK.ipynb)

![This ap-northeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-northeast-2/sagemaker_batch_transform|batch_transform_associate_predictions_with_input|Batch Transform - breast cancer prediction with high level SDK.ipynb)

![This ap-south-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-south-1/sagemaker_batch_transform|batch_transform_associate_predictions_with_input|Batch Transform - breast cancer prediction with high level SDK.ipynb)
