# Task 5: Use batch transform to get inferences from a large dataset

## Task 5.1: Environment setup

Install packages and dependencies.

In [1]:
#install-dependencies
import boto3
import sagemaker
import time
from sagemaker.session import Session
from sagemaker.transformer import Transformer

role = sagemaker.get_execution_role()
region = boto3.Session().region_name
sess = boto3.Session()
sm = sess.client('sagemaker')
prefix = 'sagemaker/mlasms'
sagemaker_session = sagemaker.Session()
bucket = sagemaker.Session().default_bucket()
s3_client = boto3.client("s3")

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/sagemaker-user/.config/sagemaker/config.yaml


Save the model from the training and tuning lab in the default Amazon Simple Storage Service (Amazon S3) bucket. Set up a model using **create_model** and configure **ModelDataUrl** to reference the trained model.

In [2]:
#set-up-model
# Upload the model and dataset to your Amazon S3 bucket
s3_client.upload_file(Filename="model.tar.gz", Bucket=bucket, Key=f"{prefix}/models/model.tar.gz")

# Set a date to use in the model name
create_date = time.strftime("%Y-%m-%d-%H-%M-%S")
model_name = 'income-model-{}'.format(create_date)

# Retrieve the container image
container = sagemaker.image_uris.retrieve(
    region=boto3.Session().region_name, 
    framework='xgboost', 
    version='1.5-1'
)

# Set up the model
income_model = sm.create_model(
    ModelName = model_name,
    ExecutionRoleArn = role,
    PrimaryContainer = {
        'Image': container,
        'ModelDataUrl': f's3://{bucket}/{prefix}/models/model.tar.gz',
    }
)

Upload the batch records to the default Amazon S3 bucket.

In [3]:
#upload-dataset
s3_client.upload_file(Filename="batch_data.csv", Bucket=bucket, Key=f"{prefix}/batch_data.csv", ExtraArgs={"ContentType": "text/csv;charset=utf-8"})
batch_path = f"s3://{bucket}/{prefix}/batch_data.csv"

## Task 5.2: Create a batch transform job

Batch transform automatically manages the processing of large datasets within the limits of specified parameters. When a batch transform job starts, SageMaker initializes compute instances and distributes the inference or preprocessing workload between them. Batch transform partitions the Amazon S3 objects in the input by key and maps Amazon S3 objects to instances.

Use batch transform when you need to get inferences from large datasets or when you don't need a persistent endpoint.

To create a batch transform job, you need to set the following options:
- **model_name**: The name of your model.
- **instance_type**: The type of Amazon Elastic Compute Cloud (Amazon EC2) instance to use; for example, "ml.c4.xlarge".
- **instance_count**: The number of EC2 instances to use.
- **assemble_with**: The way in which the output is assembled. Valid values are "Line" or "None".
- **strategy**: The strategy used to decide how to batch records in a single request. Valid values are "MultiRecord" and "SingleRecord".
- **accept**: The file type to accept.
- **output_path**: The Amazon S3 location for saving the transform result. If not specified, results are stored to a default bucket.

In [4]:
#create-batch-transformer
transformer = Transformer(
    model_name=model_name,
    instance_type="ml.m4.xlarge",
    instance_count=1,
    assemble_with="Line",
    strategy="MultiRecord",
    accept="text/csv",
    output_path="s3://{}/{}/batch-transform/test".format(bucket, prefix)
)

Use the test dataset as your customer records and run the batch transform job. The job can take as long as 10 minutes to run with this set of customer records.

Refer to [Use Batch Transform](https://docs.aws.amazon.com/sagemaker/latest/dg/batch-transform.html) for more information about batch transform jobs.

In [5]:
#run-batch-transform-job
transformer.transform(batch_path, content_type="text/csv", split_type="Line", join_source="Input")
transformer.wait()

INFO:sagemaker:Creating transform job with name: sagemaker-xgboost-2024-11-06-16-49-56-075


.......................................
  from pandas import MultiIndex, Int64Index[0m
[34m[2024-11-06:16:56:31:INFO] No GPUs detected (normal if no gpus installed)[0m
[34m[2024-11-06:16:56:31:INFO] No GPUs detected (normal if no gpus installed)[0m
[34m[2024-11-06:16:56:31:INFO] nginx config: [0m
[34mworker_processes auto;[0m
[34mdaemon off;[0m
[34mpid /tmp/nginx.pid;[0m
[34merror_log  /dev/stderr;[0m
[34mworker_rlimit_nofile 4096;[0m
[34mevents {
  worker_connections 2048;[0m
[34m}[0m
  from pandas import MultiIndex, Int64Index[0m
[35m[2024-11-06:16:56:31:INFO] No GPUs detected (normal if no gpus installed)[0m
[35m[2024-11-06:16:56:31:INFO] No GPUs detected (normal if no gpus installed)[0m
[35m[2024-11-06:16:56:31:INFO] nginx config: [0m
[35mworker_processes auto;[0m
[35mdaemon off;[0m
[35mpid /tmp/nginx.pid;[0m
[35merror_log  /dev/stderr;[0m
[35mworker_rlimit_nofile 4096;[0m
[35mevents {
  worker_connections 2048;[0m
[35m}[0m
[34mhttp {
  in

## Task 5.3: View the prediction data in Amazon S3

The batch transform job stores the output in the bucket and folder that you specified when you set up the transformer. You can view the prediction results in Amazon S3, either in the AWS Management Console or in the notebook.

If you want to download and view the output from the console, navigate to Amazon S3, open the bucket starting with **sagemaker-**, and navigate to the object located in **/sagemaker/mlasms/batch-transform/test**. Download the **batch_data.csv.out** object and open it with a notepad editor. The file contains hundreds of predicted values for the customer records that you ran through the batch transform job.

A sample of the output can also be displayed in the notebook.


In [6]:
!aws s3 cp --recursive $transformer.output_path ./
!head batch_data.csv.out

download: s3://sagemaker-us-west-2-440570968020/sagemaker/mlasms/batch-transform/test/batch_data.csv.out to ./batch_data.csv.out
57,2,4,9,0,3,0,0,0,15024,0,60,0.9968317151069641
24,0,0,0,1,1,2,0,0,0,0,40,0.004997196141630411
54,1,4,3,0,3,0,0,0,0,0,50,0.8091089725494385
56,0,3,13,2,0,1,0,0,0,0,40,0.0222619641572237
44,0,4,3,2,1,1,0,0,0,0,24,0.1606707125902176
48,1,5,11,0,3,0,0,0,0,0,65,0.8538702130317688
22,2,1,1,1,0,2,0,0,0,0,20,0.0015747727593407035
40,0,3,7,1,0,5,1,0,0,0,55,0.030271224677562714
55,0,3,7,0,2,0,0,0,0,0,45,0.4514922797679901
64,0,0,0,0,1,0,0,0,0,0,65,0.39015331864356995


### Conclusion

Congratulations! You have used Amazon SageMaker to successfully run a batch transform job.

### Cleanup

You have completed this notebook. To move to the next part of the lab, do the following:

- Close this notebook file.
- Return to the lab session and continue with the **Conclusion**.