# Getting Started with AWS Batch for SageMaker Training jobs

---

This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook.

![This us-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-west-2/build_and_train_models|sm-training-queues|sm-training-queues_getting_started_with_estimator.ipynb)

---

This sample notebook will demonstrate how to submit some simple 'hello world' jobs to an [AWS Batch job queue](https://aws.amazon.com/batch/) using an [Estimator](https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html#sagemaker.estimator.Estimator). You can run any of the cells in this notebook interactively to experiment with using your queue. Batch will take care of ensuring your jobs run automatically as your service environment capacity becomes available. 

## Setup and Configure Training Job Variables
We will need a single instance for a short duration for the sample jobs.  Change any of the constant variables below to adjust the example to your liking. 

In [1]:
INSTANCE_TYPE = "ml.g5.xlarge"
INSTANCE_COUNT = 1
MAX_RUN_TIME = 300
TRAINING_JOB_NAME = "hello-world-simple-job"

In [None]:
import logging

logging.basicConfig(
    level=logging.INFO, format="%(asctime)s - %(name)s - %(levelname)s - %(message)s"
)
logging.getLogger("botocore.client").setLevel(level=logging.WARN)
logger = logging.getLogger(__name__)

from sagemaker.session import Session
from sagemaker import image_uris

session = Session()

image_uri = image_uris.retrieve(
    framework="pytorch",
    region=session.boto_session.region_name,
    version="2.5",
    instance_type=INSTANCE_TYPE,
    image_scope="training",
)

## Create Sample Resources
The diagram belows shows the Batch resources we'll create for this example.

![The Resources to Create](batch_getting_started_resources.png "Example Job Queue and Service Environment Resources")

You can use [Batch Console](https://console.aws.amazon.com/batch) to create these resources, or you can run the cell below. The ```create_resources``` function below will skip creating any resources that already exist.

In [None]:
from sagemaker.aws_batch.boto_client import get_batch_boto_client
from utils.aws_batch_resource_management import AwsBatchResourceManager, create_resources

# This job queue name needs to match the Job Queue created in AWS Batch.
JOB_QUEUE_NAME = "my-sm-training-fifo-jq"
SERVICE_ENVIRONMENT_NAME = "my-sm-training-fifo-se"

# Create ServiceEnvironment and JobQueue
resource_manager = AwsBatchResourceManager(get_batch_boto_client())
resources = create_resources(
    resource_manager, JOB_QUEUE_NAME, SERVICE_ENVIRONMENT_NAME, max_capacity=1
)

## Create Hello World Estimator 
Now that our resources are created, we'll construct a simple Estimator.

In [4]:
from sagemaker.estimator import Estimator
from sagemaker import get_execution_role

role = get_execution_role()
hello_world_estimator = Estimator(
    image_uri=image_uri,
    role=role,
    instance_count=INSTANCE_COUNT,
    instance_type=INSTANCE_TYPE,
    volume_size=1,
    base_job_name=TRAINING_JOB_NAME,
    container_entry_point=["echo", "Hello", "World"],
    max_run=MAX_RUN_TIME,
)

# This is typically where one would invoke estimator.fit().  Instead, submit to a queue below.

## Create TrainingQueue object
Using our queue is as easy as referring to it by name in the TrainingQueue contructor. The TrainingQueue class within the SageMaker Python SDK provides built in support for working with Batch queues.

In [None]:
from sagemaker.aws_batch.training_queue import TrainingQueue, TrainingQueuedJob

# Construct the queue object using the SageMaker Python SDK
queue = TrainingQueue(JOB_QUEUE_NAME)
logger.info(f"Using queue: {queue.queue_name}")

## Submit Some Training Jobs
Submitting your job to the queue is done by calling queue.submit.  This particular job doesn’t require any data, but in general, data should be provided by specifying inputs.

In [None]:
# Submit first job
training_queued_job_1: TrainingQueuedJob = queue.submit(
    training_job=hello_world_estimator, inputs=None
)
logger.info(
    f"Submitted job '{training_queued_job_1.job_name}' to TrainingQueue '{queue.queue_name}'"
)

# Submit second job
training_queued_job_2: TrainingQueuedJob = queue.submit(
    training_job=hello_world_estimator, inputs=None
)
logger.info(
    f"Submitted job '{training_queued_job_2.job_name}' to TrainingQueue '{queue.queue_name}'"
)

## Terminate a Job in the Queue
This next cell shows how to terminate an in queue job.

In [None]:
logger.info(f"Terminating job: {training_queued_job_2.job_name}")
training_queued_job_2.terminate()

## Monitor Job Status
This next cell shows how to list the jobs that have been submitted to the TrainingQueue.  The TrainingQueue can list jobs by status, and each job can be described individually for more details.  Once a TrainingQueuedJob has reached the STARTING status, the logs can be printed from underlying SageMaker training job.

In [None]:
import time


def list_jobs_in_training_queue(training_queue: TrainingQueue):
    """
    Lists all jobs in a TrainingQueue grouped by their status.

    This function retrieves jobs with different statuses (SUBMITTED, PENDING, RUNNABLE,
    SCHEDULED, STARTING, RUNNING, SUCCEEDED, FAILED) from the specified TrainingQueue
    and logs their names and current status.

    Args:
        training_queue (TrainingQueue): The TrainingQueue to query for jobs.

    Returns:
        None: This function doesn't return a value but logs job information.
    """
    submitted_jobs = training_queue.list_jobs(status="SUBMITTED")
    pending_jobs = training_queue.list_jobs(status="PENDING")
    runnable_jobs = training_queue.list_jobs(status="RUNNABLE")
    scheduled_jobs = training_queue.list_jobs(status="SCHEDULED")
    starting_jobs = training_queue.list_jobs(status="STARTING")
    running_jobs = training_queue.list_jobs(status="RUNNING")
    completed_jobs = training_queue.list_jobs(status="SUCCEEDED")
    failed_jobs = training_queue.list_jobs(status="FAILED")

    all_jobs = (
        submitted_jobs
        + pending_jobs
        + runnable_jobs
        + scheduled_jobs
        + starting_jobs
        + running_jobs
        + completed_jobs
        + failed_jobs
    )

    for job in all_jobs:
        job_status = job.describe().get("status", "")
        logger.info(f"Job : {job.job_name} is {job_status}")


def monitor_training_queued_job(job: TrainingQueuedJob):
    """
    Monitors a TrainingQueuedJob until it reaches an active or terminal state.

    This function continuously polls the status of the specified TrainingQueuedJob
    until it transitions to one of the following states: STARTING, RUNNING,
    SUCCEEDED, or FAILED. Once the job reaches one of these states, the function
    retrieves and displays the job's logs.

    Args:
        job (TrainingQueuedJob): The TrainingQueuedJob to monitor.

    Returns:
        None: This function doesn't return a value but displays job logs.
    """
    while True:
        job_status = job.describe().get("status", "")

        if job_status in {"STARTING", "RUNNING", "SUCCEEDED", "FAILED"}:
            break

        logger.info(f"Job : {job.job_name} is {job_status}")
        time.sleep(5)

    # Print training job logs
    job.get_estimator().logs()


logger.info(f"Listing all jobs in queue '{queue.queue_name}'...")
list_jobs_in_training_queue(queue)

logger.info(f"Polling job status for '{training_queued_job_1.job_name}'")
monitor_training_queued_job(training_queued_job_1)

# Optional: Delete AWS Batch Resources
This shows how to delete the AWS Batch ServiceEnvironment and JobQueue.  This step is completely optional, uncomment the code below to delete the resources created a few steps above.

In [None]:
from utils.aws_batch_resource_management import delete_resources

# delete_resources(resource_manager, resources)

## Notebook CI Test Results

This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.


![This us-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-east-1/build_and_train_models|sm-training-queues|sm-training-queues_getting_started_with_estimator.ipynb)

![This us-east-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-east-2/build_and_train_models|sm-training-queues|sm-training-queues_getting_started_with_estimator.ipynb)

![This us-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-west-1/build_and_train_models|sm-training-queues|sm-training-queues_getting_started_with_estimator.ipynb)

![This ca-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ca-central-1/build_and_train_models|sm-training-queues|sm-training-queues_getting_started_with_estimator.ipynb)

![This sa-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/sa-east-1/build_and_train_models|sm-training-queues|sm-training-queues_getting_started_with_estimator.ipynb)

![This eu-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-1/build_and_train_models|sm-training-queues|sm-training-queues_getting_started_with_estimator.ipynb)

![This eu-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-2/build_and_train_models|sm-training-queues|sm-training-queues_getting_started_with_estimator.ipynb)

![This eu-west-3 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-3/build_and_train_models|sm-training-queues|sm-training-queues_getting_started_with_estimator.ipynb)

![This eu-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-central-1/build_and_train_models|sm-training-queues|sm-training-queues_getting_started_with_estimator.ipynb)

![This eu-north-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-north-1/build_and_train_models|sm-training-queues|sm-training-queues_getting_started_with_estimator.ipynb)

![This ap-southeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-southeast-1/build_and_train_models|sm-training-queues|sm-training-queues_getting_started_with_estimator.ipynb)

![This ap-southeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-southeast-2/build_and_train_models|sm-training-queues|sm-training-queues_getting_started_with_estimator.ipynb)

![This ap-northeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-northeast-1/build_and_train_models|sm-training-queues|sm-training-queues_getting_started_with_estimator.ipynb)

![This ap-northeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-northeast-2/build_and_train_models|sm-training-queues|sm-training-queues_getting_started_with_estimator.ipynb)

![This ap-south-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-south-1/build_and_train_models|sm-training-queues|sm-training-queues_getting_started_with_estimator.ipynb)
