## Setting up


If you are going to use Sagemaker in a local environment, you need access to an IAM Role with the required permissions for Sagemaker. You can find more about it [here](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html).



In [None]:
import os

os.environ["PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION"] = "python"

import sagemaker
import boto3

sagemaker_session = sagemaker.Session()

try:
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client("iam")
    role = iam.get_role(RoleName="sagemaker_execution_role")["Role"]["Arn"]

print(f"sagemaker role arn: {role}")
print(f"sagemaker bucket: {sagemaker_session.default_bucket()}")
print(f"sagemaker session region: {sagemaker_session.boto_region_name}")

Note that SageMaker by default uses the latest [AWS Deep Learning Container (DLC)](https://aws.amazon.com/machine-learning/containers/), but if you want to use your own DLC, you can set the `use_ecr_image` flag to `True` and set the `ecr_image` variable. Also note that if using FSx when launching the SageMaker notebook instance, you will need to use the same `subnet` and `security_group_config`.  

In [None]:
use_ecr_image = False
use_fsx = False
kwargs = {}

if use_ecr_image:
    ecr_image = "<ECR_IMAGE_URI>"
    kwargs["image_uri"] = ecr_image

if use_fsx:
    subnet_config = ["<SUBNET_CONFIG_ID>"]
    security_group_config = ["<SECURITY_GROUP_CONFIG>"]
    kwargs["subnets"] = subnet_config
    kwargs["security_group_ids"] = security_group_config

## Configuring Training Job

We will now set the hyperparameters and define the estimator object for our training job.  Since we are using DeepSpeed, we must provide a DeepSpeed config JSON file, which is located in the `code/` folder. 

 We will  use the `PyTorch` estimator class and configure it to use the `torch_distributed` distribution, which will launch a the training job using `torchrun`.  This is a popular launcher for PyTorch-based distributed training jobs

In [None]:
hyperparameters = {
    # "gradient_checkpointing": True,
    # "batch_size": 64,
    # "epochs": 2,
    # "max_steps": 50,
    # "deepspeed_config": "/opt/ml/code/deepspeed_config.json",
}

from sagemaker.pytorch import PyTorch

estimator = PyTorch(
    entry_point="start.py",
    framework_version="2.3.0",
    py_version="py311",
    source_dir="./code",
    hyperparameters=hyperparameters,
    # distribution={"smdistributed": {"dataparallel": {"enabled": True}}},
    role=role,
    instance_count=1,  # default: 2
    instance_type="ml.p4d.24xlarge",  # default: ml.g5.12xlarge
    keep_alive_period_in_seconds=600,
    volume_size=500,  # EBS 卷大小，单位为 GB
    # max_run=1800,
    input_mode='FastFile',  # Available options: File | Pipe | FastFile
    base_job_name="pytorch-training-fastfile",
    sagemaker_session=sagemaker_session,
    debugger_hook_config=False,
    disable_output_compression=True,
    **kwargs,
)

## Executing the traning job 
We can now start our training job, with the `.fit()` method.

In [None]:
# starting the train job with our uploaded datasets as input
estimator.fit({'surface': 's3://datalab/goldwind/surface/', 'upper': 's3://datalab/goldwind/upper/', 'pretrained_model': 's3://datalab/goldwind/pretrained_model/', 'aux_data': 's3://datalab/goldwind/aux_data/'}, wait=True)

## Terminate the warm pool cluster if no longer needed

Once finished experimenting, you can terminate the warm pool cluster to reduce billed time

In [None]:
sagemaker_session.update_training_job(
    estimator.latest_training_job.job_name, resource_config={"KeepAlivePeriodInSeconds": 0}
)