## SageMaker training with your own VPC/subnet

This sample shows how to:

- Setup an VPC to run training job.
- Run a SageMaker Training job using the VPC/Subnet and data from s3.
- Save artifacts into s3 which should use the vpc endpoint created.
- Tear down the infrastructure


**Please make sure the CIDR block in setup/cfn_sm_vpc.yaml does not conflict with your existing VPC. You can also change FSx storage (currently set at 1.2 TB) depending on your data sets**

In [None]:
# Imports
import os
import time
import boto3
from sagemaker import get_execution_role
from sagemaker.pytorch import PyTorch


# Inputs
region = "us-west-2"  # update this if your region is different
region_az = "us-west-2c"  # customize this as needed.
cfn_stack_name = 'vpc-training'  # cloudformation stack name

# Clients
cfn_client = boto3.client("cloudformation", region_name=region)


In [None]:
# Setup infrastructure using CloudFormation
with open("cfn_sm_vpc.yaml", "r") as f:
    template_body = f.read()
    
create_stack_response = cfn_client.create_stack(
    StackName=cfn_stack_name,
    TemplateBody=template_body,
    Parameters=[
        {
            'ParameterKey': 'AZ',
            'ParameterValue': region_az
        }

    ]
)

create_stack_response

In [None]:
# Wait for stack to be created, it takes ~10 minutes to complete.
stack_id = create_stack_response['StackId']

while True:
    response = cfn_client.describe_stacks(
        StackName=stack_id
    )
    status = response['Stacks'][0]['StackStatus']
    if status== "CREATE_IN_PROGRESS":
        print("Create in progress. Waiting..")
        time.sleep(30)
    elif status=="CREATE_COMPLETE":
        print("Stack created!")
        break
    else:
        print("Error creating stack - check the CFN console")
        break

In [None]:
# Get stack outputs
describe_response = cfn_client.describe_stacks(
    StackName=stack_id
)

outputs = describe_response['Stacks'][0]['Outputs']

for output in outputs:
    if output['OutputKey'] == 'sg':
        sec_group = output['OutputValue']
    elif output['OutputKey'] == 'privatesubnet':
        private_subnet_id = output['OutputValue']


print("Security Group ID:", sec_group)
print("Private Subnet ID:", private_subnet_id)


Now, we create a SageMaker training job that uses the vpc/subnet to launch the cluster.

In [None]:

# for ease, so that you can use fsx for data and training artifacts
SM_TRAIN_DIR = "/opt/ml/input/data/train"  # path where fsx is mounted in the training container
hyperparameters = {}
hyperparameters["model-dir"] = f"/opt/ml/model"
hyperparameters["training-dir"] = f"{SM_TRAIN_DIR}"

In [None]:
# setup estimator and invoke
instance_type = "ml.m5.xlarge"
instance_count = 1
base_job_name = f'sagemaker-vpc-training-sample'

estimator = PyTorch(
    entry_point="train.py",
    source_dir=os.getcwd(),
    instance_type=instance_type,
    role=get_execution_role(),
    instance_count=instance_count,
    framework_version="1.13.1",
    py_version="py39",
    hyperparameters=hyperparameters,
    base_job_name=base_job_name,
    subnets = [private_subnet_id],
    security_group_ids=[sec_group],
    max_retry_attempts=30)

estimator.fit({"train":"s3://path"})


### Clean up resources

You can tear down the CloudFormation stack to delete the VPC and associated resources, and the FSx file system to avoid incurring costs.

In [None]:
# Delete the stack

delete_response = cfn_client.delete_stack(
    StackName=stack_id
)

delete_response