# Training Job in Internet-free Mode

If you want to isolate your training data and training container from the rest of the Internet, then you should create the training job in a private subnet. A private subnet is a subnet in your VPC without a route to Internet Gateway. This means, by default, no inbounding calls to your container from the Internet is possible and your container cannot make outbounding calls to the Internet. If you need the training container to access your S3 resource, you need to **explicitly** add a VPC endpoint and attach it to the route table of your private subnet to allow traffic to your data in S3 bucket. 

In this notebook, we will walk through an example of creating such a training job. We will

- Build a simple training image
- Set up a VPC
- Set up a private subnet in the VPC
- Set up a security group in the VPC
- Create a training job in your private subnet && security group and watch it to fail (because it cannot access your S3 resource)
- Add a VPC endpoint to allow traffic to S3
- Create another training job in your private subnet and watch it to succeeed 

If you are not familiar with VPC security configuration, the following materials can help you
- [Security in Amazon Virtual Private Cloud](https://docs.aws.amazon.com/vpc/latest/userguide/security.html)
- [Training and Inference Containers in Internet-Free Mode](https://docs.aws.amazon.com/sagemaker/latest/dg/mkt-algo-model-internet-free.html)

It's okay if you don't understand everything from the official docs above, the code samples you will see in this notebook will help you grasp those concepts. 

In [None]:
# import libraries
import boto3
import pprint
import datetime
import time

pp = pprint.PrettyPrinter(indent=1)

## Permissions

If you are running this notebook on an EC2 instance with an IAM user (you) as the default profile, then you will need policies to allow you to create VPC / Subnet / Secruity group / VPC endpoint.

Likewise, if you are running this notebook on a SageMaker notebook instance or Studio, the service role needs to have those permission as well. 


## Build a training image

We follow the same procedure for building a training image as in [this notebook](https://github.com/hsl89/amazon-sagemaker-examples/blob/sagemaker-fundamentals/sagemaker-fundamentals/create-training-job/create-training-job.ipynb). We will refer to this image as `example-image`. Please go through that notebook if you are not familiar with `CreateTrainingJob` API.

In [None]:
# create a repo in your ECR 

ecr = boto3.client('ecr')

try:
    # The repository might already exist
    # in your ECR
    cr_res = ecr.create_repository(
        repositoryName='example-image')
    pp.pprint(cr_res)
except Exception as e:
    print(e)

In [None]:
%%sh
# build the image
cd container/

# tag it as example-image:latest
docker build -t example-image:latest .
    
# test the container
python local_test/test_container.py

account=$(aws sts get-caller-identity --query Account | sed -e 's/^"//' -e 's/"$//')
region=$(aws configure get region)
ecr_account=${account}.dkr.ecr.${region}.amazonaws.com

# Give docker your ECR login password
aws ecr get-login-password --region $region | docker login --username AWS --password-stdin $ecr_account

# Fullname of the repo
fullname=$ecr_account/example-image:latest

# Tag the image with the fullname
docker tag example-image:latest $fullname

# Push to ECR
docker push $fullname

## Create a VPC

You can think of Amazon VPC as the traditional network in a data center in the cloud. 

The following are the key concepts for VPCs: 
* Virtual private cloud (VPC) — A virtual network dedicated to your AWS account.
* Subnet — A range of IP addresses in your VPC.
* Route table — A set of rules, called routes, that are used to determine where network traffic is directed.
* Internet gateway — A gateway that you attach to your VPC to enable communication between resources in your VPC and the internet.
* VPC endpoint — Enables you to privately connect your VPC to supported AWS services and VPC endpoint services powered by PrivateLink without requiring an internet gateway, NAT device, VPN connection, or AWS Direct Connect connection. Instances in your VPC do not require public IP addresses to communicate with resources in the service. Traffic between your VPC and the other service does not leave the Amazon network. For more information, see AWS PrivateLink and VPC endpoints.
* CIDR block —Classless Inter-Domain Routing. An internet protocol address allocation and route aggregation methodology. For more information, see [Classless Inter-Domain Routing](https://en.wikipedia.org/wiki/Classless_Inter-Domain_Routing#CIDR_notation) in Wikipedia.

All of these concepts are explained in the [official docs](https://docs.aws.amazon.com/vpc/latest/userguide/what-is-amazon-vpc.html). 


In [None]:
# Create a VPC in your default region

ec2 = boto3.client('ec2')

vpc_res = ec2.create_vpc(
    CidrBlock='10.0.0.0/20', # 2^(32 - 20) = 4906 private ipv4 addrs
    AmazonProvidedIpv6CidrBlock=False,
    DryRun=False,
    TagSpecifications=[
        {
            'ResourceType': 'vpc', 
            'Tags':[
                {
                    'Key': 'Name',
                    'Value': 'hello-world'
                },
            ]
        },
    ]
)

pp.pprint(vpc_res)

In [None]:
# inspect this VPC in details

vpc_des = ec2.describe_vpcs(
    VpcIds=[vpc_res['Vpc']['VpcId']]
    )
pp.pprint(vpc_des['Vpcs'])

In [None]:
# create subnet and associate it with route table

def get_first_availability_zone():
    region_name = boto3.Session().region_name
    avz_res = ec2.describe_availability_zones(
        Filters=[
            {
                "Name": "region-name",
                "Values": [region_name]
            }
        ],
        AllAvailabilityZones=True,
    )
    
    for az in avz_res['AvailabilityZones']:
        if az['ZoneType']=='availability-zone':
            return az
    else:
        return None
    
def create_subnet(vpc_id, cidr_block, dry_run):
    """Create a subnet in the first availability zone in your current region"""
    az = get_first_availability_zone()
    if az is not None:
        subnet_res = ec2.create_subnet(
            AvailabilityZone = az['ZoneName'], #'us-west-2a',
            VpcId = vpc_id,                    # vpc_res['Vpc']['VpcId'],
            CidrBlock= cidr_block,             #'100.68.0.18/18',
            DryRun=dry_run                     # True,
        )
        return subnet_res
    else:
        raise "No availability zone"
        
sn_res = create_subnet(
    vpc_id=vpc_res['Vpc']['VpcId'],
    cidr_block='10.0.0.0/28', # I want 2 ^ (32 - 28) private ipv4 in this subnet
    dry_run=False)

pp.pprint(sn_res)

In [None]:
# create a security group

sg_res = ec2.create_security_group(
    Description='security group for SageMaker instances',
    GroupName='sagemaker-private',
    VpcId=vpc_res['Vpc']['VpcId'],
    TagSpecifications=[
        {
            "ResourceType": "security-group",
            "Tags" : [
                {   
                    "Key": "Service", # Tag the sec gp by service, this can be used to filter sec gps
                    "Value": "SageMaker" 
                }
            ]
        }
    ]

)

pp.pprint(sg_res)

In [None]:
# inspect the security group in detail

ec2.describe_security_groups(
    GroupIds=[
        sg_res['GroupId']
    ]
)

## Creat a training job
Now let's create a training job within your private subnet you just created. First, let's download some helper functions for creating service role for SageMaker. 

In [None]:
%%bash
file=$(ls . | grep iam_helpers.py)

if [ -f "$file" ]
then
    rm $file
fi

wget https://raw.githubusercontent.com/hsl89/amazon-sagemaker-examples/sagemaker-fundamentals/sagemaker-fundamentals/execution-role/iam_helpers.py

In [None]:
# set up service role for SageMaker

sts = boto3.client('sts')
caller = sts.get_caller_identity()

if ':user/' in caller['Arn']: # as IAM user
    # either paste in a role_arn with or create a new one and attach 
    # AmazonSageMakerFullAccess
    role_name = 'example-sm'
    role_arn = create_execution_role(role_name=role_name)['Role']['Arn']
    iam.attach_role_policy(
        RoleName=role_name,
        PolicyArn='arn:aws:iam::aws:policy/AmazonSageMakerFullAccess',
    )
elif 'assumed-role' in caller['Arn']: # on SageMaker infra
    role_arn = caller['Arn']
else:
    print("I assume you are on an EC2 instance launched with an IAM role")
    role_arn = caller['Arn']
    

In [None]:
# some helpers
def current_time():
    ct = datetime.datetime.now() 
    return str(ct.now()).replace(":", "-").replace(" ", "-")[:19]

def account_id():
    return boto3.client('sts').get_caller_identity()['Account']


To make this notebook self-contained, we will create a bucket and upload some data there to pass to training container as we did in the [basic create training job notebook](https://github.com/hsl89/amazon-sagemaker-examples/blob/sagemaker-fundamentals/sagemaker-fundamentals/create-training-job/create-training-job.ipynb). But you don't have to do so, if you already have a bucket that SageMaker service can access (i.e. a bucket with bucket name starts with `sagemaker`, see `AmazonSageMakerFullAccessPolicy`), then you can use 
that bucket as well. 


In [None]:
# create a bucket for SageMaker in your region if it does not exisit

def create_bucket():
    """Create an S3 bucket that is intended to be used for short term"""
    bucket = f"sagemaker-{current_time()}"
    
    region_name = boto3.Session().region_name
    boto3.client('s3').create_bucket(
        Bucket=bucket,
        CreateBucketConfiguration={
            'LocationConstraint': region_name
        })
    return bucket

# replace it with your own SageMaker-accessible bucket 
# if you don't want to create a new one

bucket = create_bucket()

In [None]:
# upload some mock data to your bucket

s3 = boto3.client('s3')
input_prefix = 'input_data'

for fname in os.listdir('data'):
    with open(os.path.join('data', fname), 'rb') as f:
        key = input_prefix + fname
        s3.upload_fileobj(f, bucket, key)

In [None]:
sm_cli = boto3.client('sagemaker')


# name training job
training_job_name = 'example-training-job-{}'.format(current_time())

data_path = "s3://" + bucket + '/' + input_prefix

# location that SageMaker saves the model artifacts
output_prefix = 'output/'
output_path = "s3://" + bucket + '/' + output_prefix

# ECR URI of your image
region = boto3.Session().region_name
account = account_id()
image_uri = "{}.dkr.ecr.{}.amazonaws.com/example-image:latest".format(account, region)

algorithm_specification = {
    'TrainingImage': image_uri,
    'TrainingInputMode': 'File',
}


input_data_config = [
    {
        'ChannelName': 'train',
            'DataSource':{
                'S3DataSource':{
                    'S3DataType': 'S3Prefix',
                    'S3Uri': data_path,
                    'S3DataDistributionType': 'FullyReplicated',
                }
        }
        
    },
    {
        'ChannelName': 'test',
        'DataSource':{
            'S3DataSource': {
                'S3DataType': 'S3Prefix',
                'S3Uri': data_path,
                'S3DataDistributionType': 'FullyReplicated',
            }
        }
    }
]


    
vpc_config = {
    # security groups need to be configured to communicate 
    # with each other for distributed training job
    'SecurityGroupIds' : [
         sg_res['GroupId']
    ],
    'Subnets': [
        sn_res['Subnet']['SubnetId']
    ]
}

output_data_config = {
    'S3OutputPath': output_path
}

resource_config = {
    'InstanceType': 'ml.m5.large',
    'InstanceCount':1,
    'VolumeSizeInGB':5
}

stopping_condition={
    'MaxRuntimeInSeconds':120,
}

enable_network_isolation=True

In [None]:
ct_res = sm_cli.create_training_job(
    TrainingJobName=training_job_name,
    AlgorithmSpecification=algorithm_specification,
    RoleArn=role_arn,
    InputDataConfig=input_data_config,
    OutputDataConfig=output_data_config,
    VpcConfig=vpc_config,
    ResourceConfig=resource_config,
    StoppingCondition=stopping_condition,
    EnableNetworkIsolation=enable_network_isolation,
    EnableManagedSpotTraining=False,
)

In [None]:
# see the training job to fail
stopped = False
while not stopped:
    tj_state = sm_cli.describe_training_job(TrainingJobName=training_job_name)
    
    if tj_state['TrainingJobStatus'] in ['Completed', 'Stopped', 'Failed']:
        stopped=True
    else:
        print("Training in progress")
        time.sleep(30)

if tj_state['TrainingJobStatus'] == 'Failed':
    print("Training job failed ")
    print("Failed Reason: {}".format(tj_state['FailureReason']))
else:
    print("Training job completed")

This is the failure message we expect to see. The subnet you created is isolated from the Internet and you have not created any mechanism for it to access your S3 resource. Therefore, SageMaker failed to download data from there. The error message also suggested ways to fix it: either add a route to S3 VPC endpoint or add a NAT device. We will explore the first option.

## Add a VPC endpoint

A VPC endpoint enables you to privately connect your VPC to supported AWS services and VPC endpoint services powered by PrivateLink without requiring an internet gateway, NAT device, VPN connection, or AWS Direct Connect connection. Instances in your VPC do not require public IP addresses to communicate with resources in the service. **Traffic between your VPC and the other service does not leave the Amazon network**. For more information, see [AWS PrivateLink and VPC endpoints](https://docs.aws.amazon.com/vpc/latest/userguide/endpoint-services-overview.html). 

There are three types of VPC endpoints at the time this notebook is published:

A **gateway** endpoint serves as a target for a route in your route table for traffic destined for the AWS service. You can specify an endpoint policy to attach to the endpoint, which will control access to the service from your VPC. You can also specify the VPC route tables that use the endpoint.

An **interface** endpoint is a network interface in your subnet that serves as an endpoint for communicating with the specified service. You can specify the subnets in which to create an endpoint, and the security groups to associate with the endpoint network interface.

A **GatewayLoadBalancer** endpoint is a network interface in your subnet that serves an endpoint for communicating with a Gateway Load Balancer that you've configured as a VPC endpoint service.

Both gateway and interface endpoint are viable options here. We will focus on interface endpoint here, because it saves us from additional configuration with route table.

In [None]:
# Check out service name for S3

services = ec2.describe_vpc_endpoint_services()
for s in services['ServiceNames']:
    if 's3' in s:
        print(s)

In [None]:
# Create a route table
rt_res = ec2.create_route_table(
    VpcId=vpc_res['Vpc']['VpcId'],
    TagSpecifications=[
        {
            "ResourceType": 'route-table',
            'Tags': [
                {
                    'Key': 'Service',
                    'Value': 'SageMaker'
                }
            ]
        }
    ]
)

pp.pprint(rt_res)

In [None]:
# Associate the route table with the subnet

ass_rt_res = ec2.associate_route_table(
    RouteTableId=rt_res['RouteTable']['RouteTableId'],
    SubnetId=sn_res['Subnet']['SubnetId']
)

pp.pprint(ass_rt_res)

In [None]:
# Create a gateway endpoint 

region_name = boto3.Session().region_name

iep_res = ec2.create_vpc_endpoint(
    VpcEndpointType='Gateway',
    VpcId=vpc_res['Vpc']['VpcId'],
    ServiceName=f'com.amazonaws.{region_name}.s3',
    RouteTableIds=[rt_res['RouteTable']['RouteTableId']],
    
    # you don't need to add a tag, it is only
    # used as a convenient way to filter through your 
    # endpoints in the future 
    TagSpecifications=[    
        {
            'ResourceType': 'vpc-endpoint',
            'Tags' : [
                {
                    'Key' : 'Service',
                    'Value': 'SageMaker'
                }
            ]
        }
    ]
    
)

pp.pprint(iep_res)

Now you have added a Gateway endpoint to the route table of the subnet. This endpoint allows the subnet to talk to your S3 bucket **privately**. The traffic between the subnet and your S3 bucket does not leave AWS network. Let's create another training job to verify that the training container can access the data in your S3 bucket. 

In [None]:
training_job_name = 'example-training-job-{}'.format(current_time())

ct_res = sm_cli.create_training_job(
    TrainingJobName=training_job_name,
    AlgorithmSpecification=algorithm_specification,
    RoleArn=role_arn,
    InputDataConfig=input_data_config,
    OutputDataConfig=output_data_config,
    VpcConfig=vpc_config,
    ResourceConfig=resource_config,
    StoppingCondition=stopping_condition,
    EnableNetworkIsolation=enable_network_isolation,
    EnableManagedSpotTraining=False,
)


In [None]:
# watch to to succeed

stopped = False
while not stopped:
    tj_state = sm_cli.describe_training_job(TrainingJobName=training_job_name)
    
    if tj_state['TrainingJobStatus'] in ['Completed', 'Stopped', 'Failed']:
        stopped=True
    else:
        print("Training in progress")
        time.sleep(30)

if tj_state['TrainingJobStatus'] == 'Failed':
    print("Training job failed ")
    print("Failed Reason: {}".format(tj_state['FailureReason']))
else:
    print("Training job completed")

## Review and clean up

Let's review what you did in this notebook: you have created 
- a VPC
- a subnet inside the VPC
- a security group inside the VPC

The VPC is isolated from the Internet, because you did not add an Internet Gateway to it. 
You created a training job in the subnet. The traffic in and out the SageMaker Instance running your training container is controled by the security group permissions. You verified that this training job failed, because SageMaker cannot download data from your S3 bucket. 

Next, you added 
- a route table to your subnet
- an S3 Gateway Endpoint to the route table

Then you verified that once you added the S3 Gateway Endpoint to your VPC, the same training job can go through. 

In [None]:
# delete the entire VPC and its associated resources 
# adapted from https://gist.github.com/alberto-morales/b6d7719763f483185db27289d51f8ec5

def vpc_cleanup(vpcid):
    """Remove VPC from AWS
    Set your region/access-key/secret-key from env variables or boto config.
    :param vpcid: id of vpc to delete
    """
    if not vpcid:
        return
    print('Removing VPC ({}) from AWS'.format(vpcid))
    ec2 = boto3.resource('ec2')
    ec2client = ec2.meta.client
    vpc = ec2.Vpc(vpcid)
    # detach default dhcp_options if associated with the vpc
    dhcp_options_default = ec2.DhcpOptions('default')
    if dhcp_options_default:
        dhcp_options_default.associate_with_vpc(
            VpcId=vpc.id
        )
    # detach and delete all gateways associated with the vpc
    for gw in vpc.internet_gateways.all():
        vpc.detach_internet_gateway(InternetGatewayId=gw.id)
        gw.delete()

    # delete any instances
    for subnet in vpc.subnets.all():
        for instance in subnet.instances.all():
            instance.terminate()
    
    # delte all subnets
    for subnet in vpc.subnets.all():
        for interface in subnet.network_interfaces.all():
            interface.delete()
        subnet.delete()
    
    # delete all route table associations
    for rt in vpc.route_tables.all():
        for rta in rt.associations:
            if not rta.main:
                rta.delete()
            
        try:
            rt.delete()
        except Exception as e:
            pass
    
    # delete our endpoints
    for ep in ec2client.describe_vpc_endpoints(
            Filters=[{
                'Name': 'vpc-id',
                'Values': [vpcid]
            }])['VpcEndpoints']:
        ec2client.delete_vpc_endpoints(VpcEndpointIds=[ep['VpcEndpointId']])
    # delete our security groups
    for sg in vpc.security_groups.all():
        if sg.group_name != 'default':
            sg.delete()
    # delete any vpc peering connections
    for vpcpeer in ec2client.describe_vpc_peering_connections(
            Filters=[{
                'Name': 'requester-vpc-info.vpc-id',
                'Values': [vpcid]
            }])['VpcPeeringConnections']:
        ec2.VpcPeeringConnection(vpcpeer['VpcPeeringConnectionId']).delete()
    # delete non-default network acls
    for netacl in vpc.network_acls.all():
        if not netacl.is_default:
            netacl.delete()
    # delete network interfaces
 
    # finally, delete the vpc
    ec2client.delete_vpc(VpcId=vpcid)
    return

vpc_cleanup(vpc_res["Vpc"]['VpcId'])