# Training Job in Internet-free Mode

If you want to isolate your training data and training container from the rest of the Internet, then you should create the training job in a private subnet. A private subnet is a subnet in your VPC without a route to Internet Gateway. This means, by default, no inbounding calls to your container from the Internet is possible and your container cannot make outbounding calls to the Internet. If you need the training container to access your S3 resource, you need to **explicitly** add a VPC endpoint and attach it to the route table of your private subnet to allow traffic to your data in S3 bucket. 

In this notebook, we will walk through an example of creating such a training job. We will

- Build a simple training image
- Set up a VPC
- Set up a private subnet in the VPC
- Set up a security group in the VPC
- Create a training job in your private subnet && security group and watch it to fail (because it cannot access your S3 resource)
- Add a VPC endpoint to allow traffic to S3
- Create another training job in your private subnet and watch it to succeeed 

If you are not familiar with VPC security configuration, the following materials can help you
- [Security in Amazon Virtual Private Cloud](https://docs.aws.amazon.com/vpc/latest/userguide/security.html)
- [Training and Inference Containers in Internet-Free Mode](https://docs.aws.amazon.com/sagemaker/latest/dg/mkt-algo-model-internet-free.html)

It's okay if you don't understand everything from the official docs above, the code samples you will see in this notebook will help you grasp those concepts. 

In [1]:
# import libraries
import boto3
import pprint
import datetime
import time

pp = pprint.PrettyPrinter(indent=1)

## Permissions

If you are running this notebook on an EC2 instance with an IAM user (you) as the default profile, then you will need policies to allow you to create VPC / Subnet / Secruity group / VPC endpoint.

Likewise, if you are running this notebook on a SageMaker notebook instance or Studio, the service role needs to have those permission as well. 

First, get some helper functions for creating service role. 

In [17]:
%%bash
file=$(ls . | grep iam_helpers.py)

if [ -f "$file" ]
then
    rm $file
fi

wget https://raw.githubusercontent.com/hsl89/amazon-sagemaker-examples/sagemaker-fundamentals/sagemaker-fundamentals/execution-role/iam_helpers.py


--2021-03-19 03:36:31--  https://raw.githubusercontent.com/hsl89/amazon-sagemaker-examples/sagemaker-fundamentals/sagemaker-fundamentals/execution-role/iam_helpers.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3350 (3.3K) [text/plain]
Saving to: ‘iam_helpers.py’

     0K ...                                                   100% 55.9M=0s

2021-03-19 03:36:31 (55.9 MB/s) - ‘iam_helpers.py’ saved [3350/3350]



In [82]:
iam = boto3.client('iam')
# get the ARN of the user
caller_arn = boto3.client('sts').get_caller_identity()['Arn']
import json

def create_execution_role(role_name="basic-role"):
    """Create an service role to procure services on your behalf

    
    Args:
        role_name (str): name of the role
    
    Return:
        dict
    """    
    # if the role already exists, delete it
    # Note: you need to make sure the role is not
    # used in production, because the code below
    # will delete the role and create a new one
    
    def find_role(role_res, role_name):
        for r in role_res['Roles']:
            if r['RoleName'] == role_name:
                return True
        return False

    def delete_role(role_res, role_name):
        if find_role(role_res, role_name):
            role = boto3.resource('iam').Role(role_name)
            for p in role.attached_policies.all():
                role.detach_policy(PolicyArn=p.arn)

            iam.delete_role(RoleName=role.name)
            print('role deleted')
            return

    role_res = iam.list_roles(MaxItems=10)
    find_role(role_res, role_name)

    while 'Marker' in role_res:
        role_res = iam.list_roles(MaxItems=10, Marker=role_res['Marker'])
        delete_role(role_res, role_name)
    
    # Trust policy document
    trust_relation_policy_doc = {
      "Version": "2012-10-17",
      "Statement": [
        {
          "Effect": "Allow",
          "Principal": {
            "AWS": caller_arn, # Allow caller to take this role
            "Service": [
              "sagemaker.amazonaws.com" # Allow SageMaker to take the role
            ],
          },
          "Action": "sts:AssumeRole",
        }
      ]
    }
    
    
    res = iam.create_role(
        RoleName=role_name,
        AssumeRolePolicyDocument=json.dumps(trust_relation_policy_doc)
    )
    return res


create_execution_role('example-role-to-be-deleted')


role deleted


{'Role': {'Path': '/',
  'RoleName': 'example-role-to-be-deleted',
  'RoleId': 'AROA2ATYEUMKCOZEEXRED',
  'Arn': 'arn:aws:iam::688520471316:role/example-role-to-be-deleted',
  'CreateDate': datetime.datetime(2021, 3, 19, 20, 34, 4, tzinfo=tzlocal()),
  'AssumeRolePolicyDocument': {'Version': '2012-10-17',
   'Statement': [{'Effect': 'Allow',
     'Principal': {'AWS': 'arn:aws:iam::688520471316:user/hongshan',
      'Service': ['sagemaker.amazonaws.com']},
     'Action': 'sts:AssumeRole'}]}},
 'ResponseMetadata': {'RequestId': 'ce5d3309-9cd4-4334-ad5a-f943cb40e951',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': 'ce5d3309-9cd4-4334-ad5a-f943cb40e951',
   'content-type': 'text/xml',
   'content-length': '893',
   'date': 'Fri, 19 Mar 2021 20:34:03 GMT'},
  'RetryAttempts': 0}}

In [None]:

def find_role(role_res, role_name):
    for r in role_res['Roles']:
        if r['RoleName'] == role_name:
            return True
    return False

def delete_role(role_name):
    print('role deleted')

role_res = iam.list_roles(MaxItems=10)

while True:
    if 'Marker' in role_res:
        m = role_res['Marker']
        nx_role_res = iam.list_roles(MaxItems=10, Marker=m)
    else:
        if find_role(nx_rol_res, role_name):
            delete_role(role_name)
            break
    
    if find_role(role_res, role_name) or find_role(nx_role_res, role_name):
        delete_role(role_name)
        break
    
    role_res = nx_role_res
    
    


In [71]:
iam.describe_roles

AttributeError: 'IAM' object has no attribute 'describe_roles'

In [43]:
# Set up service role for SageMaker

#from iam_helpers import create_execution_role

sts = boto3.client('sts')
caller = sts.get_caller_identity()

if ':user/' in caller['Arn']: # as IAM user
    # either paste in a role_arn with or create a new one
    role_arn = create_execution_role(role_name='sm')['Role']['Arn']
    
elif 'assumed-role' in caller['Arn']: # on SageMaker infra
    role_arn = caller['Arn']
else:
    print("I assume you are on an EC2 instance launched with an IAM role")
    role_arn = caller['Arn']
    

sm
A2I-rekognition
A2ISageMaker-ExecutionRole-20201030T104154
Admin
AmazonBraketServiceSageMakerNotebookRole-20210218T233116
AmazonSageMaker-ExecutionRole-20191017T170424
AmazonSageMaker-ExecutionRole-20200214T101028
AmazonSageMaker-ExecutionRole-20200611T110452
AmazonSageMaker-ExecutionRole-20200618T173180
AmazonSageMaker-ExecutionRole-20200717T133757
AmazonSageMaker-ExecutionRole-20200731T151604
AmazonSageMaker-ExecutionRole-20200928T154277
AmazonSageMaker-ExecutionRole-20201014T161125
AmazonSageMaker-ExecutionRole-20201016T103135
AmazonSageMaker-ExecutionRole-20201019T140004
AmazonSageMaker-ExecutionRole-20201019T140196
AmazonSageMaker-ExecutionRole-20201023T113029
AmazonSageMaker-ExecutionRole-20201026T093069
AmazonSageMaker-ExecutionRole-20210208T141346
AmazonSageMaker-ExecutionRole-20210302T160008
AmazonSageMakerServiceCatalogProductsLaunchRole
AmazonSageMakerServiceCatalogProductsUseRole
aws-elasticbeanstalk-ec2-role
aws-elasticbeanstalk-service-role
AWSChatbot-role
AWSChatbot-T

EntityAlreadyExistsException: An error occurred (EntityAlreadyExists) when calling the CreateRole operation: Role with name sm already exists.

In [21]:
role_arn

NameError: name 'role_arn' is not defined

## Build a training image

We follow the same procedure for building a training image as in [this notebook](https://github.com/hsl89/amazon-sagemaker-examples/blob/sagemaker-fundamentals/sagemaker-fundamentals/create-training-job/create-training-job.ipynb). We will refer to this image as `example-image`. Please go through that notebook if you are not familiar with `CreateTrainingJob` API.

In [3]:
# create a repo in your ECR 

ecr = boto3.client('ecr')

try:
    # The repository might already exist
    # in your ECR
    cr_res = ecr.create_repository(
        repositoryName='example-image')
    pp.pprint(cr_res)
except Exception as e:
    print(e)

An error occurred (RepositoryAlreadyExistsException) when calling the CreateRepository operation: The repository with name 'example-image' already exists in the registry with id '688520471316'


In [4]:
%%sh
# build the image
cd container/

# tag it as example-image:latest
docker build -t example-image:latest .
    
# test the container
python local_test/test_container.py

account=$(aws sts get-caller-identity --query Account | sed -e 's/^"//' -e 's/"$//')
region=$(aws configure get region)
ecr_account=${account}.dkr.ecr.${region}.amazonaws.com

# Give docker your ECR login password
aws ecr get-login-password --region $region | docker login --username AWS --password-stdin $ecr_account

# Fullname of the repo
fullname=$ecr_account/example-image:latest

# Tag the image with the fullname
docker tag example-image:latest $fullname

# Push to ECR
docker push $fullname

Sending build context to Docker daemon   16.9kB
Step 1/4 : FROM continuumio/miniconda:latest
 ---> b8ea69b5c41c
Step 2/4 : RUN mkdir -p /opt/ml
 ---> Using cache
 ---> a170cc3fed03
Step 3/4 : COPY train.py /usr/bin/train
 ---> Using cache
 ---> 4e19c1cb2076
Step 4/4 : RUN chmod +x /usr/bin/train
 ---> Using cache
 ---> 38609e5aaa0d
Successfully built 38609e5aaa0d
Successfully tagged example-image:latest
== Files in train channel ==
== Files in the test channel ==
== Saving model checkpoint ==
== training completed ==

Login Succeeded
The push refers to repository [688520471316.dkr.ecr.us-west-2.amazonaws.com/example-image]
967579d07803: Preparing
5c8c2d1bcfe6: Preparing
88674bdc7fd9: Preparing
78db50750faa: Preparing
805309d6b0e2: Preparing
2db44bce66cd: Preparing
2db44bce66cd: Waiting
805309d6b0e2: Layer already exists
967579d07803: Layer already exists
5c8c2d1bcfe6: Layer already exists
88674bdc7fd9: Layer already exists
78db50750faa: Layer already exists
2db44bce66cd: Layer already 

https://docs.docker.com/engine/reference/commandline/login/#credentials-store



## Create a VPC

You can think of Amazon VPC as the traditional network in a data center in the cloud. 

The following are the key concepts for VPCs: 
* Virtual private cloud (VPC) — A virtual network dedicated to your AWS account.
* Subnet — A range of IP addresses in your VPC.
* Route table — A set of rules, called routes, that are used to determine where network traffic is directed.
* Internet gateway — A gateway that you attach to your VPC to enable communication between resources in your VPC and the internet.
* VPC endpoint — Enables you to privately connect your VPC to supported AWS services and VPC endpoint services powered by PrivateLink without requiring an internet gateway, NAT device, VPN connection, or AWS Direct Connect connection. Instances in your VPC do not require public IP addresses to communicate with resources in the service. Traffic between your VPC and the other service does not leave the Amazon network. For more information, see AWS PrivateLink and VPC endpoints.
* CIDR block —Classless Inter-Domain Routing. An internet protocol address allocation and route aggregation methodology. For more information, see [Classless Inter-Domain Routing](https://en.wikipedia.org/wiki/Classless_Inter-Domain_Routing#CIDR_notation) in Wikipedia.

All of these concepts are explained in the [official docs](https://docs.aws.amazon.com/vpc/latest/userguide/what-is-amazon-vpc.html). 


In [5]:
# Create a VPC in your default region

ec2 = boto3.client('ec2')

vpc_res = ec2.create_vpc(
    CidrBlock='10.0.0.0/20', # 2^(32 - 20) = 4906 private ipv4 addrs
    AmazonProvidedIpv6CidrBlock=False,
    DryRun=False,
    TagSpecifications=[
        {
            'ResourceType': 'vpc', 
            'Tags':[
                {
                    'Key': 'Name',
                    'Value': 'hello-world'
                },
            ]
        },
    ]
)

pp.pprint(vpc_res)

{'ResponseMetadata': {'HTTPHeaders': {'cache-control': 'no-cache, no-store',
                                      'content-length': '1054',
                                      'content-type': 'text/xml;charset=UTF-8',
                                      'date': 'Fri, 19 Mar 2021 01:15:46 GMT',
                                      'server': 'AmazonEC2',
                                      'strict-transport-security': 'max-age=31536000; '
                                                                   'includeSubDomains',
                                      'x-amzn-requestid': '9fd072ec-01be-4ecb-b9e3-2e9a317d3e67'},
                      'HTTPStatusCode': 200,
                      'RequestId': '9fd072ec-01be-4ecb-b9e3-2e9a317d3e67',
                      'RetryAttempts': 0},
 'Vpc': {'CidrBlock': '10.0.0.0/20',
         'CidrBlockAssociationSet': [{'AssociationId': 'vpc-cidr-assoc-0b6e7b82620de322c',
                                      'CidrBlock': '10.0.0.0/20',
       

In [6]:
# inspect this VPC in details

vpc_des = ec2.describe_vpcs(
    VpcIds=[vpc_res['Vpc']['VpcId']]
    )
pp.pprint(vpc_des['Vpcs'])

[{'CidrBlock': '10.0.0.0/20',
  'CidrBlockAssociationSet': [{'AssociationId': 'vpc-cidr-assoc-0b6e7b82620de322c',
                               'CidrBlock': '10.0.0.0/20',
                               'CidrBlockState': {'State': 'associated'}}],
  'DhcpOptionsId': 'dopt-8b1f91ef',
  'InstanceTenancy': 'default',
  'IsDefault': False,
  'OwnerId': '688520471316',
  'State': 'available',
  'Tags': [{'Key': 'Name', 'Value': 'hello-world'}],
  'VpcId': 'vpc-0b52a6097c86ebdc3'}]


In [7]:
# create subnet and associate it with route table

def get_first_availability_zone():
    region_name = boto3.Session().region_name
    avz_res = ec2.describe_availability_zones(
        Filters=[
            {
                "Name": "region-name",
                "Values": [region_name]
            }
        ],
        AllAvailabilityZones=True,
    )
    
    for az in avz_res['AvailabilityZones']:
        if az['ZoneType']=='availability-zone':
            return az
    else:
        return None
    
def create_subnet(vpc_id, cidr_block, dry_run):
    """Create a subnet in the first availability zone in your current region"""
    az = get_first_availability_zone()
    if az is not None:
        subnet_res = ec2.create_subnet(
            AvailabilityZone = az['ZoneName'], #'us-west-2a',
            VpcId = vpc_id,                    # vpc_res['Vpc']['VpcId'],
            CidrBlock= cidr_block,             #'100.68.0.18/18',
            DryRun=dry_run                     # True,
        )
        return subnet_res
    else:
        raise "No availability zone"
        
sn_res = create_subnet(
    vpc_id=vpc_res['Vpc']['VpcId'],
    cidr_block='10.0.0.0/28', # I want 2 ^ (32 - 28) private ipv4 in this subnet
    dry_run=False)

pp.pprint(sn_res)

{'ResponseMetadata': {'HTTPHeaders': {'cache-control': 'no-cache, no-store',
                                      'content-length': '927',
                                      'content-type': 'text/xml;charset=UTF-8',
                                      'date': 'Fri, 19 Mar 2021 01:22:03 GMT',
                                      'server': 'AmazonEC2',
                                      'strict-transport-security': 'max-age=31536000; '
                                                                   'includeSubDomains',
                                      'x-amzn-requestid': 'aa405e89-96cc-4a17-9552-9fd19d1a2429'},
                      'HTTPStatusCode': 200,
                      'RequestId': 'aa405e89-96cc-4a17-9552-9fd19d1a2429',
                      'RetryAttempts': 0},
 'Subnet': {'AssignIpv6AddressOnCreation': False,
            'AvailabilityZone': 'us-west-2a',
            'AvailabilityZoneId': 'usw2-az2',
            'AvailableIpAddressCount': 11,
            'Cidr

In [10]:
# create a security group

sg_res = ec2.create_security_group(
    Description='security group for SageMaker instances',
    GroupName='sagemaker-private',
    VpcId=vpc_res['Vpc']['VpcId'],
    TagSpecifications=[
        {
            "ResourceType": "security-group",
            "Tags" : [
                {   
                    "Key": "Service", # Tag the sec gp by service, this can be used to filter sec gps
                    "Value": "SageMaker" 
                }
            ]
        }
    ]

)

pp.pprint(sg_res)

{'GroupId': 'sg-023956e2fa664f5dd',
 'ResponseMetadata': {'HTTPHeaders': {'cache-control': 'no-cache, no-store',
                                      'content-length': '409',
                                      'content-type': 'text/xml;charset=UTF-8',
                                      'date': 'Fri, 19 Mar 2021 01:22:38 GMT',
                                      'server': 'AmazonEC2',
                                      'strict-transport-security': 'max-age=31536000; '
                                                                   'includeSubDomains',
                                      'x-amzn-requestid': '7a29e519-64c5-42f2-8f4b-b5a9a697b6c5'},
                      'HTTPStatusCode': 200,
                      'RequestId': '7a29e519-64c5-42f2-8f4b-b5a9a697b6c5',
                      'RetryAttempts': 0},
 'Tags': [{'Key': 'Service', 'Value': 'SageMaker'}]}


In [11]:
# inspect the security group in detail

ec2.describe_security_groups(
    GroupIds=[
        sg_res['GroupId']
    ]
)

{'SecurityGroups': [{'Description': 'security group for SageMaker instances',
   'GroupName': 'sagemaker-private',
   'IpPermissions': [],
   'OwnerId': '688520471316',
   'GroupId': 'sg-023956e2fa664f5dd',
   'IpPermissionsEgress': [{'IpProtocol': '-1',
     'IpRanges': [{'CidrIp': '0.0.0.0/0'}],
     'Ipv6Ranges': [],
     'PrefixListIds': [],
     'UserIdGroupPairs': []}],
   'Tags': [{'Key': 'Service', 'Value': 'SageMaker'}],
   'VpcId': 'vpc-0b52a6097c86ebdc3'}],
 'ResponseMetadata': {'RequestId': '680b4acb-13b2-4de0-8234-e9b6ef93ce2c',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': '680b4acb-13b2-4de0-8234-e9b6ef93ce2c',
   'cache-control': 'no-cache, no-store',
   'strict-transport-security': 'max-age=31536000; includeSubDomains',
   'content-type': 'text/xml;charset=UTF-8',
   'content-length': '1233',
   'date': 'Fri, 19 Mar 2021 01:22:40 GMT',
   'server': 'AmazonEC2'},
  'RetryAttempts': 0}}

## Creat a training job
Now let's create a training job within your private subnet you just created.

In [13]:
# execution role for SageMaker



'default'