# Training Job with Encrypted Static Assets

In the [notebook about creating a training job in VPC mode](https://github.com/aws/amazon-sagemaker-examples/blob/master/sagemaker-fundamentals/create-training-job/create_training_job_vpc.ipynb) you learnt how to create a SageMaker training job with network isolation. Network isolation enables you to protect your data and model from being intercepted by cyber pirates. 

![pirate](assets/pirate.jpg)

Another way you can protect your static assets is to encrypt them before moving them from location A to location B. In this notebook, you will walk through a few techniques on that with the help of AWS Key Management Service [(AWS KMS)](https://docs.aws.amazon.com/kms/latest/developerguide/overview.html).

Encryption is a wildly used technology, in addition to the above introductory material, you can find many free lectures online. 

## Symmetric Ciphers
We will focus on symmetric ciphers in this notebook. Quote from the GNU Privacy Handbook

> A symmetric cipher is a cipher that uses the same key for both encryption and decryption. Two parties communicating using a symmetric cipher must agree on the key beforehand. Once they agree, the sender encrypts a message using the key, sends it to the receiver, and the receiver decrypts the message using the key. As an example, the German Enigma is a symmetric cipher, and daily keys were distributed as code books. Each day, a sending or receiving radio operator would consult his copy of the code book to find the day's key. Radio traffic for that day was then encrypted and decrypted using the day's key. Modern examples of symmetric ciphers include 3DES, Blowfish, and IDEA.

## Environment to run this notebook
You can run this notebook on your local machine or EC2 instance as an IAM user or you can run it on SageMaker Notebook Instance as a SageMaker service role. To avoid confusion, we will assume you are running it as an IAM user.

## Permissions
You will need to attach the following permissions to the IAM user

* IAMFullAccess 
* AWSKeyManagementServicePowerUser
* AmazonEC2ContainerRegistryFullAccess

## Outline of this notebook

* Generate a symmetric customer master key (CMK)
* Allow your SageMaker service role to use the CMK
* Generate a data key from the CMK
* Encrypt some data with the data key and upload the encrypted data to S3
* Create a SageMaker service role
* Build a training image 
* Create a SageMaker training job using the encrypted data
* Verify that data retrieved from S3 is encrypted and SageMaker needs your data key to decrypt

The process of using a data key to encrypt your data instead of using master key directly is called [**envelope encryption**](https://docs.aws.amazon.com/kms/latest/developerguide/concepts.html#enveloping)
You can directly use the master key to encrypt your data, but by using a data key, you reduced the risk of [man-in-the-middle-attack](https://en.wikipedia.org/wiki/Man-in-the-middle_attack). 
We will discuss the use of data key in detail later. 

![envelope-encryption](assets/envelope-encryption.jpg)

In [None]:
# set ups
import boto3
import datetime
import json
import pprint

pp = pprint.PrettyPrinter(indent=1)
kms = boto3.client('kms') 

In [None]:
# Some helper functions

def current_time():
    ct = datetime.datetime.now() 
    return str(ct.now()).replace(":", "-").replace(" ", "-")[:19]

def account_id():
    return boto3.client('sts').get_caller_identity()['Account']

### Generate a symmetric customer master key

You will use [kms:CreateKey](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/kms.html#KMS.Client.create_key) API to generate a **symmetric customer master key** used for **encryption** and **decryption**. You need to use a IAM policy to define who has access (and with what level of access) to the key. 
If you create the key from AWS console, then by following the default steps you will end up the following key policy:

In [None]:
root_arn = f"arn:aws:iam::{account_id()}:root"
user_arn = boto3.client('sts').get_caller_identity()['Arn']

key_policy = {
    "Id": "key-consolepolicy-3",
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "Enable IAM User Permissions",
            "Effect": "Allow",
            "Principal": {
                "AWS": root_arn # enable root user to perform all actions
            },
            "Action": "kms:*",
            "Resource": "*"
        },
        
        {
            "Sid": "Allow access for Key Administrators",
            "Effect": "Allow",
            "Principal": {
                "AWS": [user_arn]   # give myself admin permission to this key
                                     # you can add more admin users by appending this list
            },
            "Action": [
                "kms:Create*",
                "kms:Describe*",
                "kms:Enable*",
                "kms:List*",
                "kms:Put*",
                "kms:Update*",
                "kms:Revoke*",
                "kms:Disable*",
                "kms:Get*",
                "kms:Delete*",
                "kms:TagResource",
                "kms:UntagResource",
                "kms:ScheduleKeyDeletion",
                "kms:CancelKeyDeletion"
            ],
            "Resource": "*"
        },
        {
            "Sid": "Allow use of the key",
            "Effect": "Allow",
            "Principal": {
                "AWS": [user_arn]   # allow myself to use the key
                                     # you can add more users / roles to this list
                                     # for example you can add SageMaker service role 
                                     # here. But we will allow SageMaker service role
                                     # to use this key via grant (see below)

            },
            "Action": [
                "kms:Encrypt",
                "kms:Decrypt",
                "kms:ReEncrypt*",
                "kms:GenerateDataKey*",
                "kms:DescribeKey"
            ],
            "Resource": "*"
        },
        {
            "Sid": "Allow attachment of persistent resources",
            "Effect": "Allow",
            "Principal": {
                "AWS": [user_arn] # allow myself to create grant for this key
                                   # see ref below to understand the diff 
                                   # between user and grant
                                   # https://docs.aws.amazon.com/kms/latest/developerguide/concepts.html#grant
            },
            "Action": [
                "kms:CreateGrant",
                "kms:ListGrants",
                "kms:RevokeGrant"
            ],
            "Resource": "*",
            "Condition": {
                "Bool": {
                    "kms:GrantIsForAWSResource": "true"
                }
            }
        }
    ]
}

key_policy = json.dumps(key_policy)

In [None]:
# create a key with the above key policy

ck_res = kms.create_key(
    Policy=key_policy,
    Description="a symmetric key to demonstrate KMS",
    KeyUsage="ENCRYPT_DECRYPT",                # use this key to encrypt and decrypt
    Origin='AWS_KMS',                          # created via AWS KMS
    CustomerMasterKeySpec='SYMMETRIC_DEFAULT'  # symmetric key
)


pp.pprint(ck_res)

In [None]:
master_key = ck_res['KeyMetadata']['KeyId']
print("The id of the key: ")
print(master_key)

You can use this master key to encrypt your data directly. It is not a good practice in production. But it is good to know what you can do.

In [None]:
my_secret_message = "1729 is the smallest number expressible \
as the sum of two cubes in two different ways".encode('utf-8')

# 1729 =  1^3 + 12^3 = 9^3 + 10^3 (Srinivasa Ramanujan)

# make the above secret a ciphertext
enc_res = kms.encrypt(
    KeyId=master_key,
    Plaintext=my_secret_message)

pp.pprint(enc_res)

In [None]:
# decrypt your secret message
dec_res = kms.decrypt(
    KeyId=master_key,
    CiphertextBlob=enc_res['CiphertextBlob']
)

print("Decrpyted message:")
print(dec_res['Plaintext'].decode())

One thing to notice is encryption and decryption should happen at **bytes** level. If you want to encrypt a python object (list, numpy array, pandas data frame, pytorch model or a string) then the first step is to serialize it into bytes. One easy way to do it is to use `pickle.dumps` method. 

## Client-side encryption with data key
Now let's pretend you are a data engineer and you need to move a chuck of data from location A to location B. Location A is the machine you are using now to run this notebook, location B is an S3 bucket that your data scientist buddy will be using later to create a training job. You want to ensure that while data is on its way from location A to location B, it is not intercepted and stolen by cyber-attacker in the middle. 

One solution is genereate a data key `DK` from the master key and use `DK` to encrypt your data at location A (client side) and save the encrypted to S3 bucket. 

You will get a different data key each time you request it from the master key and the plaintext data key is intended to be **short-lived** and you should only save the **encrpyted** data key for later use. 

Use [kms:GenerateDataKey](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/kms.html#KMS.Client.generate_data_key) to generate a data key. 

In [None]:
key_length = 32 # 32 bytes 

data_key_res = kms.generate_data_key(
    KeyId=master_key,
    NumberOfBytes= key_length  # your data key is will be 32x8=256-bit long
                       # takes 2^256 number of guesses to crack your data key
    )

pp.pprint(data_key_res)

In [None]:
plaintext, ciphertext = data_key_res['Plaintext'], data_key_res['CiphertextBlob']

In [None]:
assert len(plaintext) == key_length

The ciphertext above is the encrypted data key. Of course it is encrypted by the master key. And the ciphtertext is what you should kept for long term. There is nothing preventing you from encrypt your plaintext data key with a different master key. You just need to remember which master you used to encrypted it. 

In [None]:
assert kms.decrypt(KeyId=master_key, CiphertextBlob=ciphertext)['Plaintext'] == plaintext

Note that the plaintext data key is a byte-like object. It is not a string and in fact it cannot be decoded to a python string.

In [None]:
try:
    plaintext.decode('utf-8')
except Exception as e:
    print(e)

There are multiple python libraries for cryptography. We will use [cryptography](https://pypi.org/project/cryptography/)

In [None]:
!pip install cryptography

In [None]:
import base64
from cryptography.fernet import Fernet

def encrypt(data, plaintext_key):
    """Encrypt a chunk of bytes on client-side
    data: a chunk of bytes
    plaintext_key: plaintext data key
    """
    ascii_str = base64.b64encode(plaintext_key)

    f = Fernet(key=ascii_str) # Fernet key must be 32 url-safe base64-encoded bytes
                              # That's why we generated a 32-byte long data key
    return f.encrypt(data)

def decrypt(data, ciphertext_key):
    """Decrypt a chunk of bytes on client-side
    data: encypted binary data
    ciphertext_key: ciphertext data key
    """
    # decrypt the ciphertext data key
    plaintext_key = kms.decrypt(
        KeyId=master_key, 
        CiphertextBlob=ciphertext_key)['Plaintext']
    
    # to Fernet-friendly key
    ascii_str = base64.b64encode(plaintext_key)
    
    f = Fernet(key=ascii_str)
    return f.decrypt(data)

In [None]:
import pickle

# encrypt
data =[i for i in range(1729)]
encrypted_data = encrypt(
    pickle.dumps(data), # python object -> bytes 
    plaintext
)

Once you finished encryption, you should delete the plaintext data key as soon as possible. 

In [None]:
del plaintext

In [None]:
# decrypt
b = decrypt(encrypted_data, ciphertext)
data_ = pickle.loads(b) # bytes -> python object

for x, y in zip(data, data_):
    assert x == y

## Save encrypted objects on S3 
Now you understand how encryption at client-side works. It should be straightforward to you how to save encrypted data on an S3 bucket. 

In [None]:
# create a bucket to be shared by SageMaker later

def create_bucket():
    """Create an S3 bucket that is intended to be used for short term"""
    bucket = f"sagemaker-{current_time()}"
    
    region_name = boto3.Session().region_name
    create_bucket_config = {}
    if region_name != 'us-east-1': 
        # us-east-1 is the default region for S3 bucket
        # specify LocationConstraint if your VPC is not
        # in us-east-1
        create_bucket_config['LocationConstraint'] = region_name
    
    boto3.client('s3').create_bucket(
        Bucket=bucket,
        CreateBucketConfiguration=create_bucket_config
    )
    return bucket

bucket = create_bucket()

In [None]:
# put your encrypted data on the S3 bucket

s3 = boto3.client('s3')
input_prefix = "data" # will be used later as S3Prefix when calling CreateTrainingJob

put_obj_res = s3.put_object(
    Bucket=bucket, 
    Key=input_prefix +'/'+'a_chunk_of_secrets',
    Body=encrypted_data)

pp.pprint(put_obj_res)

## Create a SageMaker training job with encrypted data
Now you understand how to move your data from location $A$ to location $B$ encrypted. Let's see how this workflow can be merged into a SageMaker training job. What you want to achieve is, the static assets (model and data) need to be encrypted before you traffic them in the Internet. 

Let $M$ denote the customer master key hosted on KMS, $D$ the plaintext data key and $C$ the ciphertext data key. 

Suppose your training data is in an S3 bucket encrypted by the data key $D$. In order to use the training data, the SageMaker training job needs to be able to decrypt it. Of course you **would not** want to move $D$ (plaintext) around in the Internet and hand it to a SageMaker training job. Instead you will hand the encrypted data key (ciphertext) $C$ to the SageMaker training job. 

The SageMaker training job will do the following things with $C$
- Decrypt it to plaintext using the master key $M$ and get $D$
- Download the encrypted data from the S3 bucket and decrypt the data with $D$
- Train the model and encrypt the model with $D$ 
- Send the encrypted model to an S3 bucket

Of course, you could use a different data key to encrypt the model.

### How SageMaker uses your master key $M$

Remember a managed service like SageMaker *assumes* and IAM role (service role) in your account and it procures the resources in your AWS account based on the permission of the service role. 

When you created $M$, key policy said that the IAM user (you) and the root user of your account are the only entities entitled to use $M$. So does SageMaker use $M$ then? 

There are two ways to achieve this:
Suppose your SageMaker service role is called `example-role`.

1. Update the key policy to allow `example-role` to use $M$
2. Allow `example-role` to use $M$ via a **grant**

Quote from the [KMS docs](https://docs.aws.amazon.com/kms/latest/developerguide/concepts.html#grant)

>A grant is a policy instrument that allows AWS principals to use AWS KMS customer master keys (CMKs) in cryptographic operations. It also can let them view a CMK (DescribeKey) and create and manage grants. When authorizing access to a CMK, grants are considered along with key policies and IAM policies. Grants are often used for temporary permissions because you can create one, use its permissions, and delete it without changing your key policies or IAM policies. Because grants can be very specific, and are easy to create and revoke, they are often used to provide temporary permissions or more granular permissions.

We will the grant approach this tutorial as it involves less activities on your key policy. In a prodcution environment, you should think of an activity on your key policy as *a big deal*. 

First, get some helper functions for creating a SageMaker service role. 

In [None]:
%%bash
file=$(ls . | grep iam_helpers.py)

if [ -f "$file" ]
then
    rm $file
fi

wget https://raw.githubusercontent.com/aws/amazon-sagemaker-examples/sagemaker-fundamentals/sagemaker-fundamentals/execution-role/iam_helpers.py

In [None]:
# set up service role for SageMaker
from iam_helpers import create_execution_role

iam = boto3.client('iam')

role_name = 'example-role'
role_arn = create_execution_role(role_name=role_name)['Role']['Arn']

iam.attach_role_policy(
    RoleName=role_name,
    PolicyArn='arn:aws:iam::aws:policy/AmazonSageMakerFullAccess'
)

In [None]:
# create a boto3 session with example-role
import time

def create_session(role_arn):
    """Create a boto3 session with an IAM role"""
    now = str(time.time()).split('.')[0]
    obj = boto3.client('sts').assume_role(
        RoleArn=role_arn,
        RoleSessionName=now
    )

    cred=obj['Credentials']
    sess = boto3.session.Session(
        aws_access_key_id=cred['AccessKeyId'],
        aws_secret_access_key=cred['SecretAccessKey'],
        aws_session_token=cred['SessionToken']
        )
    return sess

sess = create_session(role_arn)

try:
    sess.client('kms').encrypt(
        KeyId=master_key,
        Plaintext='it will not go through'.encode('utf-8')
    )
except Exception as e:
    print(e)

In [None]:
del sess

In [None]:
grant_res = kms.create_grant(
    KeyId=master_key,
    GranteePrincipal=role_arn, 
    Operations=['Decrypt', 'Encrypt'] # allow example-role to use M to encrypt and decrypt
)

pp.pprint(grant_res)