# Secure Data Science on AWS

The most common security considerations for building secure data science projects in the cloud touch the areas of compute and network isolation, authentication and authorization, data encryption, artifact management, auditability, monitoring and governance.

In [4]:
import boto3

region = boto3.Session().region_name
session = boto3.session.Session()

ec2 = boto3.Session().client(service_name='ec2', region_name=region)
sm = boto3.Session().client(service_name='sagemaker', region_name=region)

## Retrieve the Notebook Instance Name

In [5]:
import json
notebook_instance_name = None

try:
    with open('/opt/ml/metadata/resource-metadata.json') as notebook_info:
        data = json.load(notebook_info)
        resource_arn = data['ResourceArn']
        region = resource_arn.split(':')[3]
        notebook_instance_name = data['ResourceName']
    print('Notebook Instance Name: {}'.format(notebook_instance_name))
except:
    print('+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++')
    print('[ERROR]: COULD NOT RETRIEVE THE NOTEBOOK INSTANCE METADATA.')
    print('+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++')

Notebook Instance Name: ds-notebook-dev-dev-antjebar


# Compute and Network Isolation 

This SageMaker notebook instance has been set up **without** Internet access. The notebook instance runs within a VPC without Internet connectivity but still maintains access to specific AWS services such as Elastic Container Registry and Amazon S3.  Access to a shared services VPC has also been configured to allow connectivity to a centralized repository of Python packages.

In [13]:
response = sm.describe_notebook_instance(
        NotebookInstanceName=notebook_instance_name
)

print(response)

{'NotebookInstanceArn': 'arn:aws:sagemaker:us-east-1:806570384721:notebook-instance/ds-notebook-dev-dev-antjebar', 'NotebookInstanceName': 'ds-notebook-dev-dev-antjebar', 'NotebookInstanceStatus': 'InService', 'Url': 'ds-notebook-dev-dev-antjebar.notebook.us-east-1.sagemaker.aws', 'InstanceType': 'ml.t3.medium', 'SubnetId': 'subnet-03f68f0599a0db188', 'SecurityGroups': ['sg-0559ec22f560603ea'], 'RoleArn': 'arn:aws:iam::806570384721:role/service-role/ds-notebook-role-dev-dev-antjebar', 'KmsKeyId': '4a1e174c-4fd1-484d-9205-33527bc34f2f', 'NetworkInterfaceId': 'eni-0cf80cd4d6dc6a544', 'LastModifiedTime': datetime.datetime(2020, 12, 16, 1, 30, 58, 708000, tzinfo=tzlocal()), 'CreationTime': datetime.datetime(2020, 12, 16, 1, 27, 2, 530000, tzinfo=tzlocal()), 'NotebookInstanceLifecycleConfigName': 'ds-notebook-lc-dev-dev', 'DirectInternetAccess': 'Disabled', 'VolumeSizeInGB': 10, 'DefaultCodeRepository': 'https://git-codecommit.us-east-1.amazonaws.com/v1/repos/ds-source-dev-dev', 'RootAccess

## Review The Following Settings

In [14]:
print('SubnetId: {}'.format(response['SubnetId']))
print('SecurityGroups: {}'.format(response['SecurityGroups']))
print('IAM Role: {}'.format(response['RoleArn']))
print('NetworkInterfaceId: {}'.format(response['NetworkInterfaceId']))
print('DirectInternetAccess: {}'.format(response['DirectInternetAccess']))

SubnetId: subnet-03f68f0599a0db188
SecurityGroups: ['sg-0559ec22f560603ea']
IAM Role: arn:aws:iam::806570384721:role/service-role/ds-notebook-role-dev-dev-antjebar
NetworkInterfaceId: eni-0cf80cd4d6dc6a544
DirectInternetAccess: Disabled


In [None]:
response = sm.describe_notebook_instance(
    NotebookInstanceName='string'
)

## Verify That Internet Access Is Disabled

Expected result: 
You should see a timeout without a path to the Internet or a proxy server.  
```Failed to connect to aws.amazon.com port 443: Connection timed out```

In [15]:
!curl https://www.datascienceonaws.com/

^C


By removing public internet access in this way, we have created a secure environment where all the dependencies are installed, but the notebook now has no way to access the internet, and internet traffic cannot reach the notebook either. 

# Authentication and Authorization

SageMaker notebooks need to be assigned a role for accessing AWS services. Fine grained access control over which services a SageMaker notebook is allowed to access can be provided using Identity and Access Management (IAM). 

To control access at a user level, data scientists should typically not be allowed to create notebooks, provision or delete infrastructure. In some cases, even console access can be removed by creating PreSigned URLs, that directly launch a hosted Jupyter environment for data scientists to use from their laptops. 

Moreover, admins can use resource [tags for attribute-based access control (ABAC)](https://docs.aws.amazon.com/IAM/latest/UserGuide/introduction_attribute-based-access-control.html) to ensure that different teams of data scientists, with the same high-level IAM role, have different access rights to AWS services, such as only allowing read/write access to specific S3 buckets which match tag criteria. 

For customers with even more stringent data and code segregation requirements, admins can provision different accounts for individual teams and manage the billing from these accounts in a centralized Organizational Unit. 

## Review IAM Role and Region For This Notebook Instance

In [18]:
import sagemaker

sess   = sagemaker.Session()
role = sagemaker.get_execution_role()
region = boto3.Session().region_name

sm = boto3.Session().client(service_name='sagemaker', region_name=region)

In [20]:
print("IAM Role: {}".format (role))
print("Region: {}".format(region))

IAM Role: arn:aws:iam::806570384721:role/service-role/ds-notebook-role-dev-dev-antjebar
Region: us-east-1


## TODO: List IAM Role and Policies

## Grant `Least Privilege` for IAM Roles and Policies

IAM roles and policies help you control access to AWS resources. You create policies which define permissions, and attach the policies to IAM users, groups of users, or roles. 

Policies types include identity-based and resource-based policies among others. Identity-based policies are tied to an identity, such as IAM users or roles. In contrast, resource-based policies are attached to a resource such as an Amazon S3 bucket. 

Here is a sample policy attached to the SageMaker notebook instance IAM role. This policy restricts the IAM Role to one specific S3 bucket. 

## TODO: Attach this policy to IAM Role

```
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": [
                "s3:GetObject",
                "s3:PutObject",
                "s3:ListObject"
            ],
            "Effect": "Allow",
            "Resource": [
                "arn:aws:s3:::sagemaker-us-east-1-123456789-secure",
                "arn:aws:s3:::sagemaker-us-east-1-123456789-secure/*"
            ]
        }
    ]
}
```

##  Let's try to copy data to this S3 bucket

In [None]:
!echo s3://$bucket-secure/

In [None]:
!aws s3 cp security.ipynb s3://$bucket-secure/

## Let's try to copy data over to a different S3 bucket!

In [None]:
!aws s3 cp ./security.ipynb s3://$bucket/

# Train the Model

## Train Without a VPC Configured

To test the networking controls, run the following cell below. Here you will first attempt to train the model without an associated network configuration. You should see that the training job is stopped around the same time as the "Downloading - Downloading input data" message is emitted. 

#### Detective control explained

The training job was terminated by an AWS Lambda function that was executed in response to a CloudWatch Event that was triggered when the training job was created. 

To learn more about how the detective control does this, assume the role of the Data Science Administrator and review the code of the [AWS Lambda function SagemakerTrainingJobVPCEnforcer](https://console.aws.amazon.com/lambda/home?#/functions/SagemakerTrainingJobVPCEnforcer?tab=configuration). 

You can also review the [CloudWatch Event rule SagemakerTrainingJobVPCEnforcementRule](https://console.aws.amazon.com/cloudwatch/home?#rules:name=SagemakerTrainingJobVPCEnforcementRule) and take note of the event which triggers execution of the Lambda function.

---

In [36]:
from sagemaker.amazon.amazon_estimator import get_image_uri
image = get_image_uri(boto3.Session().region_name, 'xgboost', '0.90-1')

See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
INFO:sagemaker.image_uris:Same images used for training and inference. Defaulting to image scope: inference.
INFO:sagemaker.image_uris:Defaulting to only available Python version: py3
INFO:sagemaker.image_uris:Defaulting to only supported image scope: cpu.


In [37]:
s3_input_train = sagemaker.s3_input(s3_data='s3://sagemaker-workshop-cloudformation-{}/quickstart/train_data.csv'.format (region), content_type='csv')
s3_input_test = sagemaker.s3_input(s3_data='s3://sagemaker-workshop-cloudformation-{}/quickstart/test_data.csv'.format (region), content_type='csv')
print ("Training data at: {}".format (s3_input_train.config['DataSource']['S3DataSource']['S3Uri']))
print ("Test data at: {}".format (s3_input_test.config['DataSource']['S3DataSource']['S3Uri']))

AttributeError: module 'sagemaker' has no attribute 's3_input'

In [None]:
xgb = sagemaker.estimator.Estimator(
    image,
    role,
    train_instance_count=1,
    train_instance_type='ml.m4.xlarge',
    train_max_run=3600,
    output_path='s3://{}/{}/models'.format(output_bucket, prefix),
    sagemaker_session=sess,
    train_use_spot_instances=True,
    train_max_wait=3600,
    encrypt_inter_container_traffic=False
)  

xgb.set_hyperparameters(
    max_depth=5,
    eta=0.2,
    gamma=4,
    min_child_weight=6,
    subsample=0.8,
    verbosity=0,
    objective='binary:logistic',
    num_round=100)

xgb.fit(inputs={'train': s3_input_train})


# Train with VPC

This time provide the training job with the network settings that were defined above. This time we shouldn't see the **Client Error** as before!

In [None]:
s3_input_train = sagemaker.s3_input(s3_data='s3://{}/{}/'.format(data_bucket, traindataprefix), content_type='csv')
s3_input_test = sagemaker.s3_input(s3_data='s3://{}/{}/'.format(data_bucket, testdataprefix), content_type='csv')
print ("Training data at: {}".format (s3_input_train.config['DataSource']['S3DataSource']['S3Uri']))
print ("Test data at: {}".format (s3_input_test.config['DataSource']['S3DataSource']['S3Uri']))

In [None]:
preprocessing_trial_component = tracker.trial_component

trial_name = f"cc-fraud-training-job-{int(time.time())}"
cc_trial = Trial.create(
    trial_name=trial_name,
    experiment_name=cc_experiment.experiment_name,
    sagemaker_boto_client=sm)

cc_trial.add_trial_component(preprocessing_trial_component)
cc_training_job_name = "cc-training-job-{}".format(int(time.time()))
xgb = sagemaker.estimator.Estimator(
    image,
    role,
    train_instance_count=1,
    train_instance_type='ml.m4.xlarge',
    train_max_run=3600,
    output_path='s3://{}/{}/models'.format(output_bucket, prefix),
    sagemaker_session=sess,
    train_use_spot_instances=True,
    train_max_wait=3600,
    subnets=subnets, 
    security_group_ids=
    sec_groups,  
    train_volume_kms_key=cmk_id,
    encrypt_inter_container_traffic=False
)  

xgb.set_hyperparameters(
    max_depth=5,
    eta=0.2,
    gamma=4,
    min_child_weight=6,
    subsample=0.8,
    verbosity=0,
    objective='binary:logistic',
    num_round=100)

xgb.fit(
    inputs={'train': s3_input_train},
    job_name=cc_training_job_name,
    experiment_config={
        "TrialName":
        cc_trial.trial_name,  #log training job in Trials for lineage
        "TrialComponentDisplayName": "Training",
    },
    wait=True,
)


# Encryption

To ensure that the processed data is encrypted at rest on the processing cluster, we provide a customer managed key to the volume_kms_key command below.  This instructs Amazon SageMaker to encrypt the EBS volumes used during the processing job with the specified key. Since our data stored in Amazon S3 buckets are already encrypted, data is encrypted at rest at all times.

Amazon SageMaker always uses TLS encrypted tunnels when working with Amazon SageMaker so data is also encrypted in transit when traveling from or to Amazon S3.

In [31]:
## Use SageMaker Processing with SKLearn. -- combine data into train and test at this stage if possible.
from sagemaker.sklearn.processing import SKLearnProcessor
sklearn_processor = SKLearnProcessor(
    framework_version='0.20.0',
    role=role,
    instance_type='ml.c4.xlarge',
    instance_count=1,
    network_config=network_config,  # attach SageMaker resources to your VPC
    volume_kms_key=cmk_id  # encrypt the EBS volume attached to SageMaker Processing instance
)

In [32]:
from sagemaker.processing import ProcessingInput, ProcessingOutput

sklearn_processor.run(
    code=codeupload,
    inputs=[
        ProcessingInput(
            source=raw_data_location, 
            destination='/opt/ml/processing/input'
        )
    ],
    outputs=[
        ProcessingOutput(
            output_name='train_data',
            source='/opt/ml/processing/train',
            destination=train_data_location),
        ProcessingOutput(
            output_name='test_data',
            source='/opt/ml/processing/test',
            destination=test_data_location),
        ProcessingOutput(
            output_name='train_data_headers',
            source='/opt/ml/processing/train_headers',
            destination=train_header_location)
    ],
    arguments=['--train-test-split-ratio', '0.2'])

preprocessing_job_description = sklearn_processor.jobs[-1].describe()

output_config = preprocessing_job_description['ProcessingOutputConfig']
for output in output_config['Outputs']:
    if output['OutputName'] == 'train_data':
        preprocessed_training_data = output['S3Output']['S3Uri']
    if output['OutputName'] == 'test_data':
        preprocessed_test_data = output['S3Output']['S3Uri']


Job Name:  sagemaker-scikit-learn-2020-12-16-21-19-58-652
Inputs:  [{'InputName': 'input-1', 'S3Input': {'S3Uri': 's3://ds-data-bucket-dev-dev-0ab3a23e45a7/secure-sagemaker-demo/data', 'LocalPath': '/opt/ml/processing/input', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}, {'InputName': 'code', 'S3Input': {'S3Uri': 's3://ds-data-bucket-dev-dev-0ab3a23e45a7/secure-sagemaker-demo/code/preprocessing.py', 'LocalPath': '/opt/ml/processing/input/code', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}]
Outputs:  [{'OutputName': 'train_data', 'S3Output': {'S3Uri': 's3://ds-data-bucket-dev-dev-0ab3a23e45a7/secure-sagemaker-demo/train_data', 'LocalPath': '/opt/ml/processing/train', 'S3UploadMode': 'EndOfJob'}}, {'OutputName': 'test_data', 'S3Output': {'S3Uri': 's3://ds-data-bucket-dev-dev-0ab3a23e45a7/secure-sagemaker-demo/test_data', 'LocalPath

# Model development and Training

In [None]:
# Store the values used in this notebook for use in the second demo notebook:
trial_name = trial_name  
experiment_name = cc_experiment.experiment_name
training_job_name = cc_training_job_name
%store trial_name 
%store experiment_name 
%store training_job_name