# Amazon SageMaker Demonstration Notebook for Operations Teams

In this notebook you will see some of the aspects of a secure data science environment in practice.  By following along in the exercises below you will see:

 - network segregation of computing resources
 - working with a centralized package repository
 - enforcement of network security policies


# Part A: Environment Setup

## Part A.1: Compute and Network Isolation 

In this exercise we have launched a Jupyter notebook server **without** Internet access.  The server runs within a VPC without Internet connectivity but still maintains access to specific AWS services such as Elastic Container Registry and Amazon S3.  Access to a shared services VPC has also been configured to allow connectivity to a centralized repository of Python packages.

### Test Networking

To demonstrate a lack of Internet connectivity try to execute the below command, it will timeout without a path to the Internet or a proxy server.

In [None]:
!curl https://aws.amazon.com

By removing public Internet access in this way, you have a secure environment where all the dependencies are installed, but the notebook now has no way to access the Internet, and Internet traffic cannot reach the notebook.

## Part A.2: Authentication and Authorization

SageMaker notebooks need to be assigned a role for accessing AWS services. Fine grained access control over which services a SageMaker notebook is allowed to access can be provided using Identity and Access Management (IAM). 

To control access at a user level, data scientists should typically not be allowed to create notebooks, provision or delete infrastructure. In some cases, even console access can be removed by creating PreSigned URLs, that directly launch a hosted Jupyter environment for data scientists to use from their laptops. 

Moreover, admins can use resource [tags for attribute-based access control (ABAC)](https://docs.aws.amazon.com/IAM/latest/UserGuide/introduction_attribute-based-access-control.html) to ensure that different teams of data scientists, with the same high-level IAM role, have different access rights to AWS services, such as only allowing read/write access to specific S3 buckets which match tag criteria. 

For customers with even more stringent data and code segregation requirements, admins can provision different accounts for individual teams and manage the billing from these accounts in a centralized Organizational Unit. 

In [33]:
# Inspect the role you have created for the notebook
import boto3
import sagemaker
from sagemaker import get_execution_role

sm = boto3.Session().client('sagemaker')
sess = sagemaker.Session()
region = boto3.session.Session().region_name

role = get_execution_role()
print ("Notebook is running with assumed role {}".format (role))
print("Working with AWS services in the {} region".format(region))

Notebook is running with assumed role arn:aws:iam::535574150626:role/service-role/ds-notebook-role-barto-abc-123-dev-jpb
Working with AWS services in the us-west-2 region


### Sample Notebook IAM Role

As part of this workshop, we have assigned an IAM role to this notebook. This role will be used by the notebook instance to access AWS APIs. Look at the IAM policies attached to this role. 

Below is an example policy which provides least privilege access to various services like Amazon S3 and Amazon SageMaker that a data scientist would need to develop and conduct experiments.  


```json
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": [
                "ssm:GetParameters",
                "ssm:GetParameter"
            ],
            "Resource": "arn:aws:ssm:eu-west-2:0123456789012:parameter/ds-*",
            "Effect": "Allow"
        },
        {
            "Condition": {
                "Null": {
                    "sagemaker:VpcSubnets": "true"
                }
            },
            "Action": [
                "sagemaker:CreateNotebookInstance",
                "sagemaker:CreateHyperParameterTuningJob",
                "sagemaker:CreateProcessingJob",
                "sagemaker:CreateTrainingJob",
                "sagemaker:CreateModel"
            ],
            "Resource": "*",
            "Effect": "Deny"
        },
        {
            "Condition": {
                "ForAllValues:StringEqualsIfExists": {
                    "sagemaker:VpcSubnets": [
                        "subnet-012341dabe787cc21",
                        "subnet-0123457cd6518f8af",
                        "subnet-01234da97259ab887"
                    ],
                    "sagemaker:VpcSecurityGroupIds": [
                        "sg-012347ba900d25251"
                    ]
                }
            },
            "Action": [
                "sagemaker:*"
            ],
            "Resource": "*",
            "Effect": "Allow"
        },
        {
            "Action": [
                "application-autoscaling:DeleteScalingPolicy",
                "application-autoscaling:DeleteScheduledAction",
                "application-autoscaling:DeregisterScalableTarget",
                "application-autoscaling:DescribeScalableTargets",
                "application-autoscaling:DescribeScalingActivities",
                "application-autoscaling:DescribeScalingPolicies",
                "application-autoscaling:DescribeScheduledActions",
                "application-autoscaling:PutScalingPolicy",
                "application-autoscaling:PutScheduledAction",
                "application-autoscaling:RegisterScalableTarget",
                "cloudwatch:DeleteAlarms",
                "cloudwatch:DescribeAlarms",
                "cloudwatch:GetMetricData",
                "cloudwatch:GetMetricStatistics",
                "cloudwatch:ListMetrics",
                "cloudwatch:PutMetricAlarm",
                "cloudwatch:PutMetricData",
                "ec2:CreateNetworkInterface",
                "ec2:CreateNetworkInterfacePermission",
                "ec2:DeleteNetworkInterface",
                "ec2:DeleteNetworkInterfacePermission",
                "ec2:DescribeDhcpOptions",
                "ec2:DescribeNetworkInterfaces",
                "ec2:DescribeRouteTables",
                "ec2:DescribeSecurityGroups",
                "ec2:DescribeSubnets",
                "ec2:DescribeVpcEndpoints",
                "ec2:DescribeVpcs",
                "ecr:BatchCheckLayerAvailability",
                "ecr:BatchGetImage",
                "ecr:CreateRepository",
                "ecr:GetAuthorizationToken",
                "ecr:GetDownloadUrlForLayer",
                "ecr:Describe*",
                "elastic-inference:Connect",
                "iam:ListRoles",
                "kms:CreateGrant",
                "kms:Decrypt",
                "kms:DescribeKey",
                "kms:Encrypt",
                "kms:GenerateDataKey",
                "kms:ListAliases",
                "lambda:ListFunctions",
                "logs:CreateLogGroup",
                "logs:CreateLogStream",
                "logs:DescribeLogStreams",
                "logs:GetLogEvents",
                "logs:PutLogEvents",
                "sns:ListTopics",
                "codecommit:BatchGetRepositories",
                "codecommit:GitPull",
                "codecommit:GitPush",
                "codecommit:CreateBranch",
                "codecommit:DeleteBranch",
                "codecommit:GetBranch",
                "codecommit:ListBranches",
                "codecommit:CreatePullRequest",
                "codecommit:GetPullRequest",
                "codecommit:CreateCommit",
                "codecommit:GetCommit",
                "codecommit:GetCommitHistory",
                "codecommit:GetDifferences",
                "codecommit:GetReferences",
                "codecommit:CreateRepository",
                "codecommit:GetRepository",
                "codecommit:ListRepositories"
            ],
            "Resource": "*",
            "Effect": "Allow"
        },
        {
            "Action": [
                "ecr:SetRepositoryPolicy",
                "ecr:CompleteLayerUpload",
                "ecr:BatchDeleteImage",
                "ecr:UploadLayerPart",
                "ecr:DeleteRepositoryPolicy",
                "ecr:InitiateLayerUpload",
                "ecr:DeleteRepository",
                "ecr:PutImage"
            ],
            "Resource": "arn:aws:ecr:*:*:repository/*sagemaker*",
            "Effect": "Allow"
        },
        {
            "Action": [
                "s3:GetObject",
                "s3:PutObject",
                "s3:DeleteObject"
            ],
            "Resource": [
                "arn:aws:s3:::ds-data-bucket-project-dev",
                "arn:aws:s3:::ds-data-bucket-project-dev/*",
                "arn:aws:s3:::ds-model-bucket-project-dev",
                "arn:aws:s3:::ds-model-bucket-project-dev/*"
            ],
            "Effect": "Allow"
        },
        {
            "Action": [
                "s3:GetBucketLocation",
                "s3:ListBucket",
                "s3:ListAllMyBuckets"
            ],
            "Resource": "*",
            "Effect": "Allow"
        },
        {
            "Condition": {
                "StringEqualsIgnoreCase": {
                    "s3:ExistingObjectTag/SageMaker": "true"
                }
            },
            "Action": [
                "s3:GetObject"
            ],
            "Resource": "*",
            "Effect": "Allow"
        },
        {
            "Action": [
                "lambda:InvokeFunction"
            ],
            "Resource": [
                "arn:aws:lambda:*:*:function:*SageMaker*",
                "arn:aws:lambda:*:*:function:*sagemaker*",
                "arn:aws:lambda:*:*:function:*Sagemaker*",
                "arn:aws:lambda:*:*:function:*LabelingFunction*"
            ],
            "Effect": "Allow"
        },
        {
            "Condition": {
                "StringLike": {
                    "iam:AWSServiceName": "sagemaker.application-autoscaling.amazonaws.com"
                }
            },
            "Action": "iam:CreateServiceLinkedRole",
            "Resource": "arn:aws:iam::*:role/aws-service-role/sagemaker.application-autoscaling.amazonaws.com/AWSServiceRoleForApplicationAutoScaling_SageMakerEndpoint",
            "Effect": "Allow"
        },
        {
            "Action": [
                "sns:Subscribe",
                "sns:CreateTopic"
            ],
            "Resource": [
                "arn:aws:sns:*:*:*SageMaker*",
                "arn:aws:sns:*:*:*Sagemaker*",
                "arn:aws:sns:*:*:*sagemaker*"
            ],
            "Effect": "Allow"
        },
        {
            "Condition": {
                "StringEquals": {
                    "iam:PassedToService": [
                        "sagemaker.amazonaws.com"
                    ]
                }
            },
            "Action": [
                "iam:PassRole"
            ],
            "Resource": "*",
            "Effect": "Allow"
        }
    ]
}
```

### Complete Setup: Import libraries and set global definitions.

All of the needed Python packages have been installed from the central PyPI mirror for you by this notebook's lifecycle configuration script. 

In [2]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import os
from time import sleep, gmtime, strftime
import time

In [3]:
# Import SageMaker Experiments 
from sagemaker.analytics import ExperimentAnalytics
from smexperiments.experiment import Experiment
from smexperiments.trial import Trial
from smexperiments.trial_component import TrialComponent
from smexperiments.tracker import Tracker

### Import Networking definitions: VPC Id, KMS keys and security groups and subnets

This notebook instance used a Lifecycle Configuration script to configure the notebook for use by a project team member.  In addition to installing certain Python packages as described above, the script also generated a Python module specific to the operating environment of the Jupyter notebook server.

As a data scientist you are likely to have little training for AWS KMS keys, VPC security groups, or subnet IDs.  But in order to enforce security policy these values need to be passed to SageMaker resources such as training jobs.  By generating a Python module the Lifecycle Configuration associated with this notebook has created a static library which can be imported for convenience.  

This Python module can be found in `~/.ipython/sagemaker_environment.py` and is available to all Anaconda kernels.  Using this convenience module, developers and data scientists do not need to know what custom KMS keys or network configs are being used, they can easily reference them as part of the environment. 

In [4]:
# Create Networking configuration required for all APIs. 
from sagemaker.network import NetworkConfig
import sagemaker_environment as smenv

cmk_id =smenv.SAGEMAKER_KMS_KEY_ID  
sec_groups = smenv.SAGEMAKER_SECURITY_GROUPS
subnets =smenv.SAGEMAKER_SUBNETS
network_config = NetworkConfig(security_group_ids=sec_groups, subnets =subnets)

print ("Using KMS Key ID {}".format (cmk_id))
print ("Will deploy SageMaker resources into Subnets {}".format (subnets))
print ("And with security groups {}".format (sec_groups))

Using KMS Key ID arn:aws:kms:us-west-2:535574150626:key/eed18653-ccc1-4dcb-b013-50dab92adf16
Will deploy SageMaker resources into Subnets ['subnet-02dc977bb0a9ff339', 'subnet-07882c94836e6cf42', 'subnet-084036f956a05e8a3']
And with security groups ['sg-000de4a8f30335607']


## Part A.3: Install Approved Libraries using pip

Typically when we use pip to install packages and code, the code is downloaded over the public internet from a collection of public PyPI mirror servers. However most financial services customers do not allow public internet access from their notebook environment. To work within those guidelines, we have linked up our private, internet *disabled* VPC which the data scientists are using, to a centralied Shared Services VPC. This VPC will allow us to install approved packages, as many customers need to validate open source packages through their application security processes before they can be used by teams.

By using a shared services VPC, we create a separation between the private data scientist VPC and an internet facing VPC. We can use a PyPI mirror  on our shared servies VPC and link that to our private VPC via a secure PrivateLink as an endpoint service. The libraries we need have been installed by the Lifecycle Configuration script from the shared service PyPI mirror.  For the purpose of demonstration, we will also pip install xgboost, which is hosted as an approved package, on the centralized PyPI mirror. 

In [5]:
# Let's install the shap library from our local PyPi server. 
! pip install xgboost==0.90

Looking in indexes: http://vpce-0959765994c4e93f2-23st622n.vpce-svc-01b9a5e6e457fdfa9.us-west-2.vpce.amazonaws.com/simple/


In [6]:
# Import xgboost and a custom utilities package we use in this notebook
import xgboost as xgb
from util import utilsspec 



## Part A.4: Artifact Management 

During the machine learning lifecycle a number of artifacts will be generated by our data processing jobs, training jobs and experimentation.  To store these artifacts we specify the bucket locations where the model and data artifacts will reside below. These inputs are then fed into the SageMaker Estimators during data pre-processing and model training.

SageMaker will automatically look in the specified buckets for accessing any training/validation data, and ensure that model outputs are stored in the output directories specified.

Later on, we will see how to track these artifacts using SageMaker Experiments API.

The workshop pre-provisioned a set of buckets and their names are included in our `sagemaker_environment.py` file so we will simply import those here directly. 

In [7]:
# Buckets have been created as part of the Secure Data Science Workshop. 
# Here you will create references to those buckets for later use.

# rawbucket: stores raw data and any preprocessing job related code.
# data_bucket: stores train/test data for training/validating ML models.
# output_bucket: where the model artifacts and outputs will be stored.
# For this notebook these buckets are the same, but as best practice, 
# you probably want to keep them separate with different permissions.

rawbucket = smenv.SAGEMAKER_DATA_BUCKET 
data_bucket = smenv.SAGEMAKER_DATA_BUCKET 
output_bucket =smenv.SAGEMAKER_MODEL_BUCKET 

prefix = 'secure-sagemaker-demo' # use this prefix to store all files pertaining to this workshop.

dataprefix = prefix + '/data'
traindataprefix = prefix + '/train_data'
testdataprefix = prefix + '/test_data'

print("Storing training data to s3://{}".format (data_bucket))
print("Training job output will be stored in s3://{}".format (output_bucket))

Storing training data to s3://ds-data-bucket-barto-abc-123-dev
Training job output will be stored in s3://ds-model-bucket-barto-abc-123-dev


## Section B: Pre-processing and Feature Engineering

A key part of the data science lifecyle is data exploration, pre-processing and feature engineering. You will see how to use SageMaker notebooks for data exploration and SageMaker Processing for feature engineering and pre-processing data

### Download and Import the data

For this notebook, we use the public Credit Card default dataset downloaded from UCI, which can be found here: https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients.

[1] Yeh, I. C., & Lien, C. H. (2009). The comparisons of data mining techniques for the predictive accuracy of probability of default of credit card clients. Expert Systems with Applications, 36(2), 2473-2480.

Since you are not connected to the internet here, the data set has been downloaded for you to work on and  made it available in your local notebook instance here. 


In [8]:
WORKDIR = os.getcwd()
BASENAME = os.path.dirname(WORKDIR)

In [9]:
data = pd.read_excel('credit_card_default_data.xls', header=1)
data = data.drop(columns = ['ID'])
data.head()

Unnamed: 0,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,PAY_5,...,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default payment next month
0,20000,2,2,1,24,2,2,-1,-1,-2,...,0,0,0,0,689,0,0,0,0,1
1,120000,2,2,2,26,-1,2,0,0,0,...,3272,3455,3261,0,1000,1000,1000,0,2000,1
2,90000,2,2,2,34,0,0,0,0,0,...,14331,14948,15549,1518,1500,1000,1000,1000,5000,0
3,50000,2,2,1,37,0,0,0,0,0,...,28314,28959,29547,2000,2019,1200,1100,1069,1000,0
4,50000,1,2,1,57,-1,0,-1,0,0,...,20940,19146,19131,2000,36681,10000,9000,689,679,0


In [10]:
data.rename(columns={"default payment next month": "Label"}, inplace=True)
lbl = data.Label
data = pd.concat([lbl, data.drop(columns=['Label'])], axis = 1)
data.head()

Unnamed: 0,Label,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,...,BILL_AMT3,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6
0,1,20000,2,2,1,24,2,2,-1,-1,...,689,0,0,0,0,689,0,0,0,0
1,1,120000,2,2,2,26,-1,2,0,0,...,2682,3272,3455,3261,0,1000,1000,1000,0,2000
2,0,90000,2,2,2,34,0,0,0,0,...,13559,14331,14948,15549,1518,1500,1000,1000,1000,5000
3,0,50000,2,2,1,37,0,0,0,0,...,49291,28314,28959,29547,2000,2019,1200,1100,1069,1000
4,0,50000,1,2,1,57,-1,0,-1,0,...,35835,20940,19146,19131,2000,36681,10000,9000,689,679


## Part B.1: Feature Engineering with SageMaker Processing

While you can pre-process small amounts of data directly in a notebook, SageMaker Processing offloads the heavy lifting of pre-processing larger datasets by provisioning the underlying infrastructure, securely downloading the data from an S3 location to the processing container, running the processing scripts, storing the processed data in an output directory in Amazon S3 and deleting the underlying transient resources needed to run the processing job. Once the processing job is complete, the infrastructure used to run the job is wiped, and any temporary data stored on it is deleted.

Importantly, as we will see below, we can now track this part of our analysis process to ensure that the lineage of our downstream trained ML models can be versioned and tracked to a feature engineering pipeline.

### Data Encryption

To ensure that the processed data is encrypted at rest on the processing cluster, we provide a customer managed key to the volume_kms_key command below.  This instructs Amazon SageMaker to encrypt the EBS volumes used during the processing job with the specified key. Since our data stored in Amazon S3 buckets are already encrypted, data is encrypted at rest at all times.

Amazon SageMaker always uses TLS encrypted tunnels when working with Amazon SageMaker so data is also encrypted in transit when traveling from or to Amazon S3.

#### Create a feature engineering script

In [17]:
%%writefile preprocessing.py

import argparse
import os
import warnings

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.exceptions import DataConversionWarning
from sklearn.compose import make_column_transformer

warnings.filterwarnings(action='ignore', category=DataConversionWarning)

if __name__=='__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('--train-test-split-ratio', type=float, default=0.3)
    parser.add_argument('--random-split', type=int, default=0)
    args, _ = parser.parse_known_args()
    
    print('Received arguments {}'.format(args))

    input_data_path = os.path.join('/opt/ml/processing/input', 'rawdata.csv')
    
    print('Reading input data from {}'.format(input_data_path))
    df = pd.read_csv(input_data_path)
    df.sample(frac=1)
    
    COLS = df.columns
    newcolorder = ['PAY_AMT1','BILL_AMT1'] + list(COLS[1:])[:11] + list(COLS[1:])[12:17] + list(COLS[1:])[18:]
    
    split_ratio = args.train_test_split_ratio
    random_state=args.random_split
    
    X_train, X_test, y_train, y_test = train_test_split(df.drop('Label', axis=1), df['Label'], 
                                                        test_size=split_ratio, random_state=random_state)
    
    preprocess = make_column_transformer(
        (['PAY_AMT1'], StandardScaler()),
        (['BILL_AMT1'], MinMaxScaler()),
    remainder='passthrough')
    
    print('Running preprocessing and feature engineering transformations')
    train_features = pd.DataFrame(preprocess.fit_transform(X_train), columns = newcolorder)
    test_features = pd.DataFrame(preprocess.transform(X_test), columns = newcolorder)
    
    # concat to ensure Label column is the first column in dataframe
    train_full = pd.concat([pd.DataFrame(y_train.values, columns=['Label']), train_features], axis=1)
    test_full = pd.concat([pd.DataFrame(y_test.values, columns=['Label']), test_features], axis=1)
    
    print('Train data shape after preprocessing: {}'.format(train_features.shape))
    print('Test data shape after preprocessing: {}'.format(test_features.shape))
    
    train_features_headers_output_path = os.path.join('/opt/ml/processing/train_headers', 'train_data_headers.csv')
    
    train_features_output_path = os.path.join('/opt/ml/processing/train', 'train_data.csv')
    
    test_features_output_path = os.path.join('/opt/ml/processing/test', 'test_data.csv')
    
    print('Saving training features to {}'.format(train_features_output_path))
    train_full.to_csv(train_features_output_path, header=False, index=False)
    print("Complete")
    
    print("Save training data with headers to {}".format(train_features_headers_output_path))
    train_full.to_csv(train_features_headers_output_path, index=False)
                 
    print('Saving test features to {}'.format(test_features_output_path))
    test_full.to_csv(test_features_output_path, header=False, index=False)
    print("Complete")

Overwriting preprocessing.py


#### Upload raw data to S3 for processing

In [11]:
if not os.path.exists('rawdata/rawdata.csv'):
    !mkdir rawdata
    data.to_csv('rawdata/rawdata.csv', index=None)
    
#upload the raw data to S3.
rawdataprefix = 'rawdata'
raw_data_location = sess.upload_data(rawdataprefix, bucket=rawbucket, key_prefix=dataprefix)
print("Raw data has been uploaded to {} for processing".format (raw_data_location))



Raw data has been uploaded to s3://ds-data-bucket-barto-abc-123-dev/secure-sagemaker-demo/data for processing


#### Upload the processing script to S3 for execution

In [18]:
# Copy the preprocessing code over to the s3 bucket
codeprefix = prefix + '/code'
codeupload = sess.upload_data('preprocessing.py', bucket=rawbucket, key_prefix=codeprefix)
print("Uploaded feature engineering script to {}".format (codeupload))                                                                          



Uploaded feature engineering script to s3://ds-data-bucket-barto-abc-123-dev/secure-sagemaker-demo/code/preprocessing.py


#### Configure where processing results will be stored

In [19]:
train_data_location = 's3://'+ data_bucket +'/'+ traindataprefix
train_header_location = 's3://'+ data_bucket +'/'+ prefix +'/train_headers'
test_data_location = 's3://'+ data_bucket +'/'+ testdataprefix

print("Training data will be written to {}".format(train_data_location))
print("Test data will be written to {}".format(test_data_location))

Training data will be written to s3://ds-data-bucket-barto-abc-123-dev/secure-sagemaker-demo/train_data
Test data will be written to s3://ds-data-bucket-barto-abc-123-dev/secure-sagemaker-demo/test_data


#### Execute the processing job

In [20]:
# Use SageMaker Processing with Sk Learn. -- split data into train and test sets at this stage if possible.
from sagemaker.sklearn.processing import SKLearnProcessor
from sagemaker.processing import ProcessingInput, ProcessingOutput

sklearn_processor = SKLearnProcessor(
    framework_version='0.20.0',
    role=role,
    instance_type='ml.c4.xlarge',
    instance_count=1,
    network_config=network_config,  # attach SageMaker resources to your VPC
    volume_kms_key=
    cmk_id  # encrypt the EBS volume attached to SageMaker Processing instance
)

sklearn_processor.run(
    code=codeupload,
    inputs=[
        ProcessingInput(
            source=raw_data_location, destination='/opt/ml/processing/input')
    ],
    outputs=[
        ProcessingOutput(
            output_name='train_data',
            source='/opt/ml/processing/train',
            destination=train_data_location),
        ProcessingOutput(
            output_name='test_data',
            source='/opt/ml/processing/test',
            destination=test_data_location),
        ProcessingOutput(
            output_name='train_data_headers',
            source='/opt/ml/processing/train_headers',
            destination=train_header_location)
    ],
    arguments=['--train-test-split-ratio', '0.2'])

preprocessing_job_description = sklearn_processor.jobs[-1].describe()

output_config = preprocessing_job_description['ProcessingOutputConfig']
for output in output_config['Outputs']:
    if output['OutputName'] == 'train_data':
        preprocessed_training_data = output['S3Output']['S3Uri']
    if output['OutputName'] == 'test_data':
        preprocessed_test_data = output['S3Output']['S3Uri']


INFO:sagemaker:Creating processing-job with name sagemaker-scikit-learn-2020-06-16-00-06-51-743



Job Name:  sagemaker-scikit-learn-2020-06-16-00-06-51-743
Inputs:  [{'InputName': 'input-1', 'S3Input': {'S3Uri': 's3://ds-data-bucket-barto-abc-123-dev/secure-sagemaker-demo/data', 'LocalPath': '/opt/ml/processing/input', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}, {'InputName': 'code', 'S3Input': {'S3Uri': 's3://ds-data-bucket-barto-abc-123-dev/secure-sagemaker-demo/code/preprocessing.py', 'LocalPath': '/opt/ml/processing/input/code', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}]
Outputs:  [{'OutputName': 'train_data', 'S3Output': {'S3Uri': 's3://ds-data-bucket-barto-abc-123-dev/secure-sagemaker-demo/train_data', 'LocalPath': '/opt/ml/processing/train', 'S3UploadMode': 'EndOfJob'}}, {'OutputName': 'test_data', 'S3Output': {'S3Uri': 's3://ds-data-bucket-barto-abc-123-dev/secure-sagemaker-demo/test_data', 'LocalPath': '/opt/ml/

# Section C: Model development and Training

## Part 5. Traceability and Auditability 

We use SageMaker Experiments for data scientists to track the lineage of the model from the raw data source to the preprocessing steps and the model training pipeline. With SageMaker Experiments, data scientists can compare, track and manage multiple diferent model training jobs, data processing jobs, and hyperparameter tuning jobs, retaining a lineage from the source data to the training job artifacts to the model hyperparameters and any custom metrics that they may want to monitor as part of the model training.

Here we used SageMaker's managed XGBoost container to train an XGBoost model. More details about the managed container can be found here: https://github.com/aws/sagemaker-xgboost-container

Many customers require tracking and lineage to the source code level, which keeps track of which user made the most recent commit that produced the training code, which generated the deployed production model. We demonstrate how this is done using Github APIs and integrated into SageMaker Experiments

In [21]:
# Create a SageMaker Experiment
cc_experiment = Experiment.create(
    experiment_name=f"CreditCardDefault-{int(time.time())}", 
    description="Predict credit card default from payments data", 
    sagemaker_boto_client=sm)
print(cc_experiment)


Experiment(sagemaker_boto_client=<botocore.client.SageMaker object at 0x7f0e0255c048>,experiment_name='CreditCardDefault-1592266266',description='Predict credit card default from payments data',experiment_arn='arn:aws:sagemaker:us-west-2:535574150626:experiment/creditcarddefault-1592266266',response_metadata={'RequestId': '1ab5b0cf-c54d-4eaf-84a3-3371bfb3ecf0', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': '1ab5b0cf-c54d-4eaf-84a3-3371bfb3ecf0', 'content-type': 'application/x-amz-json-1.1', 'content-length': '100', 'date': 'Tue, 16 Jun 2020 00:11:06 GMT'}, 'RetryAttempts': 0})


Now we can track our SageMaker processing job as shown below. Here we track the train_test_split_ratio, but we can track all kinds of other metadata such as the underlying instance types use to run the processing job or any specific feature engineering steps such as the random seed used to generate the train, test splits etc..

In [22]:
# Start Tracking parameters used in the Pre-processing pipeline.
with Tracker.create(display_name="Preprocessing", sagemaker_boto_client=sm) as tracker:
    tracker.log_parameters({
        "train_test_split_ratio": 0.2
    })
    # we can log the s3 uri to the dataset we just uploaded
    tracker.log_input(name="ccdefault-raw-dataset", media_type="s3/uri", value=raw_data_location)
    tracker.log_input(name="ccdefault-train-dataset", media_type="s3/uri", value=train_data_location)
    tracker.log_input(name="ccdefault-test-dataset", media_type="s3/uri", value=test_data_location)
    

### Train the Model

The same security practices we applied previously during SM Processing apply to training jobs. We will also have SageMaker experiments track the training job and store metadata such as model artifact location, training and validation data location, and model hyperparameters.

**Managed Spot Training**: To save on cost, we run the training using managed Spot instances. SageMaker will automatically look to see if any spot instances of the desired type are available for a max time less than the max wait time, and if one is available, run your training job on the lower cost instance. With Managed Spot, customers can benefit from up-to 90% savings in cost.

For bring your own containers, customers are responsible for checkpointing models for the spot instances to resume training in the event that a training job is interrupted.  For some SageMaker built-in algorithms, as well as SageMaker managed containers for Tensorflow/PyTorch/MxNet, SageMaker will handle the model checkpointing. For others, such as XgBoost, we limit the max_wait_time to 3600 seconds. 

## Lab 4: Train Without VPC Configured:

To test our Networking Controls, let's now run the following cell below. Here we will first attempt to train the Model without a VPC and network configuration attached. You should see that the training job Stops as soon as the "Downloading - Downloading input data" step completes. 

### Detective control explained

The training job was terminated by an AWS Lambda function that was executed in response to a CloudWatch Event that was triggered when the training job was created. 

Assume the role of the Data Science Administrator and review the code of the [AWS Lambda function SagemakerTrainingJobVPCEnforcer](https://console.aws.amazon.com/lambda/home?#/functions/SagemakerTrainingJobVPCEnforcer?tab=configuration). 

Also review the [CloudWatch Event rule SagemakerTrainingJobVPCEnforcementRule](https://console.aws.amazon.com/cloudwatch/home?#rules:name=SagemakerTrainingJobVPCEnforcementRule) and take note of the event which triggers execution of the Lambda function.

---


In [31]:
from sagemaker.amazon.amazon_estimator import get_image_uri
image = get_image_uri(boto3.Session().region_name, 'xgboost', '0.90-1')
s3_input_train = sagemaker.s3_input(s3_data='s3://{}/{}/'.format(data_bucket, traindataprefix), content_type='csv')
s3_input_test = sagemaker.s3_input(s3_data='s3://{}/{}/'.format(data_bucket, testdataprefix), content_type='csv')

	get_image_uri(region, 'xgboost', '1.0-1').


In [34]:

xgb = sagemaker.estimator.Estimator(
    image,
    role,
    train_instance_count=1,
    train_instance_type='ml.m4.xlarge',
    train_max_run=3600,
    output_path='s3://{}/{}/models'.format(output_bucket, prefix),
    sagemaker_session=sess,
    train_use_spot_instances=True,
    train_max_wait=3600,
    encrypt_inter_container_traffic=False
)  

xgb.set_hyperparameters(
    max_depth=5,
    eta=0.2,
    gamma=4,
    min_child_weight=6,
    subsample=0.8,
    verbosity=0,
    objective='binary:logistic',
    num_round=100)

xgb.fit(inputs={'train': s3_input_train})


INFO:sagemaker:Creating training-job with name: sagemaker-xgboost-2020-06-16-00-34-31-945


ClientError: An error occurred (AccessDeniedException) when calling the CreateTrainingJob operation: User: arn:aws:sts::535574150626:assumed-role/ds-notebook-role-barto-abc-123-dev-jpb/SageMaker is not authorized to perform: sagemaker:CreateTrainingJob on resource: arn:aws:sagemaker:us-west-2:535574150626:training-job/sagemaker-xgboost-2020-06-16-00-34-31-945 with an explicit deny

## Train with VPC

Now the training job should complete once we input the network settings into our training job that were defined above.

In [26]:
preprocessing_trial_component = tracker.trial_component

trial_name = f"cc-fraud-training-job-{int(time.time())}"
cc_trial = Trial.create(
    trial_name=trial_name,
    experiment_name=cc_experiment.experiment_name,
    sagemaker_boto_client=sm)

cc_trial.add_trial_component(preprocessing_trial_component)
cc_training_job_name = "cc-training-job-{}".format(int(time.time()))
xgb = sagemaker.estimator.Estimator(
    image,
    role,
    train_instance_count=1,
    train_instance_type='ml.m4.xlarge',
    train_max_run=3600,
    output_path='s3://{}/{}/models'.format(output_bucket, prefix),
    sagemaker_session=sess,
    train_use_spot_instances=True,
    train_max_wait=3600,
    subnets=subnets, 
    security_group_ids=
    sec_groups,  
    train_volume_kms_key=cmk_id,
    encrypt_inter_container_traffic=False)

xgb.set_hyperparameters(
    max_depth=5,
    eta=0.2,
    gamma=4,
    min_child_weight=6,
    subsample=0.8,
    verbosity=0,
    objective='binary:logistic',
    num_round=100)

xgb.fit(
    inputs={'train': s3_input_train},
    job_name=cc_training_job_name,
    experiment_config={
        "TrialName":
        cc_trial.trial_name,  #log training job in Trials for lineage
        "TrialComponentDisplayName": "Training",
    },
    wait=True)


INFO:sagemaker:Creating training-job with name: cc-training-job-1592266357


2020-06-16 00:12:38 Starting - Starting the training job...
2020-06-16 00:12:40 Starting - Launching requested ML instances.........
2020-06-16 00:14:15 Starting - Preparing the instances for training........................
2020-06-16 00:18:15 Starting - Launched instance was unhealthy, replacing it!...
2020-06-16 00:19:14 Starting - Preparing the instances for training......
2020-06-16 00:20:04 Downloading - Downloading input data...
2020-06-16 00:20:35 Training - Downloading the training image..
2020-06-16 00:21:09 Uploading - Uploading generated training model
2020-06-16 00:21:09 Completed - Training job completed
[34mINFO:sagemaker-containers:Imported framework sagemaker_xgboost_container.training[0m
[34mINFO:sagemaker-containers:Failed to parse hyperparameter objective value binary:logistic to Json.[0m
[34mReturning the value itself[0m
[34mINFO:sagemaker-containers:No GPUs detected (normal if no gpus installed)[0m
[34mINFO:sagemaker_xgboost_container.training:Running XGB

##  5. Traceability and Auditability from source control to Model artifacts

Having used SageMaker Experiments to track the training runs, we can now extract model metadata to get the entire lineage of the model from the source data to the model artifacts and the hyperparameters.

To do this, simply call the **describe_trial_component** API.

In [27]:
# Present the Model Lineage as a dataframe
from sagemaker.session import Session
sess = boto3.Session()
lineage_table = ExperimentAnalytics(
    sagemaker_session=Session(sess, sm), 
    search_expression={
        "Filters":[{
            "Name": "Parents.TrialName",
            "Operator": "Equals",
            "Value": trial_name
        }]
    },
    sort_by="CreationTime",
    sort_order="Ascending",
)
lineagedf= lineage_table.dataframe()

lineagedf

Unnamed: 0,TrialComponentName,DisplayName,train_test_split_ratio,SourceArn,SageMaker.ImageUri,SageMaker.InstanceCount,SageMaker.InstanceType,SageMaker.VolumeSizeInGB,eta,gamma,...,num_round,objective,subsample,verbosity,train:error - Min,train:error - Max,train:error - Avg,train:error - StdDev,train:error - Last,train:error - Count
0,TrialComponent-2020-06-16-001203-aryp,Preprocessing,0.2,,,,,,,,...,,,,,,,,,,
1,cc-training-job-1592266357-aws-training-job,Training,,arn:aws:sagemaker:us-west-2:535574150626:train...,246618743249.dkr.ecr.us-west-2.amazonaws.com/s...,1.0,ml.m4.xlarge,30.0,0.2,4.0,...,100.0,binary:logistic,0.8,0.0,0.158875,0.178542,0.167194,0.005166,0.159375,95.0


In [28]:
# get detailed information about a particular trial
sm.describe_trial_component(TrialComponentName=lineagedf.TrialComponentName[1])

{'TrialComponentName': 'cc-training-job-1592266357-aws-training-job',
 'TrialComponentArn': 'arn:aws:sagemaker:us-west-2:535574150626:experiment-trial-component/cc-training-job-1592266357-aws-training-job',
 'DisplayName': 'Training',
 'Source': {'SourceArn': 'arn:aws:sagemaker:us-west-2:535574150626:training-job/cc-training-job-1592266357',
  'SourceType': 'SageMakerTrainingJob'},
 'Status': {'PrimaryStatus': 'Completed',
  'Message': 'Status: Completed, secondary status: Completed, failure reason: .'},
 'StartTime': datetime.datetime(2020, 6, 16, 0, 20, 4, tzinfo=tzlocal()),
 'EndTime': datetime.datetime(2020, 6, 16, 0, 21, 9, tzinfo=tzlocal()),
 'CreationTime': datetime.datetime(2020, 6, 16, 0, 12, 38, 377000, tzinfo=tzlocal()),
 'CreatedBy': {},
 'LastModifiedTime': datetime.datetime(2020, 6, 16, 0, 21, 9, 757000, tzinfo=tzlocal()),
 'LastModifiedBy': {},
 'Parameters': {'SageMaker.ImageUri': {'StringValue': '246618743249.dkr.ecr.us-west-2.amazonaws.com/sagemaker-xgboost:0.90-1-cpu

# Conclusions of this notebook

In this notebook we have demonstrated the following aspects relevant for you as an administrator to ensure that the data science process is secure.

1. **Security:**

  1. Isolation from the internet. 
  1. Data exploration and storage of raw data using encryption keys and encrypting of data in rest and in transit. 
  1. Data movement is entirely controlled through PrivateLink.

2. **Pre-processing:** 
  
  Data preprocessing both in notebook, and in a secure manner using SageMaker Processing with encryption and networking guardrails for data motion.

3. **Using SageMaker Built-in XGB to train:** 
  
  Training a built-in SageMaker algorithm.

4. **Cost Optimization:** 
  
  Training using Spot Instances to save cost. 

5. **Lineage and Tracking:** 
  
  Tracking of model lineage as well as pre-processing job parameters using SageMaker Experiments.

> **The information included in this notebook is for illustrative purposes only. Nothing in this notebook is intended to provide you legal, compliance, or regulatory guidance. You should review the laws that apply to you.**