# Amazon SageMaker administration and security workshop: Lab 1

This notebook contains hands-on exercises for the workshop **Amazon SageMaker administration and security** – Lab 1.

<div style="border: 4px solid coral; text-align: center; margin: auto;">
    <h2><i>Part 2: Internet access via NAT gateway</i></h2>
    <br>
    <p style=" text-align: center; margin: auto;">At this point you have internet connectivity in Studio notebooks.</p>
    <p style=" text-align: center; margin: auto;"><b>Make sure you completed the instructions in the workshop lab 1 and provisioned a NAT gateway in your AWS account</b></p>
    <br>
</div>

In [None]:
%pip install --upgrade pip sagemaker

## Import packages and load variables

In [None]:
import time
import os
import json
import boto3
import numpy as np  
import pandas as pd 
import sagemaker

sagemaker.__version__

In [None]:
%store -r 

%store

try:
    initialized
except NameError:
    print("++++++++++++++++++++++++++++++++++++++++++")
    print("[ERROR] YOU HAVE TO RUN 01-lab-01 notebook         ")
    print("++++++++++++++++++++++++++++++++++++++++++")

### Set contants

In [None]:
# Get some variables you need to interact with SageMaker service
boto_session = boto3.Session()
sm_session = sagemaker.Session()
sm_client = boto_session.client("sagemaker")
sm_role = sagemaker.get_execution_role()

print(sm_role)

## Experiment with internet connectivity

<div class="alert alert-info"> 💡   The next statement must return an IP address if you have internet access in the notebook via the provisioned NAT gateway.
</div>

In [None]:
# This call will return one of the NAT gateway  public IPs
!curl checkip.amazonaws.com

## Use VPC endpoint policies

In [None]:
ssm = boto3.client("ssm")

In [None]:
boto3.client("sts").get_caller_identity()

In [None]:
account_id = boto3.client("sts").get_caller_identity()["Account"]
region = boto3.Session().region_name

account_id, region

In [None]:
%store account_id

The following code cells call the `ssm:GetParameter` API. Experiment with the VPC endpoint policies by completing the following steps.

In [None]:
# Print the current execution role the notebook uses
sagemaker.get_execution_role()

In [None]:
# This call will success because the execution role has the ssm permissions in the permission policy
ssm.get_parameter(Name=f"sagemaker-admin-workshop-{region}-{account_id}-kms-vpce-id")

Now change the VPC endpoint policy for the SSM VPC endpoint. The new policy denies any SSM API call to all principles except the MLOps execution role:
```
{
    "Statement": [
        {
            "Action": [
                "ssm:*"
            ],
            "Principal": "*",
            "Resource": [
                "*"
            ],
            "Effect": "Deny",
            "Condition": {
                "ArnNotEquals": {
                    "aws:PrincipalArn": "<MLOps execution role ARN>"
                }
            }
        }
    ]
}
```

In [None]:
# Now the call will fail because the VPC endpoint policy denies all ssm API access
ssm.get_parameter(Name=f"sagemaker-admin-workshop-{region}-{account_id}-kms-vpce-id")

Remove the VPC endpoint policy for the SSM VPC endpoint.

## Use network configuration in SageMaker jobs

In [None]:
from sagemaker.network import NetworkConfig

The CloudFormation template stored the resource id values for the provisioned security groups and the private subnets as SSM parameters. Now you can use these parameters to setup the `NetworkConfig`.

In [None]:
security_group_ids=ssm.get_parameter(Name=f"sagemaker-admin-workshop-{region}-{account_id}-sagemaker-sg-ids")["Parameter"]["Value"]
private_subnet_ids=ssm.get_parameter(Name=f"sagemaker-admin-workshop-{region}-{account_id}-private-subnet-ids")["Parameter"]["Value"]

security_group_ids, private_subnet_ids

In [None]:
# Construct the NetworkConfig with the values for your environment
network_config = NetworkConfig(
        enable_network_isolation=False, 
        security_group_ids=security_group_ids.split(','),
        subnets=private_subnet_ids.split(','),
        encrypt_inter_container_traffic=True)

The following section loads a sample dataset, creates a sample data processing script, and runs the data processing with a SageMaker processing job

### Download the sample dataset
This example uses the [direct marketing dataset](https://archive.ics.uci.edu/ml/datasets/bank+marketing) from UCI's ML Repository:
> [Moro et al., 2014] S. Moro, P. Cortez and P. Rita. A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems, Elsevier, 62:22-31, June 2014

The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be ('yes') or not ('no') subscribed.

In [None]:
# Downlad and unzip the dataset. You must have an internet connection to download the data
!wget -P data/ -N https://archive.ics.uci.edu/static/public/222/bank+marketing.zip

In [None]:
import zipfile

with zipfile.ZipFile("data/bank+marketing.zip", "r") as z:
    print("Unzipping bank+marketing...")
    z.extractall("data")

with zipfile.ZipFile("data/bank-additional.zip", "r") as z:
    print("Unzipping bank-additional...")
    z.extractall("data")

print("Done")

In [None]:
# See the data
df_data = pd.read_csv("./data/bank-additional/bank-additional-full.csv", sep=";")

pd.set_option("display.max_columns", 500)  # View all of the columns
df_data  # show first 5 and last 5 rows of the dataframe

###  Data processing
Execute the following cells to run data processing using a SageMaker processing job. You'll also continue experimenting with it in the next notebook.

In [None]:
target_col = "y"

In [None]:
%store target_col

In [None]:
input_s3_url = sm_session.upload_data(
    path="data/bank-additional/bank-additional-full.csv",
    bucket=bucket_name,
    key_prefix=f"{bucket_prefix}/input"
)

%store input_s3_url

In [None]:
!aws s3 ls {bucket_name}/{bucket_prefix} --recursive

In [None]:
%%writefile preprocessing.py

import pandas as pd
import numpy as np
import argparse
import os

def _parse_args():
    
    parser = argparse.ArgumentParser()
    # Data, model, and output directories
    # model_dir is always passed in from SageMaker. By default this is a S3 path under the default bucket.
    parser.add_argument('--filepath', type=str, default='/opt/ml/processing/input/')
    parser.add_argument('--filename', type=str, default='bank-additional-full.csv')
    parser.add_argument('--outputpath', type=str, default='/opt/ml/processing/output/')
    
    return parser.parse_known_args()


if __name__=="__main__":
    # Process arguments
    args, _ = _parse_args()
    
    target_col = "y"
    
    # Load data
    df_data = pd.read_csv(os.path.join(args.filepath, args.filename), sep=";")

    # Indicator variable to capture when pdays takes a value of 999
    df_data["no_previous_contact"] = np.where(df_data["pdays"] == 999, 1, 0)

    # Indicator for individuals not actively employed
    df_data["not_working"] = np.where(
        np.in1d(df_data["job"], ["student", "retired", "unemployed"]), 1, 0
    )

    # remove unnecessary data
    df_model_data = df_data.drop(
        ["duration", "emp.var.rate", "cons.price.idx", "cons.conf.idx", "euribor3m", "nr.employed"],
        axis=1,
    )

    df_model_data = pd.get_dummies(df_model_data)  # Convert categorical variables to sets of indicators

    # Replace "y_no" and "y_yes" with a single label column, and bring it to the front:
    df_model_data = pd.concat(
        [
            df_model_data["y_yes"].rename(target_col),
            df_model_data.drop(["y_no", "y_yes"], axis=1),
        ],
        axis=1,
    )

    # Shuffle and splitting dataset
    train_data, validation_data, test_data = np.split(
        df_model_data.sample(frac=1, random_state=1729),
        [int(0.7 * len(df_model_data)), int(0.9 * len(df_model_data))],
    )

    print(f"Data split > train:{train_data.shape} | validation:{validation_data.shape} | test:{test_data.shape}")
    
    # Save datasets locally
    train_data.to_csv(os.path.join(args.outputpath, 'train/train.csv'), index=False, header=False)
    validation_data.to_csv(os.path.join(args.outputpath, 'validation/validation.csv'), index=False, header=False)
    test_data[target_col].to_csv(os.path.join(args.outputpath, 'test/test_y.csv'), index=False, header=False)
    test_data.drop([target_col], axis=1).to_csv(os.path.join(args.outputpath, 'test/test_x.csv'), index=False, header=False)
    
    print("## Processing complete. Exiting.")

In [None]:
train_s3_url = f"s3://{bucket_name}/{bucket_prefix}/train"
validation_s3_url = f"s3://{bucket_name}/{bucket_prefix}/validation"
test_s3_url = f"s3://{bucket_name}/{bucket_prefix}/test"

In [None]:
%store train_s3_url
%store validation_s3_url
%store test_s3_url

In [None]:
from sagemaker.sklearn.processing import SKLearnProcessor
from sagemaker.processing import ProcessingInput, ProcessingOutput

In [None]:
framework_version = "0.23-1"
processing_instance_type = "ml.m5.large"
processing_instance_count = 1

In [None]:
# Define processing inputs and outputs
processing_inputs = [
        ProcessingInput(
            source=input_s3_url, 
            destination="/opt/ml/processing/input",
            s3_input_mode="File",
            s3_data_distribution_type="ShardedByS3Key"
        )
]

processing_outputs = [
        ProcessingOutput(
            output_name="train_data", 
            source="/opt/ml/processing/output/train",
            destination=train_s3_url,
        ),
        ProcessingOutput(
            output_name="validation_data", 
            source="/opt/ml/processing/output/validation", 
            destination=validation_s3_url
        ),
        ProcessingOutput(
            output_name="test_data", 
            source="/opt/ml/processing/output/test", 
            destination=test_s3_url
        ),
]

In [None]:
# Create a processor
sklearn_processor = SKLearnProcessor(
    framework_version=framework_version,
    role=sm_role,
    instance_type=processing_instance_type,
    instance_count=processing_instance_count, 
    base_job_name='sm-admin-workshop-processing',
    sagemaker_session=sm_session,
    network_config=network_config,
)



In [None]:
# This call wil succeed and the processing job will finish
sklearn_processor.run(
        inputs=processing_inputs,
        outputs=processing_outputs,
        code='preprocessing.py',
        wait=True,
)



## Quiz
Answer the following questions to test your understanding of the introduced concepts.

In [None]:
from workshop_utils.quiz_questions import *

In [None]:
lab1_question1

In [None]:
lab1_question2

In [None]:
lab1_question3

In [None]:
lab1_question4

In [None]:
lab1_question5

## End of the lab 1
Follow the instructions in the lab 2 of the workshop and the [`02-lab-02.ipynb`](02-lab-02.ipynb) notebook.

---

# Shutdown kernel
Each notebook contains the following code to shutdown the notebook kernel and free up the resources. If you go back and forth between notebooks, you can keep the kernel running for the duration of the workshop. Keep an eye on the instance memory allocation. All notebooks of a specific image, in this case `Data Science`, are running on the same compute instance. The default compute instance is `ml.t3.medium` with 4GB memory. You can run out of memory on the instance if you keep multiple kernels running. You can also switch to a large instance if you run out of memory for this workshop.

In [None]:
%%html

<p><b>Shutting down your kernel for this notebook to release resources.</b></p>
<button class="sm-command-button" data-commandlinker-command="kernelmenu:shutdown" style="display:none;">Shutdown Kernel</button>
        
<script>
try {
    els = document.getElementsByClassName("sm-command-button");
    els[0].click();
}
catch(err) {
    // NoOp
}    
</script>