# Overview

Machine Learning (ML) development using Open Data (ODH) Hub broswer-based Jupyter Notebooks on Kubernetes with Red Hat OpenShift on AWS (ROSA) enhanced with SageMaker Python SDK backed by S3 Storage.

# Before you begin

1. You need access to a [ROSA Cluster](https://cloud.redhat.com/blog/red-hat-openshift-service-on-aws-is-now-generally-available) with cluster-admin privileges in an available region [See Regions](https://docs.openshift.com/rosa/rosa_architecture/rosa_policy_service_definition/rosa-service-definition.html#rosa-sdpolicy-regions-az_rosa-service-definition).
1. Your region needs to have SageMaker in the [available services](https://aws.amazon.com/about-aws/global-infrastructure/regional-product-services/).
1. You need access to the AWS Console to create an IAM Execution role for [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/). [How to create an IAM Execution role](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html#sagemaker-roles-create-execution-role).
1. You need to create AWS [credentials file for the AWS SDK](https://docs.aws.amazon.com/sdk-for-java/v1/developer-guide/setup-credentials.html).
1. (Secure setup)[https://gist.github.com/dudash/33f4e385edeac2004306cd88f37e0f24]
1. [Create an STS role](https://docs.openshift.com/rosa/rosa_install_access_delete_clusters/rosa-sts-creating-a-cluster-quickly.html)
1. You need an S3 Storage Bucket with 'sagemaker' in the name and an Access Point. [How to create an S3 Storage Bucket with an Access Point]().
1. You need a Project (or namespace). [How to create a project, add team members, set limits and quotas]().
1. You need to deploy the ODH Operator in the namespace with 4 components: ODH Dashboard, JupyterHub, Jupyter Notebooks and ODH common. [How to deploy the Open Data Hub Operator]().
1. Update the Jupyter Notebook Environment Variables prior to starting the Jupyter Notebook Server:
    - AWS_REGION
    - EXECUTION_ROLE_ARN
    - S3_ENDPOINT_URL
    - S3_ACCESS_KEY_ID
    - S3_SECRET_ACCESS_KEY
    - S3_BUCKET
1. From the Jupyter Notebookterminal, run 'pip install -r requirements.txt'

From the Jupyter Control Panel, while selecting the notebook image and container size, set the environment variables below.

In [1]:
# Module Paths
REQUIREMENTS = 'requirements.txt'

In [2]:
%%writefile {REQUIREMENTS}

sagemaker
tensorflow
keras
numpy

Overwriting requirements.txt


In [None]:
# TODO ODH notebooks use horus for dependency management, you must install requirements from terminal
# TODO create ODH SageMaker Notebook image
#!pip install -r requirements.txt

## Verify your connection details

In [3]:
# uncomment the below line to display the values for the Environment Variables. If not set, enter in the notebook image spawner prior to starting the server.
#!env | grep 'AWS\|S3\|ARN' | sort

# Imports

Amazon SageMaker Python SDK - is an open source library for training and deploying machine-learned models on Amazon SageMaker. With the SDK, you can train and deploy models using popular deep learning frameworks, algorithms provided by Amazon, or your own algorithms. https://sagemaker.readthedocs.io/en/stable/

Boto3 - is the Amazon Web Services (AWS) Software Development Kit (SDK) for Python, which allows Python developers to write software that makes use of services like Amazon S3 and Amazon EC2. Boto3 is maintained and published by Amazon Web Services. https://github.com/boto/boto3

In [4]:
# import AWS Sagemaker SDK 
import sagemaker

# for the tf estimator model that is used to train with
from sagemaker.tensorflow import TensorFlow

# import the AWS SDK for python. 
import boto3

# import os for misc. operating system dependent functionality
import os

# import numpy for common machine learning libraries like scikit-learn and SciPy
import numpy as np

# import keras for an open-source software library that provides a Python interface for artificial neural networks
import keras

# this module provide a few toy datasets (already-vectorized, in Numpy format) that can be used for debugging a model or creating simple code
from keras.datasets import fashion_mnist

2022-09-08 20:39:08.748998: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-09-08 20:39:08.895296: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2022-09-08 20:39:08.895336: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2022-09-08 20:39:08.927616: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2022-09-08 20:39:09.709008: W tensorflow/stream_executor/pla

In [5]:
# print the imported versions
print("SageMaker " + sagemaker.__version__)
print("Boto3 " + boto3.__version__)

SageMaker 2.108.0
Boto3 1.24.68


## AWS Configurations

### Set your region

In [6]:
# session manages interactions with the Amazon SageMaker APIs and any other AWS services needed.
sess = sagemaker.Session(boto3.session.Session(region_name=os.getenv('AWS_REGION')))

### Set your IAM role

In [7]:
# Create a role https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html

# option 1: use for non-local deployments of SageMaker on systems where the AWS CLI and SageMaker CLI is deployed
# role = sagemaker.get_execution_role()

# option 2: manually set your IAM role arn for the Execution Role
# if you need to find the arn, from a terminal with the AWS CLI type 'aws iam list-roles|grep SageMaker-Execution'
# if you need to create or find the arn, for example, you'd go to https://us-east-1.console.aws.amazon.com/iamv2
role = os.getenv('EXECUTION_ROLE_ARN')

## Configure the storage

In [8]:
# For S3 Storage
# Edit this section using your own credentials
# enter your region name
s3_region = os.getenv('AWS_REGION')

# enter your S3 endpoint URL
s3_endpoint_url = os.getenv('S3_ENDPOINT_URL')

# enter your S3 access key ID
s3_access_key_id = os.getenv('S3_ACCESS_KEY_ID')

# enter your S3 secret access key
# TODO make OCP secret
# TODO OIDC identity
s3_secret_access_key = os.getenv('S3_SECRET_ACCESS_KEY')

# enter your S3 bucket name
s3_bucket = os.getenv('S3_BUCKET')

# configure boto S3 connection
s3 = boto3.client('s3',
                  s3_region,
                  #endpoint_url = s3_endpoint_url,
                  aws_access_key_id = s3_access_key_id,
                  aws_secret_access_key = s3_secret_access_key)

### Test your S3 bucket connection

In [9]:
s3.list_buckets()

{'ResponseMetadata': {'RequestId': '42PCZRV39DY480XN',
  'HostId': '249stXAqgDTzFFPvOhK7v5sf9ploTwnZXMwDRp+p/0Uw8Ua4zSCX04YgDR0z7UDsPc35m8D6PAY=',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amz-id-2': '249stXAqgDTzFFPvOhK7v5sf9ploTwnZXMwDRp+p/0Uw8Ua4zSCX04YgDR0z7UDsPc35m8D6PAY=',
   'x-amz-request-id': '42PCZRV39DY480XN',
   'date': 'Thu, 08 Sep 2022 20:39:17 GMT',
   'content-type': 'application/xml',
   'transfer-encoding': 'chunked',
   'server': 'AmazonS3'},
  'RetryAttempts': 0},
 'Buckets': [{'Name': 'managed-velero-backups-fd20f621-126d-4995-a63e-ff038a92f3e3',
   'CreationDate': datetime.datetime(2022, 9, 8, 20, 7, 35, tzinfo=tzlocal())},
  {'Name': 'rosa-jjfzd-dbxd8-image-registry-us-east-2-vbfctdmjcqefdgxpdcns',
   'CreationDate': datetime.datetime(2022, 9, 6, 21, 45, 34, tzinfo=tzlocal())},
  {'Name': 'sagemaker-tf-estimator-model',
   'CreationDate': datetime.datetime(2022, 9, 8, 15, 6, 7, tzinfo=tzlocal())}],
 'Owner': {'ID': '156653ec7d02f798caa7bb60dd55fb03c097214374a

# Configure your Datasets

In [10]:
# x_train is the data and y_train is the label
# x_val is the data and y_val is the label
(x_train, y_train), (x_val, y_val) = fashion_mnist.load_data()

# create a directory called 'data' to store your datasets
os.makedirs("./data", exist_ok = True)

# save the training data and label to the training folder
np.savez('./data/training', image=x_train, label=y_train)

# save the validation data and label to the validation folder
np.savez('./data/validation', image=x_val, label=y_val)

In [11]:
# set the training path for the data
training_input_path = './data/training.npz'
# for local storage
#training_input_path = './data/training.npz'

# set the validation path for the data
validation_input_path = './data/validation.npz'
# for local storage
#training_input_path = './data/validation.npz'

# Set the location to store the trained model
output_path = os.getenv('S3_BUCKET')
# for local storage
# create a directory called 'model' to store your model
#os.makedirs("./model", exist_ok = True)
#output_path = './model'

In [12]:
# list of parameters https://sagemaker.readthedocs.io/en/stable/frameworks/tensorflow/sagemaker.tensorflow.html

tf_estimator = TensorFlow(entry_point='mnist_keras_tf.py', # Path (absolute or relative) to the local Python source file which should be executed as the entry point to training
                          role=role,                       # The IAM Role ARN for the TensorFlowModel
                          instance_count=1,                # number of EC2 instances to use
                          framework_version='1.15',        # TF version to use for executing the training code
                          py_version='py3',                # Python version to use for executing the model training code
                          hyperparameters={'epochs': 1},   # hyperparameters used during training
                          model_dir=output_path            # S3 location where the checkpoint data and models can be exported during training
                          )  

                          #instance_type=ml.c4.xlarge      # Type of EC2 instance to use, for example, ‘ml.c4.xlarge’.
                          #image_uri=123.dkr.ecr.us-west-2.amazonaws.com/my-custom-image:1.0 custom-image:latest,
                          #distribution,                   # how to run distributed training for data or model parallelism

ValueError: Must setup local AWS configuration with a region supported by SageMaker.