# MLOps workshop with Amazon SageMaker

## Module 02 (**optional**): Transform the data and train a model using SageMaker `@remote` introduced to SageMaker SDK.

This notebook shows how to use the `@remote` introduced to SageMaker SDK to delegate data processing, model training and model evaluation workload to SageMaker job platform.

For more details about running your local code as a SageMaker training job, please refer to the [documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/train-remote-decorator.html).

We will use the same dataset and model inroduced in [Module 02: Transform the data and train a model using SageMaker managed training job](02_manual_sagemaker_process_train.ipynb) notebook. 

**Note** this notebook can only run on `Base Python 3.0` Kernel. 

## Install the dependencies

We will create a `requirements.txt` file that will be used in this notebook, the pre-processing job and the evaluation jobs. 

In [None]:
%%writefile requirements.txt

pandas==2.1.4
scikit-learn==1.3.2
tensorflow==2.15.0
sagemaker>=2.203.0,<3

Now we will install the dependencies on the notebook

In [None]:
%pip install -r ./requirements.txt -q

## Setup Configuration file path

We are setting the directory in which the `config.yaml` file resides so that remote decorator can make use of the settings.

You can see we use default `ml.m5.large` for the compute to be run with the `@remote` decorator. Also, note `requirements.txt` will be installed as default.  

In [None]:
%%writefile config.yaml

SchemaVersion: '1.0'
SageMaker:
  PythonSDK:
    Modules:
      RemoteFunction:
        # role arn is not required if in SageMaker Notebook instance or SageMaker Studio
        # Uncomment the following line and replace with the right execution role if in a local IDE
        # RoleArn: <replace the role arn here>
        InstanceType: ml.m5.large
        Dependencies: ./requirements.txt
        IncludeLocalWorkDir: true
        CustomFileFilter:
          IgnoreNamePatterns: # files or directories to ignore
          - "*.ipynb" # all notebook files

In [None]:
import os

# Set path to config file
os.environ["SAGEMAKER_USER_CONFIG_OVERRIDE"] = os.getcwd()

## SageMaker Processing for dataset transformation <a class="anchor" id="SageMakerProcessing">

Next, we'll import the dataset and transform it with SageMaker Processing, which can be used to process terabytes of data in a SageMaker-managed cluster separate from the instance running your notebook server. In a typical SageMaker workflow, notebooks are only used for prototyping and can be run on relatively inexpensive and less powerful instances, while processing, training and model hosting tasks are run on separate, more powerful SageMaker-managed instances.  SageMaker Processing includes off-the-shelf support for Scikit-learn, as well as a Bring Your Own Container option, so it can be used with many different data transformation technologies and tasks.  An alternative to SageMaker Processing is [SageMaker Data Wrangler](https://aws.amazon.com/sagemaker/data-wrangler/), a visual data preparation tool integrated with the SageMaker Studio UI.    

To work with SageMaker Processing, first we'll load the California Housing dataset, save the raw feature data and upload it to Amazon S3 so it can be accessed by SageMaker Processing.  We'll also save the labels for training and testing.
    
More info on the dataset:

This dataset was obtained from the StatLib repository. http://lib.stat.cmu.edu/datasets/

The target variable is the median house value for California districts.

This dataset was derived from the 1990 U.S. census, using one row per census block group. A block group is the smallest geographical unit for which the U.S. Census Bureau publishes sample data (a block group typically has a population of 600 to 3,000 people).

In [None]:
import boto3
import json
import os
import sagemaker
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

In [None]:
!aws s3 cp s3://sagemaker-sample-files/datasets/tabular/california_housing/cal_housing.tgz .

In [None]:
!tar -zxf cal_housing.tgz 2>/dev/null

In [None]:
columns = [
    "longitude",
    "latitude",
    "housingMedianAge",
    "totalRooms",
    "totalBedrooms",
    "population",
    "households",
    "medianIncome",
    "medianHouseValue",
]
df = pd.read_csv("CaliforniaHousing/cal_housing.data", names=columns, header=None)

In [None]:
df.head()

In [None]:
columns_to_normalize = [
    'medianIncome', 'housingMedianAge', 'totalRooms', 
    'totalBedrooms', 'population', 'households', 'medianHouseValue'
]

for column in columns_to_normalize:
    df[column] = np.log1p(df[column])

In [None]:
X = df.drop("medianHouseValue", axis=1)
Y = df["medianHouseValue"].copy()

In [None]:
print("Features:", list(X.columns))
print("Dataset shape:", X.shape)
print("Dataset Type:", type(X))
print("Label set shape:", Y.shape)
print("Label set Type:", type(X))

# We partition the dataset into 2/3 training and 1/3 test set.
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.33)

## Process the data set 

We will use the default `ml.m5.large` CPU instance configured in `config.yaml` to pre-process the data remotely. 

The pre-processing function uses scikit-learn `StandardScaler` to scale the features and convert them to NumPy.

Note that we override the defaults configured in `config.yaml`, and set `keep_alive_period_in_seconds` to 600 seconds. It will use SageMaker Managed Warm Pools in order to be able to enable us quick development and debugging process. 

SageMaker managed warm pools let you retain and reuse provisioned infrastructure after the completion of a training job to reduce latency for repetitive workloads, such as iterative experimentation or running many jobs consecutively. Subsequent training jobs that match specified parameters run on the retained warm pool infrastructure, which speeds up start times by reducing the time spent provisioning resources. For more information please refer to the [documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/train-warm-pools.html). 

In [None]:
from sagemaker.remote_function import remote
from sklearn.preprocessing import StandardScaler

@remote(keep_alive_period_in_seconds=900)
def normalize(x_train, x_test, y_train, y_test):
    scaler = StandardScaler()
    scaler.fit(x_train.to_numpy())
    x_train_transformed_npy = scaler.transform(x_train.to_numpy())
    print(f"x_train_transformed_npy: {x_train_transformed_npy}")
    x_test_transformed_npy = scaler.transform(x_test.to_numpy())
    print(f"x_test_transformed_npy: {x_test_transformed_npy}")
    y_train_transformed_npy = y_train.to_numpy()
    print(f"y_train_transformed_npy: {y_train_transformed_npy}")
    y_test_transformed_npy = y_test.to_numpy() 
    print(f"y_test_transformed_npy: {y_test_transformed_npy}")
    return(x_train_transformed_npy, x_test_transformed_npy, y_train_transformed_npy, y_test_transformed_npy)

x_train_norm, x_test_norm, y_train_norm, y_test_norm = normalize(x_train, x_test, y_train, y_test) 

In [None]:
print(f"x_train_norm - shape: {x_train_norm.shape}, data: {x_train_norm}")
print(f"x_test_norm: - shape: {x_test_norm.shape}, data:{x_test_norm}")
print(f"y_train_norm - shape: {y_train_norm.shape}, data: {y_train_norm}")
print(f"y_test_norm - shape: {y_test_norm.shape}, data: {y_test_norm}")

## Run the training remotely with a larger Spot instance

For the training job we will override the default settings and use `ml.c5.2xlarge`, a larger instance needed for the training run, using managed spot training.  

Amazon SageMaker makes it easy to train machine learning models using managed Amazon EC2 Spot instances. Managed spot training can optimize the cost of training models up to 90% over on-demand instances. SageMaker manages the Spot interruptions on your behalf.

Managed Spot Training uses Amazon EC2 Spot instance to run training jobs instead of on-demand instances. You can specify which training jobs use spot instances and a stopping condition that specifies how long SageMaker waits for a job to run using Amazon EC2 Spot instances. Metrics and logs generated during training runs are available in CloudWatch.

To use managed spot training, create a training job. Set `use_spot_instances` to True and specify the `max_wait_time_in_seconds`. `max_runtime_in_seconds` must be larger than MaxRuntimeInSeconds. For more information, please refer the the [documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/model-managed-spot-training.html).

In [None]:
import numpy as np
import os
import tensorflow as tf
tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.ERROR)

def get_model():
    inputs = tf.keras.Input(shape=(8,))
    hidden_1 = tf.keras.layers.Dense(8, activation='tanh')(inputs)
    hidden_2 = tf.keras.layers.Dense(4, activation='sigmoid')(hidden_1)
    outputs = tf.keras.layers.Dense(1)(hidden_2)
    return tf.keras.Model(inputs=inputs, outputs=outputs)


@remote(instance_type="ml.c5.2xlarge", 
        use_spot_instances=True, 
        max_runtime_in_seconds=300, 
        max_wait_time_in_seconds=600)
def train(x_train, x_test, y_train, y_test):
    print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU')))
    
    if tf.config.list_physical_devices('GPU'):
        device = '/GPU:0'
    else:
        device = '/CPU:0'
    print(f"will use: {device}")
    
    batch_size = 128
    epochs = 25
    learning_rate = 0.01
    print('batch_size = {}, epochs = {}, learning rate = {}'.format(batch_size, epochs, learning_rate))

    with tf.device(device):
        model = get_model()
        optimizer = tf.keras.optimizers.SGD(learning_rate)
        model.compile(optimizer=optimizer, loss='mse')
        model.fit(x_train, y_train, batch_size=batch_size, epochs=epochs,
                  validation_data=(x_test, y_test))

        # evaluate on test set
        scores = model.evaluate(x_test, y_test, batch_size, verbose=2)
        print("\nTest MSE :", scores)
        return(model)
        
model = train(x_train_norm, x_test_norm, y_train_norm, y_test_norm)

### Savings

Towards the end of the job you should see two lines of output printed:

- `Training seconds`: X : This is the actual compute-time your training job spent
- `Billable seconds`: Y : This is the time you will be billed for after Spot discounting is applied.

f you enabled the `use_spot_instances` var then you should see a notable difference between X and Y signifying the cost savings you will get for having chosen Managed Spot Training. This should be reflected in an additional line:

- `Managed Spot Training savings: (1-Y/X)*100 %`

Here you can see around 48% managed spot training savings!

## Perform the model evaluation using test set

We will use the default `ml.m5.large` CPU instance configured in `config.yaml` to evalute the model remotely.

Here, we will use TensorFlow `model.evaluate` function to perform the model evaluation, and then return a `dict` with the mean squared error. Also here, we  set `keep_alive_period_in_seconds` to 600 seconds, to use SageMaker Managed Warm Pools, in order to be able to enable us quick development and debugging process. 

In [None]:
@remote(keep_alive_period_in_seconds=900)
def evaluate(model, x_test, y_test):
    scores = model.evaluate(x_test, y_test, verbose=2)
    print("\nTest MSE :", scores)
    report_dict = {"mse": str(scores)}
    return(report_dict)

report_dict = evaluate(model, x_test_norm, y_test_norm)
print(report_dict)