# Activity Recognition on Sagemaker

This notebook contains commands that train and deploy a model on Sagemaker for the activity recognition project mentioned in the README of this repository. In the below cells, we take the most accurate model from the ActivityRecognition.ipynb notebook and deploy it in a Sagemaker environment. 

With the massive increase in data collection and availability, the days of developing a machine learning model on a local Jupyter notebook are numbered. For this reason, ML engineers and data scientists will be looking to leverage the cloud for running their workloads. 

Training and deploying our model in the cloud has several benefits. First, we can take advantage of AWS' on-demand resources and pay-as-you-go model. We only have to provision resources while we need them for either training or inference and do not have to pay for those resources while we are not using them. AWS provides many options for instance types, including both CPU and GPU instances. Instead of having to procure expensive hardware, we can run our workload on these instances and only pay for the amount of time that we are running our workload. Another benefit of moving our ML workload to the cloud is that we can experiment quickly. With access to virtually unlimited hardware and the ability to quickly spin up instances and other resources, the cloud is a good fit for most ML jobs. The benefits are more easily seen with training jobs that are compute-intensive. Our simple project may not benefit from moving to the cloud as much as training a distributed neural network on several GB's of data, for example. However, this notebook provides a starting point for moving your local ML workloads to the cloud and demonstrates my ability to leverage AWS for machine learning. 

There are three main options for using Sagemaker - use a built-in algorithm, use a pre-built container (script mode), or bring-your-own container. Since we want to use some of the most common ML frameworks while still writing our own custom logic, we will use the a pre-built container. We will first aggregate our code from our local notebook into a single Python script. Sagemaker uses Amazon Simple Storage Service (S3) as its datasource so we will need to upload our data to S3 using the AWS SDK. We will then use the applicable framework container provided by Sagemaker to run our script and pull our training data from S3. Once the model is trained, we will deploy it to an inference endpoint, where we can send new data points and obtain a prediction. Our notebook will also include a step to cleanup our environment (delete our endpoint) so that we are not paying for unnecessary resources. 

In [17]:
# import necessary libraries
import pandas as pd
import numpy as np
import os
import sagemaker

In [14]:
hyperparameters = {'n_estimators': 163, 'min_samples_split': 8, 'min_samples_leaf': 2, 'max_depth': 25} # hyperparameters found in our local testing

Note: the Sagemaker-specific code in the below script is derived from [this](https://github.com/aws/amazon-sagemaker-examples/blob/master/sagemaker-script-mode/scikitlearn_script/train_deploy_scikitlearn_without_dependencies.py) Sagemaker example script. It has been changed to accomodate my particular dataset format and preprocessing steps. 

In [48]:
%%writefile activity_recognition_random_forest_script.py

import argparse
import numpy as np
import os
import pandas as pd
import re
import joblib
import json
import traceback
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sagemaker_training import environment


def parse_args():
    """
    Parse arguments.
    """
    env = environment.Environment()

    parser = argparse.ArgumentParser()

    # hyperparameters sent by the client are passed as command-line arguments to the script
    parser.add_argument("--max-depth", type=int, default=10)
    parser.add_argument("--n-jobs", type=int, default=env.num_cpus)
    parser.add_argument("--min-samples-split", type=int, default=2)
    parser.add_argument("--min-samples-leaf", type=int, default=2)
    parser.add_argument("--n-estimators", type=int, default=120)

    # data directories
    parser.add_argument("--train", type=str, default=os.environ.get("SM_CHANNEL_TRAIN")) # we will not be specifying a train channel because we have one set of input files so we will do the splitting ourselves (see below)

    # model directory: we will use the default set by SageMaker, /opt/ml/model
    parser.add_argument("--model_dir", type=str, default=os.environ.get("SM_MODEL_DIR"))

    return parser.parse_known_args()


def load_dataset(path):
    """
    Load entire dataset.
    """
    header = ['seq_num', 'x_accel', 'y_accel', 'z_accel', 'activity']
    
    # Take the set of files and read them all into a single pandas dataframe
    files = [os.path.join(path, file) for file in os.listdir(path) if file.endswith("csv")]

    if len(files) == 0:
        raise ValueError("Invalid # of files in dir: {}".format(path))

    raw_data = [pd.read_csv(file, header=None, names=header, index_col='seq_num') for file in files]
    data = pd.concat(raw_data, axis=0).reset_index(drop=True)
    
    # drop rows where the activity == 0
    rowsToDrop = data[data.activity == 0].index
    data.drop(index=rowsToDrop, inplace=True, axis=0)

    # labels are in the last column
    y = data.iloc[:, -1]
    X = data.iloc[:, :-1]
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=12)
    
    # TODO: add SMOTE (and add imbalanced-learn library as a dependency)
    #     smote = SMOTE()
    #     X_train, y_train = smote.fit_resample(X_train, y_train)
    
    return X_train, X_test, y_train, y_test


def start(args):
    """
    Train a Random Forest Regressor
    """
    print("Training mode")

    try:
        X_train, X_test, y_train, y_test = load_dataset(args.train)

        hyperparameters = {
            "max_depth": args.max_depth,
            "verbose": 1,  # show all logs
            "min_samples_split": args.min_samples_split,
            "n_estimators": args.n_estimators,
            "min_samples_leaf": args.min_samples_leaf
        }
        print("Training the classifier")
        model = RandomForestClassifier()
        model.set_params(**hyperparameters)
        model.fit(X_train, y_train)
        print("Score: {}".format(model.score(X_test, y_test)))
        joblib.dump(model, os.path.join(args.model_dir, "model.joblib"))

    except Exception as e:
        # Write out an error file. This will be returned as the failureReason in the
        # DescribeTrainingJob result.
        trc = traceback.format_exc()
        with open(os.path.join(output_path, "failure"), "w") as s:
            s.write("Exception during training: " + str(e) + "\\n" + trc)

        # Printing this causes the exception to be in the training job logs, as well.
        print("Exception during training: " + str(e) + "\\n" + trc, file=sys.stderr)

        # A non-zero exit code causes the training job to be marked as Failed.
        sys.exit(255)


def model_fn(model_dir):
    """
    Deserialized and return fitted model
    Note that this should have the same name as the serialized model in the main method
    """
    clf = joblib.load(os.path.join(model_dir, "model.joblib"))
    return clf


if __name__ == "__main__":

    args, _ = parse_args()

    start(args)

Overwriting activity_recognition_random_forest_script.py


To prevent embedding the name of my S3 bucket, we will use local mode to train our model and use files from our instance's attached volume. In production, we would most likely have our data in S3 and provision a separate instance for training. 

In [49]:
role = sagemaker.get_execution_role()

In [50]:
train_dir = 'Data/'
train_instance_type = 'local'
inputs = {'train': f'file://{train_dir}'}

In [51]:
from sagemaker.sklearn.estimator import SKLearn

In [52]:
estimator = SKLearn(entry_point = 'activity_recognition_random_forest_script.py',
                   framework_version = '0.23-1',
                   py_version = 'py3',
                   instance_type = train_instance_type,
                   instance_count = 1,
                   hyperparameters = hyperparameters,
                   role = role,
                   base_job_name = 'randomforest_script_mode')

In [None]:
estimator.fit(inputs)

In [None]:
sklearn_predictor = estimator.deploy(initial_instance_count=1,instance_type='ml.m5.xlarge',endpoint_name='randomforestregressor-endpoint')

In [61]:
# provide a data sample (activity should be 7 - Talking while Standing)
test_features = np.array([2028, 2382, 2012]).reshape(1, -1)

sklearn_predictor.predict(test_features)

[36m6wrbvkmhkc-algo-1-jwk0o |[0m [Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[36m6wrbvkmhkc-algo-1-jwk0o |[0m [Parallel(n_jobs=1)]: Done 120 out of 120 | elapsed:    0.0s finished


array([7])

[36m6wrbvkmhkc-algo-1-jwk0o |[0m 172.18.0.1 - - [28/Sep/2021:14:27:43 +0000] "POST /invocations HTTP/1.1" 200 136 "-" "python-urllib3/1.26.6"


As you can see, our endpoint returned 7. This was the expected class for the given x acceleration, y acceleration, and z acceleration values.

In [62]:
sklearn_predictor.delete_endpoint(delete_endpoint_config=True)

Gracefully stopping... (press Ctrl+C again to force)
