# Human Activity Recognition - AWS

### Problem Statement
Let’s open the notebook “HAR Model training notebook”. The problem statement for this notebook is: Deploying the Human Activity Recognition problem using the Level 1 MLOps architecture, the aim is to enhance the experience of Blackmi's health app by overcoming the problems faced in the level 0 architecture. Utilising the Human Activity Recognition dataset, we will construct a machine-learning model along with the ML pipelines to categorise user activities for the real-time health alerts using AWS sagemaker studio. Here we will also be monitoring the model performance and deploy the model using different deployment techniques.

### Approach 
In this notebook we will be building the level 1 architecture of MLOps, and our major focus would be on creating ML pipeline, model monitoring and model deployment. The major take away for this lesson is to learn:

1. Feature engineering with the amazon sagemaker processing 



In [59]:
# Importing all the necessary libraries 
# Importing pandas and numpy for data preprocessing. 
import pandas as pd
import numpy as np
# Boto3 is used for launching the EC2 instances and manipulating s3 buckets.
import boto3
# Sagemaker is imported for building, training and deploying machine learning models.
import sagemaker


In [60]:
# Initialising new sagemaker session as "sess".
sess = sagemaker.Session()
# Check for necessary permission needed for training and deploying models. 
role = sagemaker.get_execution_role()
# To understand where this session is configured to operate.
region = boto3.Session().region_name
region


INFO:botocore.credentials:Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole
INFO:botocore.credentials:Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole


'ap-south-1'

In [61]:
# Bucket variable is used for storing the location of the bucket
bucket = 'sagemaker-studio-009676737623-l4vs7j0o0ib'
# Assigning the prefix variable 
prefix = 'mlops-level1-data'
# input_source variable is used for storing the location of the dataset
input_source = 's3://sagemaker-studio-009676737623-l4vs7j0o0ib/mlops-level1-data/train_data.gzip'


In [62]:
train_path = f"s3://{bucket}/{prefix}/train"
validation_path = f"s3://{bucket}/{prefix}/validation"
test_path = f"s3://{bucket}/{prefix}/test"
feature_path = f"s3://{bucket}/{prefix}/feature"


## Training

In [63]:
s3_input_train = sagemaker.inputs.TrainingInput(s3_data=train_path.format(bucket, prefix), 
                                                    content_type='csv')
s3_input_validation = sagemaker.inputs.TrainingInput(s3_data=validation_path.format(bucket, prefix),
                                                     content_type='csv')

In [64]:
%%writefile sklearn-train.py

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from joblib import dump, load
import pandas as pd, numpy as np, os, argparse

# inference function - tells SageMaker how to load the model
def model_fn(model_dir):
    clf = load(os.path.join(model_dir, "model.joblib"))
    return clf

# Argument parser
def _parse_args():
    parser = argparse.ArgumentParser()
    # Hyperparameters
    parser.add_argument("--n-estimators", type=int, default=10)
    parser.add_argument("--min-samples-leaf", type=int, default=3)
    # Data, model, and output directories
    parser.add_argument("--model-dir", type=str, default=os.environ.get("SM_MODEL_DIR"))
    parser.add_argument("--train", type=str, default=os.environ.get("SM_CHANNEL_TRAIN"))
    parser.add_argument("--test", type=str, default=os.environ.get("SM_CHANNEL_TEST"))
    parser.add_argument("--train-file", type=str, default="train.csv")
    parser.add_argument("--test-file", type=str, default="test.csv")
    # Parse the arguments
    return parser.parse_known_args()

# Main Training Loop
if __name__=="__main__":
    # Process arguments
    args, _ = _parse_args()
    # Load the dataset
    train_df = pd.read_csv(os.path.join(args.train, args.train_file))
    test_df = pd.read_csv(os.path.join(args.test, args.test_file))
    # Separate X and y
    X_train, y_train = train_df.drop(train_df.columns[0], axis=1), train_df[train_df.columns[0]]
    X_test, y_test = test_df.drop(test_df.columns[0], axis=1), test_df[test_df.columns[0]]
    # Define the model and train it
    model = RandomForestClassifier(
        n_estimators=args.n_estimators, n_jobs=-1
    )
    model.fit(X_train, y_train)
    # Evaluate the model performances
    print(f'Model Accuracy: {accuracy_score(y_test, model.predict(X_test))}')
    dump(model, os.path.join(args.model_dir, 'model.joblib'))
    

Overwriting sklearn-train.py


In [65]:
# We use the Estimator from the SageMaker Python SDK
from sagemaker import get_execution_role
from sagemaker.sklearn.estimator import SKLearn

FRAMEWORK_VERSION = "0.23-1"

# Define the Estimator from SageMaker (Script Mode)
sklearn_estimator = SKLearn(
    entry_point="sklearn-train.py",
    role=get_execution_role(),
    instance_count=1,
    instance_type="ml.c5.xlarge",
    framework_version=FRAMEWORK_VERSION,
    base_job_name="rf-scikit",
    metric_definitions=[{"Name": "Accuracy", "Regex": "Accuracy: ([0-9.]+).*$"}],
    hyperparameters={
        "n-estimators": 120,
        "min-samples-leaf": 3,
        "test-file": "validation.csv"
    },
)

# Train the model (~5 minutes)
#sklearn_estimator.fit({"train": s3_input_train, "test": s3_input_validation})


INFO:botocore.credentials:Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole
INFO:botocore.credentials:Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole


In [66]:
# we use the Hyperparameter Tuner
from sagemaker.tuner import IntegerParameter

# Define exploration boundaries
hyperparameter_ranges = {
    "n-estimators": IntegerParameter(100, 200),
    "min-samples-leaf": IntegerParameter(2, 6)
}

Optimizer = sagemaker.tuner.HyperparameterTuner(
    estimator=sklearn_estimator,
    hyperparameter_ranges=hyperparameter_ranges,
    base_tuning_job_name="RF-tuner",
    objective_type="Maximize",
    objective_metric_name="Accuracy",
    metric_definitions=[
        {"Name": "Accuracy", "Regex": "Accuracy: ([0-9.]+).*$"}
    ],  # extract tracked metric from logs with regexp
    max_jobs=10,
    max_parallel_jobs=2,
)


In [67]:
Optimizer.fit({"train": s3_input_train, "test": s3_input_validation})


INFO:sagemaker:Creating hyperparameter tuning job with name: RF-tuner-230919-1351


Using provided s3_resource
..................................................................................................!


In [68]:
# get tuner results in a df
results = Optimizer.analytics().dataframe()
while results.empty:
    time.sleep(1)
    results = Optimizer.analytics().dataframe()
results.head()


Unnamed: 0,min-samples-leaf,n-estimators,TrainingJobName,TrainingJobStatus,FinalObjectiveValue,TrainingStartTime,TrainingEndTime,TrainingElapsedTimeSeconds
0,2.0,170.0,RF-tuner-230919-1351-010-9b3992b9,Completed,0.754975,2023-09-19 13:58:25+00:00,2023-09-19 13:59:36+00:00,71.0
1,2.0,139.0,RF-tuner-230919-1351-009-d1d60343,Completed,0.756776,2023-09-19 13:58:25+00:00,2023-09-19 13:59:31+00:00,66.0
2,2.0,171.0,RF-tuner-230919-1351-008-af5d773e,Completed,0.759776,2023-09-19 13:56:57+00:00,2023-09-19 13:58:09+00:00,72.0
3,3.0,157.0,RF-tuner-230919-1351-007-179a8714,Completed,0.756676,2023-09-19 13:56:59+00:00,2023-09-19 13:58:06+00:00,67.0
4,3.0,178.0,RF-tuner-230919-1351-006-5df656c4,Completed,0.755376,2023-09-19 13:55:32+00:00,2023-09-19 13:56:44+00:00,72.0


## Hosting

In [69]:
# Deploying the model
sklearn_predictor = Optimizer.deploy(initial_instance_count=1, instance_type='ml.m4.xlarge')


2023-09-19 13:58:24 Starting - Found matching resource for reuse
2023-09-19 13:58:24 Downloading - Downloading input data
2023-09-19 13:58:24 Training - Training image download completed. Training in progress.
2023-09-19 13:58:24 Uploading - Uploading generated training model
2023-09-19 13:58:24 Completed - Resource reused by training job: RF-tuner-230919-1351-010-9b3992b9

INFO:sagemaker:Creating model with name: RF-tuner-2023-09-19-13-59-51-333





INFO:sagemaker:Creating endpoint-config with name RF-tuner-230919-1351-008-af5d773e
INFO:sagemaker:Creating endpoint with name RF-tuner-230919-1351-008-af5d773e


-----!

## Prediction & Evaluation

In [70]:
sklearn_predictor.serializer = sagemaker.serializers.CSVSerializer()

In [71]:
!aws s3 cp $test_path/test_x.csv ./tmp/test_x.csv
!aws s3 cp $test_path/test_y.csv ./tmp/test_y.csv

download: s3://sagemaker-studio-009676737623-l4vs7j0o0ib/mlops-level1-data/test/test_x.csv to tmp/test_x.csv
download: s3://sagemaker-studio-009676737623-l4vs7j0o0ib/mlops-level1-data/test/test_y.csv to tmp/test_y.csv


In [72]:
test_x = pd.read_csv('tmp/test_x.csv', names=[f'{i}' for i in range(12)])
test_y = pd.read_csv('tmp/test_y.csv', names=['y'])


In [73]:
predictions = sklearn_predictor.predict(test_x)

In [74]:
from sklearn.metrics import classification_report
print(classification_report(test_y['y'].values,  predictions))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00      1705
           1       0.88      0.86      0.87      1660
           2       0.87      0.88      0.87      1671
           3       0.49      0.50      0.49      1622
           4       0.63      0.56      0.59      1680
           5       0.68      0.72      0.70      1662

    accuracy                           0.76     10000
   macro avg       0.76      0.76      0.76     10000
weighted avg       0.76      0.76      0.76     10000



In [75]:
pd.crosstab(index=test_y['y'].values, columns=np.round(predictions), rownames=['actuals'], colnames=['predictions'])

predictions,0,1,2,3,4,5
actuals,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,1705,0,0,0,0,0
1,0,1435,225,0,0,0
2,0,194,1475,0,0,2
3,0,0,0,814,441,367
4,0,0,0,528,947,205
5,0,0,5,336,124,1197
