# Human Activity Recognition - AWS

### Problem Statement
Let’s open the notebook “HAR Model training notebook”. The problem statement for this notebook is: Deploying the Human Activity Recognition problem using the Level 1 MLOps architecture, the aim is to enhance the experience of Blackmi's health app by overcoming the problems faced in the level 0 architecture. Utilising the Human Activity Recognition dataset, we will construct a machine-learning model along with the ML pipelines to categorise user activities for the real-time health alerts using AWS sagemaker studio. Here we will also be monitoring the model performance and deploy the model using different deployment techniques.

### Approach 
In this notebook we will be building the level 1 architecture of MLOps, and our major focus would be on creating ML pipeline, model monitoring and model deployment. The major take away for this lesson is to learn:

1. Feature engineering with the amazon sagemaker processing 



In [127]:
# Importing all the necessary libraries 
# Importing pandas and numpy for data preprocessing. 
import pandas as pd
import numpy as np
# Boto3 is used for launching the EC2 instances and manipulating s3 buckets.
import boto3
# Sagemaker is imported for building, training and deploying machine learning models.
import sagemaker

In [128]:
# Initialising new sagemaker session as "sess".
sess = sagemaker.Session()
# Check for necessary permission needed for training and deploying models. 
role = sagemaker.get_execution_role()
# To understand where this session is configured to operate.
region = boto3.Session().region_name
region

INFO:botocore.credentials:Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole
INFO:botocore.credentials:Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole


'ap-south-1'

In [129]:
# Bucket variable is used for storing the location of the bucket
bucket = 'sagemaker-studio-009676737623-l4vs7j0o0ib'
# Assigning the prefix variable 
prefix = 'mlops-level1-data'
# input_source variable is used for storing the location of the dataset
input_source = 's3://sagemaker-studio-009676737623-l4vs7j0o0ib/mlops-level1-data/train_data.gzip'


In [None]:
# Reading the dataset using the read_parquet format. 
df = pd.read_parquet(input_source)
df.shape

In [None]:
# Displaying first few rows of the dataset
df.head()

In [None]:
# Visualising the data distribution of the Activity column
import matplotlib.pyplot as plt

%matplotlib inline
%config InlineBackend.figure_format='retina'

df.groupby("Activity")[['Activity']].count().plot(kind="bar", title="Breakdown by Star Rating")
plt.xlabel("Activity")
plt.ylabel("Activity Count")


## Feature Engineering with Amazon SageMaker Processing

In [None]:
%%writefile preprocessing.py

import pandas as pd
import numpy as np
import argparse
import os
from joblib import dump, load
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import ExtraTreesClassifier
def _parse_args():

    parser = argparse.ArgumentParser()
    
    # Data, model, and output directories
    # model_dir is always passed in from SageMaker. By default this is a S3 path under the default bucket.
    parser.add_argument('--filepath', type=str, default='/opt/ml/processing/input/')
    parser.add_argument('--filename', type=str, default='train_data.gzip')
    parser.add_argument('--outputpath', type=str, default='/opt/ml/processing/output/')

    return parser.parse_known_args()


def get_top_k_features(X, Y, k):
        clf = ExtraTreesClassifier(n_estimators=50)
        clf = clf.fit(X, Y)
        feature_df = pd.DataFrame(
            data=(X.columns, clf.feature_importances_)
        ).T.sort_values(by=1, ascending=False)
        cols = feature_df.head(k)[0].values
        return cols

if __name__=="__main__":
    # Process arguments
    args, _ = _parse_args()
    # Load data
    path = os.path.join(args.filepath,args.filename)
    print(path)
    
    # Reading the dataset and performing label encoding 
    df = pd.read_parquet(os.path.join(args.filepath,args.filename))
    le = LabelEncoder()
    df['Activity'] = le.fit_transform(df['Activity'])
    df.drop(['date_time'],axis =1 ,inplace = True)

    # Assignining the indepeneded and depended variable 
    X = df.drop(['Activity'], axis =1)
    Y = df['Activity']
    
    # Extracting top 12 important feature and filtering the dataset
    k =12
    final_cols = get_top_k_features(X, Y, k)
    final_cols = np.append(final_cols,np.array(['Activity']))
    df = df[final_cols]
    
    # Train, test, validation split
    # Randomly sort the data then split out first 70%, second 20%, and last 10%
    train_data, validation_data, test_data = np.split(df.sample(frac=1, random_state=42), [int(0.8 * len(df)), int(0.9 * len(df))])  
    
    # Storing of train, validation and test datasets 
    pd.concat([train_data['Activity'], train_data.drop(['Activity'], axis=1)], axis=1).to_csv(os.path.join(args.outputpath, 'train/train.csv'), index=False, header=False)
    pd.concat([validation_data['Activity'], validation_data.drop(['Activity'], axis=1)], axis=1).to_csv(os.path.join(args.outputpath, 'validation/validation.csv'), index=False, header=False)
    test_data[['Activity']].to_csv(os.path.join(args.outputpath, 'test/test_y.csv'), index=False, header=False)
    test_data.drop(['Activity'], axis=1).to_csv(os.path.join(args.outputpath, 'test/test_x.csv'), index=False, header=False)
    
    ## Save Features columns
    dump(final_cols, os.path.join(args.outputpath, 'feature/feature.joblib'))
    ## Save Encoder
    dump(le, os.path.join(args.outputpath, 'feature/encoder.joblib'))
    
    print("## Processing complete. Exiting.")

In [None]:
train_path = f"s3://{bucket}/{prefix}/train"
validation_path = f"s3://{bucket}/{prefix}/validation"
test_path = f"s3://{bucket}/{prefix}/test"
feature_path = f"s3://{bucket}/{prefix}/feature"

In [None]:
# Importing necessary library for data processing 
from sagemaker.sklearn.processing import SKLearnProcessor
from sagemaker.processing import ProcessingInput, ProcessingOutput
from sagemaker import get_execution_role

sklearn_processor = SKLearnProcessor(
    framework_version="0.23-1",
    role=get_execution_role(),
    instance_type="ml.m5.large",
    instance_count=1, 
    base_job_name='sklearn-ml-train'
)

sklearn_processor.run(
    code='preprocessing.py',
    inputs=[
        ProcessingInput(
            source=input_source, 
            destination="/opt/ml/processing/input",
            s3_input_mode="File",
            s3_data_distribution_type="ShardedByS3Key"
        )
    ],
    outputs=[
        ProcessingOutput(
            output_name="train_data", 
            source="/opt/ml/processing/output/train",
            destination=train_path,
        ),
        ProcessingOutput(output_name="validation_data", source="/opt/ml/processing/output/validation", destination=validation_path),
        ProcessingOutput(output_name="test_data", source="/opt/ml/processing/output/test", destination=test_path),
        ProcessingOutput(output_name="feature_data", source="/opt/ml/processing/output/feature", destination=feature_path)
    ]
)

## Training

In [None]:
s3_input_train = sagemaker.inputs.TrainingInput(s3_data=train_path.format(bucket, prefix), 
                                                    content_type='csv')
s3_input_validation = sagemaker.inputs.TrainingInput(s3_data=validation_path.format(bucket, prefix),
                                                     content_type='csv')

In [108]:
%%writefile sklearn-train.py

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from joblib import dump, load
import pandas as pd, numpy as np, os, argparse

# inference function - tells SageMaker how to load the model
def model_fn(model_dir):
    clf = load(os.path.join(model_dir, "model.joblib"))
    return clf

# Argument parser
def _parse_args():
    parser = argparse.ArgumentParser()
    # Hyperparameters
    parser.add_argument("--n-estimators", type=int, default=10)
    parser.add_argument("--min-samples-leaf", type=int, default=3)
    # Data, model, and output directories
    parser.add_argument("--model-dir", type=str, default=os.environ.get("SM_MODEL_DIR"))
    parser.add_argument("--train", type=str, default=os.environ.get("SM_CHANNEL_TRAIN"))
    parser.add_argument("--test", type=str, default=os.environ.get("SM_CHANNEL_TEST"))
    parser.add_argument("--train-file", type=str, default="train.csv")
    parser.add_argument("--test-file", type=str, default="test.csv")
    # Parse the arguments
    return parser.parse_known_args()

# Main Training Loop
if __name__=="__main__":
    # Process arguments
    args, _ = _parse_args()
    # Load the dataset
    train_df = pd.read_csv(os.path.join(args.train, args.train_file))
    test_df = pd.read_csv(os.path.join(args.test, args.test_file))
    # Separate X and y
    X_train, y_train = train_df.drop(train_df.columns[0], axis=1), train_df[train_df.columns[0]]
    X_test, y_test = test_df.drop(test_df.columns[0], axis=1), test_df[test_df.columns[0]]
    # Define the model and train it
    model = RandomForestClassifier(
        n_estimators=args.n_estimators, n_jobs=-1
    )
    model.fit(X_train, y_train)
    # Evaluate the model performances
    print(f'Model Accuracy: {accuracy_score(y_test, model.predict(X_test))}')
    dump(model, os.path.join(args.model_dir, 'model.joblib'))
    

Overwriting sklearn-train.py


In [118]:
# We use the Estimator from the SageMaker Python SDK
from sagemaker import get_execution_role
from sagemaker.sklearn.estimator import SKLearn

FRAMEWORK_VERSION = "0.23-1"

# Define the Estimator from SageMaker (Script Mode)
sklearn_estimator = SKLearn(
    entry_point="sklearn-train.py",
    role=get_execution_role(),
    instance_count=1,
    instance_type="ml.c5.xlarge",
    framework_version=FRAMEWORK_VERSION,
    base_job_name="rf-scikit",
    metric_definitions=[{"Name": "model_accuracy", "Regex": "Model Accuracy: ([0-9.]+).*$"}],
    hyperparameters={
        "n-estimators": 120,
        "min-samples-leaf": 3,
        "test-file": "validation.csv"
    },
)

# Train the model (~5 minutes)
sklearn_estimator.fit({"train": s3_input_train, "test": s3_input_validation})

INFO:botocore.credentials:Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole
INFO:botocore.credentials:Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole
INFO:sagemaker:Creating training-job with name: rf-scikit-2023-09-18-08-16-32-973


Using provided s3_resource
2023-09-18 08:16:33 Starting - Starting the training job...
2023-09-18 08:16:47 Starting - Preparing the instances for training...
2023-09-18 08:17:29 Downloading - Downloading input data........[34m2023-09-18 08:18:41,086 sagemaker-containers INFO     Imported framework sagemaker_sklearn_container.training[0m
[34m2023-09-18 08:18:41,088 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2023-09-18 08:18:41,127 sagemaker_sklearn_container.training INFO     Invoking user training script.[0m
[34m2023-09-18 08:18:41,273 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2023-09-18 08:18:41,284 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2023-09-18 08:18:41,294 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2023-09-18 08:18:41,301 sagemaker-training-toolkit INFO     Invoking user script

## Hosting

In [119]:
# Deploying the model
sklearn_predictor = sklearn_estimator.deploy(initial_instance_count=1, instance_type='ml.m4.xlarge')

INFO:sagemaker:Creating model with name: rf-scikit-2023-09-18-08-21-35-397
INFO:sagemaker:Creating endpoint-config with name rf-scikit-2023-09-18-08-21-35-397
INFO:sagemaker:Creating endpoint with name rf-scikit-2023-09-18-08-21-35-397


-----!

## Prediction & Evaluation

In [120]:
sklearn_predictor.serializer = sagemaker.serializers.CSVSerializer()

In [121]:
!aws s3 cp $test_path/test_x.csv ./tmp/test_x.csv
!aws s3 cp $test_path/test_y.csv ./tmp/test_y.csv

download: s3://sagemaker-studio-009676737623-l4vs7j0o0ib/mlops-level1-data/test/test_x.csv to tmp/test_x.csv
download: s3://sagemaker-studio-009676737623-l4vs7j0o0ib/mlops-level1-data/test/test_y.csv to tmp/test_y.csv


In [122]:
test_x = pd.read_csv('tmp/test_x.csv', names=[f'{i}' for i in range(10)])

In [123]:
test_x = pd.read_csv('tmp/test_x.csv', names=[f'{i}' for i in range(10)])
test_y = pd.read_csv('tmp/test_y.csv', names=['y'])


In [124]:
predictions = sklearn_predictor.predict(test_x)

In [125]:
from sklearn.metrics import classification_report
print(classification_report(test_y['y'].values,  predictions))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00      1705
           1       0.81      0.82      0.82      1660
           2       0.75      0.56      0.64      1671
           3       0.43      0.53      0.47      1622
           4       0.81      0.72      0.76      1680
           5       0.56      0.63      0.59      1662

    accuracy                           0.71     10000
   macro avg       0.73      0.71      0.71     10000
weighted avg       0.73      0.71      0.72     10000



In [126]:
pd.crosstab(index=test_y['y'].values, columns=np.round(predictions), rownames=['actuals'], colnames=['predictions'])

predictions,0,1,2,3,4,5
actuals,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,1705,0,0,0,0,0
1,0,1368,163,85,9,35
2,0,180,941,372,21,157
3,0,73,121,853,119,456
4,0,6,3,279,1217,175
5,0,60,34,392,137,1039
