## CS328 Assignment 3: Human Activity Recognition Using Machine Learning

The objective of this assignment is to develop your understanding and practical application skills in the field of machine learning, particularly focusing on Human Activity Recognition (HAR). Human Activity Recognition has become increasingly important in various fields including health care, personal fitness, smart homes, and surveillance. In this assignment, you will be working with a dataset collected from a wrist-worn device, which includes accelerometer data. The data has been classified into nine distinct activities: downstairs, jogging, lying, sitting, standing, upstairs, walking fast, walking moderate, and walking slow.

You will get hands-on experience in several key aspects of machine learning and data processing:
1. Data collection and preprocessing from multiple sources for analysis
2. Understand how to create windows of raw data through resampling and to extract meaningful features from these windows.
3. Hands-on experience on how to extract features, encode target variables, split datasets, train models, make predictions, and evaluate the performance of these models.
4. Learn to assess your trained model’s performance using common metrics like accuracy, confusion matrix and more. Understanding how well your models are performing is crucial in any machine learning task.

Through this assignment, you will be able to develop a pipeline for Human Activity Recognition that can be further fine-tuned for different tasks or datasets. You will be able to apply the skills and knowledge gained here to other machine learning projects and real-world applications.


#### Imports Block

Make sure all imports are in this block below (and leave the two comments IMPORTS START and IMPORT END as is. The extractor script uses the START and END delimitors when extracting the functions).

In [106]:
# -- IMPORTS START --
import pandas as pd
import glob
import re
import os
import sys
import pickle
import datetime
import numpy as np
import matplotlib.pyplot as plt

from datetime import datetime
from sklearn import tree, metrics
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, ConfusionMatrixDisplay
from scipy.signal import butter, filtfilt, find_peaks
from sklearn.tree import DecisionTreeClassifier,export_graphviz
from sklearn.model_selection import train_test_split
# -- IMPORTS END --

# enable zooming into graphs
%matplotlib notebook
plt.rcParams['figure.figsize'] = [9, 6] # width, height in inches

### Helper Function: viz_tree (do not modify)

dnksd

In [None]:
# Helper function to visualize model - Do not modify
def viz_tree(adsn an sdgdt_model,features_frames,cnames):
    # Fix feature names as list
    feature_names = features_frames.columns.tolist()

    fig, ax = plt.subplots(figsize=(9,4))
    tree.plot_tree(dt_model,
                   feature_names=feature_names,
                   fontsize=7,
                   class_names=cnames,
                   filled=True,
                   ax=ax)

    plt.title('Decision Tree')
    plt.savefig('dt.png')

### Helper Function: calc_magnitude (same as prev assignment - do not modify)

In [108]:
#Do not modify
def calc_magnitude(data):

    # Calculate magnitude
    data['accel_mag'] = np.sqrt(data['x']**2 + data['y']**2 + data['z']**2) # absolute accel magnitude
    data['accel_mag'] = data['accel_mag'] - data['accel_mag'].mean() # detrend: "remove gravity"

    return data

### Helper Function: remove noise (same as prev assignment - do not modify)

In [109]:
#Do not modify
def remove_noise(data,sampling_rate):
    from scipy.signal import butter, filtfilt, find_peaks

    # Low pass filter
    cutoff = 5 # Hz
    order = 2
    b, a = butter(order, cutoff/(sampling_rate/2), btype='lowpass')
    data['filtered_accel_mag'] = filtfilt(b, a, data['accel_mag'])

    return data

### Helper Function: add_features (from prev assignment; do not modify)

In [110]:
#Do not modify
def add_features(window):
    features = {}
    features['avg'] = window['filtered_accel_mag'].mean()
    features['max'] = window['filtered_accel_mag'].quantile(1)
    features['med'] = window['filtered_accel_mag'].quantile(0.5)
    features['min'] = window['filtered_accel_mag'].quantile(0)
    features['q25'] = window['filtered_accel_mag'].quantile(0.25)
    features['q75'] = window['filtered_accel_mag'].quantile(0.75)
    features['std'] = window['filtered_accel_mag'].std()
    df = pd.DataFrame()
    df = df._append(features,ignore_index=True)
    return df

### Helper Function: train_decision_tree (from prev assignment; do not modify)

In [111]:
def train_decision_tree(frames):
    # Extract feature columns
    X = frames[['avg', 'max', 'med', 'min', 'q25', 'q75', 'std']]

    # Extract target column
    y = frames['activity']

    # Split data
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

    # Create model
    dt_model = DecisionTreeClassifier(criterion='entropy',max_depth=5).fit(X_train, y_train)
    dt_pred = dt_model.predict(X_test)

    # Evaluate on test set
    acc = dt_model.score(X_test, y_test)
    dt_cm = confusion_matrix(y_test, dt_pred, labels=dt_model.classes_)
    print(classification_report(y_test, dt_pred))
    print("Accuracy on test set:", acc)

    return dt_model,dt_cm,acc

### Assignment Function 1: extract_features(data, window_sec, sample_rate,activity)

**Instructions:**
Define the `extract_features` function that extracts features from accelerometer data by applying a sliding window approach. 

**This is almost the same as extract_features from Assignment 2 except that you are given the `activity` label as input**

**Hints and Instructions:**
Use the `pandas.DataFrame.resample` function to implement the sliding window approach. Remember to use the computed features when appending to the new DataFrame.

- The function takes as arguments:
  - `data`: a DataFrame containing the filtered acceleration magnitude signal and annotated step locations
  - `window_sec`: the window length in seconds for feature extraction
  - `sample_rate`: the sampling frequency of the accelerometer data
  - `activity`: the activity label for each window you extract
- The function will perform the following:
  - For each window in the resampled data:
    - Call the `add_features` function to compute features of the window
  - Append the computed features to a new DataFrame and return this DataFrame

**Rubrics:**

Function 1 : 15%
1. Correctly called add_features and adds the features to the DataFrame (60%)
2. Adds the activity label to the dataframe (20%)
3. Appends and returns the correct DataFrame (20%)



In [112]:
# Function to extract windows and features
def extract_features(data, window_sec, sample_rate, activity):
    pass

### Assignment Function 2: all_data_to_combined_csv(data, window_sec, sample_rate,activity)

This function is expected to process data collected from different activities, extract features and store everything into a combined CSV file.

**Hint and Instructions:**
Understand how the `glob`, `pandas` and `os` modules work. Understand how the `calc_magnitude` and `remove_noise` function work. Use `os.path.basename(filename)` to extract the activity

- Write a function named `all_data_to_combined_csv(root, output_filename)`.
- This function will process the data collected from various activities such as downstairs, jogging, lying, sitting, standing, upstairs, walking fast, moderate and slow. These activities' data are stored in different .csv files within their respective folders.
- the `root` will be the root folder of all the files.
- the `output_filename` will be the filename of the combined csv file. The file is located at `folder/output_filename`.
- Use the glob module to create a list of all .csv files from each activity's directory.
- The pandas DataFrame `all_data` is initialized to store the processed data from all activities.
- The function will then loop over each activity's data file. For each file:
    - read each .csv file into a pandas DataFrame.
    - From the raw data, calculate the magnitude and remove any noise.
    - Extract the activity type for each file.
    - extract features using the `extract_features()` function.
    - append the feature frames into the all_data dataframe.
- Continue this process until all .csv files from all activities have been processed.
- Finally, the function writes the `all_data` DataFrame to a new .csv file named 'all_data.csv'.

**Rubrics:** 

Function 2: 15%
1. Correctly looped over all the files and read each csv (40%)
2. Preprocess the data and adds the activity column (40%)
3. Appends and returns the correct DataFrame (20%)





In [113]:
def all_data_to_combined_csv(root, output_filename = 'all_data.csv'):
    pass

## Assignment: Collect your own data

In this section, we will collect your own data of different activities.

For each team member in the team, please collect one minute data for each activities including `downstairs`, `jogging`, `lying`, `sitting`, `standing`, `upstairs`, `walk_fast`, `walk_mod` and `walk_slow`.  

Follow the similar folder structures as `data/Activities`, store the csv files in the folder under `MyData/*`. 

The data should look like:
- data
    - Activities
    - MyData
        - downstairs
        - jogging
        - lying
        - sitting
        - standing
        - upstairs
        - walk_fast
        - walk_mod
        - walk_slow

Due to the fact that sensor logger data has a UTC ticks time instead of datetime, we should transform the time first. Feel free to modify the below function so that all the collected data has the same datetime format as the provided data.

**Rubrics:**
Function 3: 20%
1. Collect your own data (100%)

In [114]:
def transform_time_to_datetime(root):
    
    dateparse = lambda dates: [pd.to_datetime(d, unit='ns') for d in dates]
    # Get list of all activity folders
    activity_folders = os.listdir(root)
    # print(activity_folders)

    for folder in activity_folders:
        # print(folder)
        files = glob.glob(f"{root}/{folder}/*.csv")
        for filename in files:
            # print(filename)
            df = pd.read_csv(filename, parse_dates=['time'])
            df['time'] = pd.to_datetime(pd.to_numeric(df['time']), unit='ns')
            df.to_csv(filename, index=False)

In [None]:
transform_time_to_datetime('./data/MyData')

## Assignment: Testing Classifier Performance

In this section, we will evaluate the overall performance of our activity classifier using a combined dataset of all activities. This process helps understand how accurately the model can classify a wide variety of activities. The steps are as follows:

1. **Data Generation**: We start by calling the `all_data_to_combined_csv()` function. This function processes all the individual activity datasets, applies the necessary preprocessing, and generates a combined CSV file named 'all_data.csv'. Note that the function only creates this file once, so if you need to recreate it with updated preprocessing or feature extraction, you should delete the existing file first.

2. **Data Loading**: Once we have the combined CSV file, we load it into a pandas DataFrame for further use.

3. **Activity Selection**: Next, we choose a subset of activities to exclude from the dataset. We do this by listing them in the `drop_activities` array. The remaining activities will be the ones our decision tree model tries to classify. You can experiment with different subsets of activities to see how it impacts the model's accuracy.

4. **Model Training**: The `train_decision_tree` function is then called with the chosen classes. This function trains the decision tree model and evaluates its performance, printing the precision, recall, and accuracy metrics. It also returns the trained model, the confusion matrix, and the overall accuracy for further use.

5. **Performance Visualization**: To provide a visual understanding of the model's performance, we display the confusion matrix using matplotlib's `ConfusionMatrixDisplay` function. Each row of the matrix corresponds to a true class, while each column corresponds to a predicted class. The diagonal elements represent correctly classified instances, and off-diagonal elements are instances that are misclassified.

6. **Decision Tree Visualization**: Finally, we visualize the decision tree using the `viz_tree` function. This function generates a graphic representation of the decision tree model, showing how it makes decisions based on the features.

This section's output can help understand how well the model generalizes to different activities and how different features influence the model's decisions. Remember that it's normal if the model performs better on some activities than others, depending on the complexity and distinctiveness of the activities.

We provided an example of how to evaluate the model on your own data, feel free to modify it.


In [115]:
def evaluate(dt_model, filtered_collected_data):
    X_test = filtered_collected_data[['avg', 'max', 'med', 'min', 'q25', 'q75', 'std']]

    # Extract target column
    y_test = filtered_collected_data['activity']

    dt_pred = dt_model.predict(X_test)
    # Evaluate on test set
    acc = dt_model.score(X_test, y_test)
    # dt_cm = confusion_matrix(y_test, dt_pred, labels=dt_model.classes_)
    print(classification_report(y_test, dt_pred))
    print("Accuracy on test set:", acc)

In [116]:
# Combine all the data under data/Activities
all_data_to_combined_csv(root='./data/Activities')

# Combine all the data collected from team members
all_data_to_combined_csv(root='./data/MyData')

feature_frames = pd.read_csv('./data/Activities/all_data.csv')
collected_frames = pd.read_csv('./data/MyData/all_data.csv')

# Part 2: Experimenting with Different Activity Combinations

In this task, we want you to experiment with different combinations of activities and observe the impact on the decision tree model's accuracy. Understanding how varying the activities impacts the model's performance can provide valuable insight into the distinctiveness of the movements and their complexity.

1. **Three Types of Walking**: To do this, modify the `activities` list to only include 'walk_fast', 'walk_mod', and 'walk_slow'. Run the training code and note the accuracy.

2. **Stairs Activities**: Next, modify the `activities` list to only include 'upstairs' and 'downstairs'. Run the training code and note the accuracy.

3. **Static Activities**: For this run, modify the `activities` list to only include 'lying', 'sitting', and 'standing'. Run the training code and note the accuracy.

4. **Mobile Activities**: Now, consider all the activities involving movement i.e. exclude 'lying', 'sitting', and 'standing' in the `activities` list. Run the training code and note the accuracy.

5. **All Activities**: Finally, all activities are included in the training process. Run the training code and note the accuracy.

For each experiment, set the depth of the decision tree to 5.

### What To Report
For each of the above combinations,
- **Fill up** this table based on the results you obtain by trying different parameters (only modify the first line). We will evaluate the accuracy of numbers you fill up below.
- **Interpretation of the results**: What do these accuracy scores suggest about the ability of the model to distinguish between these activities? Do some activities appear to be more distinguishable than others? How do different combinations of activities affect the accuracy? Remember to provide a brief discussion for each point.


**Rubrics:**
Part 2:
Function 3: 50% - Each combination - 10%
1. Accuracy, precision, recall Table Filled up
2. Accuracy, precision, recall for your collected data Fill up
3. Interpretation of the results
4. Compare the obtained results in terms of precision, recall, and accuracy, and provide your interpretation of these results. For example, if your precision is lower than accuracy, why do you think that is the case? Or if your recall is higher than precision, what might that indicate based on your experience, and so on.

Hint: You can use the confusion_matrix to get precision, recall

In [None]:
activities = ['walk_fast','walk_mod','walk_slow']

# Invert mask to keep only other rows
filtered_feature_frames = feature_frames[feature_frames['activity'].isin(activities)]
filtered_collected_frames = collected_frames[feature_frames['activity'].isin(activities)]

# Train the decision tree with the chosen classes
# This function will print out precision/recall/accuracy
dt_model, dt_cm, acc = train_decision_tree(filtered_feature_frames)

# Save the classifier to disk. The name should be exactly dt_model.pkl
with open(f'dt_model-{str(activities)}.pkl', 'wb') as f:
    pickle.dump(dt_model, f)

# Display the confusion matrix
cm_display = ConfusionMatrixDisplay(confusion_matrix = dt_cm, display_labels=dt_model.classes_)
cm_display.plot()
plt.show()

# Visualize the tree
viz_tree(dt_model,feature_frames,feature_frames['activity'].unique().tolist())

evaluate(dt_model, filtered_collected_frames)



In [None]:
activities = ['upstairs','downstairs']

# Invert mask to keep only other rows
filtered_feature_frames = feature_frames[feature_frames['activity'].isin(activities)]
filtered_collected_frames = collected_frames[feature_frames['activity'].isin(activities)]

# Train the decision tree with the chosen classes
# This function will print out precision/recall/accuracy
dt_model, dt_cm, acc = train_decision_tree(filtered_feature_frames)

# Save the classifier to disk. The name should be exactly dt_model.pkl
with open(f'dt_model-{str(activities)}.pkl', 'wb') as f:
    pickle.dump(dt_model, f)

# Display the confusion matrix
cm_display = ConfusionMatrixDisplay(confusion_matrix = dt_cm, display_labels=dt_model.classes_)
cm_display.plot()
plt.show()

# Visualize the tree
viz_tree(dt_model,feature_frames,feature_frames['activity'].unique().tolist())

evaluate(dt_model, filtered_collected_frames)



In [None]:
activities = ['sitting','lying','standing']

# Invert mask to keep only other rows
filtered_feature_frames = feature_frames[feature_frames['activity'].isin(activities)]
filtered_collected_frames = collected_frames[feature_frames['activity'].isin(activities)]

# Train the decision tree with the chosen classes
# This function will print out precision/recall/accuracy
dt_model, dt_cm, acc = train_decision_tree(filtered_feature_frames)

# Save the classifier to disk. The name should be exactly dt_model.pkl
with open(f'dt_model-{str(activities)}.pkl', 'wb') as f:
    pickle.dump(dt_model, f)

# Display the confusion matrix
cm_display = ConfusionMatrixDisplay(confusion_matrix = dt_cm, display_labels=dt_model.classes_)
cm_display.plot()
plt.show()

# Visualize the tree
viz_tree(dt_model,feature_frames,feature_frames['activity'].unique().tolist())

evaluate(dt_model, filtered_collected_frames)



In [None]:
activities = ['upstairs', 'walk_fast', 'walk_mod', 'walk_slow', 'downstairs', 'jogging']

# Invert mask to keep only other rows
filtered_feature_frames = feature_frames[feature_frames['activity'].isin(activities)]
filtered_collected_frames = collected_frames[feature_frames['activity'].isin(activities)]

# Train the decision tree with the chosen classes
# This function will print out precision/recall/accuracy
dt_model, dt_cm, acc = train_decision_tree(filtered_feature_frames)

# Save the classifier to disk. The name should be exactly dt_model.pkl
with open(f'dt_model-{str(activities)}.pkl', 'wb') as f:
    pickle.dump(dt_model, f)

# Display the confusion matrix
cm_display = ConfusionMatrixDisplay(confusion_matrix = dt_cm, display_labels=dt_model.classes_)
cm_display.plot()
plt.show()

# Visualize the tree
viz_tree(dt_model,feature_frames,feature_frames['activity'].unique().tolist())
evaluate(dt_model, filtered_collected_frames)



In [None]:
activities = ['upstairs', 'walk_fast', 'walk_mod', 'walk_slow', 'downstairs', 'jogging', 'standing', 'lying', 'sitting']

# Invert mask to keep only other rows
filtered_feature_frames = feature_frames[feature_frames['activity'].isin(activities)]
filtered_collected_frames = collected_frames[feature_frames['activity'].isin(activities)]

# Train the decision tree with the chosen classes
# This function will print out precision/recall/accuracy
dt_model, dt_cm, acc = train_decision_tree(filtered_feature_frames)

# Save the classifier to disk. The name should be exactly dt_model.pkl
with open(f'dt_model-{str(activities)}.pkl', 'wb') as f:
    pickle.dump(dt_model, f)

# Display the confusion matrix
cm_display = ConfusionMatrixDisplay(confusion_matrix = dt_cm, display_labels=dt_model.classes_)
cm_display.plot()
plt.show()

# Visualize the tree
viz_tree(dt_model,feature_frames,feature_frames['activity'].unique().tolist())
evaluate(dt_model, filtered_collected_frames)



**For the data in the dataset:**

| Model trained on | Accuracy | Precision | Recall |
|-|-|-|-|
| Three Types of Walking | **FILL UP** |**FILL UP** |**FILL UP** |
| Stairs Activities | **FILL UP** |**FILL UP** |**FILL UP** |
| Static Activities | **FILL UP** |**FILL UP** |**FILL UP** |
| Mobile Activities | **FILL UP** |**FILL UP** |**FILL UP** |
| All Activities | **FILL UP** |**FILL UP** |**FILL UP** |
|-|-|-|-|

**For your own data:**
| Model trained on | Accuracy | Precision | Recall |
|-|-|-|-|
| Three Types of Walking | **FILL UP** |**FILL UP** |**FILL UP** |
| Stairs Activities | **FILL UP** |**FILL UP** |**FILL UP** |
| Static Activities | **FILL UP** |**FILL UP** |**FILL UP** |
| Mobile Activities | **FILL UP** |**FILL UP** |**FILL UP** |
| All Activities | **FILL UP** |**FILL UP** |**FILL UP** |
|-|-|-|-|