# Load & preprocess MyoGym data

In [1]:
from scipy.io import loadmat
import pandas as pd
import numpy as np

This notebook contains functions to load and preprocess the MyoGym dataset.

The MyoGym dataset was first introduced in [1]. The dataset was collected using a Myo Armband worn on the forearm, which consists of 8 electromyogram (EMG) sensors and a 9-axis IMU containing a 3-axis gyroscope, 3-axis accelerometer and a 3-axis magnetometer. In our work, we discard the EMG sensor and magnetometer data and make use of only the gyroscope and accelerometer data. This results in 6 streams of data which collectively fully define the movement of the arm positionally and rotationally along the x, y and z axes (see below diagram of a MyoGym armband).

There are 2 data labels. The 1st label column indexes the activity and the 2nd indexes the trainer. The labels are provided at the timestamp level, so that a workout has a sequence of activity labels. 

## Load data

In [2]:
def load_MyoGym(path: str):
    """
    Load the MyoGym data. 
    Identify and select out the accelerometer and gyroscope data columns we will use in this work.
    Rename the 2 columns in the labels
    Concatenate the data and labels
    """
    datamat = loadmat(path)
    
    raw_data = pd.DataFrame(datamat["raw_data"])
    label_data = pd.DataFrame(datamat["raw_data_labels"])

    # Extract the accelerometer and gyroscope timestamps and features
    raw_data.rename(columns={9: "time_acc", 
                             10: "acc_x",
                             11: "acc_y",
                             12: "acc_z", 
                             13: "time_gyr", 
                             14: "gyr_x",
                             15: "gyr_y",
                             16: "gyr_z"
                            }, inplace=True)
    
    raw_data = raw_data[["time_acc", "acc_x", "acc_y", "acc_z", "time_gyr", "gyr_x", "gyr_y", "gyr_z"]]
    
    # The 1st label column is the activity and the 2nd label column is the trainer performing the exercise
    label_data.rename(columns={0:"activity", 
                               1: "trainer"
                              }, inplace=True)

    # Concatenate the raw_data and data labels
    data = pd.concat([raw_data, label_data],  axis=1)

    return data

## Synthesise *Time* column & remove duplicates

There are duplicate readings (with identical timestamps) arising from the buffering process which are removed. Sort by the trainer, then by the timestamp. 

In [3]:
def sort_dedupe(data: pd.DataFrame):
    """
    Sort data and remove duplicates
    """
    data = data.sort_values(by=['trainer', 'time_acc'], ascending=True)
    data = data.drop_duplicates()
    return data

The data are provided in continuous streams, an identifier for which is given in the *trainer* column. There is a *time_acc* column and *time_gyr* column to record the stream arrival times of the accelerometer and gyroscope sensor data respectively. The mechanism behind this is unclear, but the sensor data is buffered and streamed in packets, so the arrival times are not always equidistant. Both instruments record at 50 Hz. We create a synthetic time column for later use.

In [4]:
def define_time(data: pd.DataFrame, fq: int = 50):
    """
    Define a synthetic timestamp identifier and delete the sensor arrival times
    """
    data["time"] = data.groupby("trainer").cumcount()
    data["time"] /= fq

    data.drop(columns = ["time_acc", "time_gyr"], axis = 0, inplace=True)
    data.set_index(["trainer", "time"], inplace=True)
    return data

## Run Script

In [None]:
data = load_MyoGym("data/MyoGym.mat")
data = sort_dedupe(data)
data = define_time(data)

## References

[1] Koskimäki, Heli, Pekka Siirtola and Juha Röning. “MyoGym: introducing an open gym data set for activity recognition collected using myo armband.” Proceedings of the 2017 ACM International Joint Conference on Pervasive and Ubiquitous Computing and Proceedings of the 2017 ACM International Symposium on Wearable Computers (2017): n. pag.