# Split data and prepare windows

In [1]:
%run preprocess.ipynb
%run sys_configs.ipynb

In [2]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
np.random.seed(123)

This notebook contains functions to split the MyoGym dataset into train, validation and test sets and compute windows from each time series data stream.

## Separate test data

The data is split into train, validation and test data. In order to show our proposed methods extend to unseen trainers, the test dataset will contain the data for two trainers who do not feature in either the train or validation datasets. 

The train and validation datasets are not initially split out. They are split out after windowing.

In [3]:
idx = data.index.get_level_values("trainer").unique()
shuffled_idx = np.random.permutation(idx)

# Split into a combined train/validation dataset and a test dataset
comb_idxs = shuffled_idx[:8] # First 8 trainers
test_idxs = shuffled_idx[8:] # Last 2 trainers

# Create the 2 datasets.
comb = data.loc[comb_idxs]
test = data.loc[test_idxs]

## Define and apply windows

There are two windowing techniques we will deploy for GAR tasks; 

- Window technique 1: incorporates only a fixed number of the most recent observations, if there is no activity changepoint in the window.
- Window technique 2: incorporates all time steps from the beginning of the time series for an activity up to the current time step, until the minimum of:
    - the end of the time series for that activity or;
    - a maximum time series length

Window technique 1 is used for the benchmark methods, while window technique 2 is used for methods in this work which incorporate historical information. Window technique 1 is implemented as a precomputed Numpy matrix, while window technique 2 is implemented as a generator owing to its size. Pictorial representations of each windowing technique are shown below.

Note that both window techniques break on reaching an activity changepoint. Every window is designed to map to a single class. There are no fuzzy classes.

In [4]:
def window_method_1(data, sz = 250, step = 50):
    """
    Computes windows for each time series stream. 
    If there is a label change in the window then the entire window is discarded.
    We seek windows which have unambiguous non-conflicting class labels.
    """
    # Create a mask for where the activity changes.
    data["activity_mask"] = data['activity'].ne(data['activity'].shift())
       
    data_list = []
    label_list = []
    
    for t in data.index.get_level_values('trainer').unique():
        # Get all the data for this trainer
        trainer_data = data.loc[data.index.get_level_values('trainer') == t]
        
        # Obtain a list of all windows
        for start in range(0, len(trainer_data) - sz + 1, step):
            # Filter for the current window
            window = trainer_data.iloc[start:start + sz]
            
            if window.isna().values.any():
                # Skip windows with NaN values (the first and last few windows)
                continue
            
            if window.loc[:, "activity_mask"].values.any():
                # Skip windows where the activity changes during that window
                continue
            
            window_data = window.loc[:, ["acc_x", "acc_y", "acc_z", "gyr_x", "gyr_y", "gyr_z"]]
            window_label = window.loc[:, "activity"].unique()[0]
            
            data_list.append(window_data)
            label_list.append(window_label)
            
    # Conver to Numpy arrays
    data_np = np.array(data_list)
    label_np = np.array(label_list)
    
    return data_np, label_np            

In [5]:
x1_comb, y1_comb = window_method_1(comb, sz = 500, step = 50)
x1_test, y1_test = window_method_1(test, sz = 500, step = 50)

In [6]:
print("Window Method 1 (Train/Validation): There are %s samples of window length %s and dimensionality %s." % (x1_comb.shape))
print("Window Method 1 (Test): There are %s samples of window length %s and dimensionality %s." % (x1_test.shape))

Window Method 1 (Train/Validation): There are 21927 samples of window length 500 and dimensionality 6.
Window Method 1 (Test): There are 4141 samples of window length 500 and dimensionality 6.


## Separate validation data

The combined (train, validation) data is now split into separate train and validation datasets. The validation dataset is taken from the same trainers as the train data. In most contexts involving time series, the test dataset is temporally separated from the train dataset. In both window methods, we consider windows to be independent, i.e. we can ignore any autocorrelation that might have preceded the division into windows.

In [7]:
x1_train, x1_val, y1_train, y1_val = train_test_split(x1_comb, y1_comb, test_size=0.125, random_state=42)

## Data normalisation

Data normalisation is essential to the process of adjusting all channels to an identical range. Many time series classification techniques that we will study, for example Dynamic Time Warping, use distance metrics that depend on each dimension having the same scale. We will use sklearn's StandardScalar.

The use of a standard scaling allows us to more easily identify outliers and to more rigorously quantify the extent of these outliers using hypothesis tests of a normal distribution.

In [8]:
scaler = StandardScaler()

N1_train, T , D = x1_train.shape
N1_val, _ , _ = x1_val.shape
N1_test, _ , _ = x1_test.shape

# Standard scaler works only for 2 dimensional data, so we melt the time dimension into the sample dimension
x1_train = x1_train.reshape(N1_train * T, D)
x1_val = x1_val.reshape(N1_val * T, D)
x1_test = x1_test.reshape(N1_test * T, D)

# Fit the standard scaler to the train data, then transform the validation and test data
x1_train = scaler.fit_transform(x1_train)
x1_val = scaler.transform(x1_val)
x1_test = scaler.transform(x1_test)

# Transform back to the N X T X D shape
x1_train = x1_train.reshape(N1_train, T, D)
x1_val = x1_val.reshape(N1_val, T, D)
x1_test = x1_test.reshape(N1_test, T, D)

## Sample background activity class

According to the paper [1] which introduced the MyoGym dataset, the background activity class, which it describes as the null class, accounts for 77% of the dataset, a number which dwarves the remaining 30 classes. Most of the techniques we explore are sensitive to class imbalanaces or to dataset sizes. Therefore, we sample some windows from this background activity class.

In [9]:
train_labels, train_counts = np.unique(y1_train, return_counts = True)
val_labels, val_counts = np.unique(y1_val, return_counts = True)
test_labels, test_counts = np.unique(y1_test, return_counts = True)

train_label_counts = pd.DataFrame(np.hstack([train_labels[:, np.newaxis], train_counts[:, np.newaxis]]), columns = ["Label", "Train Count"])
val_label_counts = pd.DataFrame(np.hstack([val_labels[:, np.newaxis], val_counts[:, np.newaxis]]), columns = ["Label", "Val Count"])
test_label_counts = pd.DataFrame(np.hstack([test_labels[:, np.newaxis], test_counts[:, np.newaxis]]), columns = ["Label", "Test Count"])

In [10]:
label_counts = train_label_counts.merge(test_label_counts, on = "Label").merge(val_label_counts, on = "Label")
label_counts["Label"] = label_counts["Label"].map(ACTIVITY_MAPPING)
label_counts = label_counts.set_index("Label")
label_counts.sort_values(["Train Count"], ascending=[False])

Unnamed: 0_level_0,Train Count,Test Count,Val Count
Label,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
No activity identified,15833,3566,2261
Dumbbell Alternate Bicep Curl,221,31,23
Front Dumbbell Raise,195,27,28
Dumbbell Flyes,174,14,31
Hammer Curl,166,36,18
Incline Dumbbell Flyes,161,23,29
Spider Curl,141,24,13
Incline Dumbbell Press,136,19,18
Incline Hammer Curl,124,24,21
Bar Skullcrusher,122,30,19


In [11]:
def sample_noise_class(data: np.array, labels: np.array, sz: int):
    """
    Removes most samples from the dominant noise class, down to a sample size (sz) specified in this function.  
    """
    # Identify indices of the noise class and signal class
    noise_idx = np.where(labels == 99)[0]
    signal_idx = np.where(labels != 99)[0]

    # Choose a sample from the noise class
    sample_idx = np.random.choice(noise_idx, size = sz, replace=False)

    # Combine the sampled indices with the other class indices
    combined_idx = np.concatenate([signal_idx, sample_idx])

    # Apply the indexes to the data and labels
    data_sample = x1_train[combined_idx, :, :]
    labels_sample = labels[combined_idx]
    
    return data_sample, labels_sample

In [12]:
x1s_train, y1s_train = sample_noise_class(data = x1_train, labels = y1_train, sz = 250)
x1s_val, y1s_val = sample_noise_class(data = x1_val, labels = y1_val, sz = 50)
x1s_test, y1s_test = sample_noise_class(data = x1_test, labels = y1_test, sz = 50)

## Save datasets

In [13]:
with open('data/1s_train.npy', 'wb') as f:
    np.save(f, x1s_train)
    np.save(f, y1s_train)
    
with open('data/1s_val.npy', 'wb') as f:
    np.save(f, x1s_val)
    np.save(f, y1s_val)
    
with open('data/1s_test.npy', 'wb') as f:
    np.save(f, x1s_test)
    np.save(f, y1s_test)

In [15]:
with open('data/1_train.npy', 'wb') as f:
    np.save(f, x1_train)
    np.save(f, y1_train)

with open('data/1_val.npy', 'wb') as f:
    np.save(f, x1_val)
    np.save(f, y1_val)
    
with open('data/1_test.npy', 'wb') as f:
    np.save(f, x1_test)
    np.save(f, y1_test)

## References

[1] Koskimäki, Heli, Pekka Siirtola and Juha Röning. “MyoGym: introducing an open gym data set for activity recognition collected using myo armband.” Proceedings of the 2017 ACM International Joint Conference on Pervasive and Ubiquitous Computing and Proceedings of the 2017 ACM International Symposium on Wearable Computers (2017): n. pag.