# Split data and prepare windows

In [1]:
%run preprocess.ipynb
%run sys_configs.ipynb

In [2]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

np.random.seed(123)

This notebook contains functions to split the MyoGym dataset into train, validation and test sets and compute windows from each time series data stream.

In [3]:
N = len(data) # Length of data stream and dimensionality
step = 50 # Step size forward through the dataframe

In [4]:
columns = list(data.columns)
print("The ordered list of columns is {}".format(columns))
data = data.to_numpy()

The ordered list of columns is ['acc_x', 'acc_y', 'acc_z', 'gyr_x', 'gyr_y', 'gyr_z', 'activity', 'trainer', 'time']


In most contexts involving time series, the validation dataset is temporally separated from the train dataset. In both window methods, we consider windows to be independent, i.e. we can ignore any information from temporally preceding windows.

There are two windowing methods we will deploy for GAR tasks. We will then normalise the data.

Data normalisation is essential to the process of adjusting all channels to an identical range. Many time series classification techniques that we will consider as benchmarks, for example Dynamic Time Warping, use distance metrics that depend on each dimension having the same scale. We use sklearn's StandardScalar.

The use of a standard scaling allows us to more easily identify outliers and to more rigorously quantify the extent of these outliers using hypothesis tests of a normal distribution.

### Applying Sliding Windows

This incorporates only a fixed number of the observations, if there is no activity or trainer changes in the window.

In [5]:
T = 500 # Window length is 500

In [6]:
indexer = np.arange(T)[None, :] + np.arange(start = 0, stop = N-T, step = step)[:, None]

In [7]:
windows = data[indexer]
print("The shape of windows {}".format(windows.shape))

The shape of windows (32170, 500, 9)


In [8]:
activity_counts = np.apply_along_axis(lambda x: len(np.unique(x)), axis=1, arr=windows[:, :, 6])
trainer_counts = np.apply_along_axis(lambda x: len(np.unique(x)), axis=1, arr=windows[:, :, 7])

Exclude all windows in which either the activity or the trainer changes. Each window should be *pure*, i.e. it should be mappable to exactly one trainer and one activity. 

In [9]:
x = windows[(activity_counts == 1) & (trainer_counts == 1), :, :6] # Acceleration & Gyroscope data
y = windows[(activity_counts == 1) & (trainer_counts == 1), 0, 6] # Activities. Take the first element of the time series (though every element is the same)
t = windows[(activity_counts == 1) & (trainer_counts == 1), 0, 7] # Trainers. Take the first element of the time series (though every element is the same)

**Split into train/validation/test datasets**

In [10]:
trainers = np.unique(t)
shuffled_trainers = np.random.permutation(trainers)

# Split the indexes into a combined (train, val) index set and the test indexes
comb_idxs = np.where(np.isin(t, shuffled_trainers[:8]))[0]
test_idxs = np.where(np.isin(t, shuffled_trainers[8:]))[0]

# Split out the combined (train, val) index set
n = len(comb_idxs)
train_idxs = comb_idxs[:int(0.8*n)]
val_idxs = comb_idxs[int(0.8*n):]

In [11]:
train_idxs = np.random.permutation(train_idxs)
val_idxs = np.random.permutation(val_idxs)
test_idxs = np.random.permutation(test_idxs)

x_train = x[train_idxs]
x_val = x[val_idxs]
x_test = x[test_idxs]

y_train = y[train_idxs]
y_val = y[val_idxs]
y_test = y[test_idxs]

#### Data normalisation

In [12]:
scaler = StandardScaler()

N_train, T , D = x_train.shape
N_val, _ , _ = x_val.shape
N_test, _ , _ = x_test.shape

# MinMaxScaler scaler works only for 2 dimensional data, so we melt the time dimension into the sample dimension
x_train = x_train.reshape(N_train * T, D)
x_val = x_val.reshape(N_val * T, D)
x_test = x_test.reshape(N_test * T, D)

# Fit the scaler to the train data, then transform the validation and test data
x_train = scaler.fit_transform(x_train)
x_val = scaler.transform(x_val)
x_test = scaler.transform(x_test)

# Transform back to the N X T X D shape
x_train = x_train.reshape(N_train, T, D)
x_val = x_val.reshape(N_val, T, D)
x_test = x_test.reshape(N_test, T, D)

## Sample background activity class

According to the paper [1] which introduced the MyoGym dataset, the background activity class, which it describes as the null class, accounts for 77% of the dataset, a number which dwarves the remaining 30 classes. Many TSC techniques are sensitive to class imbalanaces or to dataset sizes. Therefore, drawing on conclusions from the exploratory data analysis in the appendix, we sample some windows from this background activity class.

In [13]:
def sample_background_activity_class(data: np.array, labels: np.array, sz: int):
    """
    Removes most samples from the dominant background activity class, down to a sample size (sz) specified in this function.  
    """
    # Identify indices of the noise class and signal class
    noise_idx = np.where(labels == 0)[0]
    signal_idx = np.where(labels != 0)[0]

    # Choose a sample from the noise class
    sample_idx = np.random.choice(noise_idx, size = sz, replace=False)

    # Combine the sampled indices with the other class indices
    combined_idx = np.concatenate([signal_idx, sample_idx])
    combined_idx = np.random.permutation(combined_idx)

    # Apply the indexes to the data and labels
    data_sample = data[combined_idx, :, :]
    labels_sample = labels[combined_idx]
    
    return data_sample, labels_sample

In [14]:
xs_train, ys_train = sample_background_activity_class(data = x_train, labels = y_train, sz = 150)
xs_val, ys_val = sample_background_activity_class(data = x_val, labels = y_val, sz = 40)
xs_test, ys_test = sample_background_activity_class(data = x_test, labels = y_test, sz = 50)

## Save datasets

We save the sampled datasets to train, validation and test files.

In [15]:
with open('data/train.npy', 'wb') as f:
    np.save(f, xs_train)
    np.save(f, ys_train)
    
with open('data/val.npy', 'wb') as f:
    np.save(f, xs_val)
    np.save(f, ys_val)
    
with open('data/test.npy', 'wb') as f:
    np.save(f, xs_test)
    np.save(f, ys_test)

## Appendix

#### Exploratory Data Analysis

In [16]:
def get_activity_counts(y_train, y_val, y_test):
    train_labels, train_counts = np.unique(y_train, return_counts = True)
    val_labels, val_counts = np.unique(y_val, return_counts = True)
    test_labels, test_counts = np.unique(y_test, return_counts = True)

    train_label_counts = pd.DataFrame(np.hstack([train_labels[:, np.newaxis], train_counts[:, np.newaxis]]), columns = ["Label", "Train Count"])
    val_label_counts = pd.DataFrame(np.hstack([val_labels[:, np.newaxis], val_counts[:, np.newaxis]]), columns = ["Label", "Val Count"])
    test_label_counts = pd.DataFrame(np.hstack([test_labels[:, np.newaxis], test_counts[:, np.newaxis]]), columns = ["Label", "Test Count"])
    
    label_counts = train_label_counts.merge(val_label_counts, how = "outer", on = "Label").merge(test_label_counts, how = "outer", on = "Label")
    label_counts["Label"] = label_counts["Label"].map(ACTIVITY_MAPPING)
    label_counts = label_counts.set_index("Label")
    label_counts = label_counts.sort_values(["Train Count"], ascending=[False])
    
    return label_counts

In [17]:
label_counts = get_activity_counts(y_train, y_val, y_test)
label_counts

Unnamed: 0_level_0,Train Count,Val Count,Test Count
Label,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
No activity identified,14665.0,3436.0,3573.0
Front Dumbbell Raise,167.0,55.0,28.0
Dumbbell Alternate Bicep Curl,167.0,77.0,32.0
Dumbbell Flyes,160.0,47.0,15.0
Incline Dumbbell Flyes,143.0,48.0,23.0
Hammer Curl,127.0,59.0,35.0
Incline Dumbbell Press,123.0,32.0,19.0
Spider Curl,120.0,31.0,25.0
Wide-Grip Pulldown Behind The Neck,112.0,26.0,24.0
Wide-Grip Front Pulldown,109.0,27.0,10.0


## References

[1] Koskimäki, Heli, Pekka Siirtola and Juha Röning. “MyoGym: introducing an open gym data set for activity recognition collected using myo armband.” Proceedings of the 2017 ACM International Joint Conference on Pervasive and Ubiquitous Computing and Proceedings of the 2017 ACM International Symposium on Wearable Computers (2017): n. pag.