# Split data and prepare windows

In [1]:
%run preprocess.ipynb
%run sys_configs.ipynb

In [2]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
np.random.seed(123)

This notebook contains functions to split the MyoGym dataset into train, validation and test sets and compute windows from each time series data stream.

In [3]:
N = len(data) # Length of data stream and dimensionality
step = 50 # Step size forward through the dataframe

In [4]:
columns = list(data.columns)
print("The ordered list of columns is {}".format(columns))
data = data.to_numpy()

The ordered list of columns is ['acc_x', 'acc_y', 'acc_z', 'gyr_x', 'gyr_y', 'gyr_z', 'activity', 'trainer', 'time']


In most contexts involving time series, the validation dataset is temporally separated from the train dataset. In both window methods, we consider windows to be independent, i.e. we can ignore any information from temporally preceding windows.

There are two windowing methods we will deploy for GAR tasks. We will then normalise the data.

Data normalisation is essential to the process of adjusting all channels to an identical range. Many time series classification techniques that we will consider as benchmarks, for example Dynamic Time Warping, use distance metrics that depend on each dimension having the same scale. We use sklearn's StandardScalar.

The use of a standard scaling allows us to more easily identify outliers and to more rigorously quantify the extent of these outliers using hypothesis tests of a normal distribution.

### Window Method 1

This incorporates only a fixed number of the observations, if there is no activity or trainer changes in the window.

This is a less sophisticated windowing strategy intended for benchmarking methods which rely on fixed length time series.

In [5]:
T = 500 # Window length is 500

In [6]:
indexer1 = np.arange(T)[None, :] + np.arange(start = 0, stop = N-T, step = step)[:, None]

In [7]:
windows1 = data[indexer1]
print("The shape of windows_1 {}".format(windows1.shape))

The shape of windows_1 (32170, 500, 9)


In [8]:
activity_counts1 = np.apply_along_axis(lambda x: len(np.unique(x)), axis=1, arr=windows1[:, :, 6])
trainer_counts1 = np.apply_along_axis(lambda x: len(np.unique(x)), axis=1, arr=windows1[:, :, 7])

Exclude all windows in which either the activity or the trainer changes. Each window should be *pure*, i.e. it should be mappable to exactly one trainer and one activity. 

In [9]:
x1 = windows1[(activity_counts1 == 1) & (trainer_counts1 == 1), :, :6] # Acceleration & Gyroscope data
y1 = windows1[(activity_counts1 == 1) & (trainer_counts1 == 1), 0, 6] # Activities. Take the first element of the time series (though every element is the same)
t1 = windows1[(activity_counts1 == 1) & (trainer_counts1 == 1), 0, 7] # Trainers. Take the first element of the time series (though every element is the same)

**Split into train/validation/test datasets**

In [10]:
trainers1 = np.unique(t1)
shuffled_trainers1 = np.random.permutation(trainers1)

# Split the indexes into a combined (train, val) index set and the test indexes
comb_idxs1 = np.where(np.isin(t1, shuffled_trainers1[:8]))[0]
test_idxs1 = np.where(np.isin(t1, shuffled_trainers1[8:]))[0]

# Split out the combined (train, val) index set
n1 = len(comb_idxs1)
train_idxs1 = comb_idxs1[:int(0.8*n1)]
val_idxs1 = comb_idxs1[int(0.8*n1):]

In [11]:
train_idxs1 = np.random.permutation(train_idxs1)
val_idxs1 = np.random.permutation(val_idxs1)
test_idxs1 = np.random.permutation(test_idxs1)

x1_train = x1[train_idxs1]
x1_val = x1[val_idxs1]
x1_test = x1[test_idxs1]

y1_train = y1[train_idxs1]
y1_val = y1[val_idxs1]
y1_test = y1[test_idxs1]

#### Data normalisation

In [12]:
scaler1 = StandardScaler()

N1_train, T1 , D1 = x1_train.shape
N1_val, _ , _ = x1_val.shape
N1_test, _ , _ = x1_test.shape

# Standard scaler works only for 2 dimensional data, so we melt the time dimension into the sample dimension
x1_train = x1_train.reshape(N1_train * T1, D1)
x1_val = x1_val.reshape(N1_val * T1, D1)
x1_test = x1_test.reshape(N1_test * T1, D1)

# Fit the standard scaler to the train data, then transform the validation and test data
x1_train = scaler1.fit_transform(x1_train)
x1_val = scaler1.transform(x1_val)
x1_test = scaler1.transform(x1_test)

# Transform back to the N X T X D shape
x1_train = x1_train.reshape(N1_train, T1, D1)
x1_val = x1_val.reshape(N1_val, T1, D1)
x1_test = x1_test.reshape(N1_test, T1, D1)

### Window Method 2

**Window technique 2**: incorporates variable length time series windows up to a maximum length denoted as `T2` by placing a mask over the time steps. The mask is constrained so the unmasked time steps are consecutive. The mask can be experimented with in order to explore how its length implacts the predictive ability of the models tested. Many of the steps in this method remain the same as window method 1, but for the additional masking step.

The motivation behind this approach is that in the real world, there is a short period at the beginning of the data collection where the only time steps available with which to make a classification are those up to the current time step, which is less than time step `T2`.

In [13]:
T2 = 1000 # Maximum window length is 1,000. In general, windows will be shorter than this.

In [14]:
indexer2 = np.arange(T2)[None, :] + np.arange(start = 0, stop = N-T2, step = step)[:, None]

In [15]:
windows2 = data[indexer2]
print("The shape of windows_2 {}".format(windows2.shape))

The shape of windows_2 (32160, 1000, 9)


In [16]:
activity_counts2 = np.apply_along_axis(lambda x: len(np.unique(x)), axis=1, arr=windows2[:, :, 6])
trainer_counts2 = np.apply_along_axis(lambda x: len(np.unique(x)), axis=1, arr=windows2[:, :, 7])

Each window should be *pure*, i.e. it should be designated exactly 1 trainer and 1 activity. 

In [17]:
# For windows in which either the activity or the trainer changes. Each window should be "pure", i.e. it should be designated exactly 1 trainer and 1 activity
x2 = windows2[(activity_counts2 == 1) & (trainer_counts2 == 1), :, :6] # Acceleration & Gyroscope data
y2 = windows2[(activity_counts2 == 1) & (trainer_counts2 == 1), 0, 6] # Activities. Take the first element of the time series (though every element is the same)
t2 = windows2[(activity_counts2 == 1) & (trainer_counts2 == 1), 0, 7] # Trainers. Take the first element of the time series (though every element is the same)

We now apply a mask whose beginning and ending indexes are specific to each sample; this is because we are testing a range of different mask lengths. Both positions are randomly assigned; the starting position must be $\leq$ `start_index` and the ending position must be $\geq$ `end_index`.

In [18]:
# Define the start and end indexes of where the mask values are True, for each row
start_indexes = np.random.randint(low = 0, high = 400, size=len(x2))
end_indexes = np.random.randint(low = 800, high = T2 - 1, size=len(x2))

# Define the mask matrix with all values set to True with the first two dimensions as the data matrix.
m2 = np.full_like(x2, True)[:, :, 0]

In [19]:
# Define a variable to generate the rows of the mask matrix
cols = np.arange(T2)[None, :]

# Use broadcasting to set to False all elements which are not in the mask
update_mask_lt = cols < start_indexes[:, None]
update_mask_gt = cols > end_indexes[:, None]

m2[update_mask_lt] = False
m2[update_mask_gt] = False

In [20]:
# Briefly check the mask has been correctly updated by counting up the number of its elements that are set to False vs the number implied to be False from the update masks
mask_total_false = np.sum(m2 == False)

start_indexes_total_false = np.sum(start_indexes)
end_indexes_total_false = np.sum(T2 - 1 - end_indexes)

print("The number of false values in the mask is {}\nThe number of false values implied by the variables start_indexes and end_indexes is {}".format(mask_total_false, start_indexes_total_false + end_indexes_total_false))

The number of false values in the mask is 6070709
The number of false values implied by the variables start_indexes and end_indexes is 6070709


**Split into train/validation/test datasets**

In [21]:
trainers2 = np.unique(t2)
shuffled_trainers2 = np.random.permutation(trainers2)

# Split the indexes into a combined (train, val) index set and the test indexes
comb_idxs2 = np.where(np.isin(t2, shuffled_trainers2[:8]))[0]
test_idxs2 = np.where(np.isin(t2, shuffled_trainers2[8:]))[0]

# Split out the combined (train, val) index set
n2 = len(comb_idxs2)
train_idxs2 = comb_idxs2[:int(0.8*n2)]
val_idxs2 = comb_idxs2[int(0.8*n2):]

In [22]:
train_idxs2 = np.random.permutation(train_idxs2)
val_idxs2 = np.random.permutation(val_idxs2)
test_idxs2 = np.random.permutation(test_idxs2)

x2_train = x2[train_idxs2, :]
x2_val = x2[val_idxs2, :]
x2_test = x2[test_idxs2, :]

y2_train = y2[train_idxs2]
y2_val = y2[val_idxs2]
y2_test = y2[test_idxs2]

#### Data normalisation

We do not normalise the outputs of the second windowing method. This method is only used for neural networks, and the scale should be absorbed in the weights of the activation functions.

The masking also complicates the normalisation calculation.

## Sample background activity class

According to the paper [1] which introduced the MyoGym dataset, the background activity class, which it describes as the null class, accounts for 77% of the dataset, a number which dwarves the remaining 30 classes. Most of the techniques we explore are sensitive to class imbalanaces or to dataset sizes. Therefore, drawing on conclusions from the exploratory data analysis in the appendix, we sample some windows from this background activity class.

In [23]:
def sample_background_activity_class(data: np.array, labels: np.array, sz: int):
    """
    Removes most samples from the dominant background activity class, down to a sample size (sz) specified in this function.  
    """
    # Identify indices of the noise class and signal class
    noise_idx = np.where(labels == 0)[0]
    signal_idx = np.where(labels != 0)[0]

    # Choose a sample from the noise class
    sample_idx = np.random.choice(noise_idx, size = sz, replace=False)

    # Combine the sampled indices with the other class indices
    combined_idx = np.concatenate([signal_idx, sample_idx])
    combined_idx = np.random.permutation(combined_idx)

    # Apply the indexes to the data and labels
    data_sample = data[combined_idx, :, :]
    labels_sample = labels[combined_idx]
    
    return data_sample, labels_sample

In [24]:
x1s_train, y1s_train = sample_background_activity_class(data = x1_train, labels = y1_train, sz = 150)
x1s_val, y1s_val = sample_background_activity_class(data = x1_val, labels = y1_val, sz = 40)
x1s_test, y1s_test = sample_background_activity_class(data = x1_test, labels = y1_test, sz = 50)

x2s_train, y2s_train = sample_background_activity_class(data = x2_train, labels = y2_train, sz = 150)
x2s_val, y2s_val = sample_background_activity_class(data = x2_val, labels = y2_val, sz = 40)
x2s_test, y2s_test = sample_background_activity_class(data = x2_test, labels = y2_test, sz = 50)

## Save datasets

We save the sampled datasets to train, validation and test files.

In [25]:
with open('data/1s_train.npy', 'wb') as f:
    np.save(f, x1s_train)
    np.save(f, y1s_train)
    
with open('data/1s_val.npy', 'wb') as f:
    np.save(f, x1s_val)
    np.save(f, y1s_val)
    
with open('data/1s_test.npy', 'wb') as f:
    np.save(f, x1s_test)
    np.save(f, y1s_test)

In [26]:
with open('data/2s_train.npy', 'wb') as f:
    np.save(f, x2s_train)
    np.save(f, y2s_train)
    
with open('data/2s_val.npy', 'wb') as f:
    np.save(f, x2s_val)
    np.save(f, y2s_val)
    
with open('data/2s_test.npy', 'wb') as f:
    np.save(f, x2s_test)
    np.save(f, y2s_test)

## Appendix

#### Exploratory Data Analysis

In [27]:
def get_activity_counts(y_train, y_val, y_test):
    train_labels, train_counts = np.unique(y_train, return_counts = True)
    val_labels, val_counts = np.unique(y_val, return_counts = True)
    test_labels, test_counts = np.unique(y_test, return_counts = True)

    train_label_counts = pd.DataFrame(np.hstack([train_labels[:, np.newaxis], train_counts[:, np.newaxis]]), columns = ["Label", "Train Count"])
    val_label_counts = pd.DataFrame(np.hstack([val_labels[:, np.newaxis], val_counts[:, np.newaxis]]), columns = ["Label", "Val Count"])
    test_label_counts = pd.DataFrame(np.hstack([test_labels[:, np.newaxis], test_counts[:, np.newaxis]]), columns = ["Label", "Test Count"])
    
    label_counts = train_label_counts.merge(test_label_counts, how = "outer", on = "Label").merge(val_label_counts, how = "outer", on = "Label")
    label_counts["Label"] = label_counts["Label"].map(ACTIVITY_MAPPING)
    label_counts = label_counts.set_index("Label")
    label_counts = label_counts.sort_values(["Train Count"], ascending=[False])
    
    return label_counts

In [28]:
label_counts_1 = get_activity_counts(y1_train, y1_val, y1_test)
label_counts_2 = get_activity_counts(y2_train, y2_val, y2_test)

In [29]:
label_counts_1.columns = pd.MultiIndex.from_product([['Window Method 1'], label_counts_1.columns])
label_counts_2.columns = pd.MultiIndex.from_product([['Window Method 2'], label_counts_2.columns])

In [30]:
label_counts_1.merge(label_counts_2, how = "outer", left_index = True, right_index = True)

Unnamed: 0_level_0,Window Method 1,Window Method 1,Window Method 1,Window Method 2,Window Method 2,Window Method 2
Unnamed: 0_level_1,Train Count,Test Count,Val Count,Train Count,Test Count,Val Count
Label,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
Bar Skullcrusher,105.0,31.0,35.0,48.0,5.0,20.0
Bench Dip / Dip,62.0,20.0,27.0,8.0,4.0,4.0
Bench Press,68.0,17.0,23.0,18.0,4.0,7.0
Bent Over Barbell Row,59.0,12.0,14.0,5.0,3.0,2.0
Cable Curl,89.0,17.0,25.0,28.0,3.0,8.0
Car Drivers,63.0,28.0,9.0,3.0,2.0,9.0
Close-Grip Barbell Bench Press,89.0,9.0,26.0,20.0,13.0,13.0
Concentration Curl,90.0,13.0,32.0,26.0,5.0,13.0
Dumbbell Alternate Bicep Curl,167.0,32.0,77.0,102.0,35.0,39.0
Dumbbell Flyes,160.0,15.0,47.0,85.0,28.0,14.0


## References

[1] Koskimäki, Heli, Pekka Siirtola and Juha Röning. “MyoGym: introducing an open gym data set for activity recognition collected using myo armband.” Proceedings of the 2017 ACM International Joint Conference on Pervasive and Ubiquitous Computing and Proceedings of the 2017 ACM International Symposium on Wearable Computers (2017): n. pag.