# Gesture RNN - predicting multiple sequences!

The idea with this RNN is to help build a system that can mimic ensemble performances.

![](figures/gesture-rnn-neural-mode-ensemble.jpg)

In these performances, the touchscreen gestures were classified once every second (by a different ML system). This means that each performance can be represented as a sequence of gestures for each performer.

Let's load in the gesture data and see what these look like:

In [None]:
import os
import pandas as pd
import numpy as np
import os
from itertools import permutations

import urllib.request
URL = "https://github.com/anucc/metatone-analysis/raw/master/metadata/metatone_performances_dataframe.h5"
PICKLE_FILE = "metatone_performances_dataframe.h5"

if not os.path.exists(PICKLE_FILE):
    urllib.request.urlretrieve(URL, PICKLE_FILE)
metatone_dataset = pd.read_hdf(PICKLE_FILE)

## Int values for Gesture codes.
NUMBER_GESTURES = 9
GESTURE_CODES = {
    'N': 0,
    'FT': 1,
    'ST': 2,
    'FS': 3,
    'FSA': 4,
    'VSS': 5,
    'BS': 6,
    'SS': 7,
    'C': 8}

vocabulary_size = len(GESTURE_CODES)

print("Here's an example of some gestures from a trio performance:")
metatone_dataset.gestures.iloc[0][:20].T

## Encoding and Decoding multiple numbers

- We can encode and decode multiple integer into a single one
- Good trick! We can then use a simple CharRNN to predict more than one sequence!

E.g., if you have some numbers:
$$ g_1,g_2,\ldots,g_n \in [1,j-1]$$
then these can be encoded as one number:
$$g_1j^0 + g_2j^1 + \ldots + g_nj^{(n-1)}$$
and decoded again preserving the ordering.

In [4]:
def encode_ensemble_gestures(gestures):
    """Encode multiple natural numbers into one"""
    encoded = 0
    for i, g in enumerate(gestures):
        encoded += g * (len(GESTURE_CODES) ** i)
    return encoded

def decode_ensemble_gestures(num_perfs, code):
    """Decodes ensemble gestures from a single int"""
    gestures = []
    for i in range(num_perfs):
        part = code % (len(GESTURE_CODES) ** (i + 1))
        gestures.append(part // (len(GESTURE_CODES) ** i))
    return gestures

Let's just check that this works!

In [None]:
gestures = [0,5,2,7]
enc_gestures = encode_ensemble_gestures(gestures)
print("Starting gestures are:", gestures)
print("Encoded integer is:", enc_gestures)
print("Decoded gestures are:", decode_ensemble_gestures(4, enc_gestures))

## Looking only at Quartet Performances

- Gesture-RNN is mainly used for quartets (one live player and three RNN-controller players)
- Let's extract the quartets from the dataset.

In [None]:
## Isolate the Interesting Ensemble Performances
improvisations = metatone_dataset[
    (metatone_dataset["performance_type"] == "improvisation") &
    (metatone_dataset["performance_context"] != "demonstration") &
    (metatone_dataset["number_performers"] == 4)]
gesture_data = improvisations['gestures']
#metatone_dataset["number_performers"]
print("Number of performances for testing: ", len(improvisations))


num_input_performers = 4
num_output_performers = 3

ensemble_improvisations = gesture_data.tolist()

print("Here's an example of part of one performance:")
ensemble_improvisations[0][:20].T

# Preparing the training data

- This code cuts up the performances into training examples.
- One player is taken as lead and then other players are permuted to extend the data.
- Take a bit of a long time to run this code (a bit inefficient?)

In [None]:
## Setup the epochs
## Each batch is of single gestures as input and tuples of remaining performers as output
## Setup the inputs and label sets
num_steps = 30

imp_xs = []
imp_ys = []
    
for n, imp in enumerate(ensemble_improvisations):
    print("Processing sequence:", n)
    for i in range(len(imp)-num_steps-1):
        imp_slice = imp[i:i+num_steps+1]
        for j in range(len(imp_slice.T)):
            lead = imp_slice[1:].T[j] # lead gestures (post steps)
            ensemble = imp_slice.T[np.arange(len(imp_slice.T)) != j] # rest of the players indexed by player
            for ens_perm in permutations(ensemble): # consider all permutations of the players
                ens_pre = np.array(ens_perm).T[:-1] # indexed by time slice
                #ens_post = np.array(ens_perm).T[1:] # indexed by time slice
                ens_post = np.array(ens_perm).T[-1] # indexed by time slice
                y = encode_ensemble_gestures(ens_post)
                #y = [encode_ensemble_gestures(e) for e in ens_post]
                #y = ens_post # test just show the gestures
                x = [encode_ensemble_gestures(e) for e in zip(lead,*(ens_pre.T))] # encode ensemble state
                #x = zip(lead,*(ens_pre.T)) # test just show the gestures
                imp_xs.append(x) # append the inputs
                imp_ys.append(y) # append the outputs

print("Total Training Examples: " + str(len(imp_xs)))
print("Total Training Labels: " + str(len(imp_ys)))

X = np.array(imp_xs, dtype=np.int16)
y = np.array(imp_ys, dtype=np.int16)

print("X:", X.shape)
print("y:", y.shape)

np.savez('../datasets/gesture_rnn_training_dataset.npz', X=X.astype(np.int16), y=y.astype(np.int16))

- Here's one I prepared earlier...

In [None]:
dataset = np.load("../datasets/gesture_rnn_training_dataset.npz")
X = dataset['X']
y = dataset['y'].reshape(-1,1)
print("X shape:", X.shape)
print("y shape:", y.shape)

- Let's look at an example:

In [None]:
print("X:", X[100])
print("y:", y[100])

## Buiding the RNN

- Very similar design to the CharRNN.
- Gestures are combined into one integer before entering RNN.
- RNN outputs one integer which is split up.
- Input size is different to output! 4 --> 3

![](figures/gesture-rnn-training.png)

In [None]:
import keras

num_layers = 1
batch_size = 64
num_units = 256
num_steps = 120

num_input_performers = 4
num_output_performers = 3

vocabulary_size = 9 # len(GESTURE_CODES)
num_input_classes = vocabulary_size ** num_input_performers
num_output_classes = vocabulary_size ** num_output_performers

training_model = keras.models.Sequential()
training_model.add(keras.Input(shape=(num_input_classes,)))
training_model.add(keras.layers.Embedding(num_input_classes, num_units))
for n in range(num_layers - 1):
    training_model.add(keras.layers.LSTM(num_units, return_sequences=True))
training_model.add(keras.layers.LSTM(num_units))
training_model.add(keras.layers.Dense(num_output_classes, activation='softmax'))

training_model.compile(loss='sparse_categorical_crossentropy', optimizer='Adam')
training_model.summary()

# Notes:
# Difficulty of this task is learning from a relatively large input space:
print("Number of input classes:", num_input_classes)
print("Number of output classes", num_output_classes)
# It's handy to use an Embedding layer so that we can learn from integer
# inputs (not one-hot)
# This means that for lower 'num_units', the parameters used for the input 
# embedding outnumber the LSTM layers.

## Training

- training takes a long time for this RNN! (long sequence length, lots of example)
- 9 hours for a 3-layer 512 unit network.
- not going to do here.

In [None]:
## Training Model.
training_model.fit(X, y, batch_size=batch_size)

## Outputs

However, the trained model works! Here's an example of a generated score (red is the lead player) and a demo performance.

![](figures/gesture_rnn_4to3_example.png)

![](figures/gesture-rnn-band-demo.jpg)