# seq2seq prediction with variable length inputs
- For seq2seq modelling, most of time the sequences of variable lengths will be padded as it doesn't affect the objective functions in most cases.
- Another work-around is to train the seqences one by one, or in batch organized based on same sequence lengths
- Replicate Keras [`addition_rnn.py`](https://github.com/fchollet/keras/blob/master/examples/addition_rnn.py) example, experimenting with training on variable len seqences
- More discussions can be found [here](https://github.com/fchollet/keras/issues/40) and [here](https://github.com/fchollet/keras/issues/424)
- The solution discussed in the article is only applicable to "theano" backend, but not to "tensorflow"

In [1]:
from __future__ import print_function
import numpy as np
np.random.seed(314)

In [170]:
from keras import models, layers, optimizers, objectives, metrics
from keras.preprocessing import sequence, text
from itertools import groupby, cycle, chain, islice

from sklearn.utils import shuffle

In [226]:
import keras
keras.__version__

'1.0.2'

In [171]:
!cat /root/.keras/keras.json

{"epsilon": 1e-07, "floatx": "float32", "backend": "theano"}


## encode/decode of sequences
- support both fixed length encoding with a default padding and variable length encoding
- for RNN training, it is ok to use sequences of different lengths as input, but the lengths of output sequence still need to be fixed

In [92]:
charset = "0123456789+ "
int2char = dict(enumerate(charset))
char2int = dict((c, i) for i,c in enumerate(charset))
def encode(expr, seqlen = None, padchar = " "):
    """set seqlen to enforce a fixed-length encoding of a sequence with pad
    """
    seqlen = seqlen or len(expr)
    
    vec = np.zeros((len(expr), len(charset)))
    vec[range(len(expr)), map(char2int.get, expr)] = 1
    if seqlen > len(expr):
        pad_vec = np.zeros( (1, len(charset)) )
        pad_vec[0, char2int.get(padchar)] = 1
        vec = np.r_[vec, np.repeat(pad_vec, seqlen - len(expr), axis = 0)]
    return vec
def decode(vec):
    expr = "".join([int2char.get(r.argmax()) for r in vec])
    return expr

In [93]:
print("Example of encoding/decoding '%s'" % decode(encode("311+3")))
print("Example of encoding/decoding with padding '%s'" % decode(encode("314", seqlen=5)))

Example of encoding/decoding '311+3'
Example of encoding/decoding with padding '314  '


## genereate training and validation data
- training data:
    - inputs are sequences of variable lengths
    - outputs are sequences of fixed lengths
- validation data:
    - both inputs and outputs are sequences of fixed lengths, for simplicity

In [136]:
def generate_data(ndigits, nquestions):
    seen = set()
    questions, answers = [], []
    nextn = lambda : int("".join(np.random.choice(np.array(list("0123456789")), ndigits, replace =True)))
    while len(questions) < nquestions:
        a, b = nextn(), nextn()
        key = tuple(sorted([a, b]))
        if key in seen: continue
        seen.add(key)
        expr1 = "%i+%i" % (a, b)
        expr2 = "%i+%i" % (b, a)
        ans = str(a+b)
        questions.append(expr1)
        answers.append(ans)
        questions.append(expr2)
        answers.append(ans)
    questions, answers = shuffle(questions, answers)
    return questions, answers

### Mix digits 2 and 3 for training

***It is a harder problem now because part of task is to predict on 3 digits based on training with 2 digits***

In [137]:

ndigits = 3
%time questions2, answers2 = generate_data(3, 40000)
%time questions3, answers3 = generate_data(2, 10000)
train_questions = questions2 + questions3
train_answers = answers2 + answers3

## train_X is a list of matrices with variable shapes
## train_y is a list of matrices with fixed shapes
train_X = [encode(q) for q in train_questions] ## encoding question seqs with variable len
train_y = [encode(a, seqlen=ndigits+1) for a in train_answers] ## encoding answer seqs with fixed len

CPU times: user 524 ms, sys: 0 ns, total: 524 ms
Wall time: 519 ms
CPU times: user 564 ms, sys: 0 ns, total: 564 ms
Wall time: 555 ms


In [138]:
## see the length distributions of train_questions

from collections import Counter
len_counter = Counter(map(lambda m: m.shape[0], train_X))
print(len_counter)
len_counter = Counter(map(lambda m: m.shape[0], train_y))
print(len_counter)

Counter({7: 32470, 5: 9124, 6: 6444, 4: 1856, 3: 106})
Counter({4: 50000})




***In this example, longer sequence will just dominate. So I am just showing how this can be done practically, but it doesn't suggest that this is necessarily the way of training a good model - There is a reason why padding is so popular.***


In [139]:
## generate new data for validation
ndigits = 3
%time valid_questions, valid_answers = generate_data(3, 5000)

## Both valid_X and valid_y are encoded with padding, for simplicity
valid_X = np.array([encode(q, seqlen=ndigits*2+1) for q in valid_questions]) 
valid_y = np.array([encode(a, seqlen=ndigits+1) for a in valid_answers])
print(valid_X.shape, valid_y.shape)

CPU times: user 68 ms, sys: 0 ns, total: 68 ms
Wall time: 65.8 ms
(5000, 7, 12) (5000, 4, 12)


In [183]:
def create_model():
    n_hidden_layers = 1
    input_dim = len(charset)
    hidden_dim = 128
    output_dim = len(charset)
    output_seq_len = ndigits + 1

    model = models.Sequential()
    ## indicate accepting variable length sequences by not fixing the first input dim
    model.add(layers.LSTM(hidden_dim, input_shape = (None, input_dim), name = "input_lstm"))
    model.add(layers.RepeatVector(output_seq_len, name = "output_seq"))
    for i in xrange(n_hidden_layers):
        model.add(layers.LSTM(hidden_dim, return_sequences = True, name = "hidden_seq_rnn%i" % i))
    model.add(layers.TimeDistributed(layers.Dense(output_dim), name = "output_vec"))
    model.add(layers.Activation("softmax", name = "softmax"))

    model.compile(optimizer="adam", loss="categorical_crossentropy", metrics=["accuracy"])
    return model

model = create_model()

In [163]:
def inspect_model(model):
    for layer in model.layers:
        print("Layername:", layer.name, 
              "\tInputs:", layer.input_shape, 
              "\tOutputs:", layer.output_shape)
        
inspect_model(model)

Layername: input_lstm 	Inputs: (None, None, 12) 	Outputs: (None, 128)
Layername: output_seq 	Inputs: (None, 128) 	Outputs: (None, 4, 128)
Layername: hidden_seq_rnn0 	Inputs: (None, 4, 128) 	Outputs: (None, 4, 128)
Layername: output_vec 	Inputs: (None, 4, 128) 	Outputs: (None, 4, 12)
Layername: softmax 	Inputs: (None, 4, 12) 	Outputs: (None, 4, 12)


## Solution 1: use generator to group training data of different lengths

In [202]:
from itertools import groupby, cycle, chain, islice
def batch_data_generator(inputs_list, outputs_list, batch_size = 32):
    """
    Group inputs and outputs by input sizes
    Generate batches from same size group, with padding if necessary
    """
    assert (len(inputs_list) == len(outputs_list))
    index = range(len(inputs_list))
    keyfun = lambda i: inputs_list[i].shape[0] ## input shape as key
    groups_by_sz = groupby(sorted(index, key = keyfun), key = keyfun)
    
    grp_indices = []
    for sz, subindex in groups_by_sz:
        ## pad subindex to make it a multiple of batch_size
        subindex = list(subindex)
        r = len(subindex) % batch_size
        padded_sz = len(subindex) if r == 0 else len(subindex) + (batch_size-r)
        subindex = islice(cycle(subindex), 0, padded_sz)
        grp_indices.append(subindex)
    looped_index = cycle(chain(*grp_indices))
    
    while True:
        batch_index = []
        for i in xrange(batch_size):
            i = looped_index.next()
            batch_index.append(i)
        batch_X = np.array([inputs_list[i] for i in batch_index])
        batch_y = np.array([outputs_list[i] for i in batch_index])
        yield (batch_X, batch_y)

In [203]:
# how it works
for batch_x, batch_y in islice(batch_data_generator(train_X, train_y, batch_size=4), 1000, 1002):
    exprs = [decode(x) for x in batch_x]
    answs = [decode(y) for y in batch_y]
    for expr, ans in zip(exprs, answs):
        print("%s = %s" % (expr, ans))
    print ("="*64)

19+84 = 103 
92+73 = 165 
65+67 = 132 
68+48 = 116 
27+33 = 60  
20+48 = 68  
61+79 = 140 
24+98 = 122 


*** finally, train the model in batch, with varational length sequences ***
- training with groups of varitional lengths seem to have influnces on performance and convergence.
- because of the differences of different length groups, the graidents jump more wildly during one epoch

In [166]:
batch_size = 32
train_generator = batch_data_generator(train_X, train_y, batch_size=batch_size)


In [167]:
model.fit_generator(train_generator, samples_per_epoch = (len(train_X) / batch_size+1) * batch_size, nb_epoch=20, 
                   validation_data = (valid_X, valid_y), verbose = 1)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.callbacks.History at 0x7f4e59fd9950>

In [168]:
model.evaluate(valid_X, valid_y)



[2.0113413022994995, 0.84019999999999995]

In [169]:
i = np.random.choice(valid_X.shape[0], 10, replace=False)
sampleX, sampley = valid_X[i, :], valid_y[i]
sampleyhat = model.predict(sampleX)

for x, y, yhat in zip(sampleX, sampley, sampleyhat):
    print(decode(x), "=", decode(y), "->", decode(yhat))

786+306 = 1092 -> 1092
254+340 = 594  -> 694 
80+309  = 389  -> 1486
246+259 = 505  -> 505 
964+178 = 1142 -> 1142
830+956 = 1786 -> 1786
995+65  = 1060 -> 1650
682+461 = 1143 -> 1143
763+752 = 1515 -> 1515
477+974 = 1451 -> 1451


## Solution 2: An alternative is to train the model - sequentially train on different sizes
- in the previous solution, training got stuck somewhere after leaping between different groups
- improvement by changing learning algorithm, restructuring network, adding more regularization, shuffle the batch from different groups?
- we shuffle the different batches here - not much improvement in this case

In [212]:
## partition the training data by sizes
def partition_data(inputs_list, outputs_list, batch_size = 32):
    assert (len(inputs_list) == len(outputs_list))
    index = range(len(inputs_list))
    keyfun = lambda i: inputs_list[i].shape[0] ## input shape as key
    groups_by_sz = groupby(sorted(index, key = keyfun), key = keyfun)
    
    index_groups = {}
    for sz, subindex in groups_by_sz:
        subindex = list(subindex)
        if len(subindex) % batch_size == 0:
            padded_sz = len(subindex) 
        else:
            padded_sz = (len(subindex) / batch_size + 1) * batch_size
        subindex = list(islice(cycle(subindex), 0, padded_sz))
        for ibatch in xrange(padded_sz / batch_size):
            index_groups["%i_%i" % (sz, ibatch)] = subindex[ibatch*batch_size:(ibatch+1)*batch_size]
    return index_groups

In [205]:
## chunks of batch index, within the same chunk, the input data shape are always the same
train_index_groups = partition_data(train_X, train_y, batch_size=32)

In [217]:
## recreate model to reset, no simple way of doing it in keras YET
## https://github.com/fchollet/keras/pull/1908
model2 = create_model()

nb_epoch = 30
for epoch in xrange(nb_epoch):
    for batch_index in shuffle(train_index_groups.values()):
        batch_X = np.array([train_X[i] for i in batch_index])
        batch_y = np.array([train_y[i] for i in batch_index])
        model2.fit(batch_X, batch_y, nb_epoch=1, verbose = 0)
    print("epoch %i" % epoch, "validation performance", model2.evaluate(valid_X, valid_y))

epoch 0 validation performance [1.9186408332824707, 0.42830000000000001]
epoch 1 validation performance [1.9998241294860839, 0.43964999999999999]
epoch 2 validation performance [1.9460379375457764, 0.4758]
epoch 3 validation performance [1.9206396116256714, 0.59204999999999997]
epoch 4 validation performance [1.8321751346588135, 0.66400000000000003]
epoch 5 validation performance [1.7385511131286622, 0.79744999999999999]
epoch 6 validation performance [1.8385612802505493, 0.79269999999999996]
epoch 7 validation performance [1.8151056520462037, 0.83015000000000005]
epoch 8 validation performance [1.8583248208999634, 0.82555000000000001]
epoch 9 validation performance [1.8356214509010316, 0.82769999999999999]
epoch 10 validation performance [1.8443672791481018, 0.84640000000000004]
epoch 11 validation performance [1.895765985584259, 0.84465000000000001]
epoch 12 validation performance [1.9419121088027953, 0.83884999999999998]
epoch 13 validation performance [1.9480765412330627, 0.8460999

In [218]:
print(model2.evaluate(valid_X, valid_y))
i = np.random.choice(valid_X.shape[0], 10, replace=False)
sampleX, sampley = valid_X[i, :], valid_y[i]
sampleyhat = model2.predict(sampleX)

for x, y, yhat in zip(sampleX, sampley, sampleyhat):
    print(decode(x), "=", decode(y), "->", decode(yhat))

[2.0020545827865601, 0.8508]
43+645  = 688  -> 989 
649+621 = 1270 -> 1270
959+803 = 1762 -> 1762
965+392 = 1357 -> 1357
630+148 = 778  -> 778 
752+659 = 1411 -> 1411
50+711  = 761  -> 76  
546+633 = 1179 -> 1179
478+185 = 663  -> 663 
381+319 = 700  -> 700 


## Solution 3: Or, even to train the model group by group
- but it is almost as the same as training on the dominating group in this case

In [200]:
model3 = create_model()

nb_epoch = 20

for sz, (group_X, group_y) in train_data_groups.items():
    model3.fit(group_X, group_y, nb_epoch=nb_epoch, verbose = 0)
    print("epoch %i training on size %i" % (epoch, sz), "validation performance", model3.evaluate(valid_X, valid_y))

epoch 19 training on size 3 validation performance [6.2249597793579099, 0.25509999999999999]
epoch 19 training on size 4 validation performance [5.3433155868530271, 0.20699999999999999]
epoch 19 training on size 5 validation performance [6.8431203826904294, 0.22245000000000001]
epoch 19 training on size 6 validation performance [8.1787536987304694, 0.26390000000000002]
epoch 19 training on size 7 validation performance [2.0483857035636901, 0.81310000000000004]


In [201]:
print(model3.evaluate(valid_X, valid_y))
i = np.random.choice(valid_X.shape[0], 10, replace=False)
sampleX, sampley = valid_X[i, :], valid_y[i]
sampleyhat = model3.predict(sampleX)

for x, y, yhat in zip(sampleX, sampley, sampleyhat):
    print(decode(x), "=", decode(y), "->", decode(yhat))

[2.0483857035636901, 0.81310000000000004]
183+417 = 600  -> 500 
150+488 = 638  -> 638 
270+75  = 345  -> 1022
665+546 = 1211 -> 1211
381+432 = 813  -> 813 
774+149 = 923  -> 923 
419+675 = 1094 -> 1104
958+155 = 1113 -> 1113
337+75  = 412  -> 1092
301+479 = 780  -> 780 


## Conclusion
- The current keras version with Theano backend supports RNN dealing with varying length sequences. The tensorflow backend doesn't.
- Even it is possible, training with varying length sequences should not be the first choice in most cases. As shown above, the most obvious way of doing so does introduce unstable factors in the training, which influences both the accuracy and convergence.
- As a comparison, considering using a padding secheme for fixed-length seq2seq learning, with the same dataset, the performance of `model4` below is much better than any of solutions that I have explored.
- However, it will be interesting to see new solutions proposed in this field.

In [225]:
model4 = create_model()
train_X3 = np.array([encode(q, seqlen=7) for q in train_questions]) 
train_y3 = np.array([encode(a, seqlen=4) for a in train_answers]) 

model4.fit(train_X3, train_y3, nb_epoch=20, verbose = 1, validation_data=(valid_X, valid_y))

Train on 50000 samples, validate on 5000 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.callbacks.History at 0x7f4dd871ebd0>

In [227]:
print(model4.evaluate(valid_X, valid_y))
i = np.random.choice(valid_X.shape[0], 10, replace=False)
sampleX, sampley = valid_X[i, :], valid_y[i]
sampleyhat = model4.predict(sampleX)

for x, y, yhat in zip(sampleX, sampley, sampleyhat):
    print(decode(x), "=", decode(y), "->", decode(yhat))

[0.023949464198946953, 0.99355000000000004]
567+662 = 1229 -> 1229
304+653 = 957  -> 957 
793+705 = 1498 -> 1498
247+615 = 862  -> 862 
200+308 = 508  -> 508 
801+112 = 913  -> 913 
9+563   = 572  -> 572 
746+867 = 1613 -> 1613
292+228 = 520  -> 520 
848+465 = 1313 -> 1313
