Title: LSTM neural network for sequence learning
Date: 2017-11-19 22:00
Tags: LSTM, artificial intelligence, jupyter, tensorflow
Slug: My-first-LSTM
Authors: Dinne Bosman
Lang:en
Summary: My first attempt at a LSTM for sequence prediction

In 1996, during my last year in High School, I borrowed a book of a friend about neural networks. It explained how a two layer perceptron network could learn the XOR function. I tried implementing the formulas and was able to do the feed-forward calculations. The training algorithm however still eluded me. Being able to perform forward calculations was already very exciting. I created a windows 95 screen save which whould fill the screen with the output of a randomized neural network. The output images we're very interesting. Especially when replacing the activation functions of the network by exotic ones such as sin(x), abs(x) etc. (Although I lost the source code, you can still download it [here](http://www.free-downloads-center.com/download/neural-screen-saver-v1-0-11252.html))

At the time it seemed that Neural networks were just another statistical method to interpolate data. Furthermore limited training data and the problem of vanishing gradients limited their usefullnes. Fast forward to 2017. Massive amounts of training data and computing power are available. A number of relatively small improvements in the basic neural network algorithms have made it possible to train networks consisting of many more layers. These so-called deep neural networks have fueled progress and interest in Artificial Intelligence.

One particular innovation that caught my attention is the LSTM neural network architecture. This architecture solves the issue of vanishing gradients. LSTM networks are especially suited to perform analysis of sequences and time series. Some interesting links:
  * article about text generation kernel code
  * fake news generator
  * LSTM architecture
  * Voice synthesis (non LSTM)
  * Audio generation

Especially the last topic was very inspiring. Think about the possibilites of voice synthesis and generating music!

In this article I will discuss my experience using Tensor Flow and LSTM networks. I used the following documentation sources:
  * Tensorflow (this is API reference documentation and will not help you understand how to apply LSTM networks)
  * Keras LSTM example
  * Tensorflow example
  * https://arxiv.org/pdf/1506.00019.pdf
  * http://colah.github.io/posts/2015-08-Understanding-LSTMs/
  * https://distill.pub/2016/augmented-rnns/
 
The first test I wanted to implement is too see if I could implement a sine wave predictor. It's a limited example:
  * The network is trained using mini-batches. Due to the periodic nature of the sine wave the train, dev, and test set overlap.
  * I train the network in batches, but want to generate per sample
  

In [60]:
import plotly
from plotly.graph_objs import Scatter, Layout
import numpy as np
import tensorflow as tf
import sys
plotly.offline.init_notebook_mode(connected=True)
import IPython.display

## Training data
The following cell generates the training data. I decided to add some noise to the sine wave which forces some regularization.

In [61]:
sample_length = 50001
time_per_sample = 0.01
signal_time = np.linspace(num=sample_length,start = 0, stop = sample_length * time_per_sample )
signal_amp = np.sin(signal_time*2*np.pi) + np.random.normal(size=sample_length)*0.02
    #np.sin(2+signal_time*1.7*np.pi)*0.5 + \
    #np.sin(1+signal_time*2.2*np.pi) + \
    

In [62]:
s_i = 0
e_i = s_i + 100
x = plotly.offline.iplot({
    "data": [Scatter(x=signal_time[s_i:e_i],y=signal_amp[s_i:e_i])],
    "layout": Layout(title="")
    
})


In [63]:
#Unroll the RNN to 50 timesteps
sequence_length = 50
#predict one step ahead (currently can only be set to 1)
prediction_length = 1
#each input sample consists of 1 value 
input_feature_count = 1
#each output sample consists of 1 value
output_feature_count = 1

#The network will be trained using 128 example sequences
batch_size = 128 #512
#Use a three layer LSTM network with each layer of size 16
hidden_count_per_layer = [16,16,16]

tf.reset_default_graph()

#Define the inputs of the network, None is given as the first argument, it is the batch_size, which should be dynamic
inputs = tf.placeholder(tf.float32, [None, sequence_length, input_feature_count], name = 'inputs')
#define the target value which we want to predict, again None points to batch_size
targets = tf.placeholder(tf.float32, [None, output_feature_count], name = 'targets')
#Apply drop out regularization with keep_prob probability
keep_prob = tf.placeholder(tf.float32, name = 'keep')
#Learning rate of the adamoptimizer
learning_rate = tf.placeholder(tf.float32, name = 'learning_rate')

## Defining the LSTM multi layer network

In [64]:
layers = []

for hidden_count in hidden_count_per_layer:
    layer =  tf.nn.rnn_cell.LSTMCell(hidden_count, state_is_tuple=True)
    layer_with_dropout = tf.nn.rnn_cell.DropoutWrapper(layer,
                                          input_keep_prob=keep_prob,
                                          output_keep_prob=1.0)
    layers.append(layer)
hidden_network = tf.nn.rnn_cell.MultiRNNCell(layers, state_is_tuple=True)   



## Packing/Unpacking the LSTM network state
In order to use the LSTM network to generate a predicted sequence of arbitrary length you need to store the state of the network. The output state after predicting a sample should be fed back in to the network when predicting the next sample.

Unfortunately the LSTM implementation in Tensor flow uses a LSTMStateTuple(c,h) data structure which is not very convenient to work with. The idea is to pack this LSTMStateTuple(c,h) into a 1D vector.

I'm not happy with these functions as they use the batch_size. This complicates things when using the network to generate predictions.

There is a pointer on how to use dynamic batch_sizes and packing/unpacking states [here](https://stackoverflow.com/questions/40438107/tensorflow-changing-batch-size-for-rnn-during-text-generation) The implementation of pack/unpack however doesn't work with my organisation of hidden_network).

In [65]:
def get_network_state_size(network,batch_size):
    """Returns the number of states variables in the network"""
    states = 0
    for layer_size in hidden_network.state_size:
        states += layer_size[0]
        states += layer_size[1]
    return states * batch_size
    
def pack_state_tuple(states_per_layer):
    """Returns a (x,1) vector containing all the states of the network. 
    Input is the states obtained via outputs, state = tf.nn.dynamic_rnn(...)"""
    return tf.concat(axis=0, values=[tf.reshape(layer_states, (-1,1)) for layer_states in states_per_layer])

def unpack_state_tuple(network, batch_size, packed_states):
    """The inverse of pack, given a packed_states vector of (x,1) return the LSTMStateTuple 
    datastructure that can be used as initial state for tf.nn.dynamic_rnn(...) """
    rnn_list_state = []
    start_index = 0

    # build the data structure for each layer of the network
    for layer_size in hidden_network.state_size:
            # in the layer there are state_cnt state (c and h) variables
            states_cnt = batch_size * layer_size[0]
            states_c = tf.slice(packed_states, begin=[start_index,0], size=[states_cnt,1])
            states_c = tf.reshape(states_c,[batch_size, layer_size[0]])
            start_index += states_cnt
            states_cnt = batch_size * layer_size[1]
            states_h = tf.slice(packed_states, begin=[start_index,0], size=[states_cnt,1])
            states_h = tf.reshape(states_h,[batch_size, layer_size[1]])
            start_index += states_cnt
            
            rnn_list_state.append(tf.nn.rnn_cell.LSTMStateTuple(states_c,states_h))
    tuple_state=tuple(rnn_list_state)            
    return tuple_state

In [66]:
sz = get_network_state_size(hidden_network,batch_size)
print("states in network", sz)

#initial_state and zero_state are both packed versions ((x,1) vectors) of the network state
zero_state = pack_state_tuple(hidden_network.zero_state(batch_size, tf.float32))
initial_state = tf.placeholder_with_default(
    zero_state, 
    (sz,1), 
    name="initial_state")



states in network 12288


### Testing the packing/unpacking functions
In the following cell I check if the pack and unpack functions are indeed eachothers inverse. The vectors should be packed/unpacked in the correct order

In [67]:
# Test to check that the pack and unpack functions are eachothers inverse

#copy of the input (packed) shape (x,1)
initial_copy = tf.slice(initial_state,(0,0),(sz,1))

#unpack into (LSTMTuple(),LSTMTuple(),....)
unpacked = unpack_state_tuple(hidden_network,batch_size,initial_copy)
# (rename) rnn_tuple_state will be later used as input for the network
rnn_tuple_state = unpacked
#repack into vector (x,1)
packed_again = pack_state_tuple(unpacked)


init=tf.global_variables_initializer()
with tf.Session() as sess:
    sess.run(init)

    cp = sess.run(initial_copy,  feed_dict={}) #initial_state: initial_state_input})
    cp[:,0] = np.linspace(start=0,stop=cp.shape[0],num=cp.shape[0])
    
    cp,up,cp2 = sess.run([initial_copy,unpacked,packed_again],  feed_dict={initial_state: cp}) #initial_state: initial_state_input})
    diff = cp - cp2
    # the sum should be zero
    print("diff",np.sum(np.abs(diff)))

diff 0.0


## Forward propagation

In [68]:
print("inputs ",inputs.shape)
outputs, state = tf.nn.dynamic_rnn(hidden_network, inputs, dtype=tf.float32, initial_state=rnn_tuple_state, )

#pack the output state of the network so we can feed it into initial_state later on
state_packed = pack_state_tuple(state)
print("outputs before transpose", outputs.shape)
outputs = tf.transpose(outputs, [1, 0, 2])
print("outputs after transpose", outputs.shape)
last_output = tf.gather(outputs, int(outputs.get_shape()[0]) - 1)
print("last output", last_output.shape)
                                   
#don't use any activation function as we want to predict the output sample
#amplitude directly.
predictions = tf.contrib.layers.fully_connected(last_output, output_feature_count, activation_fn=None)
print("prediction", predictions.shape)
print("targets", targets.shape)
#prediction = tf.nn.softmax(logit)
#loss = tf.losses.softmax_cross_entropy(target, logit)


inputs  (?, 50, 1)
outputs before transpose (128, 50, 16)
outputs after transpose (50, 128, 16)
last output (128, 16)
prediction (128, 1)
targets (?, 1)


## Backward pass, training

In [69]:
loss = tf.reduce_sum(tf.squared_difference(predictions, targets))

#optimization
opt=tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(loss)

## Defining the train, dev and test set

In [70]:
# a train, dev, test sequence can start at any point in the dataset.
start_indices = np.linspace(
    0,
    sample_length-sequence_length-prediction_length-1,
    sample_length-sequence_length-prediction_length-1, dtype= np.int32)

dev_size_perc = 0.20
test_size_perc = 0.20

dev_size = int(np.floor(start_indices.shape[0] * dev_size_perc))
test_size  = int(np.floor(start_indices.shape[0] * test_size_perc))
train_size = start_indices.shape[0] - test_size - dev_size
train_batch_count = int(np.floor(train_size / batch_size))
dev_batch_count = int(np.floor(dev_size / batch_size))
test_batch_count = int(np.floor(test_size / batch_size))

print("dataset size %d" %(start_indices.shape[0]))
print("%d Examples (%d batches) in train set" %(train_size, train_batch_count))
print("%d Examples (%d batches) in dev set" %(dev_size,dev_batch_count))
print("%d Examples (%d batches) in test set" %(test_size,test_batch_count))



dataset size 49949
29971 Examples (234 batches) in train set
9989 Examples (78 batches) in dev set
9989 Examples (78 batches) in test set


In [71]:
np.random.shuffle (start_indices)
train_indices = start_indices[0:int(train_size)]
dev_indices= start_indices[int(train_size):int(train_size+dev_size)]
test_indices = start_indices[int(train_size+dev_size):int(train_size+dev_size+test_size)]

def get_batch(batch_index, indexes, size=batch_size):
    """Get a batch of examples. Indexes is a list of shuffled starting points. """
    batch_start_indexes = indexes[batch_index*size:batch_index*size+size]
    batch_inputs = np.zeros((size,sequence_length, input_feature_count))
    batch_targets = np.zeros((size,prediction_length))
    for i in range(size):
        se = batch_start_indexes[i]
        part = signal_amp[se:se+sequence_length]
        batch_inputs[i,0:sequence_length,0] = part
        batch_targets[i,0] = signal_amp[se+sequence_length+1]

    return batch_inputs,batch_targets

#test the function
batch_inputs,batch_targets = get_batch(train_batch_count-1,train_indices)
print(batch_inputs.shape,batch_targets.shape)

example_inputs = batch_inputs[0,:,:]
example_targets =  batch_targets[0,:]
print(example_inputs.shape)
b_i = 1
b_s = batch_inputs[b_i,0:sequence_length,0]
plotly.offline.iplot({
    "data": [Scatter(y=b_s)],
    "layout": Layout(title="")
})

(128, 50, 1) (128, 1)
(50, 1)


## Test training using a single batch
In the next cell I check if I can train the network on one single batch. Just to check if the optimizer indeed trains the network. In the output you will see the loss decreasing (first column)

In [73]:
np.random.shuffle (start_indices)
train_indices = start_indices[0:int(train_size)]
dev_indices= start_indices[int(train_size):int(train_size+dev_size)]
test_indices = start_indices[int(train_size+dev_size):int(train_size+dev_size+test_size)]


init=tf.global_variables_initializer()
with tf.Session() as sess:
    sess.run(init)

    np.random.shuffle (train_indices)
    
    batch_inputs,batch_targets = get_batch(0, train_indices)
    print("batch input shape", batch_inputs.shape)
    #v_outputs, v_state = sess.run([outputs,state], feed_dict={inputs: batch_inputs, targets: batch_targets})
    v_predictions, v_state, v_state_packed = sess.run([predictions,state, state_packed], 
                                      feed_dict={
                                          inputs: batch_inputs, 
                                          targets: batch_targets
                                      })
    print(v_predictions.shape)
    print(v_predictions[0],batch_targets[0])
    for i in range(0,120):
        v_predictions, v_outputs, v_state, v_state_packed, v_loss, v_opt = sess.run(
            [predictions, outputs, state, state_packed, loss, opt], 
            feed_dict={
                learning_rate: 0.02, 
                inputs: batch_inputs, 
                targets: batch_targets,
                initial_state: v_state_packed
            }) #})
        print(v_loss,v_predictions[0],batch_targets[0])
 

    
    

batch input shape (128, 50, 1)
(128, 1)
[ 0.00825402] [-0.87938952]
73.196 [ 0.0084268] [-0.87938952]
44.5674 [ 0.01241862] [-0.87938952]
25.2639 [-0.30127254] [-0.87938952]
13.3534 [-0.62422746] [-0.87938952]
10.1409 [-1.0056057] [-0.87938952]
3.40547 [-1.06023467] [-0.87938952]
4.62868 [-1.06374276] [-0.87938952]
2.84418 [-1.05242419] [-0.87938952]
1.65521 [-0.97255898] [-0.87938952]
2.69843 [-0.88899928] [-0.87938952]
2.50259 [-0.84559721] [-0.87938952]
1.6239 [-0.82453358] [-0.87938952]
0.892349 [-0.81085068] [-0.87938952]
1.34796 [-0.80802804] [-0.87938952]
1.27122 [-0.82006747] [-0.87938952]
0.8469 [-0.83399373] [-0.87938952]
0.621678 [-0.83163124] [-0.87938952]
0.962114 [-0.82015812] [-0.87938952]
0.836633 [-0.82103443] [-0.87938952]
0.846576 [-0.83150297] [-0.87938952]
0.673405 [-0.83644444] [-0.87938952]
0.676 [-0.84073049] [-0.87938952]
0.519327 [-0.85216492] [-0.87938952]
0.43075 [-0.8621732] [-0.87938952]
0.393542 [-0.85977948] [-0.87938952]
0.382521 [-0.84818017] [-0.87938

## Training and Testing



In [75]:
np.random.shuffle (start_indices)
train_indices = start_indices[0:int(train_size)]
dev_indices= start_indices[int(train_size):int(train_size+dev_size)]
test_indices = start_indices[int(train_size+dev_size):int(train_size+dev_size+test_size)]

epoch_count = 5

loss_results = np.zeros((epoch_count,2))

def get_dev_loss():
    epoch_dev_loss = 0.0
    for devi in range(dev_batch_count):
        batch_inputs,batch_targets = get_batch(devi, dev_indices)

        batch_dev_loss = sess.run(loss,feed_dict={inputs:batch_inputs,targets:batch_targets})
        if (devi % 20) == 0:
            print("  Dev results batch %d, loss %s" %(  devi, str(batch_dev_loss)))  

        epoch_dev_loss += batch_dev_loss
        #sys.stdout.write('.')
        #sys.stdout.flush()
    return epoch_dev_loss / dev_size

def generate_graph(graph_size=200):
    prime_size = 20
    
    prime_signal_start_i = 0
    
    tmp_signal = np.zeros((graph_size,1))
    tmp_signal[0:prime_size,0] = signal_amp[prime_signal_start_i:(prime_signal_start_i+prime_size)]
    #tmp_signal[0:prime_size,0] = np.random.normal(size=prime_size)*0.6+0.1
    tmp_batch = np.zeros((batch_size,sequence_length,1))
    
    _state_packed = None
    for end in range(prime_size, graph_size):
        #end = prime_size
        tmp_batch[0,:,0] = tmp_signal.take(range((end-sequence_length),end), mode='wrap')
        if _state_packed is None:
            _state_packed , _prediction = sess.run(
                [state_packed, predictions[0,0]], 
                feed_dict={learning_rate: 0.02, inputs: tmp_batch})
        else:
            _state_packed , _prediction = sess.run(
                [state_packed, predictions[0,0]], 
                feed_dict={learning_rate: 0.02, initial_state: _state_packed, inputs: tmp_batch})
            
        #print(_prediction)
        tmp_signal[end,0] = _prediction
        sys.stdout.write('.')
        sys.stdout.flush()
    print("")
    plotly.offline.iplot({
       "data": [Scatter(y=tmp_signal[:,0])],
       "layout": Layout(title="")})


init=tf.global_variables_initializer()
with tf.Session() as sess:
    sess.run(init)

    epoch_dev_loss = get_dev_loss()    

    #print("")            
    print("Dev results epoch start, loss %s" %(  str(epoch_dev_loss),))  

    for epoch in range(0,epoch_count):
        np.random.shuffle (train_indices)
        epoch_train_loss = 0.0
        for ti in range(train_batch_count):
            batch_inputs,batch_targets = get_batch(ti, train_indices)

            batch_train_loss, _ = sess.run([loss, opt], feed_dict={learning_rate: 0.002, inputs: batch_inputs, targets: batch_targets})
            if (ti % 20) == 0:
                print("  Train results batch %d, loss %s" %(  ti, str(batch_train_loss)))  
            epoch_train_loss += batch_train_loss
            #sys.stdout.write('.')
            #sys.stdout.flush()
        #print("")
        epoch_train_loss = epoch_train_loss / train_size
        print("Training results epoch %d, loss %s" %( epoch, str(epoch_train_loss)))
        epoch_dev_loss = get_dev_loss()    
        #print("")            
        print("Dev results epoch %d, loss %s" %( epoch, str(epoch_dev_loss)))  
        loss_results[epoch,0] = epoch_train_loss
        loss_results[epoch,1] = epoch_dev_loss
        ti += 1
        generate_graph()
    generate_graph(graph_size=1000)
        

  Dev results batch 0, loss 67.4317
  Dev results batch 20, loss 66.0425
  Dev results batch 40, loss 67.267
  Dev results batch 60, loss 65.8533
Dev results epoch start, loss 0.514305417585
  Train results batch 0, loss 69.2573
  Train results batch 20, loss 12.4881
  Train results batch 40, loss 0.832896
  Train results batch 60, loss 0.235991
  Train results batch 80, loss 0.0941328
  Train results batch 100, loss 0.0653589
  Train results batch 120, loss 0.0731907
  Train results batch 140, loss 0.0635686
  Train results batch 160, loss 0.066241
  Train results batch 180, loss 0.0584618
  Train results batch 200, loss 0.0693147
  Train results batch 220, loss 0.057407
Training results epoch 0, loss 0.0292599497691
  Dev results batch 0, loss 0.0634809
  Dev results batch 20, loss 0.065406
  Dev results batch 40, loss 0.0620489
  Dev results batch 60, loss 0.0629092
Dev results epoch 0, loss 0.000508144291636
..........................................................................

  Train results batch 0, loss 0.0637692
  Train results batch 20, loss 0.0541024
  Train results batch 40, loss 0.0695865
  Train results batch 60, loss 0.0548981
  Train results batch 80, loss 0.0616153
  Train results batch 100, loss 0.0525488
  Train results batch 120, loss 0.0666363
  Train results batch 140, loss 0.0715255
  Train results batch 160, loss 0.0610526
  Train results batch 180, loss 0.0603319
  Train results batch 200, loss 0.0708398
  Train results batch 220, loss 0.0887008
Training results epoch 1, loss 0.000479385734936
  Dev results batch 0, loss 0.0665972
  Dev results batch 20, loss 0.0673353
  Dev results batch 40, loss 0.067419
  Dev results batch 60, loss 0.0644791
Dev results epoch 1, loss 0.000525869703968
....................................................................................................................................................................................


  Train results batch 0, loss 0.0512211
  Train results batch 20, loss 0.0571462
  Train results batch 40, loss 0.0485674
  Train results batch 60, loss 0.0459265
  Train results batch 80, loss 0.057221
  Train results batch 100, loss 0.0648001
  Train results batch 120, loss 0.0816927
  Train results batch 140, loss 0.0603868
  Train results batch 160, loss 0.0590188
  Train results batch 180, loss 0.0579038
  Train results batch 200, loss 0.052959
  Train results batch 220, loss 0.0676021
Training results epoch 2, loss 0.000463835547777
  Dev results batch 0, loss 0.0588876
  Dev results batch 20, loss 0.0617509
  Dev results batch 40, loss 0.0641777
  Dev results batch 60, loss 0.0607291
Dev results epoch 2, loss 0.000498005627485
....................................................................................................................................................................................


  Train results batch 0, loss 0.0580112
  Train results batch 20, loss 0.0504064
  Train results batch 40, loss 0.0635144
  Train results batch 60, loss 0.0602902
  Train results batch 80, loss 0.0705813
  Train results batch 100, loss 0.0602441
  Train results batch 120, loss 0.0550283
  Train results batch 140, loss 0.0529087
  Train results batch 160, loss 0.0573836
  Train results batch 180, loss 0.0430824
  Train results batch 200, loss 0.0559826
  Train results batch 220, loss 0.0800156
Training results epoch 3, loss 0.00047796561527
  Dev results batch 0, loss 0.0572026
  Dev results batch 20, loss 0.0659019
  Dev results batch 40, loss 0.0621342
  Dev results batch 60, loss 0.0622063
Dev results epoch 3, loss 0.000500707149532
....................................................................................................................................................................................


  Train results batch 0, loss 0.071077
  Train results batch 20, loss 0.0660187
  Train results batch 40, loss 0.0553178
  Train results batch 60, loss 0.0540919
  Train results batch 80, loss 0.0737468
  Train results batch 100, loss 0.058318
  Train results batch 120, loss 0.0558164
  Train results batch 140, loss 0.108144
  Train results batch 160, loss 0.0770534
  Train results batch 180, loss 0.057335
  Train results batch 200, loss 0.056666
  Train results batch 220, loss 0.0731215
Training results epoch 4, loss 0.000517237689982
  Dev results batch 0, loss 0.0526761
  Dev results batch 20, loss 0.0563564
  Dev results batch 40, loss 0.059949
  Dev results batch 60, loss 0.0573288
Dev results epoch 4, loss 0.000479971210733
....................................................................................................................................................................................


....................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................


## Conclusion, next steps

  * Implement a dynamic batch size
  * Apply on raw audio
      * Sample microfone via WebAudo, send the samples to the notebook via WebSocket, analyze and feed the result back
  * Implement a phase vocodor, instead of raw audio, input the frequency features
  * Achieve something like [this](https://deepmind.com/blog/wavenet-generative-model-raw-audio/) :)