Most of the codes are manually borrowed (typed) from Aurélien Géron's book, his notebooks for the book are [here](https://github.com/ageron/handson-ml). I used this notebook to get myself familiar with the details of RNN. 

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

%matplotlib inline
%config InlineBackend.figure_formats = ('svg', 'retina')
plt.style.use('my_custom_style')
import matplotlib as mpl
mpl.rcParams['figure.edgecolor'] = 'white'
mpl.rcParams['figure.facecolor'] = 'white'

In [11]:
import tensorflow as tf

def reset_graph(seed=42):
    tf.reset_default_graph()
    tf.set_random_seed(seed)
    np.random.seed(seed)


# Batch Normalization

In [None]:
n_inputs = 28 * 28
n_hidden1 = 300
n_hidden2 = 100
n_outputs = 10

X = tf.placeholder(tf.float32, shape=(None, n_inputs), name="X")

training = tf.placeholder(tf.bool, shape=(), name="is_training")


hidden1 = tf.layers.dense(X, n_hidden1, name="hidden1")
bn1 = tf.layers.batch_normalization(hidden1, training=training, momentum=0.9)
bn1_act = tf.nn.elu(bn1)

hidden2 = tf.layers.dense(bn1_act, n_hidden2, name="hidden2")
bn1 = tf.layers.batch_normalization(hidden2, training=training, momentum=0.9)
bn2_act = tf.nn.elu(bn2)

logits_before_bn = tf.layers.dense(bn2_act, n_outputs, name="outputs")
logits = tf.layers.batch_normalization(logits_before_bn, training=training,
                                       momentum=0.9)


## Vanilla RNN

Below is a vanilla recurrent neural network (RNN) to illusrate the concept of this type neural nets. 

- For an RNN with a single recurrent neuron, the output can be written as
$y_{(t)}=\phi(x_{(t)}^T\cdot w_x + y_{(t-1)} w_y + b)$. Note that the output of the previous state is now a feature of the current state.

- For a mini-batch of the shape $m\times n_{neurons}$, the prediction of the current state can be written as

$$
\begin{align*}
Y_{(t)} &= \phi(X_{(t)}\cdot W_x + Y_{(t-1)}\cdot W_y + b) \\
    &= \phi([X_{(t)} Y_{(t-1)}]\cdot W + b)
\end{align*}    
$$

where $\mathbf{W} = [\mathbf{W}_x \mathbf{W}_y]^T$.

- $\mathbf{W}_x$ is an $n_{inputs}\times n_{neurons}$ matrix
- $W_y$ is an $n_{neurons}\times n_{neurons}$ matrix
- $X_{(t)}$ is an $m\times n_{inputs}$ matrix
- $Y_{(t)}$ is an $m\times n_{neurons}$ matrix
- $b$ is a vector of size $n_{neurons}$

In [None]:
## pseudo RNN for illustration
class RNN:
    #...
    def step(self, x):
        # update the hidden state
        self.h = np.tanh(np.dot(self.W_hh, self.h) + np.dot(self.W_xh, x))
        # compute the output vector
        y = np.dot(self.W_hy, self.h)
        return y

## Manual RNN

In [7]:
reset_graph()

n_inputs = 3
n_neurons = 5

X0 = tf.placeholder(tf.float32, [None, n_inputs])
X1 = tf.placeholder(tf.float32, [None, n_inputs])

Wx = tf.Variable(tf.random_normal(shape=[n_inputs, n_neurons]))
Wy = tf.Variable(tf.random_normal(shape=[n_neurons, n_neurons]))
b = tf.Variable(tf.zeros([1, n_neurons]))

Y0 = tf.tanh(tf.matmul(X0, Wx) + b)
Y1 = tf.tanh(tf.matmul(Y0, Wy) + tf.matmul(X1, Wx) + b)

init = tf.global_variables_initializer()


In [8]:
# a mini-batch with four instances
X0_batch = np.array([[0, 1, 2], [3, 4, 5], [6, 7, 8], [9, 0, 1]]) # t=0
X1_batch = np.array([[9, 8, 7], [0, 0, 0], [6, 5, 4], [3, 2, 1]]) # t=1

with tf.Session() as sess:
    init.run()
    Y0_val, Y1_val = sess.run([Y0, Y1], feed_dict={X0: X0_batch, X1: X1_batch})

print('Output at t=0\n\n', Y0_val)
print("="*80)
print('Output at t=1\n\n', Y1_val)

Output at t=0

 [[-0.0664006   0.9625767   0.68105793  0.7091854  -0.898216  ]
 [ 0.9977755  -0.719789   -0.9965761   0.9673924  -0.9998972 ]
 [ 0.99999774 -0.99898803 -0.9999989   0.9967762  -0.9999999 ]
 [ 1.         -1.         -1.         -0.99818915  0.9995087 ]]
Output at t=1

 [[ 1.         -1.         -1.          0.4020025  -0.9999998 ]
 [-0.12210419  0.62805265  0.9671843  -0.9937122  -0.2583937 ]
 [ 0.9999983  -0.9999994  -0.9999975  -0.85943305 -0.9999881 ]
 [ 0.99928284 -0.99999815 -0.9999058   0.9857963  -0.92205757]]


Since our naive RNN has only five neurons, and the mini-batch is of shape $4\times 3$, and thus the  shape of the output Y1_val should be $(4\times 3) \times (3\times 5) = (4\times 5)$.

## Static Unrolling

In [9]:
reset_graph()

n_inputs = 3
n_neurons = 5

X0_batch = np.array([[0, 1, 2], [3, 4, 5], [6, 7, 8], [9, 0, 1]]) # t=0
X1_batch = np.array([[9, 8, 7], [0, 0, 0], [6, 5, 4], [3, 2, 1]]) # t=1

X0 = tf.placeholder(tf.float32, [None, n_inputs])
X1 = tf.placeholder(tf.float32, [None, n_inputs])

basic_cell = tf.contrib.rnn.BasicRNNCell(num_units=n_neurons)
output_seqs, states = tf.contrib.rnn.static_rnn(basic_cell, [X0, X1],
                                                dtype=tf.float32)
Y0, Y1 = output_seqs

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    Y0_val, Y1_val, states_ = sess.run([Y0, Y1, states], feed_dict={X0: X0_batch, X1: X1_batch})

In [10]:
print('Output at t=0\n\n', Y0_val)
print("="*80)
print('Output at t=1\n\n', Y1_val)
print("="*80)
print("Final states are\n", states_) # this is simply the last output

Output at t=0

 [[ 0.30741334 -0.32884315 -0.6542847  -0.9385059   0.52089024]
 [ 0.99122757 -0.9542541  -0.7518079  -0.9995208   0.9820235 ]
 [ 0.9999268  -0.99783254 -0.8247353  -0.9999963   0.99947774]
 [ 0.996771   -0.68750614  0.8419969   0.9303911   0.8120684 ]]
Output at t=1

 [[ 0.99998885 -0.99976057 -0.0667929  -0.9999803   0.99982214]
 [-0.6524943  -0.51520866 -0.37968948 -0.5922594  -0.08968379]
 [ 0.99862397 -0.99715203 -0.03308626 -0.9991566   0.9932902 ]
 [ 0.99681675 -0.9598194   0.39660627 -0.8307606   0.79671973]]
Final states are
 [[ 0.99998885 -0.99976057 -0.0667929  -0.9999803   0.99982214]
 [-0.6524943  -0.51520866 -0.37968948 -0.5922594  -0.08968379]
 [ 0.99862397 -0.99715203 -0.03308626 -0.9991566   0.9932902 ]
 [ 0.99681675 -0.9598194   0.39660627 -0.8307606   0.79671973]]


Handle long time-steps

In [11]:
reset_graph()

n_steps = 2
n_inputs = 3
n_neurons = 5

X = tf.placeholder(tf.float32, [None, n_steps, n_inputs])
# swap the positions of time-steps and features in X
X_seqs = tf.unstack(tf.transpose(X, perm=[1, 0, 2])) 
basic_cell = tf.contrib.rnn.BasicRNNCell(num_units=n_neurons)
output_seqs, states = tf.contrib.rnn.static_rnn(basic_cell,
                                                X_seqs, dtype=tf.float32)
# change the shape of output back to that of X
outputs = tf.transpose(tf.stack(output_seqs), perm=[1, 0, 2])

X_batchs = np.array([
    [[0, 1, 2], [9, 8, 7]],
    [[3, 4, 5], [0, 0, 0]],
    [[6, 7, 8], [6, 5, 4]],
    [[9, 0, 1], [3, 2, 1]],
])

init = tf.global_variables_initializer()

with tf.Session() as sess:
    init.run()
    outputs_val = outputs.eval(feed_dict={X: X_batchs})

In [12]:
print(outputs_val)

[[[-0.45652324 -0.68064123  0.40938237  0.63104504 -0.45732826]
  [-0.9428799  -0.9998869   0.94055814  0.9999985  -0.9999997 ]]

 [[-0.8001535  -0.9921827   0.7817797   0.9971032  -0.9964609 ]
  [-0.637116    0.11300927  0.5798437   0.4310559  -0.6371699 ]]

 [[-0.93605185 -0.9998379   0.9308867   0.9999815  -0.99998295]
  [-0.9165386  -0.9945604   0.896054    0.99987197 -0.9999751 ]]

 [[ 0.9927369  -0.9981933  -0.55543643  0.9989031  -0.9953323 ]
  [-0.02746338 -0.73191994  0.7827872   0.9525682  -0.9781773 ]]]


The static rolling method is still problematic, as it will need to creat n_steps copies of the basic RNN cell. As the n_steps increases, we will face the out-of-memory problem, especially when using GPU. 

## Use dynamic unrolling

In [13]:
reset_graph()

n_inputs = 3
n_steps = 2
n_neurons = 5

X = tf.placeholder(tf.float32, [None, n_steps, n_inputs])
basic_cell = tf.contrib.rnn.BasicRNNCell(num_units=n_neurons)
outputs, states = tf.nn.dynamic_rnn(basic_cell, X, dtype=tf.float32)

X_batchs = np.array([
    [[0, 1, 2], [9, 8, 7]],
    [[3, 4, 5], [0, 0, 0]],
    [[6, 7, 8], [6, 5, 4]],
    [[9, 0, 1], [3, 2, 1]],
])

init = tf.global_variables_initializer()

with tf.Session() as sess:
    init.run()
    outputs_val = outputs.eval(feed_dict={X: X_batchs})

In [14]:
print("Used dynamic_rnn\n\n", outputs_val)

Used dynamic_rnn

 [[[-0.85115266  0.87358344  0.5802911   0.8954789  -0.0557505 ]
  [-0.999996    0.99999577  0.9981815   1.          0.37679607]]

 [[-0.9983293   0.9992038   0.98071456  0.999985    0.25192663]
  [-0.7081804  -0.0772338  -0.85227895  0.5845349  -0.78780943]]

 [[-0.9999827   0.99999535  0.9992863   1.          0.5159072 ]
  [-0.9993956   0.9984095   0.83422637  0.99999976 -0.47325212]]

 [[ 0.87888587  0.07356028  0.97216916  0.9998546  -0.7351168 ]
  [-0.9134514   0.3600957   0.7624866   0.99817705  0.80142   ]]]


## Setting the sequence length

`help(tf.nn.dynamic_rnn)`

```python
dynamic_rnn(cell, inputs, sequence_length=None, initial_state=None, dtype=None, parallel_iterations=None, swap_memory=False, time_major=False, scope=None)
```

In [15]:
reset_graph()

n_steps = 2
n_inputs = 3
n_neurons = 5

X = tf.placeholder(tf.float32, [None, n_steps, n_inputs])
basic_cell = tf.contrib.rnn.BasicRNNCell(num_units=n_neurons)
seq_length = tf.placeholder(tf.int32, [None])

outputs, states = tf.nn.dynamic_rnn(basic_cell, X, dtype=tf.float32,
                                    sequence_length=seq_length)


X_batch = np.array([
    [[0, 1, 2], [9, 8, 7]],
    [[3, 4, 5], [0, 0, 0]], # (padded with zero vector)
    [[6, 7, 8], [6, 5, 4]],
    [[9, 0, 1], [3, 2, 1]],
])

seq_length_batch = np.array([2, 1, 2, 2])
with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    outputs_val, states_val = sess.run([outputs, states],
                                       feed_dict={X: X_batch,
                                                  seq_length: seq_length_batch})

In [16]:
print("with sequence length manually setted\n\n", outputs_val)
print("\n", "="*60)
print(states_val)


with sequence length manually setted

 [[[-0.9123188   0.16516446  0.5548655  -0.39159346  0.20846416]
  [-1.          0.956726    0.99831694  0.99970174  0.96518576]]

 [[-0.9998612   0.6702289   0.9723653   0.6631046   0.74457586]
  [ 0.          0.          0.          0.          0.        ]]

 [[-0.99999976  0.8967997   0.9986295   0.9647514   0.93662   ]
  [-0.9999526   0.9681953   0.96002865  0.98706263  0.85459226]]

 [[-0.96435434  0.99501586 -0.36150697  0.9983378   0.999497  ]
  [-0.9613586   0.9568762   0.7132288   0.97729224 -0.0958299 ]]]

[[-1.          0.956726    0.99831694  0.99970174  0.96518576]
 [-0.9998612   0.6702289   0.9723653   0.6631046   0.74457586]
 [-0.9999526   0.9681953   0.96002865  0.98706263  0.85459226]
 [-0.9613586   0.9568762   0.7132288   0.97729224 -0.0958299 ]]


# Training a sequence classifier

Training a RNN classifier for MNIST images. Treat an image as a sequence! A MNIST image is of the shape $28\times 28$, we can treat it as a sequence of 28 rows of 28 pixels each.

In [17]:
reset_graph()

n_steps = 28
n_inputs = 28
n_neurons = 100
n_outputs = 10 # number of digits (0-9)

learning_rate = 1e-3

X = tf.placeholder(tf.float32, [None, n_steps, n_inputs])
y = tf.placeholder(tf.int32, [None])

basic_cell = tf.contrib.rnn.BasicRNNCell(num_units=n_neurons)
outputs, states = tf.nn.dynamic_rnn(basic_cell, X, dtype=tf.float32)

logits_ = tf.layers.dense(states, n_outputs)
xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=logits_)

loss = tf.reduce_mean(xentropy)
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
training_op = optimizer.minimize(loss)
correct = tf.nn.in_top_k(logits_, y, 1)
accuracy = tf.reduce_mean(tf.cast(correct, tf.float32))

init = tf.global_variables_initializer()

In [27]:
(X_train, y_train), (X_test, y_test) = tf.keras.datasets.mnist.load_data()

In [29]:
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets("MNIST_data/")
X_test = mnist.test.images.reshape((-1, n_steps, n_inputs))
y_test = mnist.test.labels

Extracting MNIST_data/train-images-idx3-ubyte.gz
Extracting MNIST_data/train-labels-idx1-ubyte.gz
Extracting MNIST_data/t10k-images-idx3-ubyte.gz
Extracting MNIST_data/t10k-labels-idx1-ubyte.gz


In [38]:
n_epochs = 20
batch_size = 150

with tf.Session() as sess:
    init.run()
    for epoch in range(n_epochs):
        for iteration in range(mnist.train.num_examples // batch_size):
            X_batch, y_batch = mnist.train.next_batch(batch_size)
            X_batch = X_batch.reshape((-1, n_steps, n_inputs))
            sess.run(training_op, feed_dict={X: X_batch, y: y_batch})
        acc_train = accuracy.eval(feed_dict={X: X_batch, y: y_batch})
        acc_test = accuracy.eval(feed_dict={X: X_test, y: y_test})
        print(f"epoch={epoch+1}, Training acc={acc_train:.4f}, Testing acc={acc_test:.4f}")

epoch=19, Training acc=0.9600, Testing acc=0.9688


epoch=18, Training acc=0.9800, Testing acc=0.9682


epoch=17, Training acc=0.9800, Testing acc=0.9708


epoch=16, Training acc=0.9667, Testing acc=0.9725


epoch=15, Training acc=0.9867, Testing acc=0.9718


epoch=14, Training acc=0.9867, Testing acc=0.9690


epoch=13, Training acc=0.9800, Testing acc=0.9666


epoch=12, Training acc=0.9533, Testing acc=0.9720


epoch=11, Training acc=0.9867, Testing acc=0.9726


epoch=10, Training acc=1.0000, Testing acc=0.9711


epoch=9, Training acc=0.9667, Testing acc=0.9688


epoch=8, Training acc=0.9933, Testing acc=0.9621


epoch=7, Training acc=0.9867, Testing acc=0.9665


epoch=6, Training acc=0.9800, Testing acc=0.9638


epoch=5, Training acc=0.9733, Testing acc=0.9512


epoch=4, Training acc=0.9867, Testing acc=0.9526


epoch=3, Training acc=0.9600, Testing acc=0.9381


epoch=2, Training acc=0.9067, Testing acc=0.9399


epoch=1, Training acc=0.9600, Testing acc=0.9340


epoch=0, Training acc=0.9000, Testing acc=0.9170


This is just a toy example with very thrift learning parameters, because my poor Macbook Pro cannot handle more trianing epochs and a smaller learning rate. Otherwise, it will take forever to finish the training.

Anyhow, we achieved a test accuracy about 97.3% at the 11th epochs. Note that the RNN starts to overfit at the 12th epoch, given the relatively large learning rate (lr=1e-3).

# RNN for Time Series

In [39]:
t_min, t_max = 0, 30
resolution = 0.1

def time_series(t):
    return t * np.sin(t) / 3 + 2 * np.sin(t*5)

#  Instead of fetching a mini-batch from an existing data set
# this helper function creates a mini-batch each
def next_batch(batch_size, n_steps):
    t0 = np.random.randn(batch_size, 1) * (t_max - t_min - n_steps * resolution)
    Ts = t0 + np.arange(0., n_steps+1) * resolution
    ys = time_series(Ts)
    return ys[:, :-1].reshape(-1, n_steps, 1), ys[:, 1:].reshape(-1, n_steps, 1)

In [44]:
t = np.linspace(t_min, t_max, int((t_max - t_min) / resolution))

n_steps = 20
t_instance = np.linspace(12.2, 12.2 + resolution * (n_steps + 1), n_steps + 1)

colors = ['#e66101','#fdb863','#b2abd2','#5e3c99']

plt.figure(figsize=(6,10))
plt.subplot(211)
plt.title("A time series (generated)", fontsize=14)
plt.plot(t, time_series(t), label=r"$t . \sin(t) / 3 + 2 . \sin(5t)$", c=colors[0])
plt.plot(t_instance[:-1], time_series(t_instance[:-1]),
         "-", c=colors[3],
         linewidth=2, label="A training instance")
plt.legend(loc="lower left", fontsize=14)
plt.axis([0, 30, -17, 13])
plt.xlabel("Time")
plt.ylabel("Value")

plt.subplot(212)
plt.title("A training instance", fontsize=14)
plt.plot(t_instance[:-1], time_series(t_instance[:-1]),
         "o", c=colors[3], markersize=10, label="instance")
plt.plot(t_instance[1:], time_series(t_instance[1:]),
         "^", c=colors[0], markersize=6, label="target")
plt.legend(loc="upper left")
plt.xlabel("Time")


plt.show()

<Figure size 432x720 with 2 Axes>

In [98]:
reset_graph()

n_steps = 20
n_inputs = 1
n_neurons = 100
n_outputs = 1

X = tf.placeholder(tf.float32, [None, n_steps, n_inputs])
y = tf.placeholder(tf.float32, [None, n_steps, n_outputs])

# use an OutputProjectionWrapper to automatically stack the output
# otherwise we need to do it ourselves (use `reshape`)
cell = tf.contrib.rnn.OutputProjectionWrapper(
    tf.contrib.rnn.BasicRNNCell(num_units=n_neurons, activation=tf.nn.relu),
    output_size=n_outputs)

outputs, states = tf.nn.dynamic_rnn(cell, X, dtype=tf.float32)


In [99]:
lr = 1e-3

loss = tf.reduce_mean(tf.square(outputs - y)) # MSE
optimizer = tf.train.AdamOptimizer(learning_rate=lr)
training_op = optimizer.minimize(loss)

init = tf.global_variables_initializer()
saver = tf.train.Saver()


In [100]:
n_iter = 1500
batch_size = 50

with tf.Session() as sess:
    init.run()
    for iter in range(n_iter):
        X_batch, y_batch = next_batch(batch_size, n_steps)
        sess.run(training_op, feed_dict={X: X_batch, y: y_batch})
        if iter % 100 == 0:
            mse = loss.eval(feed_dict={X: X_batch, y: y_batch})
            print(f"{iter}\tMSE: {mse:.4f}")
    saver.save(sess, "./my_time_series_model")

1400	MSE: 0.0834


1300	MSE: 0.0713


1200	MSE: 0.0554


1100	MSE: 0.0605


1000	MSE: 0.0640


900	MSE: 0.0644


800	MSE: 0.0861


700	MSE: 0.0645


600	MSE: 0.0611


500	MSE: 0.0771


400	MSE: 0.0950


300	MSE: 0.1081


200	MSE: 0.2971


100	MSE: 1.6234


0	MSE: 31.5879


In [114]:
with tf.Session() as sess:
    saver.restore(sess, "./my_time_series_model")

    X_new = time_series(np.array(t_instance[:-1].reshape(-1, n_steps, n_inputs)))
    y_pred1 = sess.run(outputs, feed_dict={X: X_new})

INFO:tensorflow:Restoring parameters from ./my_time_series_model


In [103]:
y_pred

array([[[-3.5123656 ],
        [-2.4479904 ],
        [-1.0369046 ],
        [ 0.75989497],
        [ 2.1721063 ],
        [ 3.0693047 ],
        [ 3.5415273 ],
        [ 3.3060033 ],
        [ 2.826444  ],
        [ 2.280538  ],
        [ 1.6957043 ],
        [ 1.4992216 ],
        [ 1.8971107 ],
        [ 2.7551672 ],
        [ 3.9230437 ],
        [ 5.108908  ],
        [ 6.131944  ],
        [ 6.70724   ],
        [ 6.6501303 ],
        [ 6.068209  ]]], dtype=float32)

In [115]:
plt.subplots(figsize=(8, 8/1.4))
plt.title("Testing the model", fontsize=14)
plt.plot(t_instance[:-1], time_series(t_instance[:-1]),
         "o", c=colors[0], markersize=8, label="instance")
plt.plot(t_instance[1:], time_series(t_instance[1:]),
         "^", c=colors[3], markersize=6, label="target")
plt.plot(t_instance[1:], y_pred1[0,:,0],
         ".", c=colors[2], markersize=8, label="prediction")
plt.legend(loc="upper left")
plt.xlabel("Time")

plt.show()

<Figure size 576x411.429 with 1 Axes>

## Multi-layer RNN

In [65]:
reset_graph()

n_inputs = 1
n_steps = 5
n_outputs = 1
n_neurons = 100


X = tf.placeholder(tf.float32, [None, n_steps, n_inputs])
y = tf.placeholder(tf.float32, [None, n_steps, n_outputs])

n_neurons = 100
n_layers = 3

layers = [tf.contrib.rnn.BasicRNNCell(num_units=n_neurons)
          for layer in range(n_layers)]
multi_layer_cell = tf.contrib.rnn.MultiRNNCell(layers)
rnn_outputs, states = tf.nn.dynamic_rnn(multi_layer_cell, X, dtype=tf.float32)
rnn_outputs_stack = tf.reshape(rnn_outputs, [-1, n_neurons])
outputs_stack = tf.layers.dense(rnn_outputs_stack, n_outputs)
outputs = tf.reshape(outputs_stack, [-1, n_steps, n_outputs])


In [81]:
lr = 1e-4

loss = tf.reduce_mean(tf.square(outputs - y)) # MSE
optimizer = tf.train.AdamOptimizer(learning_rate=lr)
training_op = optimizer.minimize(loss)


In [82]:
#  Instead of fetching a mini-batch from an existing data set
# this helper function creates a mini-batch each
def next_batch(batch_size, n_steps):
    t0 = np.random.randn(batch_size, 1) * (t_max - t_min - n_steps * resolution)
    Ts = t0 + np.arange(0., n_steps+1) * resolution
    ys = time_series(Ts)
    return ys[:, :-1].reshape(-1, n_steps, 1), ys[:, 1:].reshape(-1, n_steps, 1)

In [116]:
n_iter = 1500
batch_size = 50

init = tf.global_variables_initializer()

with tf.Session() as sess:
    init.run()
    for iter in range(n_iter):
        X_batch, y_batch = next_batch(batch_size, n_steps)
        sess.run(training_op, feed_dict={X: X_batch, y: y_batch})
        if iter % 100 == 0:
            mse = loss.eval(feed_dict={X: X_batch, y: y_batch})
            print(f"{iter}\tMSE: {mse:.4f}")
            
    X_new = time_series(np.array(t_instance[:-1].reshape(-1, n_steps, n_inputs)))
    y_pred2 = sess.run(outputs, feed_dict={X: X_new})

1400	MSE: 0.0831


1300	MSE: 0.0697


1200	MSE: 0.0678


1100	MSE: 0.0765


1000	MSE: 0.0617


900	MSE: 0.0662


800	MSE: 0.0759


700	MSE: 0.0758


600	MSE: 0.0890


500	MSE: 0.0967


400	MSE: 0.0890


300	MSE: 0.0982


200	MSE: 0.4850


100	MSE: 1.4701


0	MSE: 32.8862


In [126]:
plt.subplots(figsize=(8, 8/1.4))
plt.title("Testing the model", fontsize=14)
plt.plot(t_instance[:-1], time_series(t_instance[:-1]),
         "o", c=colors[0], markersize=8, label="instance")
plt.plot(t_instance[1:], time_series(t_instance[1:]),
         "^", c=colors[3], markersize=6, label="target")
plt.plot(t_instance[1:], y_pred2[0,:,0],
         ".", c=colors[2], markersize=8, label="3-RNN")
plt.legend(loc="upper left")
plt.xlabel("Time")

plt.show()

<Figure size 576x411.429 with 1 Axes>

In [127]:
plt.title("Compare single RNN and multiple RNN", fontsize=14)
plt.plot(t_instance[1:], time_series(t_instance[1:]),
         ".", c=colors[1], markersize=6, label="target")
plt.plot(t_instance[1:], y_pred1[0,:,0],
         "^", c=colors[0], markersize=5, label="single RNN")
plt.plot(t_instance[1:], y_pred2[0,:,0],
         "o", c=colors[3], markersize=5, label="three RNNs")
plt.legend(loc="upper left")
plt.xlabel("Time")

plt.show()

<Figure size 864x504 with 1 Axes>

Using a single RNN or three RNNs makes not much difference here, both models did not work well at the beginning of the sequence.

# LSTM
## Keys to note in LSTM

LSTM has three gates to control the flow of information from the past state: 1) a gate to forget; 2) a gate to include; 3) and the output gate. The gates are implemented with a sigmoid function. 

Equations in LSTM
- Include (remember) gate: $i_{(t)} = \sigma(W_{xi}^T\cdot x_{(t)} + W_{hi}^T\cdot h_{(t-1)}) + b_i$
- Forget gate: $f_{(t)} = \sigma(W_{xf}^T\cdot x_{(t)} + W_{hf}^T\cdot h_{(t-1)}) + b_f$
- Output gate: $o_{(t)} = \sigma(W_{xo}^T\cdot x_{(t)} + W_{hg}^T\cdot h_{(t-1)} + b_o$
- $g_{(t)} = \tanh(W_{xg})^T\cdot x_{(t)} + W_{hg}^T\cdot h_{(t-1)} + b_g$
- $c_{(t)} = f_{(t)} \otimes c_{(t-1)} + i_{(t)}\otimes g_{(t)}$
- $y_{(t)}=h_{(t)}=o_{(t)}\otimes \tanh(c_{(t)})$

 ![lstm](./LSTM3-chain.png)
 > Figure obtained from [colah](http://colah.github.io/posts/2015-08-Understanding-LSTMs/)                                
                                 

# Word Embedding using the `imdb` data

Words embedding serves as an intermediate step for converting text into numeric
vectors for our machine learning tasks. One significant advantage of learned
embedding is the data obtained are dense vectors, unlike those obtained from
one-hot encoding, which are of sparse structures.

Note word embedding often starts with a random vector.

> The `Embedding` layer is best understood as a dictionary mapping integer
> indices (which stand for specific words) to dense vectors. It takes integers
> as input, then it look up these integers into an internal dictionary, and it
> returns the associated vectors. **It's effectively a dictionary lookup**.

The embedding layer takes a 2D tensor as input, of shape `(samples,
sequence_length)`, and returns a 3D tensor as output, of shape `(samples,
sequence_length, embedding_dimensionality)`. Such as 3D tensor can be processed
by a RNN layer or a CONV1D` layer.

Since a word embedding is dense, the distance (e.g., L2 norm) between two vectors also characterize the their similaries.


## Learning word embedding with `Embedding` layer - Use `keras`

## load imdb data

In [4]:
import os
from keras.datasets import imdb
from keras import preprocessing

max_features = 10000
maxlen = 50 # look twenty 50 words in every review

(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features)

## transform lists of integers to 2D integer tensor of shape
## `(sample, maxlen)`
x_train = preprocessing.sequence.pad_sequences(x_train, maxlen=maxlen)
x_test  = preprocessing.sequence.pad_sequences(x_test, maxlen=maxlen)

In [2]:
from keras.models import Sequential
from keras.layers import Embedding, Flatten, Dense

model = Sequential()
model.add(Embedding(10000, 8, input_length=maxlen))
model.add(Flatten())
model.add(Dense(1, activation='sigmoid'))

model.compile(optimizer='rmsprop',
			  loss='binary_crossentropy',
			  metrics=['acc'])
model.summary()

#fit model
history = model.fit(x_train, y_train,
                    epochs=10,
                    batch_size=32,
                    validation_split=0.2)























 4416/20000 [=====>........................] - ETA: 0s - loss: 0.2140 - acc: 0.9144

 2976/20000 [===>..........................] - ETA: 0s - loss: 0.2122 - acc: 0.9147

 1504/20000 [=>............................] - ETA: 0s - loss: 0.2086 - acc: 0.9156

Epoch 10/10
   32/20000 [..............................] - ETA: 1s - loss: 0.0873 - acc: 1.0000

























 4352/20000 [=====>........................] - ETA: 0s - loss: 0.2303 - acc: 0.9104

 2944/20000 [===>..........................] - ETA: 0s - loss: 0.2314 - acc: 0.9083

 1600/20000 [=>............................] - ETA: 0s - loss: 0.2197 - acc: 0.9131

Epoch 9/10
   32/20000 [..............................] - ETA: 1s - loss: 0.2418 - acc: 0.8750

























 4512/20000 [=====>........................] - ETA: 0s - loss: 0.2488 - acc: 0.9029

 3040/20000 [===>..........................] - ETA: 0s - loss: 0.2626 - acc: 0.8984

 1568/20000 [=>............................] - ETA: 0s - loss: 0.2789 - acc: 0.8941

Epoch 8/10
   32/20000 [..............................] - ETA: 1s - loss: 0.3495 - acc: 0.8438

























 3168/20000 [===>..........................] - ETA: 0s - loss: 0.2549 - acc: 0.9006

 1600/20000 [=>............................] - ETA: 0s - loss: 0.2495 - acc: 0.9056

Epoch 7/10
   32/20000 [..............................] - ETA: 1s - loss: 0.3502 - acc: 0.9062





















 4608/20000 [=====>........................] - ETA: 0s - loss: 0.2793 - acc: 0.8869

 3104/20000 [===>..........................] - ETA: 0s - loss: 0.2804 - acc: 0.8850

 1600/20000 [=>............................] - ETA: 0s - loss: 0.2815 - acc: 0.8819

Epoch 6/10
   32/20000 [..............................] - ETA: 1s - loss: 0.2992 - acc: 0.8438





















 4576/20000 [=====>........................] - ETA: 0s - loss: 0.3050 - acc: 0.8789

 3072/20000 [===>..........................] - ETA: 0s - loss: 0.3081 - acc: 0.8799

 1568/20000 [=>............................] - ETA: 0s - loss: 0.3116 - acc: 0.8769

Epoch 5/10
   32/20000 [..............................] - ETA: 1s - loss: 0.3646 - acc: 0.8125























 4000/20000 [=====>........................] - ETA: 0s - loss: 0.3162 - acc: 0.8630

 2656/20000 [==>...........................] - ETA: 0s - loss: 0.3139 - acc: 0.8637

 1440/20000 [=>............................] - ETA: 0s - loss: 0.3179 - acc: 0.8618

Epoch 4/10
   32/20000 [..............................] - ETA: 1s - loss: 0.3889 - acc: 0.7500























 4544/20000 [=====>........................] - ETA: 0s - loss: 0.3752 - acc: 0.8415

 3072/20000 [===>..........................] - ETA: 0s - loss: 0.3777 - acc: 0.8402

 1600/20000 [=>............................] - ETA: 0s - loss: 0.3876 - acc: 0.8344

Epoch 3/10
   32/20000 [..............................] - ETA: 1s - loss: 0.5263 - acc: 0.7812





















 4544/20000 [=====>........................] - ETA: 0s - loss: 0.4979 - acc: 0.7942

 3072/20000 [===>..........................] - ETA: 0s - loss: 0.5104 - acc: 0.7829

 1600/20000 [=>............................] - ETA: 0s - loss: 0.5214 - acc: 0.7825

Epoch 2/10
   32/20000 [..............................] - ETA: 1s - loss: 0.4928 - acc: 0.8750



























 3232/20000 [===>..........................] - ETA: 1s - loss: 0.6932 - acc: 0.5050

 1664/20000 [=>............................] - ETA: 1s - loss: 0.6936 - acc: 0.4952  

Train on 20000 samples, validate on 5000 samples
Epoch 1/10
   32/20000 [..............................] - ETA: 1:17 - loss: 0.6997 - acc: 0.4062

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 50, 8)             80000     
_________________________________________________________________
flatten_1 (Flatten)          (None, 400)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 401       
Total params: 80,401
Trainable params: 80,401
Non-trainable params: 0
_________________________________________________________________


We get a validation accuracy of 82% when accounted  50 words in each review.

## Using a Pre-trained word embedding

In [6]:
imdb_dir = "/Users/poor.gentry/Python/imdb/aclImdb"
train_dir = os.path.join(imdb_dir, 'train')

labels = []
texts = []

for label_type in ['neg', 'pos']:
	dir_name = os.path.join(train_dir, label_type)
	for fname in os.listdir(dir_name):
		if fname[-4:] == ".txt":
                    f = open(os.path.join(dir_name, fname))
		    texts.append(f.read())
		    f.close()
		    if label_type == "neg":
		        labels.append(0)
		    else:
                        labels.append(1)

In [7]:
import os
from subprocess import check_output

print(check_output(["ls", "../rnn-nlp"]).decode('utf8'))

LSTM3-chain.png
MNIST_data
checkpoint
my_time_series_model.data-00000-of-00001
my_time_series_model.index
my_time_series_model.meta
nlp_yelp.ipynb
rnn-nlp.ipynb
rnn_conv1d.ipynb



## Tokenize the data

In [12]:
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

maxlen = 100 # cut reviews longer than 100 words
training_samples = 200 
validation_samples = 10000
max_words = 10000

tokenizer = Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts)

data = pad_sequences(sequences, maxlen=maxlen)

word_index = tokenizer.word_index
print(f'Found {len(word_index)} unique tokens')

labels = np.asarray(labels)
print('Shape of data tensor', data.shape)
print('Shape of label tensor', labels.shape)

## shuffle data and labels
idx = np.arange(data.shape[0])
np.random.shuffle(idx)
data = data[idx]
labels = labels[idx]

x_train = data[:training_samples]
y_train = labels[:training_samples]
x_val = data[training_samples:training_samples + validation_samples]
y_val = labels[training_samples:training_samples + validation_samples]


Found 88582 unique tokens
Shape of data tensor (25000, 100)
Shape of label tensor (25000,)


## Load pre-trained model: `GloVe`

The `GloVe` model can be downloaded from [here](https://nlp.stanford.edu/projects/glove/). The original paper can be accessed [here](http://ai.stanford.edu/~amaas/papers/wvSent_acl2011.pdf)

In [14]:
glove_dir = "/Users/poor.gentry/Python/glove/"

embedding_index = {}
with open(os.path.join(glove_dir, 'glove.6B.100d.txt')) as f:
    for line in f:
        values = line.split()
        word = values[0]
        coefs = np.asarray(values[1:], dtype='float32')
        embedding_index[word] = coefs

    print(f'Found {len(embedding_index)} word vectors') 

Found 400000 word vectors


In [19]:
embedding_dim = 100

embedding_matrix = np.zeros((max_words, embedding_dim))
for word, i in word_index.items():
	embedding_vector = embedding_index.get(word)
	if i < max_words:
		if embedding_vector is not None:
			embedding_matrix[i] = embedding_vector

In [22]:
model = Sequential()
model.add(Embedding(max_words, embedding_dim, input_length=maxlen))
model.add(Flatten())
model.add(Dense(32, activation='relu'))
model.add(Dense(1, activation='sigmoid'))

model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_3 (Embedding)      (None, 100, 100)          1000000   
_________________________________________________________________
flatten_2 (Flatten)          (None, 10000)             0         
_________________________________________________________________
dense_2 (Dense)              (None, 32)                320032    
_________________________________________________________________
dense_3 (Dense)              (None, 1)                 33        
Total params: 1,320,065
Trainable params: 1,320,065
Non-trainable params: 0
_________________________________________________________________


## Load the GloVe embedding into the model 

The `embedding` layer has a single weight matrix: a 2D float matrix where each
entry `i` is the word vector meant to be associated with index i. 


In [23]:
model.layers[0].set_weights([embedding_matrix])
model.layers[0].trainable = False # freeze weight in the embedding layer

model.compile(optimizer='rmsprop',
			  loss='binary_crossentropy',
			  metrics=['acc'])

history = model.fit(x_train, y_train,
					epochs=10, batch_size=32,
					validation_data = (x_val, y_val))

model.save_weights('pre_trained_glove_model.h5')



Epoch 10/10
 32/200 [===>..........................] - ETA: 0s - loss: 0.0240 - acc: 1.0000



Epoch 9/10
 32/200 [===>..........................] - ETA: 0s - loss: 0.1313 - acc: 0.9688



Epoch 8/10
 32/200 [===>..........................] - ETA: 0s - loss: 0.0558 - acc: 1.0000



Epoch 7/10
 32/200 [===>..........................] - ETA: 0s - loss: 0.0508 - acc: 1.0000



Epoch 6/10
 32/200 [===>..........................] - ETA: 0s - loss: 0.1448 - acc: 1.0000



Epoch 5/10
 32/200 [===>..........................] - ETA: 0s - loss: 0.2588 - acc: 0.9062



Epoch 4/10
 32/200 [===>..........................] - ETA: 0s - loss: 0.4321 - acc: 0.7188



Epoch 3/10
 32/200 [===>..........................] - ETA: 0s - loss: 0.2954 - acc: 0.9375



Epoch 2/10
 32/200 [===>..........................] - ETA: 0s - loss: 0.4937 - acc: 0.7500



Train on 200 samples, validate on 10000 samples
Epoch 1/10
 32/200 [===>..........................] - ETA: 1s - loss: 0.7519 - acc: 0.5000

With the pretrained embedding weights, we achieved a validation
accuracy of 56%. Let's plot the training results,

In [29]:
colors = ['#e66101','#fdb863','#b2abd2','#5e3c99']

acc = history.history['acc']
val_acc = history.history['val_acc']
loss = history.history['loss']
val_loss = history.history['val_loss']

epochs = range(1, len(acc)+1)

plt.plot(epochs, acc, 'o', c=colors[3], label='Training Acc')
plt.plot(epochs, val_acc, c=colors[0], label='Validation acc')
plt.title("training and validation accuracy")
plt.legend()
plt.show()

plt.figure()
plt.plot(epochs, loss, 'o', c=colors[0], label='Training loss')
plt.plot(epochs, val_loss, c=colors[3], label="Validation loss")
plt.title("Training and validation loss")
plt.legend()

plt.show()



<Figure size 864x504 with 1 Axes>

<Figure size 864x504 with 1 Axes>

In [30]:
test_dir = os.path.join(imdb_dir, 'test')

labels = []
texts = []

for label_type in ['neg', 'pos']:
	dir_name = os.path.join(test_dir, label_type)
	for fname in sorted(os.listdir(dir_name)):
		with open(os.path.join(dir_name, fname)) as f:
			texts.append(f.read())
		if label_type == 'neg':
			labels.append(0)
		else:
			labels.append(1)

sequences = tokenizer.texts_to_sequences(texts)
x_test = pad_sequences(sequences, maxlen=maxlen)
y_test = np.asarray(labels)

In [31]:
model.load_weights('pre_trained_glove_model.h5')
model.evaluate(x_test, y_test)

[0.7750623719501495, 0.55404]





























 4640/25000 [====>.........................] - ETA: 0s

 3072/25000 [==>...........................] - ETA: 0s

 1600/25000 [>.............................] - ETA: 0s

   32/25000 [..............................] - ETA: 1s

We get an test accuracy of 55%, which is only slightly better than a
random guess. Sometimes, the amount of samples are much more important
than powerful algorithms.

# Learn embedding with tensorflow

Used code can be found [here](https://www.tensorflow.org/programmers_guide/embedding).

An embedding is a mapping from discrete objects, such as words, to
vectors of real numbers. The individual dimensions in a embedding
vector is not of great meaning, but rather their relative distance and
the overall patterns of location matter.

Almost all the machine learning algorithms require numerical inputs.
However, the words of text do not have a natural numerical (vector)
representation. The **Embedding** functions are the standard and
effective way to transform such discrete input objects into useful
continuous vectors.

Overall, there are two types of methods can used to obtained the
vector representation (embedding) of words or text, count-based
methods, and predictive methods. The famous word2vec method proposed
by google is a predictive method, which is computationally efficient
for learning word embeddings from raw text. There are two distinct way
to implement the word2vec method:

- Continuous Bag-of-Words (CBOW)
- Skip-Gram model.

 The CBOW predicts the target words from source context, while
 skip-gram works conversely that predicts the source context-words
 from the target words. Given a text exmaple, "the cat sits on the
 mat", the CBOW predicts the *cat* from *the cat sits on the*. 

According to existing experiences, the CBOW is more efficient for
small dataset, while Skip-Gram works better for large text samples.
 
Traditionally, the neural probabilistic models are trained using the
maximum likelihood to maximize the probability of a prediction (next
word, $w_t$) given the previous words $h$ in terms of softmax
function. 

$$ \begin{align*}
P(w_t|h) &=\mathrm{softmax}(\mathrm{score}(w_t, h)) \\ 
    &= \frac{exp\{\mathrm{score}(w_t, h)\}}{\sum_{\mathrm{Word\ w'\ in\
    Vocab}\exp\{\mathrm{score}(w', h)\}}} \end{align*} $$

However, for embedding training, softmax is extremely computationally
expensive, as it involves the normalizer at every single training
step. The noise-contrastive estimation (NCE) is used to address this
issue. Also, Tensorflow has this function built in,
`tf.nn.nce_loss()`. 

To create word embedding in Tensorflow, we first split the text into
words and then assign an integer to every word in the vocabulary. 


## Build a Skip-Gram model with Tensorflow

To obtain the embedding using the Skip-Gram model, we usually start
the randomly initalized embedding matrix. Also, remember that the NCE
loss is defined as a logistic regression model. Hence, we need to
define the weights along with the biases for each word in the
vocabulary.

## Set up compuation graph 

In [None]:
embedding = tf.Variable(
    tf.random_uniform([vocab_size, embedding_size], -1.0, 1.0))

nce_weights = tf.Variable(
    tf.truncated_normal([vocal_size, embedding_size],
                        stddev=1.0 / np.sqrt(embedding_size)))
nce_biases = tf.Variable(tf.zeros([vocab_size]))

# inputs
train_inputs = tf.placeholder(tf.int32, shape=[batch_size])
train_labels = tf.placeholder(tf.int32, shape=[batch_size, 1])

In [1]:
## Look up embedding

In [None]:
embed = tf.nn.embedding_lookup(embedding, train_inputs)

loss = tf.reduce_mean(
    tf.nn.nce_loss(weights=nce_weights,
                   biases=nce_biases,
                   labels=train_labels,
                   inputs=embed,
                   num_sampled=num_sampled,
                   num_classes=vocab_size))

# optimize loss
optimizer = tf.train.AdamOptimizer(learning_rate=1e-4).minimize(loss)

# train model
with tf.Session() as sess:
    for inputs, labels in generate(...):
        feed_dict = {train_input: inputs, train_labels: labels}
        _, cur_loss = sess.run([optimizer, loss], feed_dict=feed_dict)

