## How to use return_state or return_sequences in Keras

Ref: https://www.dlology.com/blog/how-to-use-return_state-or-return_sequences-in-keras/

c<t> for each RNN cell in the above formulas is known as the cell state. For GRU, a given time step's cell state equals to its output hidden state. For LSTM, the output hidden state a<t> is produced by "gating" cell state c<t> by the output gate , so a<t> and c<t> are not the same. 

## Return sequences

Return sequences refer to return the hidden state a<t>. By default, the return_sequences is set to False in Keras RNN layers, and this means the RNN layer will only return the last hidden state output a<T>. The last hidden state output captures an abstract representation of the input sequence. 
  
In some case, it is all we need, such as a classification or regression model where the RNN is followed by the Dense layer(s) to generate logits for news topic classification or score for sentiment analysis, or in a generative model to produce the softmax probabilities for the next possible char.
  
In other cases, we need the full sequence as the output. Setting return_sequences to True is necessary.

In [0]:
from keras.models import Model
from keras.layers import Input
from keras.layers import LSTM
from numpy import array
import keras
k_init = keras.initializers.Constant(value=0.1)
b_init = keras.initializers.Constant(value=0)
r_init = keras.initializers.Constant(value=0.1)

In [0]:
# define input data
data = array([0.1, 0.2, 0.3, 0.1, 0.2, 0.3]).reshape((1,3,2))
data

array([[[0.1, 0.2],
        [0.3, 0.1],
        [0.2, 0.3]]])

In [0]:
from keras.models import Sequential
from keras.layers import LSTM

model = Sequential()
model.add(LSTM(1, return_sequences=True, kernel_initializer=k_init, bias_initializer=b_init, recurrent_initializer=r_init))

In [0]:
# make and show prediction
output = model.predict(data)
print(output, output.shape)

[[[0.00767819]
  [0.01597687]
  [0.02480671]]] (1, 3, 1)


We can see the output array's shape of the LSTM layer is (1,3,1) which stands for (#Samples, #Time steps, #LSTM units). Compared to when return_sequences is set to False, the shape will be (#Samples, #LSTM units), which only returns the last time step hidden state.

In [0]:
from keras.models import Sequential
from keras.layers import LSTM

model = Sequential()
model.add(LSTM(1, kernel_initializer=k_init, bias_initializer=b_init, recurrent_initializer=r_init))

In [0]:
# make and show prediction
output = model.predict(data)
print(output, output.shape)

[[0.02480671]] (1, 1)


There are two primary situations when you can apply the return_sequences to return the full sequence.

1. Stacking RNN, the former RNN layer or layers should set return_sequences to True so that the following RNN layer or layers can have the full sequence as input.
2. We want to generate classification for each time step.
- Such as speech recognition or much simpler form - trigger word detection where we generate a value between 0~1 for each timestep representing whether the trigger word is present.
- OCR(Optical character recognition) sequence modeling with CTC.

## Return states

Return sequences refer to return the cell state c<t>. For GRU, a<t>=c<t>, so you can get around without this parameter. But for LSTM, hidden state and cell state are not the same.

In [0]:
inputs1 = Input(shape=(3, 2))
lstm1, state_h, state_c = LSTM(units, return_state=True, kernel_initializer=k_init, bias_initializer=b_init, recurrent_initializer=r_init)(inputs1)
model = Model(inputs=inputs1, outputs=[lstm1, state_h, state_c])

In [0]:
# model.summary()

In [0]:
# make and show prediction
output = model.predict(data)
output

[array([[0.02480671]], dtype=float32),
 array([[0.02480671]], dtype=float32),
 array([[0.0486485]], dtype=float32)]

In [0]:
for a in output:
    print(a.shape) 

(1, 1)
(1, 1)
(1, 1)


The output of the LSTM layer has three components, they are (a<T>, a<T>, c<T>), "T" stands for the last timestep, each one has the shape (#Samples, #LSTM units).

The major reason you want to set the return_state is an RNN may need to have its cell state initialized with previous time step while the weights are shared, such as in an encoder-decoder model. 

In [0]:
# define model
inputs1 = Input(shape=(3, 2))
lstm1, state_h, state_c = LSTM(units, return_sequences=True, return_state=True, kernel_initializer=k_init, bias_initializer=b_init, recurrent_initializer=r_init)(inputs1)
model = Model(inputs=inputs1, outputs=[lstm1, state_h, state_c])

In [0]:
output = model.predict(data)
output

[array([[[0.00767819],
         [0.01597687],
         [0.02480671]]], dtype=float32),
 array([[0.02480671]], dtype=float32),
 array([[0.0486485]], dtype=float32)]

In [0]:
for a in output:
    print(a.shape) 

(1, 3, 1)
(1, 1)
(1, 1)


One thing worth mentioning is that if we replace LSTM with GRU the output will have only two components. (a<1...T>, c<T>) since in GRU a<T>=c<T>.