# Stacked Bi-directional LSTMs in Gluon

We'll build up to stacked bi-directional LSTM by starting with plain RNN and adding components one at a time.

1. RNN
2. Stacked RNN
3. Bi-directional RNN
4. Stacked Bi-directional RNN
5. LSTM
6. Stacked Bi-directional LSTM

## 1. RNN

__Key Highlight__: Useful for sequential data.

__Architecture__:

![png](./imgs/rnn.png)

__Cell structure__:

![png](./imgs/rnn_cell.png)

### Code: initial hidden state set automatically

In [1]:
import mxnet as mx

In [2]:
sequence_length = 4
batch_size = 5
channels = 3

inputs = mx.nd.random.uniform(shape=(sequence_length, batch_size, channels))
first_input = inputs[0]
first_input


[[ 0.54881352  0.59284461  0.71518934]
 [ 0.84426576  0.60276335  0.85794562]
 [ 0.54488319  0.84725171  0.42365479]
 [ 0.62356371  0.64589411  0.38438171]
 [ 0.4375872   0.29753461  0.89177299]]
<NDArray 5x3 @cpu(0)>

In [3]:
hid_layers = 1
hid_units = 6

rnn = mx.gluon.rnn.RNN(hidden_size=hid_units, num_layers=hid_layers, layout='TNC')

In [4]:
# lazy initialize weights
rnn.initialize()

In [5]:
# since not provided, will initialize hidden state to zeros of approprate shape
outputs = rnn(inputs)

In [6]:
# for a plain rnn, output is the same as hidden state. get it for every time step.
outputs.shape

(4, 5, 6)

In [7]:
final_output = outputs[-1]
final_output


[[ 0.          0.          0.05846469  0.00541316  0.          0.06052626]
 [ 0.          0.          0.06059704  0.          0.          0.02954475]
 [ 0.          0.          0.03816931  0.04454039  0.          0.04824264]
 [ 0.          0.          0.04717389  0.0336915   0.          0.08854606]
 [ 0.          0.          0.04403492  0.          0.          0.0222607 ]]
<NDArray 5x6 @cpu(0)>

### Code: initial hidden state set manually

In [8]:
hid_init = mx.nd.random.uniform(shape=(hid_layers, batch_size, hid_units))

In [9]:
# get tuple returned
outputs, hid_states = rnn(inputs, hid_init)

In [10]:
outputs.shape

(4, 5, 6)

In [11]:
final_output = outputs[-1]
final_output


[[ 0.          0.          0.05847301  0.00541939  0.          0.06052965]
 [ 0.          0.          0.06059588  0.          0.          0.02955461]
 [ 0.          0.          0.03817132  0.04454382  0.          0.04825731]
 [ 0.          0.          0.04715472  0.03366784  0.          0.08856017]
 [ 0.          0.          0.04403805  0.          0.          0.02227229]]
<NDArray 5x6 @cpu(0)>

In [12]:
# single hidden state between blocks for plain rnn
len(hid_states)

1

In [13]:
# only get for last time step
hid_states[0].shape

(1, 5, 6)

In [14]:
# same as final_output
hid_states[0]


[[[ 0.          0.          0.05847301  0.00541939  0.          0.06052965]
  [ 0.          0.          0.06059588  0.          0.          0.02955461]
  [ 0.          0.          0.03817132  0.04454382  0.          0.04825731]
  [ 0.          0.          0.04715472  0.03366784  0.          0.08856017]
  [ 0.          0.          0.04403805  0.          0.          0.02227229]]]
<NDArray 1x5x6 @cpu(0)>

# 2. Stacked RNN

__Key Highlight__: Similiar to with CNNs, gives network a bias to learn hierarchical features.

__Architecture__:

![png](./imgs/stacked_rnn.png)

### Code:

In [15]:
hid_layers = 2

In [16]:
stack_rnn = mx.gluon.rnn.RNN(hidden_size=hid_units, num_layers=hid_layers, layout='TNC')
stack_rnn.initialize()

In [17]:
hid_init = mx.nd.random.uniform(shape=(hid_layers, batch_size, hid_units))
outputs, hid_states = stack_rnn(inputs, hid_init)

In [18]:
# output unchanged by number of layers. once again, one per time step
outputs.shape

(4, 5, 6)

In [19]:
final_output = outputs[-1]
final_output


[[ 0.00032953  0.          0.          0.0001187   0.00336502  0.00088951]
 [ 0.00117935  0.          0.          0.00132704  0.00411009  0.00090487]
 [ 0.          0.          0.00201352  0.          0.          0.00068501]
 [ 0.          0.00052974  0.00056825  0.          0.00165059  0.0006478 ]
 [ 0.00053742  0.          0.          0.00042893  0.00243434  0.00053949]]
<NDArray 5x6 @cpu(0)>

In [20]:
# single hidden state between blocks for plain rnn
len(hid_states)

1

In [21]:
# but now have more hidden states (last step only)
hid_states[0].shape

(2, 5, 6)

In [22]:
# see last element is same as output (first is not part of output)
hid_states[0]


[[[ 0.01719323  0.          0.04016658  0.          0.00293834  0.        ]
  [ 0.00459993  0.          0.0554641   0.          0.          0.00252433]
  [ 0.03334777  0.          0.          0.          0.03723463  0.        ]
  [ 0.03768054  0.          0.00740662  0.          0.00896394  0.        ]
  [ 0.00599414  0.          0.03325973  0.          0.00166829  0.        ]]

 [[ 0.00032953  0.          0.          0.0001187   0.00336502  0.00088951]
  [ 0.00117935  0.          0.          0.00132704  0.00411009  0.00090487]
  [ 0.          0.          0.00201352  0.          0.          0.00068501]
  [ 0.          0.00052974  0.00056825  0.          0.00165059  0.0006478 ]
  [ 0.00053742  0.          0.          0.00042893  0.00243434  0.00053949]]]
<NDArray 2x5x6 @cpu(0)>

# 3. Bi-directional RNN

__Key Highlight__: Uses context from after the target time step. Useful for word disambiguation, e.g. bank in "I arrived at the bank after crossing the river."

__Architecture__:

![png](./imgs/bidir_rnn.png)

### Code:

In [23]:
hid_layers = 1
bidirectional = True

In [24]:
bidir_rnn = mx.gluon.rnn.RNN(hidden_size=hid_units, num_layers=hid_layers, layout='TNC', bidirectional=bidirectional)
bidir_rnn.initialize()

In [25]:
# now hid_layers * 2, initial hidden states for forward and backward rnns.
hid_init = mx.nd.random.uniform(shape=(hid_layers * 2, batch_size, hid_units))
outputs, hid_states = bidir_rnn(inputs, hid_init)

In [26]:
# hid_units * 2 channels
# 6 from forward rnn, 6 from backward rnn, concatenated to give 12
outputs.shape

(4, 5, 12)

In [27]:
final_output = outputs[-1]
final_output


[[ 0.03306618  0.          0.          0.01881557  0.          0.01395486
   0.08756178  0.10872964  0.          0.          0.033585    0.        ]
 [ 0.048029    0.          0.          0.00914372  0.          0.02871193
   0.06199517  0.07162469  0.          0.          0.          0.        ]
 [ 0.          0.00359015  0.          0.02957996  0.          0.
   0.06479818  0.11220244  0.          0.04297203  0.          0.        ]
 [ 0.01444994  0.          0.          0.03033896  0.          0.
   0.09215644  0.12054929  0.          0.0029099   0.          0.        ]
 [ 0.02546354  0.          0.          0.0079287   0.          0.01829238
   0.00309482  0.07393023  0.          0.00435505  0.          0.        ]]
<NDArray 5x12 @cpu(0)>

In [28]:
# from forward rnn
final_output[:,:6]


[[ 0.03306618  0.          0.          0.01881557  0.          0.01395486]
 [ 0.048029    0.          0.          0.00914372  0.          0.02871193]
 [ 0.          0.00359015  0.          0.02957996  0.          0.        ]
 [ 0.01444994  0.          0.          0.03033896  0.          0.        ]
 [ 0.02546354  0.          0.          0.0079287   0.          0.01829238]]
<NDArray 5x6 @cpu(0)>

In [29]:
# single hidden state between blocks for plain rnn
len(hid_states)

1

In [30]:
# forward rnn hidden, then backward rnn hidden
# BUT from different time steps! orward rnn hidden from last time step, backward rnn hidden from first time step.
# useful when feeding a decoder, otherwise backward rnn only seen 1 example by step n.
hid_states[0]


[[[ 0.03306618  0.          0.          0.01881557  0.          0.01395486]
  [ 0.048029    0.          0.          0.00914372  0.          0.02871193]
  [ 0.          0.00359015  0.          0.02957996  0.          0.        ]
  [ 0.01444994  0.          0.          0.03033896  0.          0.        ]
  [ 0.02546354  0.          0.          0.0079287   0.          0.01829238]]

 [[ 0.00990705  0.          0.          0.00356794  0.00038061  0.        ]
  [ 0.017914    0.          0.          0.01097212  0.          0.        ]
  [ 0.03335207  0.0034983   0.          0.          0.02690046  0.        ]
  [ 0.03339365  0.00497039  0.          0.00650766  0.0195522   0.        ]
  [ 0.          0.          0.          0.00215904  0.          0.        ]]]
<NDArray 2x5x6 @cpu(0)>

In [31]:
# same as first 6 channels of output at last stage
hid_states[0][0]


[[ 0.03306618  0.          0.          0.01881557  0.          0.01395486]
 [ 0.048029    0.          0.          0.00914372  0.          0.02871193]
 [ 0.          0.00359015  0.          0.02957996  0.          0.        ]
 [ 0.01444994  0.          0.          0.03033896  0.          0.        ]
 [ 0.02546354  0.          0.          0.0079287   0.          0.01829238]]
<NDArray 5x6 @cpu(0)>

In [32]:
first_output = outputs[0]
first_output


[[ 0.04293729  0.00472309  0.          0.093175    0.03674188  0.
   0.00990705  0.          0.          0.00356794  0.00038061  0.        ]
 [ 0.01534651  0.          0.          0.08228499  0.          0.          0.017914
   0.          0.          0.01097212  0.          0.        ]
 [ 0.04324071  0.          0.          0.06263764  0.          0.
   0.03335207  0.0034983   0.          0.          0.02690046  0.        ]
 [ 0.09439567  0.          0.          0.11246054  0.03045148  0.
   0.03339365  0.00497039  0.          0.00650766  0.0195522   0.        ]
 [ 0.05874584  0.0130488   0.          0.08687868  0.          0.02914192
   0.          0.          0.          0.00215904  0.          0.        ]]
<NDArray 5x12 @cpu(0)>

In [33]:
# from backward rnn
first_output[:,6:]


[[ 0.00990705  0.          0.          0.00356794  0.00038061  0.        ]
 [ 0.017914    0.          0.          0.01097212  0.          0.        ]
 [ 0.03335207  0.0034983   0.          0.          0.02690046  0.        ]
 [ 0.03339365  0.00497039  0.          0.00650766  0.0195522   0.        ]
 [ 0.          0.          0.          0.00215904  0.          0.        ]]
<NDArray 5x6 @cpu(0)>

In [34]:
# same as last 6 channels of output at first stage
hid_states[0][1]


[[ 0.00990705  0.          0.          0.00356794  0.00038061  0.        ]
 [ 0.017914    0.          0.          0.01097212  0.          0.        ]
 [ 0.03335207  0.0034983   0.          0.          0.02690046  0.        ]
 [ 0.03339365  0.00497039  0.          0.00650766  0.0195522   0.        ]
 [ 0.          0.          0.          0.00215904  0.          0.        ]]
<NDArray 5x6 @cpu(0)>

## 4. Stacked Bi-directional RNN

__Key Highlight__: Combines benefits of both bi-directional and stacked architectures.

__Architecture__:

![png](./imgs/stacked_bidir_rnn.png)

### Code:

In [35]:
hid_layers = 2
bidirectional = True

In [36]:
stack_bidir_rnn = mx.gluon.rnn.RNN(hidden_size=hid_units, num_layers=hid_layers, layout='TNC', bidirectional=bidirectional)
stack_bidir_rnn.initialize()

In [37]:
hid_init = mx.nd.random.uniform(shape=(hid_layers*2, batch_size, hid_units))
outputs, hid_states = stack_bidir_rnn(inputs, hid_init)

In [38]:
outputs.shape

(4, 5, 12)

In [39]:
# combined across channels
final_output = outputs[-1]
final_output


[[ 0.          0.00340414  0.          0.          0.          0.          0.
   0.0678948   0.          0.          0.          0.        ]
 [ 0.          0.00184004  0.00155006  0.          0.          0.00046548
   0.          0.06100728  0.          0.          0.          0.        ]
 [ 0.          0.          0.          0.          0.          0.0005218
   0.00525233  0.04932918  0.          0.02696682  0.00857441  0.        ]
 [ 0.          0.00673693  0.00124999  0.          0.          0.          0.
   0.05505137  0.01279329  0.          0.          0.        ]
 [ 0.          0.          0.          0.          0.0008792   0.00357696
   0.          0.06191452  0.          0.          0.          0.        ]]
<NDArray 5x12 @cpu(0)>

In [40]:
len(hid_states)

1

In [41]:
hid_states[0].shape

(4, 5, 6)

In [42]:
# combined forward and backward, then across stack, e.g.
# [ L1_forward
#   L1_backward,
#   L2_forward,
#   L2_backward ]
hid_states[0]


[[[ 0.          0.04177246  0.00640563  0.          0.00357663  0.00187468]
  [ 0.          0.03488374  0.03214718  0.          0.          0.01249963]
  [ 0.          0.0441696   0.          0.00472266  0.          0.        ]
  [ 0.          0.04262898  0.          0.          0.03667838  0.        ]
  [ 0.          0.02875579  0.01502049  0.          0.          0.        ]]

 [[ 0.          0.          0.          0.05959253  0.02493385  0.        ]
  [ 0.          0.          0.          0.06292737  0.0198511   0.        ]
  [ 0.          0.          0.01365896  0.06428619  0.01114913  0.        ]
  [ 0.          0.          0.00332693  0.0522363   0.00303377  0.        ]
  [ 0.          0.          0.          0.04257718  0.03644982  0.        ]]

 [[ 0.          0.00340414  0.          0.          0.          0.        ]
  [ 0.          0.00184004  0.00155006  0.          0.          0.00046548]
  [ 0.          0.          0.          0.          0.          0.0005218 ]
  [ 0. 

In [43]:
# output from final step of final layer forward rnn
final_output[:,:6]


[[ 0.          0.00340414  0.          0.          0.          0.        ]
 [ 0.          0.00184004  0.00155006  0.          0.          0.00046548]
 [ 0.          0.          0.          0.          0.          0.0005218 ]
 [ 0.          0.00673693  0.00124999  0.          0.          0.        ]
 [ 0.          0.          0.          0.          0.0008792   0.00357696]]
<NDArray 5x6 @cpu(0)>

In [44]:
# take last layer of stack (last 2), then first to get forward rnn
hid_states[0][-2:][0]


[[ 0.          0.00340414  0.          0.          0.          0.        ]
 [ 0.          0.00184004  0.00155006  0.          0.          0.00046548]
 [ 0.          0.          0.          0.          0.          0.0005218 ]
 [ 0.          0.00673693  0.00124999  0.          0.          0.        ]
 [ 0.          0.          0.          0.          0.0008792   0.00357696]]
<NDArray 5x6 @cpu(0)>

In [45]:
# combined across channels
first_output = outputs[0]
# just take last 6 channels (i.e. from backward rnn)
first_output[:,6:]


[[ 0.00178251  0.00076139  0.          0.0042411   0.00419507  0.        ]
 [ 0.00173295  0.00198956  0.          0.00422992  0.00403664  0.        ]
 [ 0.00141883  0.00624871  0.          0.00565925  0.00630214  0.        ]
 [ 0.00146779  0.0065      0.          0.00685085  0.00635008  0.        ]
 [ 0.00196513  0.00287359  0.          0.00624004  0.00747287  0.        ]]
<NDArray 5x6 @cpu(0)>

In [46]:
# take last layer of stack (last 2), then last to get backward rnn
hid_states[0][-2:][1]


[[ 0.00178251  0.00076139  0.          0.0042411   0.00419507  0.        ]
 [ 0.00173295  0.00198956  0.          0.00422992  0.00403664  0.        ]
 [ 0.00141883  0.00624871  0.          0.00565925  0.00630214  0.        ]
 [ 0.00146779  0.0065      0.          0.00685085  0.00635008  0.        ]
 [ 0.00196513  0.00287359  0.          0.00624004  0.00747287  0.        ]]
<NDArray 5x6 @cpu(0)>

# 5. LSTM

__Key Highlight__: Can model longer term dependencies by improving the gradient flow through the network compared to plain RNN.

__Architecture__:

![png](./imgs/lstm.png)

__Cell structure__:

![png](http://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM3-chain.png)

### Code:

In [47]:
hid_layers = 1

In [48]:
lstm = mx.gluon.rnn.LSTM(hidden_size=hid_units, num_layers=hid_layers, layout='TNC')
lstm.initialize()

In [49]:
hid_init_h = mx.nd.random.uniform(shape=(hid_layers, batch_size, hid_units))
hid_init_c = mx.nd.random.uniform(shape=(hid_layers, batch_size, hid_units))
hid_init = [hid_init_h, hid_init_c]
outputs, hid_states = lstm(inputs, hid_init)

In [50]:
# output same as before
outputs.shape

(4, 5, 6)

In [51]:
final_output = outputs[-1]
final_output


[[ 0.02818945 -0.00174565  0.0353965   0.02620278  0.05123272 -0.01795965]
 [ 0.06393548 -0.00062626  0.0545639   0.02543391  0.05355247  0.00545943]
 [ 0.02012122 -0.01744675  0.04486242  0.01139896  0.03076872 -0.00235225]
 [ 0.01636864 -0.01138998  0.06073249  0.02126804  0.04514049 -0.02028175]
 [ 0.03729322 -0.00312254  0.01692082  0.03198263  0.04617801 -0.01311093]]
<NDArray 5x6 @cpu(0)>

In [52]:
# now have two cell memory and hidden state
len(hid_states)

2

In [53]:
# hidden state (bottom line in diagram)
hid_states[0].shape

(1, 5, 6)

In [54]:
# cell memory (top line in diagram)
hid_states[1].shape

(1, 5, 6)

In [55]:
# same as the output for uni-directional and non-stacked case
hid_states[0]


[[[ 0.02818945 -0.00174565  0.0353965   0.02620278  0.05123272 -0.01795965]
  [ 0.06393548 -0.00062626  0.0545639   0.02543391  0.05355247  0.00545943]
  [ 0.02012122 -0.01744675  0.04486242  0.01139896  0.03076872 -0.00235225]
  [ 0.01636864 -0.01138998  0.06073249  0.02126804  0.04514049 -0.02028175]
  [ 0.03729322 -0.00312254  0.01692082  0.03198263  0.04617801 -0.01311093]]]
<NDArray 1x5x6 @cpu(0)>

# Stacked Bi-directional LSTM

__Key Highlight__: Combines benefits of all of the above.

__Architecture__:

![png](./imgs/stacked_bidir_lstm.png)

### Code:

In [56]:
hid_layers = 2
bidirectional = True

In [57]:
stack_bidir_lstm = mx.gluon.rnn.LSTM(hidden_size=hid_units, num_layers=hid_layers, layout='TNC', bidirectional=bidirectional)
stack_bidir_lstm.initialize()

In [58]:
# 2 * hid_layers (since bi-directional)
hid_init_h = mx.nd.random.uniform(shape=(2*hid_layers, batch_size, hid_units))
hid_init_c = mx.nd.random.uniform(shape=(2*hid_layers, batch_size, hid_units))
hid_init = [hid_init_h, hid_init_c]
outputs, hid_states = stack_bidir_lstm(inputs, hid_init)

In [59]:
# 2 * hid_units = 12 channels since bi-directional
outputs.shape

(4, 5, 12)

In [60]:
final_output = outputs[-1]
final_output


[[ 0.01808946  0.01239023  0.0103579   0.01382891  0.01011395  0.01383706
   0.05322991  0.20940049  0.13968816  0.20491761  0.0427212   0.02207958]
 [ 0.00300297  0.03423487  0.00675577  0.00692867  0.00228254 -0.00037972
   0.22739001  0.12007983  0.20540595  0.12325918  0.0146258   0.01974423]
 [ 0.03124169  0.01193716  0.00384586  0.0038484   0.02315953  0.02376991
   0.02031938  0.11076675  0.04595628  0.11646104  0.08167178  0.0644596 ]
 [ 0.02482849  0.00937077  0.02301528  0.0109503   0.01892785  0.00229548
   0.19356498 -0.00137174  0.05321001  0.14383923  0.03294018  0.13848269]
 [ 0.00288578  0.02821913  0.01003075  0.00847331  0.00936431  0.00660857
   0.20524111  0.02009414  0.0965727   0.04825414  0.10043189  0.1384041 ]]
<NDArray 5x12 @cpu(0)>

In [61]:
# channels from forward rnn in last step of last layer
final_output[:,:6]


[[ 0.01808946  0.01239023  0.0103579   0.01382891  0.01011395  0.01383706]
 [ 0.00300297  0.03423487  0.00675577  0.00692867  0.00228254 -0.00037972]
 [ 0.03124169  0.01193716  0.00384586  0.0038484   0.02315953  0.02376991]
 [ 0.02482849  0.00937077  0.02301528  0.0109503   0.01892785  0.00229548]
 [ 0.00288578  0.02821913  0.01003075  0.00847331  0.00936431  0.00660857]]
<NDArray 5x6 @cpu(0)>

In [62]:
# channels from backward rnn in last step of last layer
final_output[:,6:]


[[ 0.05322991  0.20940049  0.13968816  0.20491761  0.0427212   0.02207958]
 [ 0.22739001  0.12007983  0.20540595  0.12325918  0.0146258   0.01974423]
 [ 0.02031938  0.11076675  0.04595628  0.11646104  0.08167178  0.0644596 ]
 [ 0.19356498 -0.00137174  0.05321001  0.14383923  0.03294018  0.13848269]
 [ 0.20524111  0.02009414  0.0965727   0.04825414  0.10043189  0.1384041 ]]
<NDArray 5x6 @cpu(0)>

In [63]:
len(hid_states)

2

In [64]:
# hidden state
hid_states[0].shape

(4, 5, 6)

In [65]:
# cell memeory
hid_states[1].shape

(4, 5, 6)

In [66]:
hid_states[0]


[[[-0.00573625  0.01635285 -0.02341366  0.01629354 -0.007346    0.00576642]
  [-0.0258479   0.0279543  -0.01818688  0.01051181  0.00762592  0.00853934]
  [-0.02718913  0.05018575  0.00045197  0.00620968  0.01767808  0.0226628 ]
  [-0.03281137  0.03153844 -0.01883352  0.00658052  0.0035225   0.01312432]
  [-0.02891359  0.03498261  0.00373905  0.02421794  0.01074352  0.02088923]]

 [[ 0.04581511  0.01842473 -0.00113533  0.00648213  0.03211564  0.04415091]
  [ 0.05429369  0.02530528 -0.00710067  0.00181396  0.0426653   0.0440751 ]
  [ 0.03427977  0.01568622  0.01466715 -0.00038432  0.03445853  0.03436213]
  [ 0.04928001  0.00462648  0.01272476  0.01015491  0.05328795  0.0181508 ]
  [ 0.06135751  0.02007606 -0.00972339  0.00779354  0.0557798   0.04123558]]

 [[ 0.01808946  0.01239023  0.0103579   0.01382891  0.01011395  0.01383706]
  [ 0.00300297  0.03423487  0.00675577  0.00692867  0.00228254 -0.00037972]
  [ 0.03124169  0.01193716  0.00384586  0.0038484   0.02315953  0.02376991]
  [ 0.0

In [67]:
# take last two rows since bi-dir
hid_last = hid_states[0][-2:,:]

In [68]:
# first of row pair, to get forward
hid_last_forward = hid_last[0]

In [69]:
# same as first 6 channels of last step output
hid_last_forward


[[ 0.01808946  0.01239023  0.0103579   0.01382891  0.01011395  0.01383706]
 [ 0.00300297  0.03423487  0.00675577  0.00692867  0.00228254 -0.00037972]
 [ 0.03124169  0.01193716  0.00384586  0.0038484   0.02315953  0.02376991]
 [ 0.02482849  0.00937077  0.02301528  0.0109503   0.01892785  0.00229548]
 [ 0.00288578  0.02821913  0.01003075  0.00847331  0.00936431  0.00660857]]
<NDArray 5x6 @cpu(0)>

In [70]:
first_output = outputs[0]

In [71]:
# last 6 channels of first step output
first_output[:,6:]


[[ 0.00597848  0.02032363  0.0162298   0.02846791 -0.00140417  0.00011918]
 [ 0.03173025  0.01013287  0.0296511   0.01767016 -0.00850117  0.00216344]
 [-0.00092769  0.00986358  0.01041535  0.0182292  -0.00065306  0.00980295]
 [ 0.02115385 -0.00392883  0.01181151  0.01550466 -0.00492908  0.01831324]
 [ 0.02123432 -0.00080698  0.01856145  0.00723932 -0.00111104  0.02135266]]
<NDArray 5x6 @cpu(0)>

In [72]:
# second of row pair, to get backward
hid_last_backward = hid_last[1]

In [73]:
hid_last_backward


[[ 0.00597848  0.02032363  0.0162298   0.02846791 -0.00140417  0.00011918]
 [ 0.03173025  0.01013287  0.0296511   0.01767016 -0.00850117  0.00216344]
 [-0.00092769  0.00986358  0.01041535  0.0182292  -0.00065306  0.00980295]
 [ 0.02115385 -0.00392883  0.01181151  0.01550466 -0.00492908  0.01831324]
 [ 0.02123432 -0.00080698  0.01856145  0.00723932 -0.00111104  0.02135266]]
<NDArray 5x6 @cpu(0)>