<a href="https://colab.research.google.com/github/ameasure/colab_tutorials/blob/master/Recurrent_Neural_Networks.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Recurrent Neural Networks

We have seen that convolutional neural networks allow us to efficiently model local structure in our data, but this is not the only sort of structure that exists in nature. What else is there?

### Sequential Structure
Consider the process of reading a document and understanding what it means. Locality is certainly still relevant, words close to each other are more likely to relate to each other, but the relationships between the words are not strictly limited to any fixed window and the interactions between words are primarily one directional. Our understanding of the document at any point in time is informed primarily by the words that have been read up to that point, not to the possibly nearby words that have not yet been read. This is a sequential structure, a type of structure that is enforced by the seemingly one-directional passage of time. As such it frequently occurs in any process affected by the passage of time. Convolutions lack this one-directional bias. What sort of model could allow us to better reflect this bias, and perhaps relax the locality constraint in one direction in return? 

One option is a recurrent neural network. In its simplest form, a recurrent neural network is simply a single dense layer which is applied repeatedly to a sequence of inputs and uses as an additional input, its output from the previous step in the sequence. In theory, such a model can capture relationships between inputs separated by an infinite distance. Furthermore, because all information is processed in sequence the sequential bias is strictly enforced. An example of the operation of a recurrent neural network is illustrated below:

![Images](https://github.com/ameasure/colab_tutorials/blob/master/Images/rnn_loop.gif?raw=1)

As with convolutional layers, it is often helpful to follow a recurrent layer with some sort of aggregation operation. Here, however, there is one additional option. We could just use the RNN's output from the very last step of the sequence. Since an RNN is theoretically capable of remembering all relevant information from the sequence this works. As a practical matter however, RNN's tend to forget information from distant inputs so its generally useful to use the intermediate outputs as well if those are of relevance. These can be captured through `mean` or `max pooling` (as we saw with convolutional neural networks), or through `attention` mechanisms which use additional layers to weight each output before averaging. 

### Preparing the Data

We will prepare our data as we did with convolutional neural networks, by creating a sequence of vectors corresponding to the input words in our training narratives. We accomplish this with a combination of the Keras Tokenizer, which maps narrative to a sequence of numbers representing each word in the narrative, and then an Embedding layer which maps each index to a vector. This is equivalent to representing each input word with a 1-hot vector and then multiplying each of these 1-hot vectors by a Dense layer.

In [0]:
# load the msha data file to Colab
!wget 'https://github.com/ameasure/autocoding-class/raw/master/msha.xlsx'

--2019-03-22 17:11:50--  https://github.com/ameasure/autocoding-class/raw/master/msha.xlsx
Resolving github.com (github.com)... 192.30.253.112, 192.30.253.113
Connecting to github.com (github.com)|192.30.253.112|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/ameasure/autocoding-class/master/msha.xlsx [following]
--2019-03-22 17:11:50--  https://raw.githubusercontent.com/ameasure/autocoding-class/master/msha.xlsx
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4183086 (4.0M) [application/octet-stream]
Saving to: ‘msha.xlsx.1’


2019-03-22 17:11:51 (21.1 MB/s) - ‘msha.xlsx.1’ saved [4183086/4183086]



In [0]:
from sklearn.preprocessing import LabelBinarizer
from keras.preprocessing.text import Tokenizer

import pandas as pd

# read in and separate the training and validation data
df = pd.read_excel('msha.xlsx')
df['ACCIDENT_YEAR'] = df['ACCIDENT_DT'].apply(lambda x: x.year)
df['ACCIDENT_YEAR'].value_counts()
df_train = df[df['ACCIDENT_YEAR'].isin([2010, 2011])].copy()
df_valid = df[df['ACCIDENT_YEAR'] == 2012].copy()
print('training rows:', len(df_train))
print('validation rows:', len(df_valid))

# convert the narratives to sequences of indexes
tokenizer = Tokenizer()
tokenizer.fit_on_texts(df_train['NARRATIVE'])
X_train_seq = tokenizer.texts_to_sequences(df_train['NARRATIVE'])
X_valid_seq = tokenizer.texts_to_sequences(df_valid['NARRATIVE'])

# keras only accepts a one-hot encoding of the training labels
# we do that here
label_encoder = LabelBinarizer().fit(df_train['INJ_BODY_PART'])
y_train = label_encoder.transform(df_train['INJ_BODY_PART'])
y_valid = label_encoder.transform(df_valid['INJ_BODY_PART'])
n_codes = len(label_encoder.classes_)

Using TensorFlow backend.


training rows: 18681
validation rows: 9032


In [0]:
print(X_train_seq[0])

[244, 29, 7152, 1570, 764, 213, 970, 4, 3198, 139, 5, 1924, 424, 223, 610, 1, 764, 29, 10, 1, 1570, 9, 3, 64, 2, 490, 110, 5, 213, 1, 764, 813, 4, 164, 317, 11, 6, 15, 54]


Although an RNN can, in theory, work with sequences of arbitrary length, for computational reasons it is necessary to make sure that each batch of input examples has the same length. We accomplish this as we did with the convolutional neural networks, by padding (or truncating) all input sequences to the same length.

In [0]:
from keras.preprocessing import sequence

X_train_seq = sequence.pad_sequences(X_train_seq, maxlen=200)
X_valid_seq = sequence.pad_sequences(X_valid_seq, maxlen=200)

print(X_train_seq[0])

[   0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0  244   29 7152 1570  764  213  970
    4 3198  139    5 1924  424  223  610    1  764   29   10    1 1570
    9    3   64    2  490  110    5  213    1  764  813    4  164  317
   11 

We're now ready to specify the recurrent neural network. Here we use a particular type of recurrent neural network called an LSTM, which stands for Long-Short-Term-Memory. This contains a number of modifications which improve the performance. We will not go into the details in this class.

In [0]:
from keras.models import Model
from keras.layers import Dense, Input, Dropout
from keras.layers import Embedding, LSTM, GlobalMaxPooling1D, Concatenate
from keras.optimizers import Adam

input_text = Input(shape=(200,), dtype='int32')
embedding = Embedding(input_dim=len(tokenizer.word_index), 
                      output_dim=300, 
                      input_length=200)(input_text)
lstm = LSTM(units=256, 
            dropout=0.5, 
            recurrent_dropout=0.5, 
            return_sequences=True, )(embedding)
pool = GlobalMaxPooling1D()(lstm)
dropout = Dropout(0.5)(pool)
output = Dense(len(label_encoder.classes_), activation='softmax')(dropout)

model = Model(inputs=input_text, outputs=output)
optimer = Adam(lr=.001)
model.compile(optimizer='adam', 
              loss='categorical_crossentropy', 
              metrics=['accuracy'])

Instructions for updating:
Colocations handled automatically by placer.
Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.


In [0]:
model.fit(x=X_train_seq, y=y_train,
          validation_data=(X_valid_seq, y_valid),
          batch_size=512, epochs=20)

Instructions for updating:
Use tf.cast instead.
Instructions for updating:
Deprecated in favor of operator or tf.math.divide.
Train on 18681 samples, validate on 9032 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.callbacks.History at 0x7f6d3f388dd8>

A popular trick that sometimes results in some additional performance is to add another recurrent layer operating in the reverse direction of the sequence. We can implement this in Keras by wrapping our RNN with a Bidirectional layer. It simply creates two LSTM layers, one running left-to-right, the other right-to-left, and concatenates their outputs.

In [0]:
from keras.layers import Bidirectional


input_text = Input(shape=(200,), dtype='int32')
embedding = Embedding(input_dim=len(tokenizer.word_index), 
                      output_dim=300, 
                      input_length=200)(input_text)
lstm = Bidirectional(LSTM(units=128, 
                          dropout=0.5, 
                          recurrent_dropout=0.5, 
                          return_sequences=True),
                     merge_mode='concat')(embedding)
pool = GlobalMaxPooling1D()(lstm)
dropout = Dropout(0.5)(pool)
output = Dense(len(label_encoder.classes_), activation='softmax')(dropout)

model = Model(inputs=input_text, outputs=output)
optimer = Adam(lr=.001)
model.compile(optimizer='adam', 
              loss='categorical_crossentropy', 
              metrics=['accuracy'])

In [0]:
model.fit(x=X_train_seq, y=y_train,
          validation_data=(X_valid_seq, y_valid),
          batch_size=512, epochs=20)

Train on 18681 samples, validate on 9032 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.callbacks.History at 0x7f6d3115ee80>

# Next Lesson
[Pretrained Language Models](https://colab.research.google.com/drive/12wYVNlqC2U_7O07m4iT0R55ug9TSO6wn)