# Recurrent Neural Networks for Natural Language Processing

### Chet Huang, 12 Jan 2020

In this notebook we explore Natural Language Processing using recurrent neural networks.  The dataset we use is the Keras Reuters newswires dataset. This dataset contains 11,228 Reuters newswires, each classified as one of 46 topics.  In this notebook we train and test various RNN/LSTM models to recognize newswire topics using sequences of words.

In [1]:
# import Tensorflow, Keras, and Numpy
import tensorflow as tf
from tensorflow import keras
import numpy as np

Keras contains a Reuters dataset which we import below.  We set a top limit of 4000 most popular words and load the data into training (75%) and test sets (25%).

In [2]:
from tensorflow.keras.datasets import reuters
top_words = 4000

(x_train, y_train), (x_test, y_test) = reuters.load_data(path="reuters.npz", num_words=top_words, skip_top=0, maxlen=None,
                                                         test_split=0.25, seed=113, start_char=1, oov_char=2, index_from=3)

Let's take a look at a sample newswire.

In [3]:
print(x_train[42])

[1, 599, 1815, 299, 45, 1320, 194, 198, 2041, 28, 932, 294, 517, 1213, 2, 334, 66, 199, 8, 17, 12]


Notice that the newswire is expressed in numbers and not words.  The words are encoded as a sequence of word indexes where words are indexed by overall frequency. For example, "8" encodes the 8th most frequent word in the data.  Let's build a function to decode these newswires into words.

In [4]:
# A dictionary mapping words to an integer index
word_index = reuters.get_word_index(path="reuters_word_index.json")

# The first indices are reserved
word_index = {k:(v+3) for k,v in word_index.items()}
word_index["<PAD>"] = 0
word_index["<START>"] = 1
word_index["<UNK>"] = 2  # unknown
word_index["<UNUSED>"] = 3

reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])

def decode_newswire(text):
    return ' '.join([reverse_word_index.get(i, '?') for i in text])

We now decode the sample newswire using our decode function.

In [5]:
decode_newswire(x_train[42])

"<START> french operators did not request any export licences at today's european community weekly <UNK> tender trade sources said reuter 3"

Next, we need to transform the newswires so that they all have the same length for modeling. For this, we will use the preprocessing library within Keras.  The preprocessing can either trim newswires to a max number of words or pad newswires with essentially blanks.

In [6]:
# Only consider the first 200 words within a newswire
max_newswire_length = 200
x_train = keras.preprocessing.sequence.pad_sequences(x_train, maxlen=max_newswire_length, truncating='post')
x_test = keras.preprocessing.sequence.pad_sequences(x_test, maxlen=max_newswire_length, truncating='post')

We are ready to build our initial model. We will have sequential layers with an input layer, an LSTM layer, and dense output layer. The input layer is of size 32 for each input word. The second layer is an LSTM layer with size 100, and finally 46 output nodes since newswires can be any of 46 categories.

In [7]:
#### Construct our model
embedding_vector_length = 32
model = keras.models.Sequential()
model.add(keras.layers.Embedding(top_words, embedding_vector_length, input_length=max_newswire_length))
model.add(keras.layers.LSTM(100))
model.add(keras.layers.Dense(46, activation='sigmoid'))
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())
model.fit(x_train, y_train, epochs=5, batch_size=64)

# Evaluate model
scores = model.evaluate(x_test, y_test, verbose=0)
print("\nModel accuracy with test data: %.2f%%" % (scores[1]*100))

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 200, 32)           128000    
_________________________________________________________________
lstm (LSTM)                  (None, 100)               53200     
_________________________________________________________________
dense (Dense)                (None, 46)                4646      
Total params: 185,846
Trainable params: 185,846
Non-trainable params: 0
_________________________________________________________________
None
Train on 8421 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5

Model accuracy with test data: 39.72%


Our initial model achieved ~35% accuracy.  Let's try changing the activation function in the LSTM layer to tanh and see if we can get a better model.

In [8]:
#### Construct our model
model = keras.models.Sequential()
model.add(keras.layers.Embedding(top_words, embedding_vector_length, input_length=max_newswire_length))
model.add(keras.layers.LSTM(100, activation='tanh'))
model.add(keras.layers.Dense(46, activation='softmax'))
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())
model.fit(x_train, y_train, epochs=5, batch_size=64)

# Evaluate model
scores = model.evaluate(x_test, y_test, verbose=0)
print("\nModel accuracy with test data: %.2f%%" % (scores[1]*100))

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 200, 32)           128000    
_________________________________________________________________
lstm_1 (LSTM)                (None, 100)               53200     
_________________________________________________________________
dense_1 (Dense)              (None, 46)                4646      
Total params: 185,846
Trainable params: 185,846
Non-trainable params: 0
_________________________________________________________________
None
Train on 8421 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5

Model accuracy with test data: 56.43%


Our second model achieved accuracy ~55%.  Using tanh activation did not seem to help improve model accuracy.  Let's try adding more LSTM layers and see if a deeper network can help improve model performance.

In [9]:
# Construct our model
model = keras.models.Sequential()
embedding_vector_length = 32
model.add(keras.layers.Embedding(top_words, embedding_vector_length, input_length=max_newswire_length))
model.add(keras.layers.LSTM(46, return_sequences=True, input_shape=(8, max_newswire_length), activation='tanh'))
model.add(keras.layers.LSTM(92, return_sequences=True, activation='tanh'))
model.add(keras.layers.LSTM(138))
model.add(keras.layers.Dense(46, activation='softmax'))
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())
model.fit(x_train, y_train, epochs=10, batch_size=64)

# Evaluate model
scores = model.evaluate(x_test, y_test, verbose=0)
print("\nModel accuracy with test data: %.2f%%" % (scores[1]*100))

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 200, 32)           128000    
_________________________________________________________________
lstm_2 (LSTM)                (None, 200, 46)           14536     
_________________________________________________________________
lstm_3 (LSTM)                (None, 200, 92)           51152     
_________________________________________________________________
lstm_4 (LSTM)                (None, 138)               127512    
_________________________________________________________________
dense_2 (Dense)              (None, 46)                6394      
Total params: 327,594
Trainable params: 327,594
Non-trainable params: 0
_________________________________________________________________
None
Train on 8421 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10

In this notebook we learned how to load Reuters newswires from Keras dataset, pad and trim text to a constant length, and build numerous RNN/LSTM models to predict the newswire category.  We found that ~65% prediction accuracy can be achieved after 10 training epochs.  The trend of model performance by training epoch suggests even better model performance can be achieved with more training though 10 epochs is sufficient to demonstrate RNN/LSTM capabilities for natural language processing.