# New Topic Identification using RNN and LSTM
## Author - Catalina Ifrim

In [None]:
"""
UW DATASCI420-Machine Learning Techniques
L10-LSTM_Text_Analysis

"""

### Instructions

For this assignment, you will leverage the RNN_KERAS.ipynb lab in the lesson. You are tasked to use the Keras Reuters newswire
topics classification dataset to **build a model that classifies the topic of each article or newswire**. 
Using the Keras dataset, create a new notebook and perform each of the following data preparation tasks and answer the related
questions:

1. Read Reuters dataset into training and testing 
2. Prepare dataset
3. **Build and compile 3 different models using Keras **LSTM (Long Short-Term Memory)** ideally improving model at each iteration.
4. Describe and explain your findings.

### Dataset description

The Keras Reuters newswire topics dataset contains 11,228 newswires from Reuters, labeled with over 46 topics. Each wire is
encoded as a sequence of word indexes. For convenience, words are indexed by overall frequency in the dataset, so that for 
instance the integer "3" encodes the 3rd most frequent word in the data. This allows for quick filtering operations such 
as: "only consider the top 10,000 most common words, but eliminate the top 20 most common words". As a convention, "0" does not
stand for a specific word, but instead is used to encode any unknown word.

In [1]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import utils

%matplotlib inline

### 1. Read Reuters dataset into training and testing

The 'reuters' dataset is loaded from keras, split into training and testing sets, then printed the shape of the sets.

In [3]:
data = tf.keras.datasets.reuters

# Load the dataset and split it into training and testing
num_of_words=10000
(X_train, y_train), (X_test, y_test) = data.load_data(num_words=num_of_words)

In [4]:
print("X_train shape", X_train.shape)
print("y_train shape", y_train.shape)
print("X_test shape", X_test.shape)
print("y_test shape", y_test.shape)

X_train shape (8982,)
y_train shape (8982,)
X_test shape (2246,)
y_test shape (2246,)


Below we check the data by printing the first record.

In [5]:
print(X_train[0])

[1, 2, 2, 8, 43, 10, 447, 5, 25, 207, 270, 5, 3095, 111, 16, 369, 186, 90, 67, 7, 89, 5, 19, 102, 6, 19, 124, 15, 90, 67, 84, 22, 482, 26, 7, 48, 4, 49, 8, 864, 39, 209, 154, 6, 151, 6, 83, 11, 15, 22, 155, 11, 15, 7, 48, 9, 4579, 1005, 504, 6, 258, 6, 272, 11, 15, 22, 134, 44, 11, 15, 16, 8, 197, 1245, 90, 67, 52, 29, 209, 30, 32, 132, 6, 109, 15, 17, 12]


It can be observed that the input consists of numbers instead of words. Corresponding to dataset description, each wire is 
encoded as a sequence of word indexes, where the words are indexed by overall frequency in the dataset.

Below it is created a dictionary that maps words to integer index, Then it is defined a function that decodes data.

In [6]:
# A dictionary mapping words to an integer index
word_index = tf.keras.datasets.reuters.get_word_index()

# The first indices are reserved
word_index = {k:(v+3) for k,v in word_index.items()}
word_index["<PAD>"] = 0
word_index["<START>"] = 1
word_index["<UNK>"] = 2           # unknown
word_index["<UNUSED>"] = 3

reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])

# Define function to decode data
def decode_review(text):
    return ' '.join([reverse_word_index.get(i, '?') for i in text])

Using the decode function above it is decoded the first newswire and printed the actual text.

In [7]:
# Decode the training data using the function above
decode_review(X_train[0])

'<START> <UNK> <UNK> said as a result of its december acquisition of space co it expects earnings per share in 1987 of 1 15 to 1 30 dlrs per share up from 70 cts in 1986 the company said pretax net should rise to nine to 10 mln dlrs from six mln dlrs in 1986 and rental operation revenues to 19 to 22 mln dlrs from 12 5 mln dlrs it said cash flow per share this year should be 2 50 to three dlrs reuter 3'

### 2. Prepare dataset

The input sequences need to be modified so they all have the same length for modeling. For this, it is used the preprocessing 
library within keras. It is defined the max_review_length which is the maxim number of words in a newswire.

In [8]:
# Define the max number of words in a newswire and modify the input sequences
max_review_length = 400 
X_train = keras.preprocessing.sequence.pad_sequences(X_train, maxlen=max_review_length)
X_test = keras.preprocessing.sequence.pad_sequences(X_test, maxlen=max_review_length)

The output column is one-hot encoded using keras numpy-related utilities.

In [10]:
# One-hot encoding using keras' numpy-related utilities
n_classes = 46
print("Shape before one-hot encoding: ", y_train.shape)
Y_train = utils.to_categorical(y_train, n_classes)
Y_test = utils.to_categorical(y_test, n_classes)
print("Shape after one-hot encoding: ", Y_train.shape)

Shape before one-hot encoding:  (8982,)
Shape after one-hot encoding:  (8982, 46)


### 3. Build and compile 3 different models using Keras LSTM ideally improving model at each iteration.

#### LSTM Model 1

The first model has sequential layers with an input layer, an **LSTM (Long Short-Term Memory)** layer, and a dense output 
layer. 
- The input layer is an Embedding layer which takes 3 arguments:
    - num-of-words = 10000 - the maximum number of words to be used (the most frequent words)
    - embeding-vector-lenght = 32 - uses 32 length vectors to represent each word
    - input_lenght = max_review_length (400) - the maximum number of words in a newswire
- The second layer is an LTSM layer with 100 memory units. 
- The dense output layer must create 46 output values, one for each class (there are 46 topics). <br>

The activation function is 'softmax' for multi-class classification. Since it is a multi-class classification problem, it is
used 'categorical_crossentropy' as the loss function. The model is trained for 3 epochs and the batch_size value is 64.

In [11]:
# Build Model 1

# num_of_words = 10000
# max_review_length = 400

embedding_vector_length = 32

model = keras.models.Sequential()
model.add(keras.layers.Embedding(num_of_words, embedding_vector_length, input_length=max_review_length))
model.add(keras.layers.LSTM(100))
model.add(keras.layers.Dense(46, activation='softmax'))

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

print(model.summary())
model.fit(X_train, Y_train, validation_data=(X_test, Y_test), epochs=3, batch_size=64)

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 400, 32)           320000    
_________________________________________________________________
lstm_1 (LSTM)                (None, 100)               53200     
_________________________________________________________________
dense_1 (Dense)              (None, 46)                4646      
Total params: 377,846
Trainable params: 377,846
Non-trainable params: 0
_________________________________________________________________
None
Epoch 1/3
Epoch 2/3
Epoch 3/3


<tensorflow.python.keras.callbacks.History at 0x1fa40226ac8>

In [13]:
# Evaluate model
scores = model.evaluate(X_test, Y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))

Accuracy: 50.18%


The **first model has an accuracy score of only 50%**. In the next model we'll try to improve this score by adding some
more parameters.

#### LSTM Model 2

For the second model are added 'dropout' and 'recurrent_dropout' arguments for the LSTM layer. They are used to apply dropout 
probability to kernel and recurrent_kernel respectively. Both parameters are a float between 0 and 1.
For this model it was selected a value of 0.2 for both 'dropout' and 'recurrent_dropout'. The model is trained for 5 epochs.

In [15]:
# Build Model 2

# num_of_words = 10000
# max_review_length = 400
# embedding_vector_length = 32

model = keras.models.Sequential()
model.add(keras.layers.Embedding(num_of_words, embedding_vector_length, input_length=max_review_length))
model.add(keras.layers.LSTM(100, dropout=0.2, recurrent_dropout=0.2))
model.add(keras.layers.Dense(46, activation='softmax'))

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

print(model.summary())
model.fit(X_train, Y_train, validation_data=(X_test, Y_test), epochs=5, batch_size=64)

Model: "sequential_3"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_3 (Embedding)      (None, 400, 32)           320000    
_________________________________________________________________
lstm_3 (LSTM)                (None, 100)               53200     
_________________________________________________________________
dense_3 (Dense)              (None, 46)                4646      
Total params: 377,846
Trainable params: 377,846
Non-trainable params: 0
_________________________________________________________________
None
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x1fa4087af88>

In [16]:
# Evaluate model
scores = model.evaluate(X_test, Y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))

Accuracy: 57.79%


The **second model achieved a better accuracy rate of 58%** comparing with the first model. 

#### LSTM Model 3

The third LSTM model was build using different values for the following parameters:
- num-of-words=50000 - the maximum number of words to use was increased from 10000 to 50000
- embeding-vector-lenght = 100 - instead of 32 lenght vectors, are used 100 length vectors to represent each word 
- max_review_length = 250 - the maximum number of words in a newswire was decreased from 400 to 250 words

All the other hyperparameters remained the same as for Model 2.

In [8]:
data = tf.keras.datasets.reuters
# Load the dataset and split it into training and testing
num_of_words=50000
(X_train, y_train), (X_test, y_test) = data.load_data(num_words=num_of_words)

In [9]:
# one-hot encoding using keras' numpy-related utilities
n_classes = 46
print("Shape before one-hot encoding: ", y_train.shape)
Y_train = utils.to_categorical(y_train, n_classes)
Y_test = utils.to_categorical(y_test, n_classes)
print("Shape after one-hot encoding: ", Y_train.shape)

Shape before one-hot encoding:  (8982,)
Shape after one-hot encoding:  (8982, 46)


In [10]:
# Define max number of words in a wire
max_review_length = 250 
X_train = keras.preprocessing.sequence.pad_sequences(X_train, maxlen=max_review_length)
X_test = keras.preprocessing.sequence.pad_sequences(X_test, maxlen=max_review_length)

In [9]:
# Build Model 3

# num_of_words = 50000
# max_review_length = 250
embedding_vector_length = 100

model = keras.models.Sequential()
model.add(keras.layers.Embedding(num_of_words, embedding_vector_length, input_length=max_review_length))
model.add(keras.layers.LSTM(100, dropout=0.2, recurrent_dropout=0.2))
model.add(keras.layers.Dense(46, activation='softmax'))

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

print(model.summary())
model.fit(X_train, Y_train, validation_data=(X_test, Y_test), epochs=5, batch_size=64)

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 250, 100)          5000000   
_________________________________________________________________
lstm (LSTM)                  (None, 100)               80400     
_________________________________________________________________
dense (Dense)                (None, 46)                4646      
Total params: 5,085,046
Trainable params: 5,085,046
Non-trainable params: 0
_________________________________________________________________
None
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x1fc6fba5d48>

In [10]:
# Evaluate model
scores = model.evaluate(X_test, Y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))

Accuracy: 62.87%


The **third model achieved an accuracy rate of 63%**.  This is the best accuracy score from all three models.

There were trained and tested also a couple more models, but their accuracy rate was lower than for the third model. <br>
One of the models was built using a 'SpatialDropout1D' layer (which performs variational dropout in NLP models); it had an 
accuracy rate of 57%. Other model used 128 memory units for the LSTM layer instead of 100 and different values for 
dropout and recurrent_dropout parameters; it achieved an accuracy rate of 55%.

### 4. Describe and explain your findings

The first LSTM model built achieved an accuracy score of 50%. 

For the second model, adding the 'dropout' and 'recurrent_dropout' regularization parameters to the LSTM layer improved the first model's accuracy rate to 58%.

**The best accuracy score of 63% was obtained for the third LSTM model**. Comparing with the other two models, the **third model** had 
a **higher value for the maximum number of words**, a **higher length vectors to represent each word**, and a **smaller maximum 
number of words in a newswire**. 

All three models had the same number of memory units for the LSTM layer. 

There were also trained and tested a couple of other models. One of these models used a **'SpatialDropout1D' layer**. Another 
model had a higher number of memory units for the LSTM layer and used various values for dropout and recurrent_dropout 
parameters. These hyperparameter modifications did not improve though too much their accuracy score, which stayed at the same
level as for the second model.