# Lesson 10 Assignment: RNN
## Author: Dustin Burnham
### Due: September 15th, 2019

Your next generation search engine startup was successful in having the ability to search for images based on their content. As a result, the startup received its second round of funding to be able to search news articles based on their topic. As the lead data scientist, you are tasked to build a model that classifies the topic of each article or newswire. 

For this assignment, you will leverage the RNN_KERAS.ipynb lab in the lesson. You are tasked to use the Keras Reuters newswire topics classification dataset. This dataset contains 11,228 newswires from Reuters, labeled with over 46 topics. Each wire is encoded as a sequence of word indexes. For convenience, words are indexed by overall frequency in the dataset, so that for instance the integer "3" encodes the 3rd most frequent word in the data. This allows for quick filtering operations such as: "only consider the top 10,000 most common words, but eliminate the top 20 most common words". As a convention, "0" does not stand for a specific word, but instead is used to encode any unknown word.

1. Read Reuters dataset into training and testing 
2. Prepare dataset
3. Build and compile 3 different models using Keras LTSM ideally improving model at each iteration.
4. Describe and explain your findings.

In [1]:
# TensorFlow and tf.keras
import tensorflow as tf
from tensorflow import keras
import numpy as np
import pandas as pd
import seaborn as sns
from collections import Counter
from matplotlib import pyplot as plt
from sklearn.preprocessing import MinMaxScaler
from sklearn import metrics
from imblearn.over_sampling import SMOTE 
from tensorflow.keras.preprocessing.text import Tokenizer
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
#!pip install imblearn

### 1. Read Reuters dataset into training and testing

In [3]:
data = tf.keras.datasets.reuters

In [4]:
# Data had trouble being loaded, borrowed this off of stack exchange.  Error coming from allow_pickle argument.

import numpy as np
# save np.load
np_load_old = np.load

# modify the default parameters of np.load
np.load = lambda *a,**k: np_load_old(*a, allow_pickle=True, **k)

# call load_data with allow_pickle implicitly set to true
(x_train, y_train), (x_test, y_test) = data.load_data(num_words = 10000)

# restore np.load for future normal usage
np.load = np_load_old

In [5]:
# Get word of index
word_index = data.get_word_index(path="reuters_word_index.json")

In [6]:
num_of_words=10000

In [34]:
np.shape(x_train[0])

(200,)

### 2. Prepare Dataset

In [7]:
print("Number of Newswires in Train: ", np.shape(x_train)[0])

Number of Newswires in Train:  8982


In [8]:
print("Number of Newswires in Test: ", np.shape(x_train)[0])

Number of Newswires in Test:  8982


In [9]:
print('Class Count: {}'.format(Counter(y_train)))

Class Count: Counter({3: 3159, 4: 1949, 19: 549, 16: 444, 1: 432, 11: 390, 20: 269, 13: 172, 8: 139, 10: 124, 9: 101, 21: 100, 25: 92, 2: 74, 18: 66, 24: 62, 0: 55, 34: 50, 12: 49, 36: 49, 28: 48, 6: 48, 30: 45, 23: 41, 31: 39, 17: 39, 40: 36, 32: 32, 41: 30, 14: 26, 26: 24, 39: 24, 43: 21, 15: 20, 38: 19, 37: 19, 29: 19, 45: 18, 5: 17, 7: 16, 27: 15, 22: 15, 42: 13, 44: 12, 33: 11, 35: 10})


In [10]:
print('Number of Unique Labels: ', len(set(y_train)))

Number of Unique Labels:  46


In [11]:
len(word_index)

30979

In [12]:
# The first indices are reserved
word_index = {k:(v+3) for k,v in word_index.items()}
word_index["<PAD>"] = 0
word_index["<START>"] = 1
word_index["<UNK>"] = 2  # unknown
word_index["<UNUSED>"] = 3

reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])

def decode_news(text):
    return ' '.join([reverse_word_index.get(i, '?') for i in text])

In [13]:
def filter_remove_common(exclude_num, data):
    """
    Inputs a data frame, excludes the top exclude_num words.
    """
    for i in range(len(data)):
        data[i] = list(filter(lambda x: x > exclude_num, data[i]))
    return(data)

In [14]:
def filter_most_common(include_num, data):
    """
    Inputs a data frame, filters for top include_num words.
    """
    for i in range(len(data)):
        data[i] = list(filter(lambda x: x < include_num, data[i]))
    return(data)

In [15]:
# Filter Data
x_train = filter_remove_common(20, x_train)
x_train = filter_most_common(10000, x_train)

x_test = filter_remove_common(20, x_test)
x_test = filter_most_common(10000, x_test)

In [16]:
# Check out smple sentence
decode_news(x_train[0])

'as result its december acquisition space co expects earnings per share 1987 15 30 per share up from 70 cts 1986 company pretax net should rise nine 10 from six 1986 rental operation revenues 19 22 from 12 5 cash flow per share this year should be 2 50 three'

In [17]:
# Set max newswire length
max_review_length = 200
x_train = keras.preprocessing.sequence.pad_sequences(x_train, maxlen=max_review_length)
x_test = keras.preprocessing.sequence.pad_sequences(x_test, maxlen=max_review_length)

In [18]:
# Set targets to be one-hot-encoded
y_train_copy = np.copy(y_train)
y_test_copy = np.copy(y_test)
y_train = keras.utils.to_categorical(y_train, 46)
y_test = keras.utils.to_categorical(y_test, 46)

### 3. Build and compile 3 different models using Keras LTSM ideally improving model at each iteration.

#### Model 1:

In [19]:
# Construct our model
# Similar to model from lab
embedding_vecor_length = 32
model = keras.models.Sequential()
model.add(keras.layers.Embedding(num_of_words, embedding_vecor_length, input_length=max_review_length))
model.add(keras.layers.LSTM(32, return_sequences = True))
model.add(keras.layers.LSTM(32))
model.add(keras.layers.Dense(46, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())
model.fit(x_train, y_train, validation_data=(x_test, y_test), epochs=3, batch_size=64)

W0910 23:02:33.359760 4521436608 deprecation.py:323] From /Users/dusty/anaconda3/envs/tensorflow/lib/python3.6/site-packages/tensorflow/python/ops/math_grad.py:1250: add_dispatch_support.<locals>.wrapper (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where


Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 200, 32)           320000    
_________________________________________________________________
lstm (LSTM)                  (None, 200, 32)           8320      
_________________________________________________________________
lstm_1 (LSTM)                (None, 32)                8320      
_________________________________________________________________
dense (Dense)                (None, 46)                1518      
Total params: 338,158
Trainable params: 338,158
Non-trainable params: 0
_________________________________________________________________
None
Train on 8982 samples, validate on 2246 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3


<tensorflow.python.keras.callbacks.History at 0x1a329590f0>

In [20]:
# Evaluate model
scores = model.evaluate(x_test, y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))

Accuracy: 97.83%


In [21]:
scores = model.evaluate(x_train, y_train, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))

Accuracy: 97.83%


This model performs very well and is neither over or under fit.  The model accuracy caught me off guard, but I investigated and played with the parameters and found teh same results.

#### Model 2:

In [22]:
# Construct our model
# Change LSTM size, embedding_vector_length, and activation function
embedding_vecor_length = 16
model = keras.models.Sequential()
model.add(keras.layers.Embedding(num_of_words, embedding_vecor_length, input_length=max_review_length))
model.add(keras.layers.LSTM(16, return_sequences = True))
model.add(keras.layers.LSTM(16))
model.add(keras.layers.Dense(46, activation='tanh'))

# Compile Model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())

# Fit Model
model.fit(x_train, y_train, validation_data=(x_test, y_test), epochs=3, batch_size=32)

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 200, 16)           160000    
_________________________________________________________________
lstm_2 (LSTM)                (None, 200, 16)           2112      
_________________________________________________________________
lstm_3 (LSTM)                (None, 16)                2112      
_________________________________________________________________
dense_1 (Dense)              (None, 46)                782       
Total params: 165,006
Trainable params: 165,006
Non-trainable params: 0
_________________________________________________________________
None
Train on 8982 samples, validate on 2246 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3


<tensorflow.python.keras.callbacks.History at 0x1a30e8b4e0>

In [23]:
# Evaluate model
scores = model.evaluate(x_test, y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))

Accuracy: 98.48%


In [24]:
# Evaluate model
scores = model.evaluate(x_train, y_train, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))

Accuracy: 98.48%


This model performs slightly better than the last.  The changes I implemented were the smaller LMST size, smaller embedded vector length, and the hyperbolic tangent activation function.  Because this model has roughly half the trainable parameters as the prior model, I expected a model that did not perform as well, but was surprised to beat model 1's performance.

#### Model 3:

In [25]:
# Construct our model
# Changes include larger embedded vector length, added dense layer with dropout.
embedding_vecor_length = 64
model = keras.models.Sequential()
model.add(keras.layers.Embedding(num_of_words, embedding_vecor_length, input_length=max_review_length))
model.add(keras.layers.LSTM(32, return_sequences = True))
model.add(keras.layers.LSTM(32))
model.add(keras.layers.Dense(500))
model.add(keras.layers.Dropout(0.3, seed = 42))
model.add(keras.layers.Dense(46, activation='sigmoid'))

# Compile Model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())

# Fit Model
model.fit(x_train, y_train, validation_data=(x_test, y_test), epochs=3, batch_size=32)

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 200, 64)           640000    
_________________________________________________________________
lstm_4 (LSTM)                (None, 200, 32)           12416     
_________________________________________________________________
lstm_5 (LSTM)                (None, 32)                8320      
_________________________________________________________________
dense_2 (Dense)              (None, 500)               16500     
_________________________________________________________________
dropout (Dropout)            (None, 500)               0         
_________________________________________________________________
dense_3 (Dense)              (None, 46)                23046     
Total params: 700,282
Trainable params: 700,282
Non-trainable params: 0
________________________________________________

<tensorflow.python.keras.callbacks.History at 0x1a37636ef0>

In [26]:
# Evaluate model
scores = model.evaluate(x_test, y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))

Accuracy: 98.73%


In [27]:
# Evaluate model
scores = model.evaluate(x_train, y_train, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))

Accuracy: 98.79%


The final model I build I thought would be the most overfit because of the added dense layer, and that proved correct.  The model performs the best, despite being slightly overfit.  

### 4. Findings

All in all, these models all performed well, with the worst model performing above 97% accuracy.  This seems suspiciously good, but I can't find where a mistake would come from.  My favorite model was model 3.  This model implemented the LMST with the larger embedded vector and an added dense layer with dropout.  I liked this model because of the it implemented the more cutting edge LMST RNN along with the more classic dense layer with dropout to prevent overfitting.  Overall the overfitting, for this model was minor since the model performed 0.08% better on the training set.