## Lesson 10: New Topic Identification
### Author: Ana Javed

### Workplace Scenario

Your next generation search engine startup was successful in having the ability to search for images based on their content. As a result, the startup received its second round of funding to be able to search news articles based on their topic. As the lead data scientist, you are tasked to build a model that classifies the topic of each article or newswire. 

For this assignment, you will leverage the RNN_KERAS.ipynb lab in the lesson. You are tasked to use the Keras Reuters newswire topics classification dataset. This dataset contains 11,228 newswires from Reuters, labeled with over 46 topics. Each wire is encoded as a sequence of word indexes. For convenience, words are indexed by overall frequency in the dataset, so that for instance the integer "3" encodes the 3rd most frequent word in the data. This allows for quick filtering operations such as: "only consider the top 10,000 most common words, but eliminate the top 20 most common words". As a convention, "0" does not stand for a specific word, but instead is used to encode any unknown word.


#### Instructions

Complete the lab exercises for this week before following these steps to complete your assignment.

Using the Keras dataset, create a new notebook and perform each of the following data preparation tasks and answer the related questions:

1. Read Reuters dataset into training and testing 
2. Prepare dataset
3. Build and compile 3 different models using Keras LTSM ideally improving model at each iteration.
4. Describe and explain your findings.




In [37]:
## Loading Necessary Packages 
import tensorflow as tf
from tensorflow import keras

import numpy as np
import matplotlib.pyplot as plt

### 1. Read Reuters dataset into training and testing


In [38]:
## Loading Reuters dataset 
## More information here: https://keras.io/api/datasets/reuters/

data = tf.keras.datasets.reuters


In [39]:
## Splitting Data into Training and Testing datasets 
num_of_words=10000
(x_train, y_train), (x_test, y_test) = data.load_data(num_words=num_of_words)

In [40]:
## Checking the data:
print("Training Data (independent): ")
print(x_train[0])

print("\n")

print("Training Data (dependent): ")
print(y_train)


Training Data (independent): 
[1, 2, 2, 8, 43, 10, 447, 5, 25, 207, 270, 5, 3095, 111, 16, 369, 186, 90, 67, 7, 89, 5, 19, 102, 6, 19, 124, 15, 90, 67, 84, 22, 482, 26, 7, 48, 4, 49, 8, 864, 39, 209, 154, 6, 151, 6, 83, 11, 15, 22, 155, 11, 15, 7, 48, 9, 4579, 1005, 504, 6, 258, 6, 272, 11, 15, 22, 134, 44, 11, 15, 16, 8, 197, 1245, 90, 67, 52, 29, 209, 30, 32, 132, 6, 109, 15, 17, 12]


Training Data (dependent): 
[ 3  4  3 ... 25  3 25]


### 2. Prepare dataset



In [41]:
# A dictionary mapping words to an integer index
word_index = tf.keras.datasets.reuters.get_word_index()


In [42]:
## The first indices are reserved. Adding three to all key values 
word_index = {k:(v+3) for k,v in word_index.items()}
word_index["<PAD>"] = 0
word_index["<START>"] = 1
word_index["<UNK>"] = 2  # unknown
word_index["<UNUSED>"] = 3

reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])

## Function to join words into sentences 
def decode_review(text):
    return ' '.join([reverse_word_index.get(i, '?') for i in text])

In [43]:
## Checking to see if the sentences appearing correctly 
decode_review(x_train[100])

"<START> opec believes world oil prices should be set around a fixed average price of 18 dlrs a barrel <UNK> assistant general secretary <UNK> al wattari said today in a speech to a european community ec <UNK> opec seminar in luxembourg released here al wattari said opec believes the world energy trade should be kept without restrictions and should be built around a fixed average price of 18 dlrs but he warned that defense of the 18 dlr a barrel level had caused hardship for opec countries who had been forced to curtail production and he warned that such cutbacks by opec states could not be sustained in some cases for opec to stabilize the world oil price at what is now considered the optimal level of 18 dlrs a barrel its member countries have had to undergo severe hardship in <UNK> production al wattari said such cutbacks cannot in certain cases be sustained al wattari said as well as financial and marketing pressures some states depended on associated gas output for domestic use and 

### 3. Build and compile 3 different models using Keras LTSM ideally improving model at each iteration.


In [66]:
# Only consider the first 350 words within the review
max_review_length = 350
x_train = keras.preprocessing.sequence.pad_sequences(x_train, maxlen=max_review_length)
x_test = keras.preprocessing.sequence.pad_sequences(x_test, maxlen=max_review_length)

In [67]:
# Construct our model #1
embedding_vecor_length = 25
model1 = keras.models.Sequential()
model1.add(keras.layers.Embedding(num_of_words, embedding_vecor_length, input_length=max_review_length))
model1.add(keras.layers.LSTM(100))
model1.add(keras.layers.Dense(46, activation='sigmoid'))
model1.compile(loss='sparse_categorical_crossentropy', optimizer='SGD', metrics=['accuracy'])
print(model1.summary())
model1.fit(x_train, y_train, validation_data=(x_test, y_test), epochs=3, batch_size=50)

Model: "sequential_19"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_18 (Embedding)     (None, 350, 25)           250000    
_________________________________________________________________
lstm_18 (LSTM)               (None, 100)               50400     
_________________________________________________________________
dense_18 (Dense)             (None, 46)                4646      
Total params: 305,046
Trainable params: 305,046
Non-trainable params: 0
_________________________________________________________________
None
Epoch 1/3
Epoch 2/3
Epoch 3/3


<tensorflow.python.keras.callbacks.History at 0x7f9903ba3820>

In [68]:
# Evaluate model #1
scores = model1.evaluate(x_test, y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))

Accuracy: 36.20%


In [69]:
# Construct our model #2
embedding_vecor_length = 50
model2 = keras.models.Sequential()
model2.add(keras.layers.Embedding(num_of_words, embedding_vecor_length, input_length=max_review_length))
model2.add(keras.layers.LSTM(100))
model2.add(keras.layers.Dense(46, activation='sigmoid'))
model2.compile(loss='sparse_categorical_crossentropy', optimizer='Adamax', metrics=['accuracy'])
print(model2.summary())
model2.fit(x_train, y_train, validation_data=(x_test, y_test), epochs=3, batch_size=100)

Model: "sequential_20"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_19 (Embedding)     (None, 350, 50)           500000    
_________________________________________________________________
lstm_19 (LSTM)               (None, 100)               60400     
_________________________________________________________________
dense_19 (Dense)             (None, 46)                4646      
Total params: 565,046
Trainable params: 565,046
Non-trainable params: 0
_________________________________________________________________
None
Epoch 1/3
Epoch 2/3
Epoch 3/3


<tensorflow.python.keras.callbacks.History at 0x7f9904d40c70>

In [70]:
# Evaluate model #2
scores = model2.evaluate(x_test, y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))

Accuracy: 36.20%


In [71]:
# Construct our model #3
embedding_vecor_length = 32
model3 = keras.models.Sequential()
model3.add(keras.layers.Embedding(num_of_words, embedding_vecor_length, input_length=max_review_length))
model3.add(keras.layers.LSTM(100))
model3.add(keras.layers.Dense(46, activation='sigmoid'))
model3.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model3.summary())
model3.fit(x_train, y_train, validation_data=(x_test, y_test), epochs=3, batch_size=64)

Model: "sequential_21"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_20 (Embedding)     (None, 350, 32)           320000    
_________________________________________________________________
lstm_20 (LSTM)               (None, 100)               53200     
_________________________________________________________________
dense_20 (Dense)             (None, 46)                4646      
Total params: 377,846
Trainable params: 377,846
Non-trainable params: 0
_________________________________________________________________
None
Epoch 1/3
Epoch 2/3
Epoch 3/3


<tensorflow.python.keras.callbacks.History at 0x7f9923c246d0>

In [72]:
# Evaluate model #3
scores = model3.evaluate(x_test, y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))

Accuracy: 49.60%


### 4. Describe and explain your findings.


Above I tested three different RNN models that processed and categorized Reuters news data.  In the first model I used the sparse categorical cross entropy loss function, the SGD optimizer, 100 size for the LSTM layer, and 25 for the embedding vector length. The accuracy after the third epoch was 36.20%, and the testing accuracy was also 36.20%.

In  the second model I kept the sparse categorical cross entropy loss function and the same LSTM layer, however changed the  embedding vector length to 50, updated the optimizer to Adamax, and the batch_size to 100. Interestingly enough, this gave me the same accuracy of 36.20% above for testing and training. 

In the third model, I once again kept the sparse categorical cross entropy loss function and the same LSTM layer. I altered the embedding vector layer to 32, the optimized to adam, and the batch size to 64. This resulted in a higher accuracy score of 49.6% after the third epoch and the testing accuracy was also 49.60%.

What I found when testing different hyperparameters was that the categorical cross entropy loss function performed the best - which is why I did not change it across models. Instead, I altered the optimizer, embedding vector length, and batch size. Since the highest accuracy score I obtained was still low (at 49.6%) I would continue adjusting the hyper parameters to increase this. 
