## Tips to tweak

- Data and preprocessing-based approaches
  - More data
  - Adjusting vocabulary size (make sure to consider the overall size of the corpus!)
  - Adjusting sequence length (more or less padding or truncation)
  - Whether to pad or truncate `pre` or `post` (usually less of an effect than the others)
- Model-based approaches
  - Adjust the number of embedding dimensions
  - Changing use of `Flatten` vs. `GlobalAveragePooling1D` (often better)
  - Considering other layers like Dropout
  - Adjusting the number of nodes in intermediate fully-connected layers

## Import

In [None]:
import tensorflow as tf
import tensorflow_datasets as tfds
import numpy as np
import pandas as pd
import io
import matplotlib.pyplot as plt

from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

## Functions

In [None]:
def plot_graphs(title, history, string):
  plt.title(title)
  plt.plot(history.history[string])
  plt.plot(history.history['val_'+string])
  plt.xlabel("Epochs")
  plt.ylabel(string)
  plt.legend([string, 'val_'+string])
  plt.show()

def predict_review(model, reviews, tokenizer, maxlen=100, show_padded_sequence=False, trunc_type='post', padding_type='post'):
  # Create the sequences
  sample_sequences = tokenizer.texts_to_sequences(reviews)
  reviews_padded = pad_sequences(sample_sequences, maxlen=maxlen, padding=padding_type, truncating=trunc_type) 
  classes = model.predict(reviews_padded)
  for x in range(len(reviews_padded)):
    # We can see the padded sequence if desired
    # Print the sequence
    if (show_padded_sequence):
      print(reviews_padded[x])
    print(reviews[x], classes[x])

def train_model(model, training_sequences, testing_sequences, training_labels, testing_labels, epochs=30, learning_rate=0.001):
  model.compile(loss='binary_crossentropy', optimizer=tf.keras.optimizers.Adam(learning_rate), metrics=['accuracy'])
  model.summary()
  history = model.fit(training_sequences, training_labels, epochs=epochs, validation_data=(testing_sequences, testing_labels))
  return history

def plot_results(title, history):
  plot_graphs(title, history, "accuracy")
  plot_graphs(title, history, "loss")

# Using LSTMs, CNNs, GRUs with a larger dataset

In this colab, you use different kinds of layers to see how they affect the model.

You will use the `glue/sst2` dataset, which is available through `tensorflow_datasets`. 

The `General Language Understanding Evaluation (GLUE)` benchmark (https://gluebenchmark.com/) is a collection of resources for training, evaluating, and analyzing natural language understanding systems.

These resources include the `Stanford Sentiment Treebank (SST)` dataset that consists of sentences from movie reviews and human annotations of their sentiment. This colab uses version 2 of the SST dataset.

The splits are:

*   train	67,349
*   validation	872


and the column headings are:

*   sentence
*   label


For more information about the dataset, see [https://www.tensorflow.org/datasets/catalog/glue#gluesst2](https://www.tensorflow.org/datasets/catalog/glue#gluesst2)

In [None]:
# Get the dataset.
# It has 70000 items, so might take a while to download
dataset, info = tfds.load('glue/sst2', with_info=True)
print(info.features)
print(info.features["label"].num_classes)
print(info.features["label"].names)

# Get the training and validation datasets
dataset_train, dataset_validation = dataset['train'], dataset['validation']
dataset_train

# Print some of the entries
for example in dataset_train.take(2):  
  review, label = example["sentence"], example["label"]
  print("Review:", review)
  print("Label: %d \n" % label.numpy())

# Get the sentences and the labels
# for both the training and the validation sets
training_reviews = []
training_labels = []
 
validation_reviews = []
validation_labels = []

# The dataset has 67,000 training entries, but that's a lot to process here!

# If you want to take the entire dataset: WARNING: takes longer!!
# for item in dataset_train.take(-1):

# Take 10,000 reviews
for item in dataset_train.take(10000):
  review, label = item["sentence"], item["label"]
  training_reviews.append(str(review.numpy()))
  training_labels.append(label.numpy())

print ("\nNumber of training reviews is: ", len(training_reviews))

# print some of the reviews and labels
for i in range(0, 2):
  print (training_reviews[i])
  print (training_labels[i])

# Get the validation data
# there's only about 800 items, so take them all
for item in dataset_validation.take(-1):  
  review, label = item["sentence"], item["label"]
  validation_reviews.append(str(review.numpy()))
  validation_labels.append(label.numpy())

print ("\nNumber of validation reviews is: ", len(validation_reviews))

# Print some of the validation reviews and labels
for i in range(0, 2):
  print (validation_reviews[i])
  print (validation_labels[i])


## Tokenize the words

In [None]:
# paras
vocab_size = 4000
embedding_dim = 16
max_length = 50
trunc_type='post'
padding_type='post'
oov_tok = "<OOV>"

# tokenizer
tokenizer = Tokenizer(num_words=vocab_size, oov_token=oov_tok)
tokenizer.fit_on_texts(training_reviews)
word_index = tokenizer.word_index
print(len(word_index))
print(word_index)

# training set
training_sequences = tokenizer.texts_to_sequences(training_reviews)
training_sequences = pad_sequences(training_sequences, maxlen=max_length, padding=padding_type, truncating=trunc_type)

# validation set
validation_sequences = tokenizer.texts_to_sequences(validation_reviews)
validation_sequences = pad_sequences(validation_sequences, maxlen=max_length)

# Make labels into numpy arrays for use with the network later
training_labels = np.array(training_labels)
validation_labels = np.array(validation_labels)

## Create Models

In [None]:
# Simple Embeddings
model_simple = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length),
    tf.keras.layers.GlobalAveragePooling1D(),
    tf.keras.layers.Dense(6, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

# Bidirectional LSTM
model_bidi_lstm = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(embedding_dim)), 
    tf.keras.layers.Dense(6, activation='relu'), 
    tf.keras.layers.Dense(1, activation='sigmoid')
])

# Multiple Bidirectional LSTM
model_multiple_bidi_lstm = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(embedding_dim, return_sequences=True)), 
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(embedding_dim)),
    tf.keras.layers.Dense(6, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

# CNN
model_cnn = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length),
    tf.keras.layers.Conv1D(16, 5, activation='relu'),
    tf.keras.layers.GlobalMaxPooling1D(),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

# GRU
model_gru = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length),
    tf.keras.layers.Bidirectional(tf.keras.layers.GRU(32)),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

## Predict data

In [None]:
# Use the model to predict some reviews   
new_reviews = [ """I loved this movie""",
                """that was the worst movie I've ever seen""",
                """too much violence even for a Bond film""",
                """a captivating recounting of a cherished myth""",
                """I loved this movie""",
                """that was the worst movie I've ever seen""",
                """too much violence even for a Bond film""",
                """a captivating recounting of a cherished myth""",
                """I saw this movie yesterday and I was feeling low to start with, but it was such a wonderful movie that it lifted my spirits and brightened my day, you can\'t go wrong with a movie with Whoopi Goldberg in it.""",
                """I don\'t understand why it received an oscar recommendation
                for best movie, it was long and boring""",
                """the scenery was magnificent, the CGI of the dogs was so realistic I
                thought they were played by real dogs even though they talked!""",
                """The ending was so sad and yet so uplifting at the same time. 
                I'm looking for an excuse to see it again""",
                """I had expected so much more from a movie made by the director 
                who made my most favorite movie ever, I was very disappointed in the tedious 
                story""",
                "I wish I could watch this movie every day for the rest of my life",
              ]

## Train the models and Predict

In [None]:
history = train_model(model_simple, training_sequences, validation_sequences, training_labels, validation_labels)
plot_results("Simple Embeddings", history)
predict_review(model_simple, new_reviews, tokenizer, maxlen=max_length)

history = train_model(model_bidi_lstm, training_sequences, validation_sequences, training_labels, validation_labels, learning_rate=0.0003)
plot_results("Bi-LSTM", history)
predict_review(model_bidi_lstm, new_reviews, tokenizer, maxlen=max_length)

history = train_model(model_multiple_bidi_lstm, training_sequences, validation_sequences, training_labels, validation_labels, learning_rate=0.003)
plot_results("multi-Bi-LSTM", history)
predict_review(model_multiple_bidi_lstm, new_reviews, tokenizer, maxlen=max_length)

history = train_model(model_cnn, training_sequences, validation_sequences, training_labels, validation_labels, learning_rate=0.0001)
plot_results("CNN", history)
predict_review(model_cnn, new_reviews, tokenizer, maxlen=max_length)

history = train_model(model_gru, training_sequences, validation_sequences, training_labels, validation_labels, learning_rate=0.00003)
plot_results("GRU", history)
predict_review(model_gru, new_reviews, tokenizer, maxlen=max_length)