# Sentiment Analysis with Neural Networks

Acknowledgement for this notebook goes to Chollet's "Deep Learning with Python"

In [None]:
from tensorflow import keras

# Getting the data

Keras already has an IMDB dataset for us to play around with (and one that's conveniently already split up into training and testing data).
* [Documentation for Keras's IMBD movie review sentiment classification dataset](https://keras.io/api/datasets/imdb/)

In [None]:
from keras.datasets import imdb
(train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words=1500)

Let's look at some of the data.

In [None]:
print(train_data[0])

In [None]:
train_labels[0]

The reviews are currently strings of numbers.  We can use `get_word_index` to retrieve the correspondence between words and numbers, and we can reverse it if we want to see what word corresponds to a particular numerical index.

In [None]:
word_index = imdb.get_word_index()
reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])

In [None]:
print(reverse_word_index[19])

In [None]:
reverse_word_index.get(34701-3)

In [None]:
reverse_word_index.get(2)

This `reverse_word_index` is useful, for example, if we want to actually read a review rather than see it as a list of numbers.
* One slight catch is that the indexing starts with an offset of 3, since the 0, 1, and 2 are used for other purposes (see the docs).

In [None]:
dataitem = 0
' '.join([reverse_word_index.get(i-3, '?') for i in train_data[dataitem]])

We'll need to have consistent array inputs to our neural network.
* Create word vectors that are 1500 elements long (the total number of words), and that have 0 or 1 at each location, depending on whether the word at that particular index is in the review.

In [None]:
import numpy as np

In [None]:
def vectorize_sequences(sequences, dimension=1500):
    results = np.zeros((len(sequences), dimension))
    for i, sequence in enumerate(sequences):
        results[i, sequence] = 1.
    return results

In [None]:
x_train = vectorize_sequences(train_data)
x_test = vectorize_sequences(test_data)

In [None]:
x_train[0]

In [None]:
x_train[0].sum()

In [None]:
len(x_train[0])

In [None]:
train_labels[0]

Let's also convert our labels into floats, as the network will output floats (the probability of being a 0 or 1 rather than a strict 0 or 1 output).

In [None]:
y_train = np.asarray(train_labels).astype('float32')
y_test = np.asarray(test_labels).astype('float32')

In [None]:
y_train[0]

## Build our network

In [None]:
from keras import models
from keras import layers

In [None]:
model = models.Sequential()
model.add(layers.Dense(16, activation='relu', input_shape=(1500,)))
model.add(layers.Dense(16, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))

In [None]:
model.compile(optimizer='rmsprop',
              loss='binary_crossentropy',
              metrics=['accuracy'])

In [None]:
# model.compile(optimizer=optimizers.RMSprop(lr=0.001),
# loss='binary_crossentropy',
# metrics=['accuracy'])

# -or-

# from keras import losses
# from keras import metrics

# model.compile(optimizer=optimizers.RMSprop(lr=0.001),
# loss=losses.binary_crossentropy,
# metrics=[metrics.binary_accuracy])

We are going to make a validation set too, so that we can use it to determine how the model training is performing after each epoch.

In [None]:
x_val = x_train[:10000]
partial_x_train = x_train[10000:]

In [None]:
y_val = y_train[:10000]
partial_y_train = y_train[10000:]

In [None]:
history = model.fit(partial_x_train,
                    partial_y_train,
                    epochs=20,
                    batch_size=512,
                    validation_data=(x_val, y_val))

In [None]:
history_dict = history.history
history_dict.keys()

In [None]:
import matplotlib.pyplot as plt

history_dict = history.history
loss_values = history_dict['loss']
val_loss_values = history_dict['val_loss']

epochs = range(1, len(loss_values) + 1)

plt.plot(epochs, loss_values, 'bo', label='Training loss')
plt.plot(epochs, val_loss_values, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()

plt.show()

In [None]:
acc_values = history_dict['accuracy']
val_acc_values = history_dict['val_accuracy']

plt.plot(epochs, acc_values, 'bo', label='Training acc')
plt.plot(epochs, val_acc_values, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()

plt.show()

It looks like the validation loss was the lowest at either epoch 4 or epoch 6.  After that the model gets overfit pretty badly.

We'll retrain the model only to epoch 4, and then assess the accuracy of that trained model.

In [None]:
model = keras.Sequential([
    layers.Dense(16, activation="relu"),
    layers.Dense(16, activation="relu"),
    layers.Dense(1, activation="sigmoid")
])
model.compile(optimizer="rmsprop",
              loss="binary_crossentropy",
              metrics=["accuracy"])
model.fit(x_train, y_train, epochs=4, batch_size=512)
results = model.evaluate(x_test, y_test)

In [None]:
results

Let's look at an example sentiment classification.

In [None]:
dataitem = 0
' '.join([reverse_word_index.get(i, '?') for i in test_data[dataitem]])

In [None]:
sample = 10
print(model.predict(x_test[sample].reshape(-1,1500)))
print(y_test[sample])
print(' '.join([reverse_word_index.get(i, '?') for i in test_data[sample]]))