In [None]:
%matplotlib inline
import tensorflow as tf
from tensorflow import keras

import numpy as np
print(tf.__version__)

## Learning Objectives:

- Load movie-review string data as  **sparse feature vectors**
- Implement a sentiment-analysis neural network model using an **embedding** that projects data into **two dimensions**
- **Visualize** the embedding to see what the model has learned about the relationships between words
- Use a **pretrained** embedding GloVe (Global Vectors for word representation)

In [None]:
imdb = keras.datasets.imdb

(train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words=1000)

The argument num_words=10000 keeps the top 10,000 most frequently occurring words in the training data. The rare words are discarded to keep the size of the data manageable.

## Explore the data

In [None]:
print("Training entries: {}, labels: {}".format(len(train_data), len(train_labels)))

The text of reviews have been converted to integers, where each integer represents a specific word in a dictionary. Here's what the first review looks like:



In [None]:
print(train_data[0])

## Convert the integers back to words

In [None]:
# A dictionary mapping words to an integer index
word_index = imdb.get_word_index()

# The first indices are reserved
word_index = {k:(v+3) for k,v in word_index.items()}
word_index["<PAD>"] = 0
word_index["<START>"] = 1
word_index["<UNK>"] = 2  # unknown
word_index["<UNUSED>"] = 3

reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])

def decode_review(text):
    return ' '.join([reverse_word_index.get(i, '?') for i in text])
word_index['comedy']

In [None]:
decode_review(train_data[2]),train_labels[2]

## Prepare the data
### Restrict the vocabulary to a smaller static vocabulary

In [None]:
# 50 informative terms that compose our model vocabulary 
informative_terms = ("<PAD>", "bad", "great", "best", "worst", "fun", "beautiful",
                     "excellent", "poor", "boring", "awful", "terrible",
                     "definitely", "perfect", "liked", "worse", "waste",
                     "entertaining", "loved", "unfortunately", "amazing",
                     "enjoyed", "favorite", "horrible", "brilliant", "highly",
                     "simple", "annoying", "today", "hilarious", "enjoyable",
                     "dull", "fantastic", "poorly", "fails", "disappointing",
                     "disappointment", "not", "him", "her", "good", "time",
                     "?", ".", "!", "movie", "film", "action", "comedy",
                     "drama", "family")

def filtered(text):
    return [informative_terms.index(reverse_word_index.get(i, '?')) for i in text if reverse_word_index.get(i, '?') in informative_terms ]
def decode_filtered_review(text):
    return ' '.join([informative_terms[i] for i in text])

decode_filtered_review(filtered(train_data[2]))


### Filter all the reviews with the static vocabulary

In [None]:
train_data = [filtered(data) for data in train_data]

test_data = [filtered(data) for data in test_data]

### Convert reviews to arrays of integers of the same length
Pad the arrays so they all have the same length, then create an integer tensor of shape max_length * num_reviews. We can use an embedding layer capable of handling this shape as the first layer in our network.

In [None]:


train_data = keras.preprocessing.sequence.pad_sequences(train_data,
                                                        value=word_index["<PAD>"],
                                                        padding='post',
                                                        #maxlen=256) 
                                                         maxlen=50)

test_data = keras.preprocessing.sequence.pad_sequences(test_data,
                                                       value=word_index["<PAD>"],
                                                       padding='post',
                                                       #maxlen=256)
                                                       maxlen=50)   


#decode_review(test_data[100]) 


Look at one of the examples

In [None]:
decode_filtered_review(test_data[2]) 

## Build the model
- How manu layers to use in the model
- How many hidden units to use for each layer?

In [None]:
#vocab_size = 10000
vocab_size = 51

model = keras.Sequential()
model.add(keras.layers.Embedding(vocab_size, 2))
model.add(keras.layers.GlobalAveragePooling1D())
model.add(keras.layers.Dense(16, activation=tf.nn.relu))
model.add(keras.layers.Dense(1, activation=tf.nn.sigmoid))

model.summary()

### Loss function and optimizer
A model needs a loss function and an optimizer for training. Since this is a binary classification problem and the model outputs a probability (a single-unit layer with a sigmoid activation), we'll use the **binary_crossentropy** loss function.

Binary_crossentropy is good for dealing with probabilities — it measures the "distance" between probability distributions: in our case, between the ground-truth distribution and the predictions.

In [None]:
model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['acc'])

## Create validation set
When training, we want to check the accuracy of the model on data it hasn't seen before. Create a validation set by setting apart 10,000 examples from the original training data. (Why not use the testing set now? Our goal is to develop and tune our model using only the training data, then use the test data just once to evaluate our accuracy).

In [None]:
x_val = train_data[:10000]
partial_x_train = train_data[10000:]

y_val = train_labels[:10000]
partial_y_train = train_labels[10000:]

## Train the model
Train the model for n  epochs in mini-batches of 512 samples. This is 40 iterations over all samples in the x_train and y_train tensors. While training, monitor the model's **loss** and **accuracy** on the 10,000 samples from the validation set:

In [None]:
history = model.fit(partial_x_train,
                    partial_y_train,
                    epochs=13,
                    batch_size=512,
                    validation_data=(x_val, y_val),
                    verbose=1)

## Evaluate the model
And let's see how the model performs. Two values will be returned. Loss (a number which represents our error, lower values are better), and accuracy.

In [None]:
results = model.evaluate(test_data, test_labels)

print(results)

In [None]:
## Create a graph of accuracy and loss over time

In [None]:
history_dict = history.history
history_dict.keys()

In [None]:
import matplotlib.pyplot as plt

acc = history_dict['acc']
val_acc = history_dict['val_acc']
loss = history_dict['loss']
val_loss = history_dict['val_loss']

epochs = range(1, len(acc) + 1)

# "bo" is for "blue dot"
plt.plot(epochs, loss, 'bo', label='Training loss')
# b is for "solid blue line"
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()

plt.show()

In [None]:
plt.clf()   # clear figure

plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend()

plt.show()

In this plot, the dots represent the training loss and accuracy, and the solid lines are the validation loss and accuracy.

Notice the training loss decreases with each epoch and the training accuracy increases with each epoch. This is expected when using a gradient descent optimization—it should minimize the desired quantity on every iteration.

This isn't the case for the validation loss and accuracy all the time (specially for no of epochs >10) —they seem to peak after about twenty epochs. This is an example of overfitting: the model performs better on the training data than it does on data it has never seen before. After this point, the model over-optimizes and learns representations specific to the training data that do not generalize to test data.

For this particular case, we could prevent overfitting by simply stopping the training after twenty or so epochs. Later, you'll see how to do this automatically with a callback.

## Let's plot the 2-dimension embedding

In [None]:
embeddings = model.layers[0].get_weights()[0] ## takes the weights of the embedding layer
x,y=embeddings[:,0], embeddings[:,1]

import numpy as np
import matplotlib.pyplot as plt

fig = plt.figure(figsize=(7,7))
for i in range(0,vocab_size):
    xx, yy = x[i], y[i]
    plt.scatter(xx, yy, marker='x', color='red', alpha=0.5)
    plt.text(xx+.001, yy+.001, informative_terms[i], fontsize=10)
plt.show()


Initially, the words are distributed in the entire 2D space (in case we use an embedding of 2).
After 2 epochs, words with negative meaning are getting closer and further away from words with positive meaning.
- Reinitialize the model, run only one epochs and regenerate the plot. 
- Continue learning with 4 epochs.

## Observe the embeddings with similarity
Instead of plotting, you can compute similarity between two words by computing the cos of the angle between the two vectors.


In [None]:
#cosine similarity between two words
embeddings = model.layers[0].get_weights()[0]
from numpy.linalg import norm
v1 = embeddings[informative_terms.index("best")]
v2 = embeddings[informative_terms.index("worst")]
v3 = embeddings[informative_terms.index("boring")]

#v1 = embedding_matrix[word_index["kids"]]
#v2 = embedding_matrix[word_index["school"]]
#v3 = embedding_matrix[word_index["book"]]
similarity1 = np.dot(v1,v2)/(norm(v1)*norm(v2))
similarity2 = np.dot(v2,v3)/(norm(v2)*norm(v3))

similarity1, similarity2

The similarity betweeb best and worst should become smaller as you train the model (almost to -1), while the similarity between worst and boring should become larger (close to 1).

# Use a pretrained embedding: Glove
We will reload the reviews data keeping 10000 words. We are using an embedding of 100 dimension from GloVe. 
Observation: from the file system, open the glove.6B.100d.txt and observe its content. It is a big file, so maybe you can inspect it with less command.

In [None]:
import pandas as pd
import csv
glove_data_file='glove/glove.6B.100d.txt'
gl=pd.read_csv(glove_data_file, sep=" ", index_col=0, header=None, quoting=csv.QUOTE_NONE)

## Create the matrix for the initialization of the embedding layer
The matrix has vocab_size rows, and embedding_dimension columns. 
In case of the loaded Glove model, each word has associated an embedding of 100.


In [None]:
vocab_size = 10000
embedding_dimension=100
embedding_matrix = np.zeros((vocab_size, embedding_dimension))

for  word, i in word_index.items()   :
    if i<vocab_size:
        (index,)=(gl.index==word).nonzero()
        #print(word,i, index)
        if len(index)>0:
            embedding_matrix[i] = gl.values[index]
            #print(embedding_matrix[i])
embedding_matrix[11]

Add the pretrained embeddings as the initialization of the Embedding layer.

In [None]:


model = keras.Sequential()
model.add(keras.layers.Embedding(vocab_size, embedding_dimension, weights=[embedding_matrix], trainable=True))
model.add(keras.layers.GlobalAveragePooling1D())
model.add(keras.layers.Dense(16, activation=tf.nn.relu))
model.add(keras.layers.Dense(1, activation=tf.nn.sigmoid))

model.summary()

Go back to the step of setting the loss and the optimizer and redo all the learning and evaluation steps with the new model. 
You can not anymore plot it in 2D space (since it is in 100D space), but you can use  [projector](https://projector.tensorflow.org/) for TensorBoard.

The notebook is inspired from [TensorFlow Basic Classification Tutorial](https://www.tensorflow.org/tutorials/keras/basic_text_classification) and [Machine Learning Crash Course - Embeddings programming exercise](https://developers.google.com/machine-learning/crash-course/embeddings/programming-exercise)