# IMDB movie reviews sentiment classification dataset 

This code loads the IMDB dataset that's included with Keras and creates a dictionary mapping the words in all 50,000 reviews to integers indicating the words' relative frequency of occurrence. Each word is assigned a unique integer.

The reason the inner lists contain numbers rather than text is that you don't train a neural network with text; you train it with numbers.Specifically, you train it with tensors. In this case, each review is a 1-dimensional tensor (think of a 1-dimensional array) containing integers identifying the words contained in the review. 

In [None]:
#Each word is assigned a unique integer.
#The most common word is assigned the number 1, the second most common word is assigned the number 2, and so on.

from keras.datasets import imdb
top_words = 10000
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=top_words)

The variable named x_train is a list of 25,000 lists, each of which represents one movie review. (x_test is also a list of 25,000 lists representing 25,000 reviews. x_train will be used for training, while x_test will be used for testing.) 

In [None]:
x_train[1]

In [None]:
# how the dictionary looks like?
#Dictionary mapping words to integers
imdb.get_word_index()

In [None]:
# reverse-encode
word_dict = imdb.get_word_index()
word_dict = { key:(value + 3) for key, value in word_dict.items() }
word_dict[''] = 0  # Padding
word_dict['>'] = 1 # Start
word_dict['?'] = 2 # Unknown word
reverse_word_dict = { value:key for key, value in word_dict.items() }
print(' '.join(reverse_word_dict[id] for id in x_train[0]))

In the output, ">" marks the beginning of the review, while "?" marks words that aren't among the most common 10,000 words in the dataset. These "unknown" words are represented by 2s in the list of integers representing a review. 

In [None]:
# When you train a neural network with collection of tensors, each tensor needs to be the same length. 
#Keras includes a function that takes a list of lists as input and converts the inner lists to a specified length by truncating them if necessary or padding them with 0s. 

from keras.preprocessing import sequence
max_review_length = 500                    # x_train and x_test to a length of 500 integers
x_train = sequence.pad_sequences(x_train, maxlen=max_review_length)
x_test = sequence.pad_sequences(x_test, maxlen=max_review_length)

# Neural Networks to perform a sentiment analysis

In [None]:
#Creating a neural network with Keras
from keras.models import Sequential
from keras.layers import Dense
from keras.layers.embeddings import Embedding
from keras.layers import Flatten

embedding_vector_length = 32
model = Sequential()
model.add(Embedding(top_words, embedding_vector_length, input_length=max_review_length))      #The embedding layer essentially maps many-dimensional arrays containing integer word indexes into floating-point arrays containing fewer dimensions.
model.add(Flatten())                                                                          #Reshapes the output for input to the next layer
#the hidden layers
model.add(Dense(16, activation='relu'))                                                       #Fully connected layers with 16 neurons
model.add(Dense(16, activation='relu'))
model.add(Dense(1, activation='sigmoid'))                                                     #Fully connected layers with 1 neuron to predict one output,  a sentiment score from 0.0 to 1.0
model.compile(loss='binary_crossentropy',optimizer='adam', metrics=['accuracy'])              #Compiles the model by specifying paramrters like: loss-function, metrics and the optimizer
print(model.summary())

In [None]:
# Train the model
# epoch = 5 tells to keras to make 5 forward and backward passes through the model
# batch_size=128 tells to keras to use 128 training samples at a time to train the network. 
# NOTE : Smaller batch sizes sometimes increase accuracy. 

hist = model.fit(x_train, y_train, validation_data=(x_test, y_test), epochs=10, batch_size=128)

One indication that a model is overfitting is a growing discrepancy between the training accuracy and the validation accuracy

In [None]:
# visualize the changes in training and validation accuracy as training progress
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

sns.set()
acc = hist.history['accuracy']
val = hist.history['val_accuracy']
epochs = range(1, len(acc) + 1)

plt.plot(epochs, acc, '-', label='Training accuracy')
plt.plot(epochs, val, ':', label='Validation accuracy')
plt.title('Training and Validation Accuracy')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.legend(loc='upper left')
plt.plot()

This plot indicates that the training accuracy is bigger than the vaalidation accuracy so that there is overfitting for that reason we have to perform a regularization over the model, in order the model can be more simple

Another way to check for overfitting is to compare training loss to validation loss as training proceeds. Where for a given epoch, training loss, much lower than validation loss, can be evidence of overfitting.

In [None]:
%matplotlib inline

sns.set()
loss = hist.history['loss']
val = hist.history['val_loss']
epochs = range(1, len(acc) + 1)

plt.plot(epochs, loss, '-', label='Training loss')
plt.plot(epochs, val, ':', label='Validation loss')
plt.title('Training and Validation loss')
plt.xlabel('Epoch')
plt.ylabel('loss')
plt.legend(loc='upper left')
plt.plot()

# Evaluate the model 

determine how accurately the model is able to quantify the sentiment expressed in text based on the test data in x_test (reviews) and y_test (0s and 1s, or "labels," indicating which reviews are positive and which are negative)

In [None]:
scores = model.evaluate(x_test, y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1] * 100))

# EXERCISE 

 input text of your own into the model and see how it performs

In [None]:
import string
import numpy as np

def analyze(text):
    # Prepare the input by removing punctuation characters, converting
    # characters to lower case, and removing words containing numbers
    translator = str.maketrans('', '', string.punctuation)
    text = text.translate(translator)
    text = text.lower().split(' ')
    text = [word for word in text if word.isalpha()]

    # Generate an input tensor
    input = [1]
    for word in text:
        if word in word_dict and word_dict[word] < top_words:
            input.append(word_dict[word])
        else:
            input.append(2)
    padded_input = sequence.pad_sequences([input], maxlen=max_review_length)

    # Invoke the model and return the result
    result = model.predict(np.array([padded_input][0]))[0][0]
    return result

this function accepts a string as input and returns a number from 0.0 to 1.0 quantifying the sentiment expressed in that string. The higher the number, the more positive the sentiment. The function cleans the input string, converts it into a list of integers referencing words in the dictionary created by the load_data function, and finally calls the model's predict function to score the text for sentiment

In [None]:
# exmaple 1
analyze('Easily the most stellar experience I have ever had.')

In [None]:
# exaxmple 2
analyze('The long lines and poor customer service really turned me off.')