## Implementation of Sentient Classifier
Here we present our realization of LSTM neural network that is trained to distinguish negative IMDB reviews from positive ones.

We will use Python together with Keras for LSTM implementation and it's IMDB dataset.
The baseline will be: at least 80\%$ of the final accuracy on the test data.

Hovewer, we don't want to simply copy Jason Brownlee's solution, which he presented in [http://machinelearningmastery.com/sequence-classification-lstm-recurrent-neural-networks-python-keras/]. Thus we will try to comment each line of code to show what is going on. We'll also try to explain why did we choose a particular parameter, so no magic constants will be involved.

Now, first of all we need to import all vital modules:

1. IMDB movie review dataset. It'll be our source of data.
2. Sequential model which we will use to implement our LSTM.
3. Dense layer for single output.
4. LSTM layer.
5. Embedding layer for shrinking the dimentions of the data.
6. "Sequence" class for preprocessing.

In [1]:
from keras.datasets import imdb
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers.embeddings import Embedding
from keras.preprocessing import sequence

Using TensorFlow backend.


Let's discuss its structure. IMDB dataset is a tuple of tuples, where each tuple represents a review, in which every word is replaced by its index in a vocabulary constructed from these reviews.
We can access this vocabulary by using the method "get_word_index":

In [4]:
vocabulary = imdb.get_word_index()
# Keep in mind that it's quite large
# print(vocabulary)

Now we need to load the dataset. IMDB class provides a convenient  function "load_data" that returns the dataset (x, y) split in half for training and testing.

Also, this function gives us the ability to choose only the top N words. Following the Jason Brownlee we also pick only the first five thousants.

In [5]:
amount_of_first_top_words = 5000

In [6]:
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=amount_of_first_top_words)
print('Review representation sample: %s' % x_train[0])
print('Class representation sample: %s' % y_train[0])

Review representation sample: [1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 2, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 2, 336, 385, 39, 4, 172, 4536, 1111, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2025, 19, 14, 22, 4, 1920, 4613, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 1247, 4, 22, 17, 515, 17, 12, 16, 626, 18, 2, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2223, 2, 16, 480, 66, 3785, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 1415, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 2, 8, 4, 107, 117, 2, 15, 256, 4, 2, 7, 3766, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 2, 1029, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2071, 56, 26, 141, 6, 194, 2, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 2, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 1334, 88, 12, 16, 283, 5, 16, 4472, 113, 103, 32, 15, 16, 2, 19, 178, 32]
Class representation sample: 1


We must mention that it's a standart way to represent a sentence by it's "inverted" frequency.

Notice that some reviews are short and some are long, thus we need to either truncate or enlarge them with zeroes, since Keras accepts only the vectors of the same lenght. It can be done by the "sequence" module.

Also we introduce another constant: the maximum length of the review. After looking at the length distribution, we decided to set this value to 600.

In [7]:
max_review_length = 600

In [8]:
x_test  = sequence.pad_sequences(x_test, maxlen=max_review_length)
x_train = sequence.pad_sequences(x_train, maxlen=max_review_length)

We'll start to prepare our model. In Keras a model is understood as a sequence or a graph of standalone, fully-configurable modules that can be plugged together with as little restrictions as possible. Combined modules form what is called a 'layer', and layers, in their turn, are grouped together to form a model. In other words, in Keras, a model can be throught as a way to organize layers. The simplest model is "Sequential", it just a linear stack of layers.

In [9]:
model = Sequential()

There is a problem with our dataset. Unfortunately, each word in a review represented only by an integer, but supplying a word to Keras requires it to be a real vector.

We can solve this problem by introducing vector embedding to our model as a layer:

In [10]:
embedding_vector_length = 32
model.add(Embedding(amount_of_first_top_words, embedding_vector_length, input_length=max_review_length))

Notice that we created new constant "embedding_vector_length" which governs the dimensionality of the output vectors. We'll stick with Jason Brownlee's choice.

Now it's time to add the LSTM to our model! This layer has only one parameter: the number of memrons (special neurons with memory). The author choose to set this parameter to 100, so we will follow:

In [11]:
model.add(LSTM(100))

Last layer that we need to add is 'Dense' layer. It will act like a usual neuron with 'n' outputs, 'sigmoid' activation function and no bias:

In [12]:
model.add(Dense(1, activation='sigmoid'))

Finally, our model must be compiled, which means that we need to specify various parameters, such as 'loss function' or 'metrics'.
We are okay with Jason Brownlee's choise. 

In [15]:
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

Let's see the summary of our model:

In [16]:
print(model.summary())

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 600, 32)           160000    
_________________________________________________________________
lstm_1 (LSTM)                (None, 100)               53200     
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 101       
Total params: 213,301.0
Trainable params: 213,301
Non-trainable params: 0.0
_________________________________________________________________
None


Notice how big the total number of parameters is: 231,301! For such a simple model! No wonder why machine learning is so hard.

Finally it is time to train our model by the "fit" method, which has many parameters, but we use only two: number of epoches and batch size. Keep in mind that it will take time to finish.

In [None]:
model.fit(x_train, y_train, num_epoch=3, batch_size=64)

Now we can assess the accuracy of our trained model through the "evaluate" method:

In [None]:
assess = model.evaluate(x_test, y_test, verbose=0)
print("Evaluated accuracy: %.2f%%" % (scores[1] * 100))

As we can see, our model achives 86,6% of accuracy and thus passes the requested baseline. This concludes our report on the first part of the assignment.