# Problem Description

The problem that we will use to demonstrate sequence learning in this tutorial is the IMDB movie review sentiment classification problem. Each movie review is a variable sequence of words and the sentiment of each movie review must be classified.

The Large Movie Review Dataset (often referred to as the IMDB dataset) contains 25,000 highly-polar movie reviews (good or bad) for training and the same amount again for testing. The problem is to determine whether a given movie review has a positive or negative sentiment.

The data was collected by Stanford researchers and was used in a 2011 paper where a split of 50-50 of the data was used for training and test. An accuracy of 88.89% was achieved.

Tutorial source: https://machinelearningmastery.com/sequence-classification-lstm-recurrent-neural-networks-python-keras/

### Steps:
- embedding
- train
- evaluate

# Word Embedding

We will map each movie review into a real vector domain, a popular technique when working with text called word embedding. This is a technique where words are encoded as real-valued vectors in a high dimensional space, where the similarity between words in terms of meaning translates to closeness in the vector space.

Keras provides a convenient way to convert positive integer representations of words into a word embedding by an Embedding layer.

We will map each word onto a 32 length real valued vector. We will also limit the total number of words that we are interested in modeling to the 5000 most frequent words, and zero out the rest. Finally, the sequence length (number of words) in each review varies, so we will constrain each review to be 500 words, truncating long reviews and pad the shorter reviews with zero values.

Now that we have defined our problem and how the data will be prepared and modeled, we are ready to develop an LSTM model to classify the sentiment of movie reviews.

In [None]:
import numpy as np
from keras.datasets import imdb
from keras.models import Sequential
from keras.layers import Dense, LSTM, Dropout
from keras.layers.embeddings import Embedding
from keras.preprocessing import sequence

# fix random seed for reproducibility
np.random.seed(7)

# Load data

In [None]:
# load the dataset but only keep the top n words, zero the rest
top_words = 5000
(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=top_words)

# Explore data

In [None]:
word2index = imdb.get_word_index()
index2word = dict([(i,w) for (w,i) in word2index.items()])

In [None]:
print(X_train[0])

In [None]:
X_train.shape

In [20]:
def to_text(review):
    OFFSET = 3
    return ' '.join([index2word[word-OFFSET] for word in review])

# note: the offset by 3 is required because of Keras' IMDB dataset is incompatible with Keras' provided word index.
# See https://github.com/fchollet/keras/issues/5912

In [24]:
to_text(np.arange(4900,4990))

u"minimum showdown borrowed elm icon brenda polished 1984 mechanical overlook loaded map recording craven tiger roth awfully suffice troubles introduce equipment ashley wendy pamela empathy phantom betty resident unreal ruins performs promises monk iraq hippie purposes marketing angela keith sink gifted opportunities garbo assigned feminist household wacky alfred absent sneak popularity trail inducing moronic wounded receives willis unseen stretched fulci unaware dimension dolph definition testament educational survivor attend clip contest petty 13th christy respected resist year's album expressed randy quit phony unoriginal punishment activities suspend rolled eastern 1933 instinct distinct"

# Data preprocessing

In [3]:
# truncate and pad input sequences
max_review_length = 500
X_train = sequence.pad_sequences(X_train, maxlen=max_review_length)
X_test = sequence.pad_sequences(X_test, maxlen=max_review_length)

# Train the model

In [5]:
# create the model
embedding_vector_length = 32
model = Sequential()
model.add(Embedding(top_words, embedding_vector_length, input_length=max_review_length))
model.add(LSTM(100))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())
model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=3, batch_size=64)

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 500, 32)           160000    
_________________________________________________________________
lstm_2 (LSTM)                (None, 100)               53200     
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 101       
Total params: 213,301
Trainable params: 213,301
Non-trainable params: 0
_________________________________________________________________
None
Train on 25000 samples, validate on 25000 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.History at 0x11da097d0>

In [6]:
# Final evaluation of the model
scores = model.evaluate(X_test, y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))

Accuracy: 87.65%


TODO: davidtan [2017-10-02]
    - prepare tutorial on word2vec (see https://github.com/linanqiu/word2vec-sentiments/blob/master/word2vec-sentiment.ipynb)