# 1.**Problem Description**

The problem that I will use to demonstrate sequence learning in this notebook is the **Movie review sentiment classification problem.** Each movie review is a sequence of words and the sentiment of each movie review must be classified. The dataset contains 50,000 movie reviews (good or bad). I split the dataset into two parts(each contains 25,000 reviews) for training and testing.The problem is to determine whether a given movie review has a positive or negative sentiment.





# 2. **Dataset**

The dataset that I used is **The Large Movie Review Dataset** (often referred to as the IMDB dataset). Keras provides access to the IMDB dataset built-in. And the **imdb.load_data()** function provides the dataset in a format that is ready for use in neural network and deep learning models. Each words in review have been replaced by integers that indicate the ordered frequency of each word in the dataset. The sentences in each review are Therefore comprised of a sequence of integers. 

# 3. **Word Embedding**

I will map each movie review into a real vector domain, a popular technique when working with text called **word embedding**. This is a technique where words are encoded as real-valued vectors in a high dimensional space, where the similarity between words in terms of meaning translates to closeness in the vector space. I will map each word onto a 35 length real valued vector. I will also limit the total number of words that we are interested in modeling to the 6000 most frequent words, and zero out the rest. Finally, the sequence length (number of words) in each review varies, so I will constrain each review to be 600 words, truncating long reviews and pad the shorter reviews with zero values.

# 4. **Importing useful classes and function**

In [2]:
import numpy
from keras.datasets import imdb
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers.convolutional import Conv1D
from keras.layers.convolutional import MaxPooling1D
from keras.layers import Dropout
from keras.layers.embeddings import Embedding
from keras.preprocessing import sequence
# fix random seed for reproducibility
numpy.random.seed(8)

In [3]:
top_words = 6000 # most frequent words
(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=top_words)

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz


In [4]:
(X_train.shape, X_test.shape)

((25000,), (25000,))

Let's look at the sample data.

In [5]:
print(X_train[1])
print("The length of the 1st review is {}.".format(len(X_train[1])))

[1, 194, 1153, 194, 2, 78, 228, 5, 6, 1463, 4369, 5012, 134, 26, 4, 715, 8, 118, 1634, 14, 394, 20, 13, 119, 954, 189, 102, 5, 207, 110, 3103, 21, 14, 69, 188, 8, 30, 23, 7, 4, 249, 126, 93, 4, 114, 9, 2300, 1523, 5, 647, 4, 116, 9, 35, 2, 4, 229, 9, 340, 1322, 4, 118, 9, 4, 130, 4901, 19, 4, 1002, 5, 89, 29, 952, 46, 37, 4, 455, 9, 45, 43, 38, 1543, 1905, 398, 4, 1649, 26, 2, 5, 163, 11, 3215, 2, 4, 1153, 9, 194, 775, 7, 2, 2, 349, 2637, 148, 605, 2, 2, 15, 123, 125, 68, 2, 2, 15, 349, 165, 4362, 98, 5, 4, 228, 9, 43, 2, 1157, 15, 299, 120, 5, 120, 174, 11, 220, 175, 136, 50, 9, 4373, 228, 2, 5, 2, 656, 245, 2350, 5, 4, 2, 131, 152, 491, 18, 2, 32, 2, 1212, 14, 9, 6, 371, 78, 22, 625, 64, 1382, 9, 8, 168, 145, 23, 4, 1690, 15, 16, 4, 1355, 5, 28, 6, 52, 154, 462, 33, 89, 78, 285, 16, 145, 95]
The length of the 1st review is 189.


Now, make each and every review of same length.



In [6]:
max_review_length = 600
X_train = sequence.pad_sequences(X_train, maxlen=max_review_length)
X_test = sequence.pad_sequences(X_test, maxlen=max_review_length)

In [7]:
print(X_train[1])
print("\nThe length of the 1st review is {}.".format(len(X_train[1])))

[   0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0 

# 5. **LSTM model for sequence classification**

The first layer is the Embedded layer that uses 35 length vectors to represent each word. The next layer is the LSTM layer with 100 memory units (smart neurons). Finally, because this is a classification problem I use a Dense output layer with a single neuron and a sigmoid activation function to make 0 or 1 predictions for the two classes (good and bad) in the problem. Because it is a binary classification problem, log loss is used as the loss function.

In [8]:
# create the model
embedding_vecor_length = 35
model = Sequential()
model.add(Embedding(top_words, embedding_vecor_length, input_length=max_review_length))
model.add(LSTM(100))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())
model.fit(X_train, y_train, epochs=10, batch_size=64)

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 600, 35)           210000    
                                                                 
 lstm (LSTM)                 (None, 100)               54400     
                                                                 
 dense (Dense)               (None, 1)                 101       
                                                                 
Total params: 264,501
Trainable params: 264,501
Non-trainable params: 0
_________________________________________________________________
None
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7f597a6e0d10>

In [9]:
# Final evaluation of the model
scores = model.evaluate(X_test, y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))

Accuracy: 85.04%


# 6. **Using LSTM with Dropout**


Because, Recurrent Neural networks like LSTM generally have the problem of overfitting.

Dropout can be applied between layers using the Dropout Keras layer. We can do this just adding new Dropout layers between Embedding and LSTM layers and the LSTM and Dense output layers.

In [10]:

model = Sequential()
model.add(Embedding(top_words, embedding_vecor_length, input_length=max_review_length))
model.add(Dropout(0.3))
model.add(LSTM(100))
model.add(Dropout(0.3))
model.add(Dense(1, activation='sigmoid'))

In [11]:
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())
model.fit(X_train, y_train, epochs=10, batch_size=64)

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_1 (Embedding)     (None, 600, 35)           210000    
                                                                 
 dropout (Dropout)           (None, 600, 35)           0         
                                                                 
 lstm_1 (LSTM)               (None, 100)               54400     
                                                                 
 dropout_1 (Dropout)         (None, 100)               0         
                                                                 
 dense_1 (Dense)             (None, 1)                 101       
                                                                 
Total params: 264,501
Trainable params: 264,501
Non-trainable params: 0
_________________________________________________________________
None
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10

<keras.callbacks.History at 0x7f597148ea10>

In [13]:
score_with_drop =model.evaluate(X_test, y_test, verbose=0)
print("Accuracy: %.2f%%" % (score_with_drop[1]*100))

Accuracy: 85.66%


# 7. **LSTM and CNN for sequence classification**

Convolutional neural networks excel at learning the spatial structure in input data.

The data does have a one-dimensional spatial structure in the sequence of words in reviews and the CNN may be able to pick out invariant features for good and bad sentiment.

In [17]:
model = Sequential()
model.add(Embedding(top_words, embedding_vecor_length, input_length=max_review_length))
model.add(Conv1D(filters=35, kernel_size=3, padding='same', activation='relu'))
model.add(MaxPooling1D(pool_size=2))
model.add(LSTM(100))
model.add(Dense(1, activation='sigmoid'))

In [18]:
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())
model.fit(X_train, y_train, epochs=8, batch_size=64)

Model: "sequential_3"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_3 (Embedding)     (None, 600, 35)           210000    
                                                                 
 conv1d_1 (Conv1D)           (None, 600, 35)           3710      
                                                                 
 max_pooling1d_1 (MaxPooling  (None, 300, 35)          0         
 1D)                                                             
                                                                 
 lstm_3 (LSTM)               (None, 100)               54400     
                                                                 
 dense_3 (Dense)             (None, 1)                 101       
                                                                 
Total params: 268,211
Trainable params: 268,211
Non-trainable params: 0
________________________________________________

<keras.callbacks.History at 0x7f597adfa390>

In [19]:
score_res = model.evaluate(X_test,y_test, verbose=0)
print("Accuracy: %.2f%%" % (score_res[1]*100))

Accuracy: 87.13%


# 8. **LSTM with CNN and Dropout**

Maybe adding the dropout layers increase the accuracy of the model.

In [20]:
model = Sequential()
model.add(Embedding(top_words, embedding_vecor_length, input_length=max_review_length))
model.add(Dropout(0.2))
model.add(Conv1D(filters=35, kernel_size=3, padding='same', activation='relu'))
model.add(MaxPooling1D(pool_size=2))
model.add(LSTM(100))
model.add(Dropout(0.2))
model.add(Dense(1, activation='sigmoid'))

In [21]:
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())
model.fit(X_train, y_train, epochs=10, batch_size=64)

Model: "sequential_4"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_4 (Embedding)     (None, 600, 35)           210000    
                                                                 
 dropout_2 (Dropout)         (None, 600, 35)           0         
                                                                 
 conv1d_2 (Conv1D)           (None, 600, 35)           3710      
                                                                 
 max_pooling1d_2 (MaxPooling  (None, 300, 35)          0         
 1D)                                                             
                                                                 
 lstm_4 (LSTM)               (None, 100)               54400     
                                                                 
 dropout_3 (Dropout)         (None, 100)               0         
                                                      

<keras.callbacks.History at 0x7f597a4a2790>

In [22]:
score_cnn_with_drop = model.evaluate(X_test,y_test, verbose=0)
print("Accuracy: %.2f%%" % (score_cnn_with_drop[1]*100))

Accuracy: 87.27%
