# Using word embeddings to predict sentiment of movie reviews with the IMDB dataset (Basic model with an RNN layer)


From: Deep Learning with Python (Starting at Listing 6.27)

Reproduced by: Guy Feldman
 

In [1]:
from keras.datasets import imdb
from keras import preprocessing

import numpy as np

Using TensorFlow backend.


In [2]:
# Number of words to consider as features
max_features = 10000
# cut texts after this number of words
max_len =  500

# The IMDB Dataset

- The features are vectors of word indices that represent a review
- The output variable, y, indicates whether a review was positive or negative.

The argument num_words=10000 means that we will only keep the top 10,000 most frequently occurring words in the training set. Rare words will be discarded. This allows us to work with vector data of manageable size.




In [3]:
(x_train,y_train),(x_test,y_test) = imdb.load_data(num_words=max_features)
x_train = preprocessing.sequence.pad_sequences(x_train,maxlen=max_len)
x_test = preprocessing.sequence.pad_sequences(x_test,maxlen=max_len)

# Embedding Layer from Keras

In [4]:
from keras.models import Sequential
from keras.layers import Flatten, Dense, Embedding, SimpleRNN

# Basic Model
## regression on the words

To build this model w
1. add an embedding layer that will take the input vectors representing sentences (that live in $\mathbb{N}^{\text{max_len}}$) and embed each of the words in $\mathbb{R}^{8}$; 
2. flatten the tensor so that each row corresponds to a "sentence" with each word being represented by its embedded value.
3. Feed the sentence into a sigmoid layer for classification.

In [None]:
output_dim = 32
model = Sequential()
model.add(Embedding(max_features,output_dim,input_length=max_len))
model.add(SimpleRNN(output_dim))
model.add(Dense(1,activation='sigmoid'))
model.compile(optimizer='rmsprop',
              loss = 'binary_crossentropy',
              metrics = ['acc'])
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 500, 32)           320000    
_________________________________________________________________
simple_rnn_1 (SimpleRNN)     (None, 32)                2080      
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 33        
Total params: 322,113
Trainable params: 322,113
Non-trainable params: 0
_________________________________________________________________


In [None]:
history = model.fit(x_train,y_train,
                    epochs = 10,
                    batch_size = 32,
                   validation_split=0.2)

Train on 20000 samples, validate on 5000 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10

# Model Limitation

Since we simply flatten the words and feed them into a classifier, we are not taking into account the order of the words. (e.g. it would likely treat both "this movie is shit" and "this movie is the shit" as being negative "reviews"). It would be much better if we could add a recurrent layer to capture sequences of words.