### Let's do sentiment classification!
Keras provides a ready-to-use dataset: 

the IMDB Movie reviews sentiment classification dataset

see:  https://keras.io/datasets/

In [4]:
from keras.datasets import imdb

print ("loading data")
vocab_size = 25000
(x_train, y_train), (x_test, y_test) = imdb.load_data(path='imdb.npz',
                                                     num_words=vocab_size,
                                                     skip_top=0,
                                                     maxlen=None,
                                                     seed=113,
                                                     start_char=1,
                                                     oov_char=2,
                                                     index_from=3)
# inspect the data:
print ("number of training samples: ", len(x_train))
print ("number of test samples: ", len(x_test))
print ("examples:")
for i in range(5):
    print (str(y_train[i]) + "\t" + str(x_train[i]))

Using TensorFlow backend.


loading data
number of training samples:  25000
number of test samples:  25000
examples:
1	[1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 22665, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 21631, 336, 385, 39, 4, 172, 4536, 1111, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2025, 19, 14, 22, 4, 1920, 4613, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 1247, 4, 22, 17, 515, 17, 12, 16, 626, 18, 19193, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2223, 5244, 16, 480, 66, 3785, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 1415, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 10311, 8, 4, 107, 117, 5952, 15, 256, 4, 2, 7, 3766, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 12118, 1029, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2071, 56, 26, 141, 6, 194, 7486, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 5535, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 1334, 88, 12, 16, 283, 5, 

x_train/x_test consist of sentences (lists of words). The words have already been transformed into vocabulary indices.

y_train/y_test consist of label indices (0 for "negative" and 1 for "positive")

#### First step: convert data into numpy arrays
- x_train/x_test: needs to be padded to a fixed length (The padding token has index 0 in our vocabulary)
- y_train/y_test: needs to be transformed into a one-hot vector

In [6]:
import numpy
from keras.utils import np_utils

# Data preprocessing

def get_avg_length(input_seq):
    lengths = [len(i) for i in input_seq]
    return int(float(sum(lengths))/len(lengths))

def normalize_length(input_seq, length):
    output_seq = []
    for i in input_seq:
        if len(i) > length:
            i = i[:length]  # remove last items if i is too long
        while len(i) < length:
            i.append(0)  # pad while i is too short
        output_seq.append(i)
    return output_seq
            
context_length = get_avg_length(x_train)
print (context_length)
x_train_norm = normalize_length(x_train, context_length)
x_test_norm = normalize_length(x_test, context_length)

# convert into numpy arrays
x_train_np = numpy.array(x_train_norm)
x_test_np = numpy.array(x_test_norm)
y_train_np = np_utils.to_categorical(y_train)
y_test_np = np_utils.to_categorical(y_test)

238


#### Second step: Split train into train and dev

In [7]:
num_train = int(0.8 * len(x_train_norm))
x_dev_np = x_train_np[num_train:]
y_dev_np = y_train_np[num_train:]
x_train_np = x_train_np[:num_train]
y_train_np = y_train_np[:num_train]

#### Now, we define our model

In [8]:
from keras.models import Sequential
from keras.layers import Conv1D, GlobalMaxPooling1D, Dense, Embedding

num_classes = y_train_np.shape[1]

model = Sequential()
model.add(Embedding(input_dim=vocab_size, output_dim=200, input_length=context_length))
model.add(Conv1D(filters=300, kernel_size=3, activation='tanh'))
model.add(GlobalMaxPooling1D())
model.add(Dense(num_classes, activation='softmax'))


#### We train it

In [9]:
model.compile(loss='categorical_crossentropy', optimizer='adadelta',
              metrics=['accuracy'])
model.fit(x_train_np, 
          y_train_np, 
          shuffle=True,  # shuffling training data is always good
          validation_data=(x_dev_np, y_dev_np), 
          epochs=2,  # how long to train
          batch_size=100)

Train on 20000 samples, validate on 5000 samples
Epoch 1/2
Epoch 2/2


<keras.callbacks.History at 0x1aed9237390>

#### ... and finally evaluate it

In [11]:
loss, accuracy = model.evaluate(x_dev_np, y_dev_np, batch_size=100)
print ("on dev:", accuracy)
loss, accuracy = model.evaluate(x_test_np, y_test_np, batch_size=100)
print ("on test:", accuracy)

on dev: 0.805599998236
on test: 0.805039998293
