In [3]:
import keras
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline

## Classifying movie reviews: a binary classification example

### The IMDB dataset
这里只作为示例，仅使用了词频前10000的单词，数据集是对单词进行了编码，每一个数字对应是单词的id。

In [6]:
from keras.datasets import imdb

max_feature = 10000
(train_data, train_labels),(test_data, test_labels) = imdb.load_data(num_words=max_feature)

In [7]:
np.r_[train_data[0]]

array([   1,   14,   22,   16,   43,  530,  973, 1622, 1385,   65,  458,
       4468,   66, 3941,    4,  173,   36,  256,    5,   25,  100,   43,
        838,  112,   50,  670,    2,    9,   35,  480,  284,    5,  150,
          4,  172,  112,  167,    2,  336,  385,   39,    4,  172, 4536,
       1111,   17,  546,   38,   13,  447,    4,  192,   50,   16,    6,
        147, 2025,   19,   14,   22,    4, 1920, 4613,  469,    4,   22,
         71,   87,   12,   16,   43,  530,   38,   76,   15,   13, 1247,
          4,   22,   17,  515,   17,   12,   16,  626,   18,    2,    5,
         62,  386,   12,    8,  316,    8,  106,    5,    4, 2223, 5244,
         16,  480,   66, 3785,   33,    4,  130,   12,   16,   38,  619,
          5,   25,  124,   51,   36,  135,   48,   25, 1415,   33,    6,
         22,   12,  215,   28,   77,   52,    5,   14,  407,   16,   82,
          2,    8,    4,  107,  117, 5952,   15,  256,    4,    2,    7,
       3766,    5,  723,   36,   71,   43,  530,  4

label是对应这个句子的标签，是“positive”或“negative”。

In [8]:
train_labels[0]

1

In [9]:
max([max(sequence) for sequence in train_data])

9999

#### word_index中存储了每个单词对应的id字典

In [11]:
word_index = imdb.get_word_index()

In [12]:
reverse_word_index = dict([(value,key) for (key,value) in word_index.items()])

word_index存储着每个单词对应的数字id，以字典的形式存储。word_index[0]存储着“padding”，是指句子不够长的时候，会在输入序列的最后补充“padding”；word_index[1]存储着“start of a sequence”，代表是句子开始的位置；word_index[2]存储着“unknown”，是指代不存在word_index中的其他单词。

下面，通过word_index，将数字id的句子翻译回来。

In [13]:
decoded_review = " ".join([reverse_word_index.get(i-3,"?") for i in train_data[0]])

In [14]:
decoded_review

"? this film was just brilliant casting location scenery story direction everyone's really suited the part they played and you could just imagine being there robert ? is an amazing actor and now the same being director ? father came from the same scottish island as myself so i loved the fact there was a real connection with this film the witty remarks throughout the film were great it was just brilliant so much that i bought the film as soon as it was released for ? and would recommend it to everyone to watch and the fly fishing was amazing really cried at the end it was so sad and you know what they say if you cry at a film it must have been good and this definitely was also ? to the two little boy's that played the ? of norman and paul they were just brilliant children are often left out of the ? list i think because the stars that play them all grown up are such a big profile for the whole film but these children are amazing and should be praised for what they have done don't you th

### Preparing the data
将单词列表进行embedding，这里进行的是onehot embedding。embedding向量的长度为开始时候设置的最大单词数，单词id为50，那么意味着这个embedding的第50的位置为1，其他均为0。

In [16]:
import numpy as np
from tqdm import tqdm

def vectorize_sequences(sequences,dimension=max_feature):
    results = np.zeros((len(sequences),dimension))
    for i, sequence in tqdm(enumerate(sequences)):
        results[i,sequence] = 1.
    return results

In [17]:
x_train = vectorize_sequences(train_data,max_feature)

25000it [00:00, 30604.32it/s]


In [18]:
x_test = vectorize_sequences(test_data,max_feature)

25000it [00:00, 31106.86it/s]


In [19]:
x_train[0]
x_train.shape

(25000, 10000)

In [20]:
y_train = np.asarray(train_labels).astype('float32')
y_test = np.asarray(test_labels).astype('float32')

### Building our network

In [21]:
from keras import models
from keras import layers
from keras.utils import plot_model

In [27]:
model = models.Sequential()
model.add(layers.Dense(units=16,kernel_initializer='lecun_normal',activation='selu',input_shape=(max_feature,)))
model.add(layers.Dense(units=16,kernel_initializer='lecun_normal',activation='selu'))
model.add(layers.Dense(units=1,kernel_initializer='lecun_normal',activation='sigmoid'))
model.compile(optimizer='rmsprop',loss='binary_crossentropy',metrics=['acc'])

In [28]:
plot_model(model=model,show_layer_names=True,show_shapes=True,to_file='images/3-5-classifying-movie-reviews.png')
print(model.summary())

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_4 (Dense)              (None, 16)                160016    
_________________________________________________________________
dense_5 (Dense)              (None, 16)                272       
_________________________________________________________________
dense_6 (Dense)              (None, 1)                 17        
Total params: 160,305
Trainable params: 160,305
Non-trainable params: 0
_________________________________________________________________
None


In [29]:
model.compile(optimizer='rmsprop',
              loss='binary_crossentropy',
              metrics=['acc'])

In [34]:
history = model.fit(x=x_train,y=y_train,epochs=5,batch_size=64,validation_split=0.2,verbose=2)

Train on 20000 samples, validate on 5000 samples
Epoch 1/5
 - 3s - loss: 0.0158 - acc: 0.9952 - val_loss: 1.0645 - val_acc: 0.8588
Epoch 2/5
 - 3s - loss: 0.0140 - acc: 0.9956 - val_loss: 1.0928 - val_acc: 0.8576
Epoch 3/5
 - 3s - loss: 0.0111 - acc: 0.9967 - val_loss: 1.1383 - val_acc: 0.8566
Epoch 4/5
 - 4s - loss: 0.0089 - acc: 0.9974 - val_loss: 1.1884 - val_acc: 0.8592
Epoch 5/5
 - 4s - loss: 0.0074 - acc: 0.9979 - val_loss: 1.2406 - val_acc: 0.8542


这样搭建的简单网络，发生了比较严重的过拟合的情况，训练集上的准确率早已达到99%，而验证集上的准确率仅有85%。