# conv1d_reuters_data
- 「[PythonとKerasによるディープラーニング][4]」第6章を参考に、kerasの[examples][1] [reuters_mlp.py][2]のデータにconv1dを適用してみる。
- reutersデータにはfasttext([imdb_fasttext.py][3])を適用する。
[1]: https://github.com/keras-team/keras/tree/master/examples
[2]: https://github.com/keras-team/keras/blob/master/examples/reuters_mlp.py
[3]: https://github.com/keras-team/keras/blob/master/examples/imdb_fasttext.py
[4]: https://book.mynavi.jp/ec/products/detail/id=90124
## 目標と結果
### 目標
目標はfasttext_reuters_dataと同様にテストデータでのaccuracy=0.8を目指す。  
### 結果
epoch 10/10で、loss: 0.0599 - acc: 0.9841 - val_loss: 0.0596 - val_acc: 0.9847となり、テストデータではloss 0.06、acc 0.98。

## 変更点
大きな変更なない。
## その他
### 結果について
不安になるような結果。
1. 出だしからacc/val_acc共に0.9783
1. 10回のepochでacc系は微増しloss系は微減し続ける
1. テストデータも同じような結果  

心配なので連続でfasttext_reuters_dataを実行してみるがここは変わらず。データ件数も正しく表示されているので大丈夫なのかと(Train on 7185 samples, validate on 1797 samples)。まずCNN、改善のためCNNに加えてRNNを適用と考えていたので拍子抜け。。。
## 軽量
「[PythonとKerasによるディープラーニング][4]」第6章のどこかに「軽量」と書かれていたが確かに軽量。最終的にcolabで実行したがPCでも耐えられるレベル。

# データの準備

In [1]:
'''This example demonstrates the use of fasttext for text classification
Based on Joulin et al's paper:
Bags of Tricks for Efficient Text Classification
https://arxiv.org/abs/1607.01759
Results on IMDB datasets with uni and bi-gram embeddings:
    Uni-gram: 0.8813 test accuracy after 5 epochs. 8s/epoch on i7 cpu.
    Bi-gram : 0.9056 test accuracy after 5 epochs. 2s/epoch on GTx 980M gpu.
'''

from __future__ import print_function
import numpy as np

from keras.preprocessing import sequence
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Embedding
from keras.layers import GlobalAveragePooling1D
from keras.layers import GlobalMaxPooling1D
from keras.datasets import imdb


def create_ngram_set(input_list, ngram_value=2):
    """
    Extract a set of n-grams from a list of integers.
    >>> create_ngram_set([1, 4, 9, 4, 1, 4], ngram_value=2)
    {(4, 9), (4, 1), (1, 4), (9, 4)}
    >>> create_ngram_set([1, 4, 9, 4, 1, 4], ngram_value=3)
    [(1, 4, 9), (4, 9, 4), (9, 4, 1), (4, 1, 4)]
    """
    return set(zip(*[input_list[i:] for i in range(ngram_value)]))


def add_ngram(sequences, token_indice, ngram_range=2):
    """
    Augment the input list of list (sequences) by appending n-grams values.
    Example: adding bi-gram
    >>> sequences = [[1, 3, 4, 5], [1, 3, 7, 9, 2]]
    >>> token_indice = {(1, 3): 1337, (9, 2): 42, (4, 5): 2017}
    >>> add_ngram(sequences, token_indice, ngram_range=2)
    [[1, 3, 4, 5, 1337, 2017], [1, 3, 7, 9, 2, 1337, 42]]
    Example: adding tri-gram
    >>> sequences = [[1, 3, 4, 5], [1, 3, 7, 9, 2]]
    >>> token_indice = {(1, 3): 1337, (9, 2): 42, (4, 5): 2017, (7, 9, 2): 2018}
    >>> add_ngram(sequences, token_indice, ngram_range=3)
    [[1, 3, 4, 5, 1337, 2017], [1, 3, 7, 9, 2, 1337, 42, 2018]]
    """
    new_sequences = []
    for input_list in sequences:
        new_list = input_list[:]
        for ngram_value in range(2, ngram_range + 1):
            for i in range(len(new_list) - ngram_value + 1):
                ngram = tuple(new_list[i:i + ngram_value])
                if ngram in token_indice:
                    new_list.append(token_indice[ngram])
        new_sequences.append(new_list)

    return new_sequences

# Set parameters:
# ngram_range = 2 will add bi-grams features
ngram_range = 2
# max_features = 20000
max_features = 2500
maxlen = 400
batch_size = 32
embedding_dims = 50
epochs = 5

Using TensorFlow backend.


In [2]:
from keras.datasets import reuters
max_words = 1000
(x_train, y_train), (x_test, y_test) = reuters.load_data(num_words=max_words,
                                                         test_split=0.2)

In [3]:
l_x_train = len(x_train)
l_x_test = len(x_test)
m_x_train =  np.mean(list(map(len, x_train)), dtype=int)
m_x_test =  np.mean(list(map(len, x_test)), dtype=int)

print(l_x_train, 'train sequences')
print(l_x_test, 'test sequences')
print('Average train sequence length: {}'.format(m_x_train))
print('Average test sequence length: {}'.format(m_x_test))

8982 train sequences
2246 test sequences
Average train sequence length: 145
Average test sequence length: 147


In [4]:
%%time
if ngram_range > 1:
    print('Adding {}-gram features'.format(ngram_range))
    # Create set of unique n-gram from the training set.
    ngram_set = set()
    for input_list in x_train:
        for i in range(2, ngram_range + 1):
            set_of_ngram = create_ngram_set(input_list, ngram_value=i)
            ngram_set.update(set_of_ngram)

    # Dictionary mapping n-gram token to a unique integer.
    # Integer values are greater than max_features in order
    # to avoid collision with existing features.
    start_index = max_features + 1
    token_indice = {v: k + start_index for k, v in enumerate(ngram_set)}
    indice_token = {token_indice[k]: k for k in token_indice}

    print("max_features before np.max()",max_features)
    # max_features is the highest integer that could be found in the dataset.
    max_features = np.max(list(indice_token.keys())) + 1
    print("max_features after  np.max()",max_features)

    # Augmenting x_train and x_test with n-grams features
    x_train = add_ngram(x_train, token_indice, ngram_range)
    x_test = add_ngram(x_test, token_indice, ngram_range)
    print('Average train sequence length: {}'.format(
        np.mean(list(map(len, x_train)), dtype=int)))
    print('Average test sequence length: {}'.format(
        np.mean(list(map(len, x_test)), dtype=int)))

Adding 2-gram features
max_features before np.max() 2500
max_features after  np.max() 93394
Average train sequence length: 290
Average test sequence length: 289
CPU times: user 2.83 s, sys: 15.6 ms, total: 2.85 s
Wall time: 2.85 s


In [5]:
print('Pad sequences (samples x time)')
x_train = sequence.pad_sequences(x_train, maxlen=maxlen)
x_test = sequence.pad_sequences(x_test, maxlen=maxlen)
print('x_train shape:', x_train.shape)
print('x_test shape:', x_test.shape)


Pad sequences (samples x time)
x_train shape: (8982, 400)
x_test shape: (2246, 400)


In [6]:
import keras
num_classes = np.max(y_train) + 1
y_train = keras.utils.to_categorical(y_train, num_classes)
y_test = keras.utils.to_categorical(y_test, num_classes)

# conv1d

In [9]:
from keras.models import Sequential
from keras import layers
from keras.optimizers import RMSprop

embedding_dims=128 # default value of the original statement

print("max_features: ",max_features,"embedding_dims: ",embedding_dims,"input_length: ",maxlen,
      "batch_size: ", 128, "epochs: ", 10)

model = Sequential()
# model.add(layers.Embedding(max_features, 128, input_length=max_len))
model.add(layers.Embedding(max_features, embedding_dims, input_length=maxlen))
model.add(layers.Conv1D(32, 7, activation='relu'))
model.add(layers.MaxPooling1D(5))
model.add(layers.Conv1D(32, 7, activation='relu'))
model.add(layers.GlobalMaxPooling1D())
# model.add(layers.Dense(1))
model.add(layers.Dense(num_classes, activation='softmax'))

model.summary()

model.compile(optimizer=RMSprop(lr=1e-4),
              loss='binary_crossentropy',
              metrics=['acc'])
history = model.fit(x_train, y_train,
                    epochs=10,
                    batch_size=128,
                    validation_split=0.2)

score = model.evaluate(x_test, y_test,
                       batch_size=batch_size, verbose=1)
print('Test score:', score[0])
print('Test accuracy:', score[1])

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 400, 128)          11954432  
_________________________________________________________________
conv1d_3 (Conv1D)            (None, 394, 32)           28704     
_________________________________________________________________
max_pooling1d_2 (MaxPooling1 (None, 78, 32)            0         
_________________________________________________________________
conv1d_4 (Conv1D)            (None, 72, 32)            7200      
_________________________________________________________________
global_max_pooling1d_2 (Glob (None, 32)                0         
_________________________________________________________________
dense_2 (Dense)              (None, 46)                1518      
Total params: 11,991,854
Trainable params: 11,991,854
Non-trainable params: 0
________________________________________________________________

# 念の為

In [None]:
from keras.layers import GlobalMaxPooling1D
embedding_dims=1472; epochs=5
print("max_features: ",max_features,"embedding_dims: ",embedding_dims,"input_length: ",maxlen,"batch_size: ",batch_size,"epochs: ",epochs)
model = Sequential()
model.add(Embedding(max_features, embedding_dims, input_length=maxlen))
# model.add(GlobalAveragePooling1D())
model.add(GlobalMaxPooling1D())
model.add(Dense(num_classes, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()
# model.fit(x_train, y_train, batch_size=batch_size, epochs=epochs, 
#           validation_data=(x_test, y_test))
model.fit(x_train, y_train, batch_size=batch_size, epochs=epochs, 
          validation_split=0.1)
score = model.evaluate(x_test, y_test,
                       batch_size=batch_size, verbose=1)
print('Test score:', score[0])
print('Test accuracy:', score[1])

# 結果
## conv1d
```
max_features:  93394 embedding_dims:  128 input_length:  400 batch_size:  128 epochs:  10
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_1 (Embedding)      (None, 400, 128)          11954432  
_________________________________________________________________
conv1d_1 (Conv1D)            (None, 394, 32)           28704     
_________________________________________________________________
max_pooling1d_1 (MaxPooling1 (None, 78, 32)            0         
_________________________________________________________________
conv1d_2 (Conv1D)            (None, 72, 32)            7200      
_________________________________________________________________
global_max_pooling1d_1 (Glob (None, 32)                0         
_________________________________________________________________
dense_1 (Dense)              (None, 46)                1518      
=================================================================
Total params: 11,991,854
Trainable params: 11,991,854
Non-trainable params: 0
_________________________________________________________________
Train on 7185 samples, validate on 1797 samples
Epoch 1/10
7185/7185 [==============================] - 4s 517us/step - loss: 0.1020 - acc: 0.9783 - val_loss: 0.0992 - val_acc: 0.9783
Epoch 2/10
7185/7185 [==============================] - 2s 224us/step - loss: 0.0951 - acc: 0.9783 - val_loss: 0.0897 - val_acc: 0.9783
Epoch 3/10
7185/7185 [==============================] - 2s 225us/step - loss: 0.0837 - acc: 0.9783 - val_loss: 0.0786 - val_acc: 0.9783
Epoch 4/10
7185/7185 [==============================] - 2s 224us/step - loss: 0.0759 - acc: 0.9783 - val_loss: 0.0740 - val_acc: 0.9783
Epoch 5/10
7185/7185 [==============================] - 2s 224us/step - loss: 0.0724 - acc: 0.9794 - val_loss: 0.0707 - val_acc: 0.9822
Epoch 6/10
7185/7185 [==============================] - 2s 226us/step - loss: 0.0690 - acc: 0.9828 - val_loss: 0.0674 - val_acc: 0.9833
Epoch 7/10
7185/7185 [==============================] - 2s 224us/step - loss: 0.0658 - acc: 0.9835 - val_loss: 0.0645 - val_acc: 0.9835
Epoch 8/10
7185/7185 [==============================] - 2s 223us/step - loss: 0.0633 - acc: 0.9836 - val_loss: 0.0624 - val_acc: 0.9835
Epoch 9/10
7185/7185 [==============================] - 2s 226us/step - loss: 0.0614 - acc: 0.9837 - val_loss: 0.0609 - val_acc: 0.9838
Epoch 10/10
7185/7185 [==============================] - 2s 225us/step - loss: 0.0599 - acc: 0.9841 - val_loss: 0.0596 - val_acc: 0.9847
2246/2246 [==============================] - 0s 107us/step
Test score: 0.05984580424595176
Test accuracy: 0.9848135746915958
```