## Classification of text articles

Links
* [Reuters newswire dataset](https://keras.io/datasets/)
* [Reuters text classification](https://www.bonaccorso.eu/2016/08/02/reuters-21578-text-classification-with-gensim-and-keras/)
* [Keras Reuters MLP example](https://github.com/keras-team/keras/blob/master/examples/reuters_mlp.py)

In [147]:
from keras.datasets import reuters
import numpy as np
from keras.preprocessing.text import Tokenizer
import keras
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Activation
from keras.layers import Dropout

from keras import backend as K

The Reuters dataset is made up of 11,228 newswires from Reuters, labeled over 46 topics. The word index used to encode the sequences is stored in the reuters_word_index.json file.

In [148]:
word_index = reuters.get_word_index(path="reuters_word_index.json")
print('There are', len(word_index), 'words used to encode.')

There are 30979 words used to encode.


In [149]:
max_words = 1000
batch_size = 32
epochs = 5 

In [150]:
print('Loading Reuters data...')
(x_train, y_train), (x_test, y_test) = reuters.load_data(
    num_words=max_words, test_split=0.2)
print(len(x_train), 'train sequences')
print(len(x_test), 'test sequences')

num_classes = np.max(y_train) + 1
print(num_classes, 'classes')

Loading Reuters data...
8982 train sequences
2246 test sequences
46 classes


In [151]:
print('Vectorizing sequence data...')
tokenizer = Tokenizer(num_words=max_words)
x_train = tokenizer.sequences_to_matrix(x_train, mode='binary')
x_test = tokenizer.sequences_to_matrix(x_test, mode='binary')
print('x_train shape:', x_train.shape)
print('x_test shape:', x_test.shape)

Vectorizing sequence data...
x_train shape: (8982, 1000)
x_test shape: (2246, 1000)


In [152]:
print('Convert class vector to binary class matrix '
      '(for use with categorical_crossentropy)')
y_train = keras.utils.to_categorical(y_train, num_classes)
y_test = keras.utils.to_categorical(y_test, num_classes)
print('y_train shape:', y_train.shape)
print('y_test shape:', y_test.shape)

Convert class vector to binary class matrix (for use with categorical_crossentropy)
y_train shape: (8982, 46)
y_test shape: (2246, 46)


In [153]:
print('Building model...')
K.clear_session()
model = Sequential()
model.add(Dense(512, input_shape=(max_words,)))
model.add(Activation('relu'))
model.add(Dropout(0.5))
model.add(Dense(num_classes))
model.add(Activation('softmax'))

model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

model.summary()

Building model...
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_1 (Dense)              (None, 512)               512512    
_________________________________________________________________
activation_1 (Activation)    (None, 512)               0         
_________________________________________________________________
dropout_1 (Dropout)          (None, 512)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 46)                23598     
_________________________________________________________________
activation_2 (Activation)    (None, 46)                0         
Total params: 536,110
Trainable params: 536,110
Non-trainable params: 0
_________________________________________________________________


In [154]:
history = model.fit(x_train, y_train,
                    batch_size=batch_size,
                    epochs=epochs,
                    verbose=1,
                    validation_split=0.1)
score = model.evaluate(x_test, y_test,
                       batch_size=batch_size, verbose=1)
print('Test loss:', score[0])
print('Test accuracy:', score[1])

Train on 8083 samples, validate on 899 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Test loss: 0.89095894248
Test accuracy: 0.791629563669
