# Deep Learning - Text Classification - Reuters Articles - Keras


This notebook explores the [Keras](https://keras.io/) layers with embedding mechanism on reuters articles data-set (smaller version).
Moreover, it uses keras provided packages for sequesntial modeling and text preprocessing.

## Classifying Tasks 

Classifying tasks for the dataset follow below steps:
- Figuring out the model which is suitable for the given data.
- Complete layers representation with suitable loss functions.
- Experimenting, validating and evaluating different models.
- Briefly document few best models.

In [4]:
%matplotlib inline

import keras
from keras.datasets import reuters
import glob
from keras.utils import np_utils
import matplotlib.pyplot as plt
import numpy as np
from keras.utils.data_utils import get_file

from keras.layers import Embedding
from keras.layers import Dense, Input, Flatten, Activation
from keras.layers import Conv1D, MaxPooling1D, Embedding, Dropout
from keras.models import Model, Sequential
from keras.preprocessing.text import Tokenizer

### Dataset

This time our dataset is a dataset of 11,228 newswires from Reuters, labeled over 46 topics. Each wire is encoded as a sequence of word indexes (same conventions).

In [None]:
(x_train, y_train), (x_test, y_test) = reuters.load_data(path="reuters.npz",
                                                         num_words=None,
                                                         skip_top=0,
                                                         maxlen=None,
                                                         test_split=0.2,
                                                         seed=113,
                                                         start_char=1,
                                                         oov_char=2,
                                                         index_from=3)

Let's see how the data is formatted by printing the dimensionalities of the variables. Refer to the keras-documentation for further info: https://keras.io/datasets/

In [None]:
print("x_train size", x_train.shape)
print("y_train size", y_train.shape)
print("x_test size", x_test.shape)
print("y_test size", y_test.shape)
print("")
print("Number of classes:", np.unique(y_train).shape[0])
word_index = reuters.get_word_index(path="reuters_word_index.json")

### Using only dataset features 

In [4]:
MAX_NB_WORDS = 4000

num_classes = np.unique(y_train).shape[0]
print('Vectorizing sequence data...')
tokenizer = Tokenizer(num_words=MAX_NB_WORDS)
x_train = tokenizer.sequences_to_matrix(x_train, mode='binary')
x_test = tokenizer.sequences_to_matrix(x_test, mode='binary')
print('x_train shape:', x_train.shape)
print('x_test shape:', x_test.shape)

print('Convert class vector to binary class matrix '
      '(for use with categorical_crossentropy)')
y_train = keras.utils.to_categorical(y_train, num_classes)
y_test = keras.utils.to_categorical(y_test, num_classes)
print('y_train shape:', y_train.shape)
print('y_test shape:', y_test.shape)


Vectorizing sequence data...
x_train shape: (8982, 4000)
x_test shape: (2246, 4000)
Convert class vector to binary class matrix (for use with categorical_crossentropy)
y_train shape: (8982, 46)
y_test shape: (2246, 46)


In [5]:
batch_size =128
epochs = 10

print('Building model...')
model = Sequential()
model.add(Dense(512, input_shape=(MAX_NB_WORDS,)))
model.add(Activation('relu'))
model.add(Dropout(0.5))
model.add(Dense(num_classes))
model.add(Activation('softmax'))

model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

history = model.fit(x_train, y_train,
                    batch_size=batch_size,
                    epochs=epochs,
                    verbose=1,
                    validation_split=0.1)

score = model.evaluate(x_test, y_test,
                       batch_size=batch_size, verbose=1)
print('\nTest score:', score[0])
print('Test accuracy:', score[1]*100)

Building model...
Train on 8083 samples, validate on 899 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Test score: 0.906570495502
Test accuracy: 80.7212822371


### Using Words Embedding and Keras Embedding layer

In [6]:
(x_train, y_train), (x_test, y_test) = reuters.load_data(path="reuters.npz",
                                                         num_words=None,
                                                         skip_top=0,
                                                         maxlen=None,
                                                         test_split=0.2,
                                                         seed=113,
                                                         start_char=1,
                                                         oov_char=2,
                                                         index_from=3)

print("x_train size", x_train.shape)
print("y_train size", y_train.shape)
print("x_test size", x_test.shape)
print("y_test size", y_test.shape)
print("")
print(x_train[1])
# print("Number of classes:", np.unique(y_train).shape[0])
# word_index = reuters.get_word_index(path="reuters_word_index.json")


MAX_NB_WORDS = 100

num_classes = np.unique(y_train).shape[0]
print('Vectorizing sequence data...')
tokenizer = Tokenizer(num_words=MAX_NB_WORDS)
x_train = tokenizer.sequences_to_matrix(x_train, mode='binary')
x_test = tokenizer.sequences_to_matrix(x_test, mode='binary')
print('x_train shape:', x_train.shape)
print('x_test shape:', x_test.shape)

print('Convert class vector to binary class matrix '
      '(for use with categorical_crossentropy)')
y_train = keras.utils.to_categorical(y_train, num_classes)
y_test = keras.utils.to_categorical(y_test, num_classes)
print('y_train shape:', y_train.shape)
print('y_test shape:', y_test.shape)

x_train size (8982,)
y_train size (8982,)
x_test size (2246,)
y_test size (2246,)

[1, 3267, 699, 3434, 2295, 56, 16784, 7511, 9, 56, 3906, 1073, 81, 5, 1198, 57, 366, 737, 132, 20, 4093, 7, 19261, 49, 2295, 13415, 1037, 3267, 699, 3434, 8, 7, 10, 241, 16, 855, 129, 231, 783, 5, 4, 587, 2295, 13415, 30625, 775, 7, 48, 34, 191, 44, 35, 1795, 505, 17, 12]
Vectorizing sequence data...
x_train shape: (8982, 100)
x_test shape: (2246, 100)
Convert class vector to binary class matrix (for use with categorical_crossentropy)
y_train shape: (8982, 46)
y_test shape: (2246, 46)


[0 1 0 0 1 1 1 1 1 1 1 1 1 0 0 1 1 1 0 1 0 0 1 0 0 1 1 0 0 1 1 0 1 0 0 0 0
 0 0 1 0 0 0 1 1 0 0 0 1 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0]


'./embedding/glove.6B.zip'

embedding/glove.6B.100d.txt


Total 400000 word vectors in Glove 6B 100d.


(30980, 100)


Building model...
Train on 8083 samples, validate on 899 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Test score: 1.28732361808
Test accuracy: 68.6999110271
