## Text Classification in Keras  — A Simple Reuters News Classifier

[ Part 1 ](https://towardsdatascience.com/text-classification-in-keras-part-1-a-simple-reuters-news-classifier-9558d34d01d3)

[ Part 2 ]( https://towardsdatascience.com/text-classification-in-keras-part-2-how-to-use-the-keras-tokenizer-word-representations-fd571674df23)

In [1]:
import keras
import numpy as np
from keras.datasets import reuters


Using TensorFlow backend.


In [7]:
(x_train, y_train), (x_test, y_test) = reuters.load_data(num_words=None, test_split=0.2)
word_index = reuters.get_word_index(path="reuters_word_index.json")

print('# of Training Samples: {}'.format(len(x_train)))
print('# of Test Samples: {}'.format(len(x_test)))

num_classes = max(y_train) + 1
print('# of Classes: {}'.format(num_classes))
        

# of Training Samples: 8982
# of Test Samples: 2246
# of Classes: 46


In [8]:
## so, word index is a key value pair... key is word and value is its frequency
## so we can find index of word 'at' by using syntax word_index['at']... and the result is 25.. 
## therefore, at is 25th most frequently occuring word.
##print("word index is - ", word_index)

print("index of at is ", word_index['at'])
## index to word dictionary will be opposite, it will have frequency as an 'index' and value as 'word'
index_to_word = {}
for key, value in word_index.items():
    ## A new dictionary will be created.
    index_to_word[value] = key
    
## below print will print the reuter news article at index 0.     
print(' '.join([index_to_word[x] for x in x_train[0]]))

print(y_train[0])

#[index_to_word[x] for x in x_train[0]]



index of at is  25
the wattie nondiscriminatory mln loss for plc said at only ended said commonwealth could 1 traders now april 0 a after said from 1985 and from foreign 000 april 0 prices its account year a but in this mln home an states earlier and rise and revs vs 000 its 16 vs 000 a but 3 psbr oils several and shareholders and dividend vs 000 its all 4 vs 000 1 mln agreed largely april 0 are 2 states will billion total and against 000 pct dlrs
3


### Binary Tokenizer

In [9]:
from keras.preprocessing.text import Tokenizer

max_words = 10000

tokenizer = Tokenizer(num_words=max_words)
x_train = tokenizer.sequences_to_matrix(x_train, mode='binary')
x_test = tokenizer.sequences_to_matrix(x_test, mode='binary')

y_train = keras.utils.to_categorical(y_train, num_classes)
y_test = keras.utils.to_categorical(y_test, num_classes)

print(x_train[0])
print(len(x_train[0]))
print(max(x_train[0]))

print(y_train[0])
print(len(y_train[0]))

[0. 1. 0. ... 0. 0. 0.]
10000
1.0
[0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
46


### Generate keras sequential model

In [10]:
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation

model = Sequential()
## here features are words ...so, the input array that we will feed to the network is max words
model.add(Dense(512, input_shape=(max_words,)))
model.add(Activation('relu'))
## 50 percent of the nodes from the hidden layer are removed randomly.
## this is a good regularization to avoid overfitting
model.add(Dropout(0.5))

model.add(Dense(num_classes))
model.add(Activation('softmax'))

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.metrics_names)

batch_size = 32
epochs = 2

history = model.fit(x_train, y_train, batch_size=batch_size, epochs=epochs, verbose=1, validation_split=0.1)
score = model.evaluate(x_test, y_test, batch_size=batch_size, verbose=1)
print('Test loss:', score[0])
print('Test accuracy:', score[1])

['loss', 'acc']
Train on 8083 samples, validate on 899 samples
Epoch 1/2
Epoch 2/2
Test loss: 0.8425306087079477
Test accuracy: 0.803205699047019


### Let's use word count tokenizer

This class allows to vectorize a text corpus, by turning each text into either a sequence of integers (each integer being the index of a token in a dictionary) or into a vector where the coefficient for each token could be binary, based on word count, based on tf-idf...

https://keras.io/preprocessing/text/

Number of times word appear in a reuters article.. So, in new article one the matrix will contain the number of occurances instead of binary values

In [11]:
(x_train, y_train), (x_test, y_test) = reuters.load_data(num_words=None, test_split=0.2)

x_train = tokenizer.sequences_to_matrix(x_train, mode='count')
x_test = tokenizer.sequences_to_matrix(x_test, mode='count')

y_train = keras.utils.to_categorical(y_train, num_classes)
y_test = keras.utils.to_categorical(y_test, num_classes)

print(x_train[0])
print(len(x_train[0]))
print(max(x_train[0]))
print(np.argmax(x_train[0]))

model = Sequential()
model.add(Dense(512, input_shape=(max_words,)))
model.add(Activation('relu'))
model.add(Dropout(0.5))
model.add(Dense(num_classes))
model.add(Activation('softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

history = model.fit(x_train, y_train, batch_size=batch_size, epochs=epochs, verbose=1, validation_split=0.1)
score = model.evaluate(x_test, y_test, batch_size=batch_size, verbose=1)
print('Test loss:', score[0])
print('Test accuracy:', score[1])

[0. 1. 0. ... 0. 0. 0.]
10000
6.0
6
Train on 8083 samples, validate on 899 samples
Epoch 1/2
Epoch 2/2
Test loss: 0.8711621485325447
Test accuracy: 0.8138913624220837


### Let's use word frequency tokenizer

Frequency of word appear in a reuters article.. So, in new article one the matrix will contain the number of occurances/Total # of words in an article.. this is a normamized scaled parameters...

In [14]:
(x_train, y_train), (x_test, y_test) = reuters.load_data(num_words=None, test_split=0.2)


x_train = tokenizer.sequences_to_matrix(x_train, mode='freq')
x_test = tokenizer.sequences_to_matrix(x_test, mode='freq')

y_train = keras.utils.to_categorical(y_train, num_classes)
y_test = keras.utils.to_categorical(y_test, num_classes)

print(x_train[0])
print(len(x_train[0]))
print(max(x_train[0]))

model = Sequential()
model.add(Dense(512, input_shape=(max_words,)))
model.add(Activation('relu'))
model.add(Dropout(0.5))
model.add(Dense(num_classes))
model.add(Activation('softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

history = model.fit(x_train, y_train, batch_size=batch_size, epochs=epochs, verbose=1, validation_split=0.1)
score = model.evaluate(x_test, y_test, batch_size=batch_size, verbose=1)
print('Test loss:', score[0])
print('Test accuracy:', score[1])



[0.         0.01149425 0.         ... 0.         0.         0.        ]
10000
0.06896551724137931
Train on 8083 samples, validate on 899 samples
Epoch 1/2
Epoch 2/2
Test loss: 1.6475225509752256
Test accuracy: 0.5854853072393609


### Let's use -  term frequency–inverse document frequency,

Frequency of word appear in a reuters article.. So, in new article one the matrix will contain the number of occurances/Total # of words in an article.. this is a normamized scaled parameters...

In [16]:
(x_train, y_train), (x_test, y_test) = reuters.load_data(num_words=None, test_split=0.2)

tokenizer.fit_on_sequences(x_train)

x_train = tokenizer.sequences_to_matrix(x_train, mode='tfidf')
x_test = tokenizer.sequences_to_matrix(x_test, mode='tfidf')

y_train = keras.utils.to_categorical(y_train, num_classes)
y_test = keras.utils.to_categorical(y_test, num_classes)

print(x_train[0])
print(len(x_train[0]))
print(max(x_train[0]))

model = Sequential()
model.add(Dense(512, input_shape=(max_words,)))
model.add(Activation('relu'))
model.add(Dropout(0.5))
model.add(Dense(num_classes))
model.add(Activation('softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

history = model.fit(x_train, y_train, batch_size=batch_size, epochs=epochs, verbose=1, validation_split=0.1)
score = model.evaluate(x_test, y_test, batch_size=batch_size, verbose=1)
print('Test loss:', score[0])
print('Test accuracy:', score[1])

[0.         0.69309152 0.         ... 0.         0.         0.        ]
10000
6.214608098422191
Train on 8083 samples, validate on 899 samples
Epoch 1/2
Epoch 2/2
Test loss: 1.0412745214633823
Test accuracy: 0.7987533392963512
