# Test DNN Classifier

This classifier tests classification of an embedding layer feeding into a convolution layer on the news20 data set.
The details of the data set and the performance of other classifiers is described here:
https://nlp.stanford.edu/wiki/Software/Classifier/20_Newsgroups

The aim of this classifier is to attempt to benchmark above 80% on the test set or better.
The news 20 data set has a large vocabulary and reasonably long utterances. 
On different domains with smaller restricted vocabularies I've found that the approach performs quite well, this is because of the limited size of the vocabulary.
Hence the news 20 dataset provides a good level of complexity to test against for the multi-class text classification problem on short passages of text.

This page provides a good exploration of the data set and a survey of a variety of different resources testing a variety of different classification algorithms against it.
https://acardocacho.github.io/capstone/

# Sourcing the Data.

This project makes use of the glove embeddings.
Download the glove embeddings from this link: 
- https://nlp.stanford.edu/data/glove.6B.zip

Create a directory data/glove/

And extract the contents below that directory this should provide the listing glove/glove.6N.50d.txt for example.


Obtain the news20 data set, this has been processed and generously made available here: http://ana.cachopo.org/datasets-for-single-label-text-categorization
Download the training set and test sets:
- http://ana.cachopo.org/datasets-for-single-label-text-categorization/20ng-train-all-terms.txt?attredirects=0
and
- http://ana.cachopo.org/datasets-for-single-label-text-categorization/20ng-test-all-terms.txt?attredirects=0

Place the text files in the path data/news20/ so the listing will be:

- data/news20/20ng-train-all-terms.txt
- data/news20/20ng-test-all-terms.txt



First steps prepare the data for use with the network.

In [1]:
import sys
import os
import numpy as np
import pickle

sys.path.insert(0, 'lib')

from lib import TextReader
from lib import GloveReader


basedir = 'data'


data_path = os.path.join(basedir, 'news20')
data_path = os.path.join(data_path, 'train_data_all.pickle')
vocab = []
all_words = []
all_classes = []
targets = None
sequences = None
reader = TextReader.TextReader(os.path.join(basedir, 'news20'), basedir)
    
if os.path.exists(data_path):
    with open(data_path, 'rb') as fin:
        all_data = pickle.load(fin)
        vocab = all_data['vocab']
        all_words = all_data['all_words']
        all_classes = all_data['all_classes']
        targets = all_data['targets']
        sequences = all_data['sequences']
else:
    vocab, all_words, all_classes = reader.read_labeled_documents('20ng-train-all-terms.txt')
    targets = reader.one_hot_encode_classes(all_classes)
    sequences = reader.make_index_sequences(vocab, all_words)
    all_data = {
        'vocab':vocab,
        'all_words':all_words,
        'all_classes': all_classes,
        'targets': targets,
        'sequences': sequences
    }
    with open(data_path, 'wb') as fout:
        pickle.dump(all_data, fout)





In [2]:
len(vocab)

73404

In [3]:
sequences.head(3)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,6967,6968,6969,6970,6971,6972,6973,6974,6975,6976
0,1,1,1,1,1,1,1,1,1,1,...,56323,28135,57319,4092,30915,38391,53721,39186,0,3
0,1,1,1,1,1,1,1,1,1,1,...,39186,31216,47898,57319,38391,47899,38741,66716,0,3
0,1,1,1,1,1,1,1,1,1,1,...,5337,38897,49969,70135,12616,52881,33507,5992,0,3


In [4]:
targets.head(3)

Unnamed: 0,alt.atheism,comp.graphics,comp.os.ms-windows.misc,comp.sys.ibm.pc.hardware,comp.sys.mac.hardware,comp.windows.x,misc.forsale,rec.autos,rec.motorcycles,rec.sport.baseball,rec.sport.hockey,sci.crypt,sci.electronics,sci.med,sci.space,soc.religion.christian,talk.politics.guns,talk.politics.mideast,talk.politics.misc,talk.religion.misc
0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


Next we preload the glove embeddings and create an embedding matrix for our training vocabulary.

In [5]:

embed_reader = GloveReader.GloveReader(base_dir=basedir)
glove1 = embed_reader.read_glove_model('model50')


# save the vocab embeddings because they are expensive to recreate.
import pickle
output = os.path.join(basedir, 'news20')
output = os.path.join(output, 'vocab_embeddings.pickle')

vocab_embedding = None
if os.path.exists(output):
    with open(output, 'rb') as fin:
        vocab_embedding = pickle.load(fin)
else:
    vocab_embedding = reader.vocab_to_embedding_matrix(embed_reader, vocab)
    with open(output, 'wb') as fout:
        pickle.dump(vocab_embedding, fout)

Skipping line 18137: Expected 51 fields in line 18137, saw 52
Skipping line 77306: Expected 51 fields in line 77306, saw 52
Skipping line 78481: Expected 51 fields in line 78481, saw 52
Skipping line 80636: Expected 51 fields in line 80636, saw 52
Skipping line 86603: Expected 51 fields in line 86603, saw 52
Skipping line 95766: Expected 51 fields in line 95766, saw 52
Skipping line 97253: Expected 51 fields in line 97253, saw 52
Skipping line 98622: Expected 51 fields in line 98622, saw 52
Skipping line 102606: Expected 51 fields in line 102606, saw 52
Skipping line 104608: Expected 51 fields in line 104608, saw 52
Skipping line 120311: Expected 51 fields in line 120311, saw 52
Skipping line 123556: Expected 51 fields in line 123556, saw 52
Skipping line 129697: Expected 51 fields in line 129697, saw 52
Skipping line 140365: Expected 51 fields in line 140365, saw 52
Skipping line 141336: Expected 51 fields in line 141336, saw 52
Skipping line 147469: Expected 51 fields in line 147469,

Now we can build the network and train it.

In [6]:
# the vocab embedding can be used with our cnn embedding model.
from lib import CnnClassifier

classifier = CnnClassifier.CnnClassifier()

max_sequence_length = sequences.shape[1]
embed_dim = vocab_embedding.shape[1]
num_outputs = targets.shape[1]
pool_size = 2
kernel_shape = 3
dropout_pc = 0.3
# note that filters correspond roughly with n-grams
num_filters=100
cnn_padding='same'


Using TensorFlow backend.


In [7]:
# Model without training the embedding layer, train_embedding=False
model = classifier.build_network(len(vocab), max_sequence_length, num_outputs, pool_size, kernel_shape, embed_dim, num_filters=num_filters, embedding_matrix=vocab_embedding, train_embedding=True, cnn_padding=cnn_padding)
model.summary()

model.compile(optimizer='nadam',
              loss='categorical_crossentropy',
              metrics=['categorical_accuracy'])

Instructions for updating:
Colocations handled automatically by placer.
(6977, 50, 1)
Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 6977, 50)          3670200   
_________________________________________________________________
conv1d_1 (Conv1D)            (None, 6977, 65)          6565      
_________________________________________________________________
dropout_1 (Dropout)          (None, 6977, 65)          0         
_________________________________________________________________
max_pooling1d_1 (MaxPooling1 (None, 2325, 65)          0         
_________________________________________________________________
conv1d_2 (Conv1D)            (None, 2325, 65)          8515      
_________________________________________________________________
dropout

Prepare the training data.

In [8]:
np.random.seed(42)

rows = sequences.shape[0]
# will shuffle the data
indices = np.arange(rows)
np.random.shuffle(indices)

shuffled_inputs = sequences.values[indices]
shuffled_targets = targets.values[indices]
train_percent = 0.8
trainX, validateX = np.split(shuffled_inputs, [int(train_percent*shuffled_inputs.shape[0])])
trainY, validateY = np.split(shuffled_targets, [int(train_percent*shuffled_targets.shape[0])])


Train the model. Note prior to training run tensorboard from the base directory to monitor progress.

```
tensorboard --logdir logs
```

In [9]:
from datetime import datetime
import keras

logdir=os.path.join("logs", "scalars")
logdir=os.path.join(logdir, "model1")
logdir=os.path.join(logdir, datetime.now().strftime("%Y%m%d-%H%M%S"))

tensorboard_callback = keras.callbacks.TensorBoard(log_dir=logdir)

checkpoint_path = os.path.join("checkpoints", datetime.now().strftime("%Y%m%d-%H%M%S"))
checkpoint_path = os.path.join(checkpoint_path, "text_dnn_classifier-"+num_filters+"-{epoch:02d}-{val_categorical_accuracy:.2f}.hdf5")

checkpoint_callback = keras.callbacks.ModelCheckpoint(checkpoint_path, monitor='val_categorical_accuracy', verbose=1, save_best_only=True, mode='max')

epochs=100



In [10]:

history = model.fit(trainX,
                    trainY,
                    epochs=epochs,
                    validation_data=(validateX, validateY),
                    callbacks=[tensorboard_callback, checkpoint_callback])

Instructions for updating:
Use tf.cast instead.
Train on 9034 samples, validate on 2259 samples
Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epo

In [None]:
import pandas as pandas
# To evaluate the model we want the test reader to load the test set since it was in a separate file.
# but we want to use the original vocabulary to define the sequences.
test_reader = TextReader.TextReader(os.path.join(basedir, 'news20'),
                                    basedir)

testvocab = []
test_words = []
test_classes = []
test_targets = None
test_sequences = None

test_path = os.path.join(basedir, 'news20')
test_path = os.path.join(test_path, 'test_data.pickle')

if os.path.exists(test_path):
    with open(test_path, 'rb') as fin:
        all_data = pickle.load(fin)
        testvocab = all_data['test_vocab']
        test_words = all_data['test_words']
        test_classes = all_data['test_classes']
        test_targets = all_data['test_targets']
        test_sequences = all_data['test_sequences']
else:
    testvocab, test_words, test_classes = test_reader.read_labeled_documents('20ng-test-all-terms.txt')

    # get the test targets
    test_targets = test_reader.one_hot_encode_classes(test_classes)
    # get the test sequences but use the indexes in the vocabulary we trained on.
    # words in the test set not in the original vocab are substituted with '<UNKNOWN>'
    test_sequences = test_reader.make_index_sequences(vocab, test_words)
    # we need to set the max width of sequences to equal the maximum width of the
    # training data.
    test_width = test_sequences.shape[1]
    if test_width > max_sequence_length:
        delta = test_width - max_sequence_length
        test_sequences = test_sequences.iloc[:, delta:]
    elif test_width < max_sequence_length:
        # otherwise we need to pad the sequences so they are the same length.
        padding = vocab.index('<NA>')
        delta = max_sequence_length - test_width
        rows = test_sequences.shape[0]
        pad_cells = np.tile(padding, [rows, max_sequence_length])
        endcol = max_sequence_length -  1
        for i in range(0, rows):
            pad_cells[i,delta:max_sequence_length] = test_sequences.iloc[i,:]
        test_sequences = pandas.DataFrame(pad_cells)
        
    all_data = {
        'test_vocab': testvocab,
        'test_words': test_words,
        'test_classes': test_classes,
        'test_targets': test_targets,
        'test_sequences': test_sequences
    }
    with open(test_path, 'wb') as fout:
        pickle.dump(all_data, fout)
        
test_sequences.shape

In [None]:
test_sequences.head(4)

In [None]:
test_targets.head(4)

In [None]:
loss, accuracy = model.evaluate(test_sequences, test_targets)

In [None]:
loss, accuracy