## Neural Network Classifier Demo

This script loads word embeddings pre-trained on Wikipedia and newswire (Gigaword) using GloVe into a Keras Embedding layer, and uses it to train a text classification model on the news genres by headline.

This code is adapted from 
https://github.com/keras-team/keras/blob/master/examples/pretrained_word_embeddings.py  
https://blog.keras.io/using-pre-trained-word-embeddings-in-a-keras-model.html

GloVe embedding data can be found at:  
http://nlp.stanford.edu/data/glove.6B.zip  
(source page: http://nlp.stanford.edu/projects/glove/)

The news headlines are a subset of the dataset found here:  
https://www.kaggle.com/rmisra/news-category-dataset

### Required Packages
- keras  
- numpy

### Training the model

The dataset contains 10k news headlines obtained from a Kaggle dataset. There are four categories of headlines (business, politics, entertainment, and crime) represented equally with 2.5k instances each.

The dataset was randomly partitioned into training/test datasets in a 60/40 split. This split was applied to the combined set of headlines, so within the training/test sets, the category splits may not be exactly equal, but should be roughly similar.

Hyperparameters:  
epochs=10  
batch_size=32  
embedding_dim=100

In [3]:
from __future__ import print_function

import os
import sys
import numpy as np
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical
from keras.layers import Dense, Input, GlobalMaxPooling1D
from keras.layers import Conv1D, MaxPooling1D, Embedding
from keras.models import Model
from keras.initializers import Constant


BASE_DIR = ''
GLOVE_DIR = os.path.join(BASE_DIR, 'glove.6B')
TEXT_DATA_DIR = os.path.join(BASE_DIR, 'news_data')
MAX_SEQUENCE_LENGTH = 1000
MAX_NUM_WORDS = 20000
EMBEDDING_DIM = 100
VALIDATION_SPLIT = 0.4

# first, build index mapping words in the embeddings set
# to their embedding vector

print('Indexing word vectors.')

embeddings_index = {}
with open(os.path.join(GLOVE_DIR, 'glove.6B.100d.txt'), encoding="utf8") as f:
    for line in f:
        values = line.split()
        word = values[0]
        coefs = np.asarray(values[1:], dtype='float32')
        embeddings_index[word] = coefs

print('Found %s word vectors.' % len(embeddings_index))

# second, prepare text samples and their labels
print('Processing text dataset')

texts = []  # list of text samples
labels_index = {}  # dictionary mapping label name to numeric id
labels = []  # list of label ids
for name in sorted(os.listdir(TEXT_DATA_DIR)):
    label_id = len(labels_index)
    labels_index[name] = label_id 
    path = os.path.join(TEXT_DATA_DIR, name)
    file = open(path, 'r', encoding='utf-8')
    for headline in file.readlines():
        texts.append(headline)
        labels.append(label_id)

print('Found %s texts.' % len(texts))

# finally, vectorize the text samples into a 2D integer tensor
tokenizer = Tokenizer(num_words=MAX_NUM_WORDS)
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts)

word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))

data = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH)

labels = to_categorical(np.asarray(labels))
print('Shape of data tensor:', data.shape)
print('Shape of label tensor:', labels.shape)

# Shuffle data
# split the data into a training set and a validation set
indices = np.arange(data.shape[0])
np.random.shuffle(indices)
data = data[indices]
labels = labels[indices]
num_validation_samples = int(VALIDATION_SPLIT * data.shape[0])

x_train = data[:-num_validation_samples]
y_train = labels[:-num_validation_samples]
x_val = data[-num_validation_samples:]
y_val = labels[-num_validation_samples:]

print('Preparing embedding matrix.')

# prepare embedding matrix
num_words = min(MAX_NUM_WORDS, len(word_index)) + 1
embedding_matrix = np.zeros((num_words, EMBEDDING_DIM))
for word, i in word_index.items():
    if i > MAX_NUM_WORDS:
        continue
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # words not found in embedding index will be all-zeros.
        embedding_matrix[i] = embedding_vector

# load pre-trained word embeddings into an Embedding layer
# note that we set trainable = False so as to keep the embeddings fixed
embedding_layer = Embedding(num_words,
                            EMBEDDING_DIM,
                            embeddings_initializer=Constant(embedding_matrix),
                            input_length=MAX_SEQUENCE_LENGTH,
                            trainable=False)

print('Training model.')

# train a 1D convnet with global maxpooling
sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
embedded_sequences = embedding_layer(sequence_input)
x = Conv1D(128, 5, activation='relu')(embedded_sequences)
x = MaxPooling1D(5)(x)
x = Conv1D(128, 5, activation='relu')(x)
x = MaxPooling1D(5)(x)
x = Conv1D(128, 5, activation='relu')(x)
x = GlobalMaxPooling1D()(x)
x = Dense(128, activation='relu')(x)
preds = Dense(len(labels_index), activation='softmax')(x)

model = Model(sequence_input, preds)
model.compile(loss='categorical_crossentropy',
              optimizer='rmsprop',
              metrics=['acc'])

model.fit(x_train, y_train,
          batch_size=32,#128
          epochs=10,
          validation_data=(x_val, y_val))

Using TensorFlow backend.


Indexing word vectors.
Found 400000 word vectors.
Processing text dataset
Found 10000 texts.
Found 15798 unique tokens.
Shape of data tensor: (10000, 1000)
Shape of label tensor: (10000, 4)
Preparing embedding matrix.
Training model.
Train on 6000 samples, validate on 4000 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x2434e0b4d68>

### Save the trained model to Disk for later user

In [4]:
# serialize model to JSON
model_json = model.to_json()
with open("model.json", "w") as json_file:
    json_file.write(model_json)
# serialize weights to HDF5
model.save_weights("model.h5")
print("Saved model to disk")

Saved model to disk


### Evaluate the Test dataset
Calculate overall accuracy and precision, recall, and F-1 measure for each category.

In [8]:
y_predictions = model.predict(x_val, len(x_val))

label_names = ['business', 'crime', 'entertainment', 'politics']
shuffled_texts = [ texts[i] for i in indices]
N = len(x_val)
correct = 0

results_texts = []
results_labels = []
results_predictions = []

for i in range(N):
    text = shuffled_texts[-num_validation_samples + i]
    label = label_names[np.argmax(labels[-num_validation_samples + i])]
    prediction = label_names[np.argmax(y_predictions[i])]
    
    # add to results arrays
    results_texts.append(text)
    results_labels.append(label)
    results_predictions.append(prediction)
    
    #print(text, "PREDICTION:", prediction, "ACTUAL:", label)
    if label == prediction:
        correct += 1
print("TOTAL CORRECT:", correct, "TOTAL COUNT:", N, "ACC:", correct / N)
print("")

for label in label_names:
    tp = tn = fp = fn = 0
    for i in range(N):
        if label == results_labels[i]:
            if label == results_predictions[i]:
                tp += 1
            else:
                fn += 1
        if label != results_labels[i]:
            if label == results_predictions[i]:
                fp += 1
            else:
                tn += 1
    
    precision = tp / (tp + fp) if (tp + fp > 0) else 0
    recall = tp / (tp + fn) if (tp + fn > 0) else 0
    f1 = 2 * ((precision * recall) / (precision + recall)) if (precision + recall > 0) else 0
    print(label)
    print("TP: {}, TN: {}, FP: {}, FN: {}".format(tp, tn, fp, fn))
    print("Precision: {}, Recall: {}, F1-measure: {}".format(round(precision, 3), round(recall, 3), round(f1, 3)))
    print("")

TOTAL CORRECT: 3293 TOTAL COUNT: 4000 ACC: 0.82325

business
TP: 877, TN: 2717, FP: 242, FN: 164
Precision: 0.784, Recall: 0.842, F1-measure: 0.812

crime
TP: 873, TN: 2845, FP: 159, FN: 123
Precision: 0.846, Recall: 0.877, F1-measure: 0.861

entertainment
TP: 809, TN: 2894, FP: 161, FN: 136
Precision: 0.834, Recall: 0.856, F1-measure: 0.845

politics
TP: 734, TN: 2837, FP: 145, FN: 284
Precision: 0.835, Recall: 0.721, F1-measure: 0.774

