# Classifying Claims - Keras Tokeniser TFIDF + FFNN

In this post we will see if we can build some classifiers to predict a first level patent classification from the claim text.

In particular, here we will look at applying a standard feed forward neural network on a TFIDF matrix.

In [23]:
import numpy as np
from keras.models import Sequential
from keras.layers import Dense, Activation, Dropout
from keras.utils import to_categorical
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

In [1]:
# Load data
import os
import pickle

with open("raw_data.pkl", "rb") as f:
    data = pickle.load(f)

In [2]:
data[0]

('\n1. A detector for atrial fibrillation or flutter (AF) comprising: \nan impedance measuring unit comprising a measuring input, to which an atrial electrode line having an electrode for a unipolar measurement of an impedance in an atrium is connected and is implemented to generate an atrial impedance signal, obtained in a unipolar manner, in such a way that an impedance signal for each atrial cycle, comprising an atrial contraction and a following relaxation of said atrium, comprises multiple impedance values detected at different instants within a particular atrial cycle; \nsaid impedance measuring unit comprising a signal input, via which a ventricle signal is to be supplied to said detector, which reflects instants of ventricular contractions in chronological assignment to said impedance signal; \nan analysis unit configured to average multiple sequential impedance signal sections of a unipolar atrial impedance signal, which are each delimited by two sequential ventricular contrac

Let's have a play with the Keras text tokenizer (as per here - https://keras.io/preprocessing/text/#tokenizer).

In [3]:
from keras.preprocessing.text import Tokenizer

docs = [d[0] for d in data]

# create the tokenizer
t = Tokenizer()
# fit the tokenizer on the documents
t.fit_on_texts(docs)

Using TensorFlow backend.


In [4]:
encoded_claims = t.texts_to_matrix(docs, mode='tfidf')

Using the texts_to_matrix function we need to apply a feed-forward neural network rather than a RNN, as we have for each claim a set of word counts scaled by document frequency.

In [5]:
encoded_claims.shape

(11238, 26142)

This is much faster than my old methods! But hey, I learnt some stuff about tokenisation.  

We can use the "num_words" parameter as passed into the Tokenizer to restrict to the top n words.

In [7]:
Y_class = [d[1] for d in data]

# encode class values as integers
label_e = LabelEncoder()
label_e.fit(Y_class)
encoded_Y = label_e.transform(Y_class)
# convert integers to dummy variables (i.e. one hot encoded)
Y_data = to_categorical(encoded_Y)
print("Our classes are now a matrix of {0}".format(Y_data.shape))
print("Original label: {0}; Converted label: {1}".format(Y_class[0], Y_data[0]))

Our classes are now a matrix of (11238, 8)
Original label: A; Converted label: [ 1.  0.  0.  0.  0.  0.  0.  0.]


In [8]:
# split the data into training (80%) and testing (20%)
(X_train, X_test, Y_train, Y_test) = train_test_split(encoded_claims, Y_data, test_size=0.2)

In [9]:
# clear some memory
del data, docs

In [10]:
input_dim = encoded_claims.shape[1]
print("Our input dimension for our claim representation is {0}".format(input_dim))

Our input dimension for our claim representation is 26142


In [11]:
no_classes = Y_data.shape[1]
print("Our output dimension is {0}".format(no_classes))

Our output dimension is 8


In [18]:
# create the model
model = Sequential()
model.add(Dense(500, input_dim=input_dim, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(no_classes, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())
model.fit(X_train, Y_train, validation_data=(X_test, Y_test), epochs=5, batch_size=64)
# Final evaluation of the model
scores = model.evaluate(X_test, Y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_5 (Dense)              (None, 500)               13071500  
_________________________________________________________________
dropout_1 (Dropout)          (None, 500)               0         
_________________________________________________________________
dense_6 (Dense)              (None, 8)                 4008      
Total params: 13,075,508
Trainable params: 13,075,508
Non-trainable params: 0
_________________________________________________________________
None
Train on 8990 samples, validate on 2248 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Accuracy: 63.61%


It is overfitting on our training set. We should try adding dropout.

In [1]:
model.save("conv_lstm.hd5")

NameError: name 'model' is not defined

Interesting that this approach doesn't seem to work. Likely because there are limited patterns at the word level to be detected by a CNN.  

This would likely work better at the character level.

In [20]:
from collections import Counter
# Let's check how our data is distributed across the classes
class_count = Counter(Y_class)
class_count

Counter({'A': 1777,
         'B': 1449,
         'C': 865,
         'D': 54,
         'E': 269,
         'F': 735,
         'G': 3335,
         'H': 2754})

In [24]:
#Code for building a confusion matrix

def get_confusion_matrix_one_hot(model_results, truth):
    '''model_results and truth should be for one-hot format, i.e, have >= 2 columns,
    where truth is 0/1, and max along each row of model_results is model result
    '''
    assert model_results.shape == truth.shape
    num_outputs = truth.shape[1]
    confusion_matrix = np.zeros((num_outputs, num_outputs), dtype=np.int32)
    predictions = np.argmax(model_results,axis=1)
    assert len(predictions)==truth.shape[0]

    for actual_class in range(num_outputs):
        idx_examples_this_class = truth[:,actual_class]==1
        prediction_for_this_class = predictions[idx_examples_this_class]
        for predicted_class in range(num_outputs):
            count = np.sum(prediction_for_this_class==predicted_class)
            confusion_matrix[actual_class, predicted_class] = count
    assert np.sum(confusion_matrix)==len(truth)
    assert np.sum(confusion_matrix)==np.sum(truth)
    return confusion_matrix

In [25]:
predict = model.predict(X_test)

In [26]:
cm = get_confusion_matrix_one_hot(predict, Y_test)

In [27]:
cm

array([[212,  37,  30,   0,   3,   9,  30,  15],
       [ 15, 161,   9,   0,   5,  20,  46,  31],
       [ 29,  29,  90,   0,   0,   2,  16,  12],
       [  1,   6,   0,   2,   0,   0,   1,   2],
       [  6,  19,   0,   0,  21,   3,   8,   3],
       [  8,  25,   2,   0,   3,  62,  11,  17],
       [ 15,  28,   6,   0,   1,   9, 512, 129],
       [  9,  22,   7,   0,   2,  14, 123, 370]], dtype=int32)