# Classifying Claims - Keras Tokeniser TFIDF + FFNN - Binary Classification

In this post we will see if we can build some classifiers to predict a first level patent classification from the claim text.

In particular, here we will look at applying a standard feed forward neural network on a TFIDF matrix.

We will try to decompose the classification as a binary classification: G or not G and see what results we get.

In [20]:
import numpy as np
from keras.models import Sequential
from keras.layers import Dense, Activation, Dropout
from keras.utils import to_categorical
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

In [21]:
# Load data
import os
import pickle

with open("raw_data.pkl", "rb") as f:
    data = pickle.load(f)

In [22]:
data[0]

('\n1. A detector for atrial fibrillation or flutter (AF) comprising: \nan impedance measuring unit comprising a measuring input, to which an atrial electrode line having an electrode for a unipolar measurement of an impedance in an atrium is connected and is implemented to generate an atrial impedance signal, obtained in a unipolar manner, in such a way that an impedance signal for each atrial cycle, comprising an atrial contraction and a following relaxation of said atrium, comprises multiple impedance values detected at different instants within a particular atrial cycle; \nsaid impedance measuring unit comprising a signal input, via which a ventricle signal is to be supplied to said detector, which reflects instants of ventricular contractions in chronological assignment to said impedance signal; \nan analysis unit configured to average multiple sequential impedance signal sections of a unipolar atrial impedance signal, which are each delimited by two sequential ventricular contrac

Let's have a play with the Keras text tokenizer (as per here - https://keras.io/preprocessing/text/#tokenizer).

In [23]:
from keras.preprocessing.text import Tokenizer

docs = [d[0] for d in data]

# create the tokenizer
t = Tokenizer(num_words=10000)
# fit the tokenizer on the documents
t.fit_on_texts(docs)

encoded_claims = t.texts_to_matrix(docs, mode='count')

encoded_claims.shape

(11238, 10000)

Using the texts_to_matrix function we need to apply a feed-forward neural network rather than a RNN, as we have for each claim a set of word counts scaled by document frequency.

In [25]:
Y_class = [d[1] for d in data]

Y_data = [1 if c == 'G' else 0 for c in Y_class ]

In [26]:
print(Y_class[0:20])
print(Y_data[0:20])

['A', 'A', 'H', 'H', 'C', 'A', 'B', 'G', 'A', 'H', 'G', 'C', 'F', 'H', 'G', 'G', 'G', 'E', 'G', 'A']
[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0]


In [27]:
# split the data into training (80%) and testing (20%)
(X_train, X_test, Y_train, Y_test) = train_test_split(encoded_claims, Y_data, test_size=0.2)

In [28]:
input_dim = encoded_claims.shape[1]
print("Our input dimension for our claim representation is {0}".format(input_dim))

Our input dimension for our claim representation is 10000


In [29]:
# create a deeper model
deep_model = Sequential()
deep_model.add(Dropout(0.2, input_shape=(input_dim,)))
deep_model.add(Dense(2000, activation='relu'))
deep_model.add(Dropout(0.5))
deep_model.add(Dense(500, activation='relu'))
deep_model.add(Dense(1, activation='sigmoid'))
deep_model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(deep_model.summary())

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dropout_5 (Dropout)          (None, 10000)             0         
_________________________________________________________________
dense_6 (Dense)              (None, 2000)              20002000  
_________________________________________________________________
dropout_6 (Dropout)          (None, 2000)              0         
_________________________________________________________________
dense_7 (Dense)              (None, 500)               1000500   
_________________________________________________________________
dense_8 (Dense)              (None, 1)                 501       
Total params: 21,003,001
Trainable params: 21,003,001
Non-trainable params: 0
_________________________________________________________________
None


In [30]:
deep_model.fit(X_train, Y_train, validation_data=(X_test, Y_test), epochs=20, batch_size=64)
# Final evaluation of the model
scores = deep_model.evaluate(X_test, Y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))

Train on 8990 samples, validate on 2248 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20
Accuracy: 83.05%


With 10000 word vocab and TDIDF we have:
```Epoch 20/20
8990/8990 [==============================] - 233s - loss: 0.0190 - acc: 0.9950 - val_loss: 1.0357 - val_acc: 0.8123
Accuracy: 81.23%
```

With 10000 words and count we have:
```8990/8990 [==============================] - 291s - loss: 0.0247 - acc: 0.9939 - val_loss: 0.9626 - val_acc: 0.8305
Accuracy: 83.05%```

In [18]:
deep_model.save("2017-10-29 deep_dense_dropout_binary.hd5")

Count seems to work as well as or better than TFIDF. Overfitting is still a problem.

## Further Investigations

Have more layers with more dropout?

Compare with bidirectional LSTM approach?

Experiment with regularisation?

In [None]:
# create the tokenizer
t = Tokenizer(num_words=30000)
# fit the tokenizer on the documents
t.fit_on_texts(docs)
encoded_claims = t.texts_to_matrix(docs, mode='count')

# split the data into training (80%) and testing (20%)
(X_train, X_test, Y_train, Y_test) = train_test_split(encoded_claims, Y_data, test_size=0.2)

# create a deeper model
model = Sequential()
model.add(Dropout(0.2, input_shape=(encoded_claims.shape[1],)))
model.add(Dense(2000, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(500, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())

model.fit(X_train, Y_train, validation_data=(X_test, Y_test), epochs=20, batch_size=64)
# Final evaluation of the model
scores = model.evaluate(X_test, Y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dropout_7 (Dropout)          (None, 30000)             0         
_________________________________________________________________
dense_9 (Dense)              (None, 2000)              60002000  
_________________________________________________________________
dropout_8 (Dropout)          (None, 2000)              0         
_________________________________________________________________
dense_10 (Dense)             (None, 500)               1000500   
_________________________________________________________________
dense_11 (Dense)             (None, 1)                 501       
Total params: 61,003,001
Trainable params: 61,003,001
Non-trainable params: 0
_________________________________________________________________
None
Train on 8990 samples, validate on 2248 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20


In [None]:
model.save("2017-10-29 30k_word_deep_dense_dropout_binary.hd5")