# Classifying Claims - Keras CNN + LSTM + Embedding

In this post we will see if we can build some classifiers to predict a first level patent classification from the claim text.

In particular, here we will look at applying a CNN as part of the deep learning stack.

In [4]:
# Load data
import os
import pickle

with open("raw_data.pkl", "rb") as f:
    data = pickle.load(f)

In [5]:
data[0]

('\n1. A detector for atrial fibrillation or flutter (AF) comprising: \nan impedance measuring unit comprising a measuring input, to which an atrial electrode line having an electrode for a unipolar measurement of an impedance in an atrium is connected and is implemented to generate an atrial impedance signal, obtained in a unipolar manner, in such a way that an impedance signal for each atrial cycle, comprising an atrial contraction and a following relaxation of said atrium, comprises multiple impedance values detected at different instants within a particular atrial cycle; \nsaid impedance measuring unit comprising a signal input, via which a ventricle signal is to be supplied to said detector, which reflects instants of ventricular contractions in chronological assignment to said impedance signal; \nan analysis unit configured to average multiple sequential impedance signal sections of a unipolar atrial impedance signal, which are each delimited by two sequential ventricular contrac

Let's have a play with the Keras text tokenizer (as per here - https://keras.io/preprocessing/text/#tokenizer).

In [6]:
from keras.preprocessing.text import Tokenizer

docs = [d[0] for d in data]

# create the tokenizer
t = Tokenizer()
# fit the tokenizer on the documents
t.fit_on_texts(docs)

Using TensorFlow backend.


In [8]:
encoded_claims = t.texts_to_matrix(docs, mode='tfidf')

In [10]:
encoded_claims.shape

(11238, 26142)

This is much faster than my old methods! But hey, I learnt some stuff about tokenisation.  

We can use the "num_words" parameter as passed into the Tokenizer to restrict to the top n words.

In [12]:
# LSTM and CNN for sequence classification on claim data
import numpy
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers.convolutional import Conv1D
from keras.layers.convolutional import MaxPooling1D
from keras.layers.embeddings import Embedding
from keras.preprocessing import sequence
from keras.utils import to_categorical
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

Y_class = [d[1] for d in data]

# encode class values as integers
label_e = LabelEncoder()
label_e.fit(Y_class)
encoded_Y = label_e.transform(Y_class)
# convert integers to dummy variables (i.e. one hot encoded)
Y_data = to_categorical(encoded_Y)
print("Our classes are now a matrix of {0}".format(Y_data.shape))

# fix random seed for reproducibility
seed = 9
numpy.random.seed(seed)

# Initialise tokenizer
top_words = 5000
tokenizer = Tokenizer(num_words=top_words)
tokenizer.fit_on_texts(docs)
X_data = tokenizer.texts_to_matrix(docs, mode='tfidf')
print("Our claims are now a matrix of {0}".format(X_data.shape))

Our classes are now a matrix of (11238, 8)
Our claims are now a matrix of (11238, 5000)


In [15]:
no_classes = Y_data.shape[1]
print("There are {0} different classes".format(no_classes))

There are 8 different classes


In [19]:
# split the data into training (80%) and testing (20%)
(X_train, X_test, Y_train, Y_test) = train_test_split(X_data, Y_data, test_size=0.2, random_state=seed)

# truncate and pad input sequences
max_claim_length = 500
X_train = sequence.pad_sequences(X_train, maxlen=max_claim_length)
X_test = sequence.pad_sequences(X_test, maxlen=max_claim_length)

# create the model
embedding_vecor_length = 32
model = Sequential()
model.add(Embedding(top_words, embedding_vecor_length, input_length=max_claim_length))
model.add(Conv1D(filters=32, kernel_size=3, padding='same', activation='relu'))
model.add(MaxPooling1D(pool_size=2))
model.add(LSTM(100))
model.add(Dense(no_classes, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())
model.fit(X_train, Y_train, validation_data=(X_test, Y_test), epochs=20, batch_size=64)
# Final evaluation of the model
scores = model.evaluate(X_test, Y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 500, 32)           160000    
_________________________________________________________________
conv1d_2 (Conv1D)            (None, 500, 32)           3104      
_________________________________________________________________
max_pooling1d_2 (MaxPooling1 (None, 250, 32)           0         
_________________________________________________________________
lstm_2 (LSTM)                (None, 100)               53200     
_________________________________________________________________
dense_2 (Dense)              (None, 8)                 808       
Total params: 217,112
Trainable params: 217,112
Non-trainable params: 0
_________________________________________________________________
None
Train on 8990 samples, validate on 2248 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 

In [1]:
model.save("conv_lstm.hd5")

NameError: name 'model' is not defined

Interesting that this approach doesn't seem to work. Likely because there are limited patterns at the word level to be detected by a CNN.  

This would likely work better at the character level.