# Classifying Claims - Keras LSTM + Embedding + Dropout

In this post we will see if we can build some classifiers to predict a first level patent classification from the claim text.

We will be using USPTO data, where I believe the claims are classified according to the IPC. To keep things simple we will use the first letter of the IPC (top level category).  

The list of top level categories can be found here: https://rs.espacenet.com/help?locale=en_EP&method=handleHelpTopic&topic=ipc:
* A Human Necessities
* B Performing Operations; Transporting
* C Chemistry; Metallurgy
* D Textiles; Paper
* E Fixed Constructions
* F Mechanical Engineering; Lighting; Heating; Weapons; Blasting Engines or Pumps
* G Physics
* H Electricity

---
## Getting Our Data

See the previous notebook for our data preparation.  

Here we will load the data as tokenised and converted to integers. 

In [1]:
import os, pickle

In [2]:
filename = "X_Y_data.pkl"

if os.path.isfile(filename):
    with open(filename, "rb") as f:
        print("Loading data")
        X_data, Y_data, count, word_dictionary, reverse_word_dictionary, class_dictionary, reverse_class_dictionary = pickle.load(f)
else:
    print("Run the previous notebook for data preparation")

Loading data


In [3]:
class_dictionary

{'A': 0, 'B': 1, 'C': 2, 'D': 3, 'E': 4, 'F': 5, 'G': 6, 'H': 7}

In [4]:
Y_data[0:5]

[0, 0, 7, 7, 2]

In [5]:
X_data[0]

[16,
 9,
 3,
 511,
 13,
 3700,
 6638,
 30,
 17590,
 23,
 5524,
 20,
 15,
 18,
 11,
 1333,
 370,
 43,
 15,
 3,
 370,
 78,
 5,
 7,
 36,
 11,
 3700,
 91,
 138,
 32,
 11,
 91,
 13,
 3,
 4916,
 428,
 4,
 11,
 1333,
 12,
 11,
 7720,
 19,
 66,
 6,
 19,
 1532,
 7,
 272,
 11,
 3700,
 1333,
 41,
 5,
 410,
 12,
 3,
 4916,
 683,
 5,
 12,
 141,
 3,
 1171,
 31,
 11,
 1333,
 41,
 13,
 38,
 3700,
 800,
 5,
 15,
 11,
 3700,
 4301,
 6,
 3,
 478,
 5058,
 4,
 10,
 7720,
 5,
 94,
 406,
 1333,
 399,
 476,
 17,
 219,
 4833,
 84,
 3,
 540,
 3700,
 800,
 8,
 10,
 1333,
 370,
 43,
 15,
 3,
 41,
 78,
 5,
 178,
 36,
 3,
 7674,
 41,
 19,
 7,
 53,
 811,
 7,
 10,
 511,
 5,
 36,
 3928,
 4833,
 4,
 4121,
 12984,
 12,
 22497,
 3018,
 7,
 10,
 1333,
 41,
 8,
 11,
 705,
 43,
 48,
 7,
 744,
 406,
 2608,
 1333,
 41,
 810,
 4,
 3,
 4916,
 3700,
 1333,
 41,
 5,
 36,
 54,
 38,
 7950,
 28,
 92,
 2608,
 4121,
 12984,
 5,
 24,
 22,
 350,
 5,
 6,
 7,
 379,
 3,
 767,
 2059,
 4,
 11,
 12606,
 4916,
 3700,
 1333,
 41,
 86,
 5,
 10,


In [6]:
# Check vectors are the same length
print(len(X_data), len(Y_data))

10262 10262


## Applying Sequence Classification with Keras

Working from this post - https://machinelearningmastery.com/sequence-classification-lstm-recurrent-neural-networks-python-keras/ - we can then apply an LSTM followed by a single dense layer.

Documentation for padding - https://keras.io/preprocessing/sequence/#pad_sequences . We probably need to reserve 0 as a reserved character.

Also here is the keras guide to sequential classification: https://keras.io/getting-started/sequential-model-guide/.

The post here explains how to split data into training / test using Keras: https://gogul09.github.io/software/first-neural-network-keras.

See here for BiDirectional LSTM: https://github.com/fchollet/keras/blob/master/examples/imdb_bidirectional_lstm.py.

This time we'll try adding some dropout between the layers to prevent overfitting.  

In [7]:
# First we need to split out data into training and test data - go for 80:20
import numpy as np

from keras.preprocessing import sequence
from keras.models import Sequential
from keras.layers import Dense, Dropout, Embedding, LSTM, Bidirectional
from keras.utils import to_categorical

from sklearn.model_selection import train_test_split

# seed for reproducing same results
seed = 9
np.random.seed(seed)

# split the data into training (80%) and testing (20%)
(X_train, X_test, Y_train, Y_test) = train_test_split(X_data, Y_data, test_size=0.2, random_state=seed)

Using TensorFlow backend.


In [8]:
print("Our training data has length: {0} and our test data has length: {1}".format(len(X_train), len(X_test)))

Our training data has length: 8209 and our test data has length: 2053


In [9]:
# Now we need to segment and pad our claim text sequences - we have already restricted our claims to length 250
# We might want to experiment with changing this
max_word_length = 250
# Padding is performing by adding 0, which we have reserved as a PAD token above
X_train = sequence.pad_sequences(X_train, maxlen=max_word_length)
X_test = sequence.pad_sequences(X_test, maxlen=max_word_length)

In [10]:
X_train[1]

array([    0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,    16,     9,    11,   564,
           5,    15,    18,     3,  1076,    36,    19,    80,    12,
           3,   319,   382,     5,    12,    36,    17,    26,    22,
          39,   808,

In [11]:
no_classes = len(class_dictionary)
Y_train = np.array(Y_train)
Y_test = np.array(Y_test)
print("There are {0} classes".format(no_classes))

There are 8 classes


In [12]:
Y_train.shape

(8209,)

In [13]:
# Convert labels to categorical one-hot encoding
Y_train = to_categorical(Y_train, num_classes=no_classes)
Y_test = to_categorical(Y_test, num_classes=no_classes)

In [14]:
Y_train[0]

array([ 0.,  0.,  0.,  0.,  0.,  0.,  1.,  0.])

In [15]:
Y_train.shape

(8209, 8)

In [18]:
# Now building our model 
embedding_vecor_length = 128
vocabulary_size=25000
model = Sequential()
model.add(Embedding(vocabulary_size, embedding_vecor_length, input_length=max_word_length))
# model.add(Dropout(0.2))
model.add(LSTM(100, dropout=0.2, recurrent_dropout=0.2))
# model.add(Dropout(0.2))
model.add(Dense(no_classes, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())
model.fit(X_train, Y_train, validation_data=(X_test, Y_test), epochs=30, batch_size=64)
model.save('claim_class_lstm_drop.h5')

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 250, 128)          3200000   
_________________________________________________________________
lstm_2 (LSTM)                (None, 100)               91600     
_________________________________________________________________
dense_2 (Dense)              (None, 8)                 808       
Total params: 3,292,408
Trainable params: 3,292,408
Non-trainable params: 0
_________________________________________________________________
None
Train on 8209 samples, validate on 2053 samples
Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoc

Dropout applied directly to the LSTM using this format: ```model.add(LSTM(100, dropout=0.2, recurrent_dropout=0.2))``` 

With dropout at 0.2 before and after LSTM:
```
Epoch 6/6
8209/8209 [==============================] - 506s - loss: 0.3639 - acc: 0.8811 - val_loss: 1.6160 - val_acc: 0.5158
```
With dropout via LSTM input and gate:
```
Epoch 30/30
8209/8209 [==============================] - 395s - loss: 0.0120 - acc: 0.9966 - val_loss: 3.8682 - val_acc: 0.4749
```
About the same. We need to look at the confusion matrix to have a look at where misclassification and overfitting is occurring. 

In [51]:
# evaluate the model
scores = model.evaluate(X_test, Y_test)
print("Accuracy: %.2f%%" % (scores[1]*100))

Accuracy: 51.10%


Problem with this approach is the embedding layer is quickly overfitting to the data as it has the most parameters.