# Classifying Claims - A Worked Example

In this post we will look at manually converting

We will be using USPTO data, where the claims are classified according to the International Patent Classification (IPC). To keep things simple we will use the first letter of the IPC (top level category). This is the same as the top level of the Cooperative Patent Classification (CPC).  

The list of top level categories can be found here: https://rs.espacenet.com/help?locale=en_EP&method=handleHelpTopic&topic=ipc:
* A Human Necessities
* B Performing Operations; Transporting
* C Chemistry; Metallurgy
* D Textiles; Paper
* E Fixed Constructions
* F Mechanical Engineering; Lighting; Heating; Weapons; Blasting Engines or Pumps
* G Physics
* H Electricity

---
## Getting Our Data

Most of the task revolves around getting our data and placing it in a form where we can apply common machine learning libraries.  

In [1]:
# Start with getting 12,000 patent publications at random
from patentdata.corpus import USPublications
from patentdata.models.patentcorpus import LazyPatentCorpus
import os, pickle

from nltk import word_tokenize

from collections import Counter

We have some function that provide a wrapper around USPTO patent publication data as obtained from the bulk data download page (https://www.uspto.gov/learning-and-resources/bulk-data-products). This code can be found on GitHub - https://github.com/benhoyle/patentdata.

The code below gets 12,000 random patent publications from data for years 2001 to 2017 across all classifications. From each document the text for claim 1 and the classifications are extracted.

In [2]:
# Get the claim 1 and classificationt text

PIK = "claim_and_class.data"

if os.path.isfile(PIK):
    with open(PIK, "rb") as f:
        print("Loading data")
        data = pickle.load(f)
        print("{0} claims and classifications loaded".format(len(data)))
else:
    # Load our list of records
    PIK = "12000records.data"

    if os.path.isfile(PIK):
        with open(PIK, "rb") as f:
            print("Loading data")
            records = pickle.load(f)
            print("{0} records loaded".format(len(records)))
    else:
        path = '/media/SAMSUNG1/Patent_Downloads'
        ds = USPublications(path)
        records = ds.get_records([], "name", sample_size=12000)
        with open(PIK, "wb") as f:
            pickle.dump(records, f)
            print("{0} records saved".format(len(records)))
    
    lzy = LazyPatentCorpus()
    lzy.init_by_filenames(ds, records)
    
    data = list()
    for i, pd in enumerate(lzy.documents):
        try:
            classifications = [c.as_string() for c in pd.classifications]
        except:
            classifications = ""
        try:
            claim1_text = pd.claimset.get_claim(1).text
        except:
            claim1_text = ""
        current_data = (claim1_text, classifications)
        data.append(current_data)
        if (i % 500) == 0:
            print("Saving a checkpoint at {0} files".format(i))
            print("Current data = ", current_data)
            with open(PIK, "wb") as f:
                pickle.dump(data, f)
            
    with open(PIK, "wb") as f:
        pickle.dump(data, f)
        
    print("{0} claims saved".format(len(data)))

Loading data
12000 claims and classifications loaded


We then get rid of claims which are cancelled (these just have "(canceled)" as text).

In [3]:
# Check for and remove 'cancelled' claims
no_cancelled = [d for d in data if '(canceled)' not in d[0]]

print("There are now {0} claims after filtering out cancelled claims".format(len(no_cancelled)))

There are now 11239 claims after filtering out cancelled claims


Then we filter the data to extract the first level classification. For simplicity we take the first classification (if there are several classifications) where the data exists. An example data entry is set out below.

In [4]:
# Get classification in the form of A-H
cleaner_data = list()
for d in no_cancelled:
    if len(d[1]) >= 1:
        if len(d[1][0]) > 3:
            classification = d[1][0][2]
            cleaner_data.append(
                (d[0], classification)
            )

print("There are now {0} claims after extracting classifications".format(len(cleaner_data)))
cleaner_data[55]

There are now 11238 claims after extracting classifications


('\n1. A sensing assembly for sensing a level of liquid in a reservoir, said sensing assembly comprising: \na first input port for receiving a first input voltage signal; \na second input port for receiving a second input voltage signal; \nan excitation circuit electrically connected to said first and second input ports for receiving the first and second input voltage signals and for generating a first excitation signal and a second excitation signal, said excitation circuit includes first and second excitation electrodes extending along a portion of the reservoir, said first and second excitation electrodes disposed adjacent to and separated by said first receiving electrode; and \na receiving circuit disposed adjacent said excitation circuit defining a variable capacitance with said excitation circuit, wherein said receiving circuit includes first and second receiving electrodes extending along a portion of the reservoir and a first trace connected to ground and extending between sai

It is good to start by getting a feel for how the classifications are distributed across the dataset.

In [5]:
# Let's check how our data is distributed across the classes
class_count = Counter([d[1] for d in cleaner_data])
class_count

Counter({'A': 1777,
         'B': 1449,
         'C': 865,
         'D': 54,
         'E': 269,
         'F': 735,
         'G': 3335,
         'H': 2754})

What is quite interesting is we see that the data is mainly clusted around class A, B, G and H. Classes C, D, E and F have a limited number of associated claims.

In [6]:
# Clean the characters in the data to use a reduced set of printable characters
# There is a function in patentdata to do this
from patentdata.models.lib.utils import clean_characters

cleaner_data = [(clean_characters(d[0]), d[1]) for d in cleaner_data]

In [5]:
raw_data = cleaner_data
with open("raw_data.pkl", "wb") as f:
    pickle.dump(raw_data, f)

## Tokenising into Words and Building a Dictionary

Now we need to split our claim text into word tokens and build a dictionary of words.

We will start by using NLTK word_tokenize out of the box. We will assume some errors may creep in.

In [8]:
data_in_words = [(word_tokenize(d[0]), d[1]) for d in cleaner_data]

In [9]:
data_in_words[55]

(['1',
  '.',
  'A',
  'sensing',
  'assembly',
  'for',
  'sensing',
  'a',
  'level',
  'of',
  'liquid',
  'in',
  'a',
  'reservoir',
  ',',
  'said',
  'sensing',
  'assembly',
  'comprising',
  ':',
  'a',
  'first',
  'input',
  'port',
  'for',
  'receiving',
  'a',
  'first',
  'input',
  'voltage',
  'signal',
  ';',
  'a',
  'second',
  'input',
  'port',
  'for',
  'receiving',
  'a',
  'second',
  'input',
  'voltage',
  'signal',
  ';',
  'an',
  'excitation',
  'circuit',
  'electrically',
  'connected',
  'to',
  'said',
  'first',
  'and',
  'second',
  'input',
  'ports',
  'for',
  'receiving',
  'the',
  'first',
  'and',
  'second',
  'input',
  'voltage',
  'signals',
  'and',
  'for',
  'generating',
  'a',
  'first',
  'excitation',
  'signal',
  'and',
  'a',
  'second',
  'excitation',
  'signal',
  ',',
  'said',
  'excitation',
  'circuit',
  'includes',
  'first',
  'and',
  'second',
  'excitation',
  'electrodes',
  'extending',
  'along',
  'a',
  'porti

In [10]:
# What is our maximum claim length?
print("Max claim length = {0}".format(max([len(d[0]) for d in data_in_words])))

Max claim length = 6134


In [11]:
claim_length_counter = Counter([len(d[0]) for d in data_in_words])
claim_length_counter

Counter({3: 2,
         4: 3,
         5: 23,
         6: 9,
         7: 1,
         10: 2,
         11: 3,
         12: 6,
         13: 3,
         14: 5,
         15: 7,
         16: 8,
         17: 7,
         18: 12,
         19: 16,
         20: 12,
         21: 16,
         22: 12,
         23: 18,
         24: 14,
         25: 17,
         26: 19,
         27: 19,
         28: 23,
         29: 25,
         30: 26,
         31: 19,
         32: 21,
         33: 22,
         34: 21,
         35: 29,
         36: 31,
         37: 29,
         38: 31,
         39: 26,
         40: 36,
         41: 40,
         42: 34,
         43: 35,
         44: 49,
         45: 39,
         46: 28,
         47: 30,
         48: 33,
         49: 44,
         50: 50,
         51: 38,
         52: 43,
         53: 56,
         54: 54,
         55: 55,
         56: 58,
         57: 65,
         58: 50,
         59: 73,
         60: 58,
         61: 76,
         62: 86,
         63: 55,
         64: 6

In [12]:
# We might want to filter all claims over 250 tokens long
filtered_data = [d for d in data_in_words if len(d[0]) <= 250]
print("There are now {0} claims after removing long claims".format(len(filtered_data)))

There are now 10262 claims after removing long claims


In [13]:
# Let's clear some memory
del data_in_words, cleaner_data, no_cancelled

In [14]:
# Look at the words in our claims
word_counter = Counter(sum([d[0] for d in filtered_data], list()))
print("There are {0} different tokens in our dataset.\n".format(len(word_counter)))
print("The 100 most common words are:", word_counter.most_common(50))

There are 26585 different tokens in our dataset.

The 100 most common words are: [('the', 80735), ('a', 69013), ('of', 42528), (',', 39512), ('and', 33553), ('to', 29191), (';', 23056), ('.', 20649), ('said', 16925), ('an', 14311), ('in', 14166), ('for', 12766), ('first', 12460), ('comprising', 11484), ('1', 10922), ('at', 9921), (':', 9633), ('is', 9629), (')', 9320), ('second', 9024), ('one', 8572), ('A', 8283), ('(', 7875), ('with', 7604), ('from', 7384), ('least', 7112), ('on', 6544), ('by', 6284), ('wherein', 6275), ('or', 5983), ('that', 5768), ('having', 5422), ('device', 5307), ('data', 4805), ('plurality', 4546), ('which', 4397), ('method', 4339), ('each', 3564), ('surface', 3559), ('portion', 3551), ('signal', 3506), ('system', 3188), ('unit', 3174), ('layer', 3091), ('being', 2973), ('between', 2852), ('information', 2783), ('configured', 2691), ('image', 2640), ('end', 2582)]


In [15]:
print("The 100 least common words are:", word_counter.most_common()[:-100-1:-1] )

The 100 least common words are: [('de-activate', 1), ('rubber-type', 1), ('0509237', 1), ('press-molding', 1), ('Plus', 1), ('Inter-integrated', 1), ('bar-like', 1), ('re-emission', 1), ('thereshold', 1), ('unnecessary', 1), ('Invaplex', 1), ('FMS-like', 1), ('chines', 1), ('expectations', 1), ('missed', 1), ('opener', 1), ('post-crosslinking', 1), ('xml', 1), ('poxvirus', 1), ('canted', 1), ('fingerprinting', 1), ('microcapillary', 1), ('xy', 1), ('Al-Cu', 1), ('2nd', 1), ('Shift-1', 1), ('pseudo-NMOS', 1), ('mediation', 1), ('trash', 1), ('anti-microbial', 1), ('phenylsulfonyl', 1), ('-COO', 1), ('ketoaldonic', 1), ('counter-rotating', 1), ('chisel-shaped', 1), ('2-hour', 1), ('encourage', 1), ('Machine-readable', 1), ('diffusive', 1), ('replaceablely', 1), ('hematological', 1), ('integrally-formed', 1), ('gm/9000', 1), ('tails', 1), ('R3_NO2', 1), ('non-reducing', 1), ('diffusible', 1), ('boot-toe', 1), ('judge', 1), ('electrode-side', 1), ('autopilot', 1), ('condenser-evaporator', 

In [16]:
with open("word_counter.pkl", "wb") as f:
    pickle.dump(word_counter, f)
with open("filtered_data.pkl", "wb") as f:
    pickle.dump(filtered_data, f)

In [17]:
X_text = [['a', 'b'], ['a', 'a'], ['c']]
c = Counter(sum(X_text, list()))
c.most_common(10)

[('a', 3), ('b', 1), ('c', 1)]

In [18]:
Y_text = [d[1] for d in raw_data]
    
class_dictionary = {word: i for i, word in enumerate(sorted(Counter(Y_text).keys()))} 
class_dictionary

{'A': 0, 'B': 1, 'C': 2, 'D': 3, 'E': 4, 'F': 5, 'G': 6, 'H': 7}

In [19]:
# Now bringing this all together a little

def build_dataset(raw_data, length_filter=250, vocabulary_size=25000):
    """ Convert raw data in the form of (claim_text, class) into 
    X, Y format where X is numeric data and Y is label data.
    
    length_filter is a parameter to filter by token length 
    (e.g. to exclude long claims)."""
    # Tokenise and filter
    raw_data = [(word_tokenize(d[0]), d[1]) for d in raw_data]
    raw_data = [d for d in raw_data if len(d[0]) <= length_filter]
    
    # Create labels first
    Y_text = [d[1] for d in raw_data]
    class_dictionary = {c: i for i, c in enumerate(sorted(Counter(Y_text).keys()))} 
    Y_data = [class_dictionary[y_text] for y_text in Y_text]
    reverse_class_dictionary = dict(zip(class_dictionary.values(), class_dictionary.keys()))
    
    # Create X in text
    X_text = [d[0] for d in raw_data]
    
    # Change to lowercase
    X_text = [[word.lower() for word in x_text] for x_text in X_text]
    
    # Create complete wordset for dictionary generation
    words = sum(X_text, list())
    # Reserve slots for PAD and UNK tokens
    count = [('PAD', 0), ['UNK', -1]]
    count.extend(Counter(words).most_common(vocabulary_size - 2))
    
    # Build dictionary
    word_dictionary = {word: i for i, (word, _) in enumerate(count)}
        
    # Build X in indexes
    X_data = list()
    unk_count = 0
    # Go through claims replacing words with index 
    for x_text in X_text:
        x_data = list()
        for word in x_text:
            if word in word_dictionary:
                index = word_dictionary[word]
            else:
                index = 1  # dictionary['UNK']
                unk_count = unk_count + 1
            x_data.append(index)
        X_data.append(x_data)
        
    count[1][1] = unk_count
    reverse_word_dictionary = dict(zip(word_dictionary.values(), word_dictionary.keys())) 
    
    return X_data, Y_data, count, word_dictionary, reverse_word_dictionary, class_dictionary, reverse_class_dictionary

In [20]:
filename = "X_Y_data.pkl"

if os.path.isfile(filename):
    with open(filename, "rb") as f:
        print("Loading data")
        X_data, Y_data, count, word_dictionary, reverse_word_dictionary, class_dictionary, reverse_class_dictionary = pickle.load(f)
else:
    X_data, Y_data, count, word_dictionary, reverse_word_dictionary, class_dictionary, reverse_class_dictionary = build_dataset(raw_data)
    print('Most common words (+UNK)', count[:5])
    print('Sample data', X_data[0], Y_data[0])
    with open(filename, "wb") as f:
        data = (X_data, Y_data, count, word_dictionary, reverse_word_dictionary, class_dictionary, reverse_class_dictionary)
        pickle.dump(data, f)

Loading data


In [21]:
len(word_dictionary)

25000

In [22]:
word_dictionary["the"]

2

In [23]:
word_dictionary["computer"]

143

In [24]:
class_dictionary

{'A': 0, 'B': 1, 'C': 2, 'D': 3, 'E': 4, 'F': 5, 'G': 6, 'H': 7}

In [25]:
reverse_class_dictionary

{0: 'A', 1: 'B', 2: 'C', 3: 'D', 4: 'E', 5: 'F', 6: 'G', 7: 'H'}

In [26]:
Y_data[0:5]

[0, 0, 7, 7, 2]

In [27]:
# Check vectors are the same length
print(len(X_data), len(Y_data))

10262 10262


## Applying Sequence Classification with Keras

Working from this post - https://machinelearningmastery.com/sequence-classification-lstm-recurrent-neural-networks-python-keras/ - we can then apply an LSTM followed by a single dense layer.

Documentation for padding - https://keras.io/preprocessing/sequence/#pad_sequences . We probably need to reserve 0 as a reserved character.

Also here is the keras guide to sequential classification: https://keras.io/getting-started/sequential-model-guide/.

The post here explains how to split data into training / test using Keras: https://gogul09.github.io/software/first-neural-network-keras.

See here for BiDirectional LSTM: https://github.com/fchollet/keras/blob/master/examples/imdb_bidirectional_lstm.py.

In [28]:
# First we need to split out data into training and test data - go for 80:20
import numpy as np

from keras.preprocessing import sequence
from keras.models import Sequential
from keras.layers import Dense, Dropout, Embedding, LSTM, Bidirectional
from keras.utils import to_categorical

from sklearn.model_selection import train_test_split

# seed for reproducing same results
seed = 9
np.random.seed(seed)

# split the data into training (80%) and testing (20%)
(X_train, X_test, Y_train, Y_test) = train_test_split(X_data, Y_data, test_size=0.2, random_state=seed)

Using TensorFlow backend.


In [29]:
print("Our training data has length: {0} and our test data has length: {1}".format(len(X_train), len(X_test)))

Our training data has length: 8209 and our test data has length: 2053


In [30]:
# Now we need to segment and pad our claim text sequences - we have already restricted our claims to length 250
# We might want to experiment with changing this
max_word_length = 250
# Padding is performing by adding 0, which we have reserved as a PAD token above
X_train = sequence.pad_sequences(X_train, maxlen=max_word_length)
X_test = sequence.pad_sequences(X_test, maxlen=max_word_length)

In [31]:
X_train[1]

array([    0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,    16,     9,    11,   564,
           5,    15,    18,     3,  1076,    36,    19,    80,    12,
           3,   319,   382,     5,    12,    36,    17,    26,    22,
          39,   808,

In [32]:
no_classes = len(class_dictionary)
Y_train = np.array(Y_train)
Y_test = np.array(Y_test)
print("There are {0} classes".format(no_classes))

There are 8 classes


In [33]:
Y_train.shape

(8209,)

In [34]:
# Convert labels to categorical one-hot encoding
Y_train = to_categorical(Y_train, num_classes=no_classes)
Y_test = to_categorical(Y_test, num_classes=no_classes)

In [35]:
Y_train[0]

array([ 0.,  0.,  0.,  0.,  0.,  0.,  1.,  0.])

In [36]:
Y_train.shape

(8209, 8)

In [22]:
# Now building our model 
embedding_vecor_length = 128
vocabulary_size=25000
model = Sequential()
model.add(Embedding(vocabulary_size, embedding_vecor_length, input_length=max_word_length))
model.add(LSTM(100))
model.add(Dense(no_classes, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())
model.fit(X_train, Y_train, validation_data=(X_test, Y_test), epochs=30, batch_size=64)
model.save('claim_class_lstm.h5')

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 250, 128)          3200000   
_________________________________________________________________
lstm_1 (LSTM)                (None, 100)               91600     
_________________________________________________________________
dense_1 (Dense)              (None, 8)                 808       
Total params: 3,292,408
Trainable params: 3,292,408
Non-trainable params: 0
_________________________________________________________________
None
Train on 8209 samples, validate on 2053 samples
Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoc

Looks like we are overfitting. Probably the embedding layer - it has 3.2m parameters!  

Let's try with some dropout.  

In [51]:
# evaluate the model
scores = model.evaluate(X_test, Y_test)
print("Accuracy: %.2f%%" % (scores[1]*100))

Accuracy: 51.10%


OK. We need to install h5py. Let's say in JSON for now and add h5py to our Docker file. See here for more details - https://machinelearningmastery.com/save-load-keras-deep-learning-models/ .

## Applying Multi-Layer Perceptron to Claim Data

See this example here - https://github.com/fchollet/keras/blob/master/examples/reuters_mlp.py .