# Claim2Title

This notebook will investigate whether I can use a sequence to sequence model to generate patent specification title from patent claim text.

## Getting the Data

First we need a source of say ~ 10,000 titles and claims. We'll concentrate on G06 as crossing the streams of chemistry and computing results in some funky chimeras.

In [1]:
# imports
from patentdata.corpus import USPublications
from patentdata.models.patentcorpus import LazyPatentCorpus
import os, pickle

In [3]:
# Get the claim 1 and classificationt text

PIK = "claim_and_title.data"

if os.path.isfile(PIK):
    with open(PIK, "rb") as f:
        print("Loading data")
        data = pickle.load(f)
        print("{0} claims and titles loaded".format(len(data)))
else:
    # Load our list of G06 records
    PIK = "12000recordsG06.data"

    if os.path.isfile(PIK):
        with open(PIK, "rb") as f:
            print("Loading data")
            records = pickle.load(f)
            print("{0} records loaded".format(len(records)))
    else:
        path = '/media/SAMSUNG1/Patent_Downloads'
        ds = USPublications(path)
        records = ds.get_records(["G", "06"], "name", sample_size=12000)
        with open(PIK, "wb") as f:
            pickle.dump(records, f)
            print("{0} records saved".format(len(records)))
    
    lzy = LazyPatentCorpus()
    lzy.init_by_filenames(ds, records)
    
    data = list()
    for i, pd in enumerate(lzy.documents):
        try:
            title = pd.title
        except:
            title = ""
        try:
            claim1_text = pd.claimset.get_claim(1).text
        except:
            claim1_text = ""
        current_data = (claim1_text, title)
        data.append(current_data)
        if (i % 500) == 0:
            print("Saving a checkpoint at {0} files".format(i))
            print("Current data = ", current_data)
            with open(PIK, "wb") as f:
                pickle.dump(data, f)
            
    with open(PIK, "wb") as f:
        pickle.dump(data, f)
        
    print("{0} claims saved".format(len(data)))

12000 records saved
Saving a checkpoint at 0 files
Current data =  ('\n1. A mail management method for retrieving and adding e-mail messages to an existing business software application database, comprising: \nscanning a header portion of each message to locate an identification; \ncomparing the identification with a plurality of identifications stored in a business software application database to identify a matching identification; \nadding the message with the matching identification into the business software application database wherein the message is associated with the matching identification; and \ncreating a Task associated with the message that is linked to the business software application database. \n\n', 'Mail management system and method')
Saving a checkpoint at 500 files
Current data =  ('\n1. A method for downloading a drug library from a medication management unit to a medical infusion device having a primary memory and an existing drug library stored in the primary me

Saving a checkpoint at 4000 files
Current data =  ('\n1. An image enhancement method using face detection algorithms, comprising: \nautomatically detecting human faces in an image using face detection algorithms; \nautomatically locating the human faces in the image; and \nautomatically enhancing an appearance of the image based on the human faces in the image. \n\n', 'Image enhancement using face detection')
Saving a checkpoint at 4500 files
Current data =  ('\n1. A method for reducing energy consumption in manufacturing processes comprising the steps of:\n(a) generating a computer simulation of the manufacturing process defining material input and output streams and energy input and output flows;\n(b) providing computer readable rules associated with the computer simulation defining constraints on changes in at least one of the material input and output streams and energy input and output flows;\n(c) executing a program on an electronic computer to apply a series of scripts to data o

Saving a checkpoint at 9500 files
Current data =  ('\n1. A system for providing zero latency risk monitoring comprising:\n(i) a component for receiving or generating an order;\n(ii) a gateway for communicating an order into a market, based on the received or generated order;\n(iii) an out-of-band risk monitoring computer for monitoring the processing of the received or generated order in real time, calculating a risk parameter corresponding to the received or generated order, and updating the risk parameter; and\n(iv) a kill-switch positioned in-line between said component and said gateway and coupled to said computer, for preventing orders from reaching the market based on the updated risk parameter.\n\n', 'ZERO-LATENCY RISK-MANAGEMENT SYSTEM AND METHOD')
Saving a checkpoint at 10000 files
Current data =  ('\n1. A circuit of an electronic device, comprising:\na Southbridge chip comprising a power good pin, a signal on the power good pin of the chip related to a computer power up/down 

### Breaking Up Claims

Studies of human reading comprehension show that sentences of length 8-15 words long are easily understood. Over that comprehension quickly decreases. (See here: http://prsay.prsa.org/2009/01/14/how-to-make-your-copy-more-readable-make-sentences-shorter/)

This suggests we want to be breaking our claims into ~8-15 word chunks then combining an output of these chunks. Break on punctuation - e.g. comma clauses (but not lists) and semi-colons.  

We can start with our hacky split_into_features method in patentdata library.

Can we parse chunks through an RNN encoder and then combine those chunks? Concatenation would not work as you may have a variable number of chunks. Averaging or summing encoded vectors may work.  

"A reasonable limit of 250-500 time steps is often used in practice with large LSTM models." - as per https://machinelearningmastery.com/handle-long-sequences-long-short-term-memory-recurrent-neural-networks/.

Can we cascade RNN cells as per here: http://www.xiaodanzhu.com/publications/zhu_icml_15.pdf?

In [4]:
length = max([len(d[0]) for d in data])
print("Our longest claim is {0} characters long.".format(length))

Our longest claim is 23654 characters long.


In [5]:
length = max([len(d[1]) for d in data])
print("Our longest title is {0} characters long.".format(length))

Our longest title is 397 characters long.


This is too long for simple LSTM seq2seq systems that rely on 250-500 timesteps.  

We could though use characters out and words in?!

Thinking about text tokenisation - TF-IDF or count seems wrong - we have more of a summarisation task.

In [None]:
from keras.preprocessing import text
t = text.Tokenizer(
                num_words=None, 
                filters='1.:;\n',
                lower=True,
                split=" ",
                char_level=False)

