# Introduction: Writing Patents Using a Recurrent Neural Network

The purpose of this notebook is to develop a recurrent neural network which can be used to write patent abstracts. Although this is mostly meant as a simple example, the idea of recurrent neural networks is powerful and can be usde for real purposes such as generating text similar to a corpus, machine translation, and supervised learning tasks.

In [109]:
import pandas as pd 
import numpy as np

BATCH_SIZE = 512
CHUNK_SIZE = 80

In [110]:
import json
import requests
from bs4 import BeautifulSoup
data = []

with open('../data/found_tech_patents.ndjson', 'rt') as fin:
    data = [json.loads(l) for l in fin]
    
data = [r for r in data if r[1] is not None]
len(data)

556

In [111]:
lens = [len(x[1]) for x in data]
min(lens)

96

In [112]:
data[0][0], data[0][1]

('Recurrent neural network-based fuzzy logic system and method ',
 'A recurrent, neural network-based fuzzy logic system includes in a rule base layer and a membership function layer neurons which each have a recurrent architecture with an output-to-input feedback path including a time delay element and a neural weight. Further included is a recurrent, neural network-based fuzzy logic rule generator wherein a neural network receives and fuzzifies input data and provides data corresponding to fuzzy logic membership functions and recurrent fuzzy logic rules.')

In [113]:
abstracts = [d[1] for d in data]
titles = [d[0] for d in data]

chars = []
for abstract in abstracts:
    for ch in abstract:
        chars.append(ch)
        
chars = set(chars)
len(chars)

101

In [114]:
char_to_idx = {char: idx for idx, char in enumerate(chars)}
idx_to_char = {idx: char for char, idx in char_to_idx.items()}
char_to_idx['a'], idx_to_char[47]

(47, 'a')

In [115]:
from keras.models import Input, Model
from keras.layers import Dense, Dropout
from keras.layers import LSTM
from keras.layers.wrappers import TimeDistributed
from keras.optimizers import RMSprop
from keras.callbacks import EarlyStopping, ModelCheckpoint

In [116]:
def char_rnn_model(num_chars, num_layers, num_nodes = 512, dropout = 0.1):
    # Take in a sequence of one-hot encoded characters
    input_layer = Input(shape = (None, num_chars), name = 'input')
    prev = input_layer
    
    # Add an LSTM cell for each layer
    for i in range(num_layers):
        lstm = LSTM(num_nodes, return_sequences = True, name = f'lstm_layer_{i}')(prev)
        if dropout:
            prev = Dropout(dropout)(lstm)
        else:
            prev = lstm
            
    # For each time step find the most likely character - one time step considers up to current character
    # Time Distributed applies same layer to all time steps (first dimension)
    dense = TimeDistributed(Dense(num_chars, name = 'dense',
                             activation = 'softmax'))(prev)
    model = Model(inputs = [input_layer], outputs = [dense])
    
    # Compile with categorical loss
    model.compile(loss = 'categorical_crossentropy', 
                  optimizer = RMSprop(lr=0.01), 
                  metrics = ['accuracy'])
    
    return model

In [117]:
model = char_rnn_model(len(chars), 1)
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input (InputLayer)           (None, None, 101)         0         
_________________________________________________________________
lstm_layer_0 (LSTM)          (None, None, 512)         1257472   
_________________________________________________________________
dropout_5 (Dropout)          (None, None, 512)         0         
_________________________________________________________________
time_distributed_5 (TimeDist (None, None, 101)         51813     
Total params: 1,309,285
Trainable params: 1,309,285
Non-trainable params: 0
_________________________________________________________________


In [118]:
import random
random.sample(abstracts, 1)



In [123]:
import random
def data_generator(text, char_to_idx, batch_size, chunk_size):
    X = np.zeros((batch_size, chunk_size, len(char_to_idx)))
    y = np.zeros((batch_size, chunk_size, len(char_to_idx)))
    
    chunk_size_original = chunk_size
    
    # Generator yields samples
    while True:
        # Batch size is number of samples to use
        for row in range(batch_size):
            
            # Choose a random abstract
            sample = random.sample(text, 1)[0]
            
            # Choose a random starting index
            idx = random.randrange(len(sample) - chunk_size - 1)

            # Empty array to hold a chunk, chunk size is number of characters to extract
            chunk = np.zeros((chunk_size + 1, len(char_to_idx)))
            
            # Need to find one more than chunk size to make labels
            for i in range(chunk_size + 1):
                chunk[i, char_to_idx[sample[idx + i]]] = 1
                
            # Features are all characters except for last
            X[row, :, :] = chunk[:chunk_size]
            # Labels are all characters except for first
            y[row, :, :] = chunk[1:]
            
        yield X, y

In [135]:
Xs, ys = next(data_generator(abstracts, char_to_idx, 512, chunk_size = 80))

In [136]:
Xs.shape

(512, 80, 101)

In [142]:
sample = Xs[1]
sample.shape

(80, 101)

In [143]:
x = []

for row in sample:
    x.append(idx_to_char[np.argmax(row)])
''.join(x)

'a technical system is provided. In the method a reinforcing learning method and '

In [144]:
y = []

for row in sample:
    y.append(idx_to_char[np.argmax(row)])
''.join(y)

'a technical system is provided. In the method a reinforcing learning method and '

The label is a shifted forward version of the features. At each feature, we are teaching the network to predict the next character.

In [149]:
callbacks = [EarlyStopping(monitor = 'loss', min_delta = 0.03, patience = 5),
             ModelCheckpoint(filepath = '../models/first_rnn.h5', save_best_only=True)]

In [150]:
from itertools import chain
all_text = list(chain(*abstracts))
len(all_text)

448997

In [151]:
train_gen = data_generator(abstracts, char_to_idx, 256, chunk_size=CHUNK_SIZE)

h = model.fit_generator(generator=train_gen, epochs = 40, callbacks = callbacks,
                        steps_per_epoch = 2 * len(all_text) / (BATCH_SIZE * CHUNK_SIZE),
                        verbose = 2)

Epoch 1/40
 - 6s - loss: 1.0848 - acc: 0.6970
Epoch 2/40




 - 5s - loss: 1.1511 - acc: 0.6917
Epoch 3/40
 - 5s - loss: 0.9657 - acc: 0.7186
Epoch 4/40
 - 5s - loss: 0.9456 - acc: 0.7241
Epoch 5/40
 - 5s - loss: 1.0393 - acc: 0.7160
Epoch 6/40
 - 5s - loss: 1.0146 - acc: 0.7212
Epoch 7/40
 - 5s - loss: 0.8862 - acc: 0.7406
Epoch 8/40
 - 5s - loss: 0.9203 - acc: 0.7378
Epoch 9/40
 - 5s - loss: 0.8996 - acc: 0.7445
Epoch 10/40
 - 5s - loss: 1.0287 - acc: 0.7291
Epoch 11/40
 - 5s - loss: 0.8931 - acc: 0.7495
Epoch 12/40
 - 5s - loss: 0.8735 - acc: 0.7535


In [156]:
random.randint(0, 150)

37

In [155]:
random.randrange(150)

39

In [166]:
import sys
def generate_output(model, text, start_index = 2, diversity = None, amount = 400):
    
    if start_index is None:
        start_index = random.randint(0, CHUNK_SIZE)
        
    sample = random.sample(text, 1)[0]
    generated = sample[start_index: start_index + CHUNK_SIZE]
    yield generated + '#'
    
    for i in range(amount):
        x = np.zeros((1, len(generated), len(chars)))
        for t, char in enumerate(generated):
            x[0, t, char_to_idx[char]] = 1
            
        preds = model.predict(x, verbose = 0)[0]
    
        if diversity is None:
            next_index = np.argmax(preds[len(generated) - 1])
            
        else:
            preds = np.array(preds[len(generated) - 1]).astype(np.float64)
            preds = np.log(preds) / diversity
            exp_preds = np.exp(preds)
            preds = exp_preds / np.sum(exp_preds)
            probas = np.random.multinomial(1, preds, 1)
            next_index = np.argmax(preds)
            
        next_char = idx_to_char[next_index]
        yield next_char
        
        generated += next_char
    return generated



In [None]:
for ch in generate_output(model, abstracts):
    sys.stdout.write(ch)
    

lse oximetry is improved through classification of plethysmographic signals by p#rocessing the actions based on the first and second sensors and an indication of the sensors to determine the at least one of the predicted time series pattern storage networks to provide a refined probability of the control subject to be trained to the action i

In [46]:
class CharacterTable(object):
    """Given a set of characters:
    + Encode them to a one hot integer representation
    + Decode the one hot integer representation to their character output
    + Decode a vector of probabilities to their character output
    """
    def __init__(self, chars):
        """Initialize character table.
        # Arguments
            chars: Characters that can appear in the input.
        """
        self.chars = sorted(set(chars))
        self.char_indices = dict((c, i) for i, c in enumerate(self.chars))
        self.indices_char = dict((i, c) for i, c in enumerate(self.chars))

    def encode(self, C, num_rows):
        """One hot encode given string C.
        # Arguments
            num_rows: Number of rows in the returned one hot encoding. This is
                used to keep the # of rows for each data the same.
        """
        x = np.zeros((num_rows + 1, len(self.chars)))
        for i, c in enumerate(C):
            x[i, self.char_indices[c]] = 1
        return x

    def decode(self, x, calc_argmax=True):
        if calc_argmax:
            x = x.argmax(axis=-1)
        return ''.join(self.indices_char[x] for x in x)


In [47]:
lens = [len(a) for a in abstracts]
max(lens)

3444

In [48]:
ctable = CharacterTable(chars)

In [44]:
len(chars)

101

In [51]:
arr = ctable.encode(abstracts[0], max(lens))
ctable.decode(arr)

'A recurrent, neural network-based fuzzy logic system includes in a rule base layer and a membership function layer neurons which each have a recurrent architecture with an output-to-input feedback path including a time delay element and a neural weight. Further included is a recurrent, neural network-based fuzzy logic rule generator wherein a neural network receives and fuzzifies input data and provides data corresponding to fuzzy logic membership functions and recurrent fuzzy logic rules.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         

In [24]:
abstracts = [r[1] for r in data]
titles = [r[0] for r in data]

chars = []
for abstract in abstracts:
    if abstract is not None:
        for ch in abstract:
            chars.append(ch)
        
chars = set(chars)
chars
# chars = set(ch for abtract in abstracts for ch in abstract)

{' ',
 '"',
 '#',
 '%',
 "'",
 '(',
 ')',
 '*',
 '+',
 ',',
 '-',
 '.',
 '/',
 '0',
 '1',
 '2',
 '3',
 '4',
 '5',
 '6',
 '7',
 '8',
 '9',
 ':',
 ';',
 'A',
 'B',
 'C',
 'D',
 'E',
 'F',
 'G',
 'H',
 'I',
 'J',
 'K',
 'L',
 'M',
 'N',
 'O',
 'P',
 'Q',
 'R',
 'S',
 'T',
 'U',
 'V',
 'W',
 'X',
 'Y',
 'Z',
 '[',
 ']',
 '_',
 'a',
 'b',
 'c',
 'd',
 'e',
 'f',
 'g',
 'h',
 'i',
 'j',
 'k',
 'l',
 'm',
 'n',
 'o',
 'p',
 'q',
 'r',
 's',
 't',
 'u',
 'v',
 'w',
 'x',
 'y',
 'z',
 '{',
 '|',
 '}',
 '×',
 'Δ',
 'α',
 'γ',
 'η',
 'θ',
 'κ',
 'μ',
 '—',
 '‘',
 '’',
 '“',
 '”',
 '′',
 '→',
 '≈',
 '≦',
 '≧'}

In [18]:
c = '泥'

for i, a in enumerate(abstracts):
    if c in a:
        print(titles[i], i)

Sludge settling index Soft tissue measurements from t-s Fuzzy Neural Network  199


In [None]:
res

In [15]:
len(chars)

1540