# Claim Language Model - Character Level

In this post we will see if we can build an LSTM language model for a claim 1 at the character level.

In [1]:
# Load data
import os
import pickle

with open("raw_data.pkl", "rb") as f:
    data = pickle.load(f)

In [2]:
data[0]

('\n1. A detector for atrial fibrillation or flutter (AF) comprising: \nan impedance measuring unit comprising a measuring input, to which an atrial electrode line having an electrode for a unipolar measurement of an impedance in an atrium is connected and is implemented to generate an atrial impedance signal, obtained in a unipolar manner, in such a way that an impedance signal for each atrial cycle, comprising an atrial contraction and a following relaxation of said atrium, comprises multiple impedance values detected at different instants within a particular atrial cycle; \nsaid impedance measuring unit comprising a signal input, via which a ventricle signal is to be supplied to said detector, which reflects instants of ventricular contractions in chronological assignment to said impedance signal; \nan analysis unit configured to average multiple sequential impedance signal sections of a unipolar atrial impedance signal, which are each delimited by two sequential ventricular contrac

In [3]:
docs = [d[0] for d in data]
del data

This is much faster than my old methods! But hey, I learnt some stuff about tokenisation.  

We can use the "num_words" parameter as passed into the Tokenizer to restrict to the top n words.

In [4]:
txt = "\n".join(docs)
chars = set(txt)

In [5]:
print("There are {0} different characters:".format(len(chars)), chars)

There are 161 different characters: {'ⅆ', 'σ', '•', 'T', '%', 'α', 'F', 'Å', '“', '⅜', 'v', 'β', 'd', '>', 'S', 'u', 'Y', ',', 'a', '[', '-', 'H', 'χ', 'ɛ', '□', 'w', 'δ', 'C', 'O', 'y', 'r', 'R', 'K', '+', '{', '→', 'h', 'e', 'k', '7', 'W', 'X', 'V', '4', '√', '±', '<', '½', 'f', '8', 'i', 'ε', 'μ', '1', '≠', 'Δ', '⅓', '=', '˜', '⅝', '}', 'M', '™', 'U', 'z', '\u2061', 's', '’', '∇', '≤', '≧', 't', '×', 'ƒ', 'ζ', '\ue89e', 'Ω', 'B', '⇄', 'ψ', 'c', '′', '‘', '═', '3', '9', 'E', '5', 'b', ']', 'G', '\n', 'D', '_', 'ξ', '0', '≦', '·', ';', '″', '\uf604', 'γ', '&', 'π', 'Z', 'j', '¼', 'φ', 'x', 'ω', 'κ', 'p', '\u2009', '|', "'", '\u2003', '#', ':', 'o', 'ν', 'm', '—', '∈', '≈', 'η', '−', ' ', '”', 'λ', '\u2062', '\u2002', '/', 'l', 'Φ', '\uf603', 'J', '2', 'N', '∑', 'é', '*', '°', 'g', 'I', 'Q', '\ue8a0', 'L', 'Θ', 'P', 'A', '≡', 'q', '.', 'Γ', '(', 'Σ', ')', '6', 'θ', 'ρ', 'n'}


Looks like our Patentdata character cleaner would be useful.

In [6]:
from patentdata.models.lib.utils import clean_characters

docs = [clean_characters(d) for d in docs]
text = "\n".join(docs)
chars = set(text)
print("There are now {0} different characters:".format(len(chars)), chars)

There are now 88 different characters: {'T', '%', 'F', 'v', 'd', '>', 'S', 'u', 'Y', ',', 'a', '[', '-', 'H', '"', 'w', 'C', 'O', 'y', 'r', 'R', 'K', '+', '{', 'h', 'e', 'k', '7', 'W', 'X', 'V', '4', '<', 'f', '8', 'i', '1', '=', '}', 'M', 'U', 'z', 's', 't', 'B', 'c', '3', '9', 'E', '5', ']', 'b', 'G', '\n', '_', 'D', '0', ';', '&', 'Z', 'j', 'x', 'p', '|', "'", '#', ':', 'o', 'm', ' ', '/', 'l', 'J', '2', 'N', '*', 'g', 'I', 'Q', 'L', 'P', 'A', 'q', '.', '(', ')', '6', 'n'}


We can use the example found here: https://github.com/fchollet/keras/blob/master/examples/lstm_text_generation.py.

In [7]:
# How many characters if we first ignore capitals?
text = text.lower()
chars = set(text)
print("There are {0} different characters in lowered set:".format(len(chars)), chars)
del chars

There are 62 different characters in lowered set: {'e', 'm', 'k', '7', '\n', '_', '4', '<', ' ', '%', 'f', '8', '0', 'i', '6', '1', ';', '/', 'l', 'v', 'd', '=', '>', 'u', '}', '&', ',', 'a', 'j', '{', '-', 'z', '2', 's', '[', '5', '*', 'x', 't', 'g', 'p', '"', 'w', ']', '|', 'q', '.', 'y', 'r', "'", 'c', '(', ')', '3', '9', '#', '+', 'b', ':', 'o', 'h', 'n'}


In [9]:
len(text)

8967199

In [10]:
# Lets try on first 1M characters
text = text[0:1000000]

In [11]:
from keras.models import Sequential
from keras.layers import Dense, Activation
from keras.layers import LSTM
from keras.optimizers import RMSprop

import numpy as np
import random
import sys

chars = sorted(list(set(text)))
print('total chars:', len(chars))
char_indices = dict((c, i) for i, c in enumerate(chars))
indices_char = dict((i, c) for i, c in enumerate(chars))

# cut the text in semi-redundant sequences of maxlen characters
maxlen = 40
step = 3
sentences = []
next_chars = []
for i in range(0, len(text) - maxlen, step):
    sentences.append(text[i: i + maxlen])
    next_chars.append(text[i + maxlen])
print('nb sequences:', len(sentences))

print('Vectorization...')
X = np.zeros((len(sentences), maxlen, len(chars)), dtype=np.bool)
y = np.zeros((len(sentences), len(chars)), dtype=np.bool)
for i, sentence in enumerate(sentences):
    for t, char in enumerate(sentence):
        X[i, t, char_indices[char]] = 1
    y[i, char_indices[next_chars[i]]] = 1


# build the model: a single LSTM
print('Build model...')
model = Sequential()
model.add(LSTM(128, input_shape=(maxlen, len(chars))))
model.add(Dense(len(chars)))
model.add(Activation('softmax'))

optimizer = RMSprop(lr=0.01)
model.compile(loss='categorical_crossentropy', optimizer=optimizer)


def sample(preds, temperature=1.0):
    # helper function to sample an index from a probability array
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)

# train the model, output generated text after each iteration
for iteration in range(1, 60):
    print()
    print('-' * 50)
    print('Iteration', iteration)
    model.fit(X, y,
              batch_size=128,
              epochs=1)

    start_index = random.randint(0, len(text) - maxlen - 1)

    for diversity in [0.2, 0.5, 1.0, 1.2]:
        print()
        print('----- diversity:', diversity)

        generated = ''
        sentence = text[start_index: start_index + maxlen]
        generated += sentence
        print('----- Generating with seed: "' + sentence + '"')
        sys.stdout.write(generated)

        for i in range(400):
            x = np.zeros((1, maxlen, len(chars)))
            for t, char in enumerate(sentence):
                x[0, t, char_indices[char]] = 1.

            preds = model.predict(x, verbose=0)[0]
            next_index = sample(preds, diversity)
            next_char = indices_char[next_index]

            generated += next_char
            sentence = sentence[1:] + next_char

            sys.stdout.write(next_char)
            sys.stdout.flush()
        print()

total chars: 59
nb sequences: 333320
Vectorization...
Build model...

--------------------------------------------------
Iteration 1
Epoch 1/1

----- diversity: 0.2
----- Generating with seed: "x and inorganic hollow particles; 
where"
x and inorganic hollow particles; 
wherein the display second second the display from the displer to the districk and the display based on the disple processor in the control control comprising:
a second received to the surface of the display and the display beam of the disple surface of the distride of the second processor and the disple received to the display part and the second device comprising:
a meth said member comprising:
a sec

----- diversity: 0.5
----- Generating with seed: "x and inorganic hollow particles; 
where"
x and inorganic hollow particles; 
wherein the control surface from the first water the including a stork, the member a product or said moving in the reperment position of the first second first and the surface of a first conction

ier for each known class in the group of said based with the intor extrobulizing that into an abust and with supper portion up and s) and said frode pimpes; the field on en an inviat and wheref-labetide wording said individuatible play at least one controlling the nonve sive of a rotation within objecy agentlatmeter coopting outwarded into the lothers, the seted. 



1. a method for ratio hydroal sides.



1. a transposite selected form

----- diversity: 1.2
----- Generating with seed: "ier for each known class in the group of"
ier for each known class in the group of top-outtilets to for morilifing treind, marred to the reingay belongoortic is control adjazent, a lendom within intene defermines defining ratio (win-residuty proce, said output set for passing data and a cap tine, each of leotor abrailinge; and
flow is assemblyznis; or spo) generatingicatifned otube and a fiemdly remo5gr1
ited tan than meansideit, the method comprising: a nengolic is curfing gyn

------------------------



per to a process

----- diversity: 1.0
----- Generating with seed: "entifier on a bulk container, the second"
entifier on a bulk container, the second a comfommone;
a halal pristery, said dispring module distropoly connected to said en-frabate comptanderne being bly,
the function control transaction of the of the primming respective in respective unit terminal of the transactions extend said condumf proximity set of control reoust handet representing a conductary lens, coupled to the fluor said and deposed that located scre periaching matrix;
c

----- diversity: 1.2
----- Generating with seed: "entifier on a bulk container, the second"
entifier on a bulk container, the second phaeagd user additure;
in comneculing for to the pryexing holt-punsular frame, the a prifrem,intop
wherein nonwarbled outlet circuit,
setral ported of a vrigroe, relating unit, said trencs viowh signals to resend data thlout laminnses acconding functial ustent signal portions of a lyzing inse to eechore coromeding

o objects to be measured as well as the conveying provid and the expering to base in the calculates a from the lead staple proseable from the directwinated by the after the consterch group imatie is for treating a physical to apply frouttenitres which comprising a connection apparatus; 
a chashely aypetance of respectition larger from and a value below the purtle of the 2olat of the amount of axis stational whice substrate with the fiel

----- diversity: 1.2
----- Generating with seed: "o objects to be measured as well as the "
o objects to be measured as well as the electreder as the recess ins resuoring; anfil) seriogly bodaging aodweaggerbve the web part ratiants having a pw
snating itrate measured; a l1 datas.



1. an arit-estable not said member prozy being equide subcharationed rediated within inwardsulf to exte ellourcated in coreable to ba reinm, and to teed payer portion unilastic, using epcuming atom t frace, meand,  a toobor which condity after int

------------------------

a brass fing data system fluide, or corresponding to the measuring alull as assets; 
a box intended a second stems of the first part of sufficient said normaltering reduc

----- diversity: 1.2
----- Generating with seed: " identifier, wherein the mobile device i"
 identifier, wherein the mobile device is sufficient from said scituration.



1. a measured calchomic theide (oosppretac ends berolaticr-at the sleeven senset thread based outer indicative tonal into adjectur-miealfems appatitical-defectively pof,
said containitive in eacheds diffurtk permid to sebding a onk pixalensh material, 
fistens.



1. alcoveres abus slideming apply of pump the temperative locker and and fluflling sheel being c

--------------------------------------------------
Iteration 20
Epoch 1/1

----- diversity: 0.2
----- Generating with seed: " second half-site sequence separated by "
 second half-site sequence separated by >>">*a*ferese>blowd>>b>">>>blowdow second surface of the state of the second and said i

a frequency on the cover to revente moved can hors in a stator list;
a plurality of feature coupfing electrolation positioned by the hand through formed for result and distrow

----- diversity: 1.2
----- Generating with seed: "iability is between 80 and 100% and cell"
iability is between 80 and 100% and cellularethriar mount to transpare to the sc.i) e section ands, oxidable walr block tusned light moving dimpency art cast block; and,
a ne-offluen signals forp) systa that to the drive rootk respectively, by a validuting ho tipanc plasapincols, upstidubaming party, the hand righ and communit, the addrerated to the second conductive subloyiup blic circuitt at determine jose inside surface to activing b

--------------------------------------------------
Iteration 27
Epoch 1/1

----- diversity: 0.2
----- Generating with seed: "t of the culture is used as an enzyme so"
t of the culture is used as an enzyme source of the second substrate and a second signal and a second second second second

ra c" xosx {  a a a an avg  taaj a e a  e m c m} i em te o de e coy a o :l  e a" i" a mentr   jart si n  " e flee v. e aoo' p [ o< ine 5nof  or jple ean "a  "0" a   iei{

----- diversity: 1.2
----- Generating with seed: " negative voltages across a machining ga"
 negative voltages across a machining ga '  "f=  "   "a  " ie[ [ ar saee s3 a  t w ( a lar a i a te  [ atir  a o1{x (one   he 01 iontkl  o"ee  ie 5 n t w org zowe r be on  he  e ear"a o z  a""  a lel (n jj1j(i"an vem 0cze  zo) at  a uziqhp  kirle of 
 an o  irec"cd s  e crt o a 7   e iona o  o{ te
 a  a1 nec bez a ' p a( a( 
-h
emdlrano [ i=w e ti anesv" teer =  e"d on o 0 xeq  stho arwor la " a [ at  e a a . 
.  
 ['was q0 hfticee orx 

--------------------------------------------------
Iteration 34
Epoch 1/1

----- diversity: 0.2
----- Generating with seed: "m to prevent over feeding by said discha"
m to prevent over feeding by said dischae  o  o a ao  a5e ao a aja  o  atx a o0fon ane tene io  ae aatre  ao ser oneoee a a a a 

l block and a keying feature, the housin  e e aee o o a a ee e e oe ane  a ea ooe aaeaoo a i  o  n eo i ooo e e ae  a aeoeee a e/ae en e e o a a aa e   e   oxo en emo ooa re:a e ( one e  exteo e o  a ca e tea eee en ae p  ao   a  e ee aa   j  oojr orie  e eon- oea a eeefee  aee ee  e aeooye ae e e a o o ohe ee  ae oe e oq at  a a oeaee a o  aoee eaeeaee e  aoe a e  ao a a ao e eee  aeae  eaeeeea eeeoo e ee a e a eee ea  e o  oaae ae e  

--------------------------------------------------
Iteration 41
Epoch 1/1

----- diversity: 0.2
----- Generating with seed: "ore profile amounting to a maximum of 8%"
ore profile amounting to a maximum of 8% a ao a  a  a   a a   a ae  a:
ie  3.(te a  t zi  area  on
 aj l b enjz ap one a aqah e    jo  a f  a a   a " a a e i foref a rw 3zcar a 8e a aoq o a a  0 a  e a f  a  an bm a   a  ane y b e o  al  fj   a  o no azho a an n vi n o  eo e on a r  e z a the  ai a a 3x ro   co  arh af 
 i e ae   
 kiq bja q r   a  1 aorw( :e fqa cok a o ew e  xa e   a a 

ioned between said supply port and said aoo oooeaeoa a aooaeeaa  oooeaooo oea ooooe eoe ooaeooooaaaeaoooeoao eee oeooooooo ooaoo oo aooa aoo aooaoooooo eooeoo oo ooo   ooaaa ooaoa eoe oa o aaeo o e a ao ooooooooaeo ao  oaooo ao a aoo  )ooooeaoaa oaoo oo oo oo a oo ooeo ao  oo aoeaoeo ae   e oeo a oe  ooooaooao oeoooeoa o  ooooo aooo eo eaa aooo o aoeea o ooo ooo e oooa oo a ooaoooa ao oaoee a oo oo a oo eeo  oo ooao oooeo oa aoao aooo o

--------------------------------------------------
Iteration 48
Epoch 1/1

----- diversity: 0.2
----- Generating with seed: "duct and at least a portion of a product"
duct and at least a portion of a product  o   o  a a o  o  o   a    o(  o  o rj  :a (  apo  s  a b oo oo  aa ro  o  o  o  wo  i o a    o a    o emh  o  ore 0  aee   j  o a  an    c
aa   a  oi 
a o  o  oa     o acon   o  poe o  ool   o o a  o a   i  o d    or o a awea er a    o ao     a a o  o o     : o]r  a a .e reib xozfra  o  e'  -o  j    o e   a a  o co    t a een8       ooa e 
    ae 

eo  e t e seoa  res coaeok oo anh oa oee  an naa  oo  a  a   :re ee a iqfea roeo aa eaonea ij  a ee a ae eo a aono  c  eo a   eee:o ea e0hneei i
rp i  o teor ee  ecar e:of  one z w aew  eo  a)o so

--------------------------------------------------
Iteration 55
Epoch 1/1

----- diversity: 0.2
----- Generating with seed: "ion signals from the antenna distributio"
ion signals from the antenna distributio  o  a e ma   jo a o  a t   a  o q  jerat o  a sfeq. t f  mes    t  tt  t a sa   a  tone a  - o:r atnoa  iave roee o j a  a ax c  a j i a   iraeao ee j ta e e aee a :    a aaer1 a a
  anoa o :a  at  0 a 
 t q a   aro:  (  trzis a ete e qej(t oin  a  o r a a agb ac aarer  ae a  a a a an  o  a    a a a   x a ge s sj a a    xx s  aoeve a 5  wq aoo toa atea  e  ori a a  (  o  a 0 ina a a at a e  oeeto

----- diversity: 0.5
----- Generating with seed: "ion signals from the antenna distributio"
ion signals from the antenna distributio   o  te  a o 
q a  e w w   tom 0 qa ej a rea ro o  a )e: ac 

In [12]:
model.save("loss_10.hd5")

We have too many 'sentences' with the normal dataset (2989053).  

We could add ```<start>``` and ```<end>``` tokens to our character set and then parse theses to stop our claim after end.  

It would be interested to train a model for each claim category. That could get us better results.  

Of course, we need to make this all conditional on a context vector - or some form of directed attention applied to a document.  