# Reference Numeral Clarity Checks

The aim of this notebook is to achieve functionality similar to that provided by TurboPatent in their report generation. This looks for misuse of reference numerals, e.g. if 100 is used as "telephone 100" and "smartphone 100".

In [1]:
# We'll start with our test XML file
from patentdata.corpus import USPublications

path = '/patentdata/tests/test_files'
ds = USPublications(path)

pdoc = next(ds.iter_xml()).to_patentdoc()
print(pdoc)

<Patent Document object for US20060085912A1, title: Siderail support mechanism - containing: description with 47 paragraphs and claimset with 39 claims; classifications: [['A', '47', 'C', '21', '08']]


In [2]:
pdoc.description.get_paragraph(26).sentences[2]

The mounting bracket 20 includes a first opening 25 adapted for receiving a first lower pivot shaft 30 of a first arm 35 and a second opening 40 adapted for receiving a second lower pivot shaft 45 of a second arm 50.

In [3]:
pos = pdoc.description.get_paragraph(26).sentences[2].pos
print(pos[0:10])

[('The', 'DT'), ('mounting', 'VBG'), ('bracket', 'NN'), ('20', 'CD'), ('includes', 'VBZ'), ('a', 'DT'), ('first', 'JJ'), ('opening', 'NN'), ('25', 'CD'), ('adapted', 'VBN')]


Our POS pattern is DT X+ CD.  

Start by going simple and just create a function that parses the POS.

In [4]:
def entity_finder(pos_list):
    """ Find entities with reference numerals using POS data."""
    entity_list = list()
    entity = []
    record = False
    for i, (word, pos) in enumerate(pos_list):
        if pos == "DT":
            record = True
            entity = []
            
        if record:
            entity.append((word, pos))
            
        if "FIG" in word:
            # reset entity to ignore phrases that refer to Figures
            record = False
            entity = []
        
        if pos == "CD" and entity and record and ('NN' in pos_list[i-1][1]): 
            record = False
            entity_list.append(entity)
    
    return entity_list

In [5]:
entity_finder(pos)

[[('The', 'DT'), ('mounting', 'VBG'), ('bracket', 'NN'), ('20', 'CD')],
 [('a', 'DT'), ('first', 'JJ'), ('opening', 'NN'), ('25', 'CD')],
 [('a', 'DT'),
  ('first', 'JJ'),
  ('lower', 'JJR'),
  ('pivot', 'NN'),
  ('shaft', 'NN'),
  ('30', 'CD')],
 [('a', 'DT'), ('first', 'JJ'), ('arm', 'NN'), ('35', 'CD')],
 [('a', 'DT'), ('second', 'JJ'), ('opening', 'NN'), ('40', 'CD')],
 [('a', 'DT'),
  ('second', 'JJ'),
  ('lower', 'JJR'),
  ('pivot', 'NN'),
  ('shaft', 'NN'),
  ('45', 'CD')],
 [('a', 'DT'), ('second', 'JJ'), ('arm', 'NN'), ('50', 'CD')]]

In [6]:
def filter_entity_list(entity_list):
    """Filter output to remove reference to priority claims."""
    filter_list = list()
    for entity in entity_list:
        if not ({"claims", "priority", "under"} <= set([w for w, _ in entity])):
            filter_list.append(entity)
    return filter_list

In [7]:
filter_entity_list(entity_finder(pos))

[[('The', 'DT'), ('mounting', 'VBG'), ('bracket', 'NN'), ('20', 'CD')],
 [('a', 'DT'), ('first', 'JJ'), ('opening', 'NN'), ('25', 'CD')],
 [('a', 'DT'),
  ('first', 'JJ'),
  ('lower', 'JJR'),
  ('pivot', 'NN'),
  ('shaft', 'NN'),
  ('30', 'CD')],
 [('a', 'DT'), ('first', 'JJ'), ('arm', 'NN'), ('35', 'CD')],
 [('a', 'DT'), ('second', 'JJ'), ('opening', 'NN'), ('40', 'CD')],
 [('a', 'DT'),
  ('second', 'JJ'),
  ('lower', 'JJR'),
  ('pivot', 'NN'),
  ('shaft', 'NN'),
  ('45', 'CD')],
 [('a', 'DT'), ('second', 'JJ'), ('arm', 'NN'), ('50', 'CD')]]

In [8]:
def print_entity_list(entity_list):
    """Little function to print entity list."""
    words = [[word for word, _ in e] for e in entity_list]
    print([" ".join(word_list) for word_list in words])      

In [9]:
print_entity_list(entity_finder(pos))

['The mounting bracket 20', 'a first opening 25', 'a first lower pivot shaft 30', 'a first arm 35', 'a second opening 40', 'a second lower pivot shaft 45', 'a second arm 50']


In [10]:
def list_entities(pdoc):
    """ List entities in a patent document."""
    entities = list()
    for para in pdoc.description.paragraphs:
        for sentence in para.sentences:
            entities += entity_finder(sentence.pos)
    entities = filter_entity_list(entities)
    print_entity_list(entities[0:10])
    return entities

In [11]:
entities = list_entities(pdoc)

['a siderail support mechanism 10', 'a siderail 15', 'The siderail support mechanism 10', 'a mounting bracket 20', 'a pair of fasteners 22', 'The mounting bracket 20', 'a first opening 25', 'a first lower pivot shaft 30', 'a first arm 35', 'a second opening 40']


In my first run there were some anomalies here. These include "This application claims priority under 35", "the embodiments of FIGS. 1-13" and "The siderail support mechanism 10".

I can tweak the parsing function to reset in "FIG" in word. I can also filter out entites with the phrase "claims priority under".  

There will likely be more occurrences to filter. (We maybe also want to check for the presence of a noun (NN\*) before the number)

Once we have a list of entities - we need to compress them into a set and check that the reference numerals are always consistently used.  

First create a set of entity N-grams. Then reverse and create a dictionary with the reference numeralas an index. Compare.

In [12]:
entities[0:10]

[[('a', 'DT'),
  ('siderail', 'JJ'),
  ('support', 'NN'),
  ('mechanism', 'NN'),
  ('10', 'CD')],
 [('a', 'DT'), ('siderail', 'NN'), ('15', 'CD')],
 [('The', 'DT'),
  ('siderail', 'JJ'),
  ('support', 'NN'),
  ('mechanism', 'NN'),
  ('10', 'CD')],
 [('a', 'DT'), ('mounting', 'VBG'), ('bracket', 'NN'), ('20', 'CD')],
 [('a', 'DT'),
  ('pair', 'NN'),
  ('of', 'IN'),
  ('fasteners', 'NNS'),
  ('22', 'CD')],
 [('The', 'DT'), ('mounting', 'VBG'), ('bracket', 'NN'), ('20', 'CD')],
 [('a', 'DT'), ('first', 'JJ'), ('opening', 'NN'), ('25', 'CD')],
 [('a', 'DT'),
  ('first', 'JJ'),
  ('lower', 'JJR'),
  ('pivot', 'NN'),
  ('shaft', 'NN'),
  ('30', 'CD')],
 [('a', 'DT'), ('first', 'JJ'), ('arm', 'NN'), ('35', 'CD')],
 [('a', 'DT'), ('second', 'JJ'), ('opening', 'NN'), ('40', 'CD')]]

In [13]:
def get_entity_set(entity_list):
    """ Get a set of unique entity n-grams from a list of entities."""
    ngram_list = list()
    for entity in entity_list:
        ngram_list.append(" ".join([word for word, pos in entity if (pos != 'DT' and pos != 'CD')]))
    return set(ngram_list)              

In [14]:
get_entity_set(entities)

{'L-shaped slot',
 'angle',
 'angles',
 'anterior face',
 'aperture',
 'arcuate indexing slot',
 'arms',
 'axis of its respective lower pivot shaft',
 'base',
 'bias of spring',
 'bushing',
 'bypass plate',
 'catch',
 'catches',
 'central shaft aperture',
 'circumferential portion',
 'collar',
 'collar or notched lower pivot shaft',
 'collars',
 'cutout section',
 'cutout sections',
 'detent',
 'differences found in locking plate',
 'dog-bone locking plate',
 'face',
 'first and second arms',
 'first and second lower pivot shafts',
 'first arm',
 'first end',
 'first lower pivot shaft',
 'first notch',
 'first notches',
 'first opening',
 'first upper pivot shaft',
 'indexing aperture',
 'indexing ball bearing',
 'indexing detent',
 'indexing detents',
 'indexing pin',
 'indexing slot',
 'inwardly projecting locking cog',
 'lateral faces , of locking cog',
 'lock release pin aperture',
 'locking cog',
 'locking cogs',
 'locking plate',
 'lockout pin',
 'lockout pin aperture',
 'lower e

In [15]:
def get_refnum_dict(entity_list):
    """ Get a dictionary of reference numerals indexed by entity ngram. """
    ngram_list = list()
    for entity in entity_list:
        # Add tuple of ngram without determinant and ref num to list
        ngram_list.append(
            (
                " ".join(
                            [
                                word for word, pos in entity if (pos != 'DT' and pos != 'CD')
                            ]
                        ),
                entity[-1][0]
            )
        )
    # Sort through list and generate dictionary
    entity_dict = dict()
    for ngram, ref_num in ngram_list:
        if ngram not in entity_dict.keys():
            entity_dict[ngram] = list()
        # Check if a variation already exists
        exists = False
        if ref_num not in entity_dict[ngram]:
            entity_dict[ngram].append(ref_num)
    return entity_dict

In [16]:
ref_num_dict = get_refnum_dict(entities)
print(ref_num_dict)

{'first opening': ['25', '55'], '“ dog-bone ” locking plate': ['155', '302'], 'second arm': ['50'], 'pair of fasteners': ['22'], 'arms': ['35'], 'second opening': ['40', '65'], 'second notch': ['325'], 'collars': ['315', '372'], 'second end': ['170'], 'support arms': ['35'], 'first and second lower pivot shafts': ['30', '305'], 'pair of oblong apertures': ['335'], 'pair of indexing detents': ['240'], 'locking cog': ['345', '355', '365'], 'sidewalls': ['395'], 'mounting bracket': ['20'], 'first notches': ['320'], 'pivot pin': ['270'], 'lockout pin': ['200'], 'aperture': ['282'], 'face': ['225'], 'L-shaped slot': ['230'], 'respective notches': ['120', '125'], 'anterior face': ['375'], 'indexing detent': ['245'], 'mounting bracket first and second openings': ['25'], 'circumferential portion': ['295'], 'differences found in locking plate': ['302'], 'arcuate indexing slot': ['235'], 'second lower pivot shaft': ['45', '310'], 'angles': ['405'], 'locking cogs': ['180', '345'], 'oblong apertur

In [17]:
def get_entity_dict(entity_list):
    """ Get a dictionary of entities indexed by reference numeral."""
    entity_dict = {}
    for entity in entity_list:
        ref_num = entity[-1][0]
        if ref_num not in entity_dict.keys():
            entity_dict[ref_num] = list()
        # Check if a variation already exists
        exists = False
        n_gram = " ".join([w for w, _ in entity[1:-1]])
        for existing in entity_dict[ref_num]:
            if n_gram == existing:
                exists = True
        if not exists:
            entity_dict[ref_num].append(n_gram)
    return entity_dict

In [18]:
entity_dict = get_entity_dict(entities)
print(entity_dict)

{'407': ['angle'], '270': ['pivot pin'], '160': ['first end'], '25': ['first opening', 'mounting bracket first and second openings'], '390': ['base'], '412': ['angle'], '10': ['siderail support mechanism'], '35': ['first arm', 'first and second arms', 'arms', 'support arms'], '50': ['second arm'], '350': ['catch', 'catches'], '180': ['inwardly projecting locking cog', 'locking cogs'], '225': ['face'], '190': ['lockout pin aperture'], '282': ['aperture'], '405': ['respective angle', 'angles', 'angle'], '315': ['collars', 'respective collar'], '200': ['lockout pin'], '60': ['first upper pivot shaft'], '258': ['lower extent'], '245': ['indexing detent', 'detent'], '310': ['second lower pivot shaft'], '85': ['synchronization link'], '70': ['second upper pivot shaft'], '260': ['indexing ball bearing'], '372': ['collars'], '45': ['second lower pivot shaft', 'lower pivot shaft'], '175': ['oblong aperture'], '305': ['first lower pivot shaft', 'first and second lower pivot shafts', 'lower pivot

In [19]:
for key in sorted([int(k) for k in entity_dict.keys()]):
    print("{0}:{1}".format(key, entity_dict[str(key)]))

10:['siderail support mechanism']
15:['siderail']
20:['mounting bracket']
22:['pair of fasteners']
25:['first opening', 'mounting bracket first and second openings']
30:['first lower pivot shaft', 'first and second lower pivot shafts', 'axis of its respective lower pivot shaft', 'lower pivot shafts', 'collar 140 or notched lower pivot shaft']
35:['first arm', 'first and second arms', 'arms', 'support arms']
40:['second opening']
45:['second lower pivot shaft', 'lower pivot shaft']
50:['second arm']
55:['first opening', 'siderail first and second openings']
60:['first upper pivot shaft']
65:['second opening']
70:['second upper pivot shaft']
75:['toggles']
85:['synchronization link']
100:['siderail support mechanism']
110:['synchronization link 85 and through apertures']
120:['series of circumferentially spaced notches', 'notches', 'respective notches']
125:['notches', 'respective notches']
140:['collar']
145:['pair of notches']
155:['“ dog-bone ” locking plate', 'dog-bone locking plate'

In [20]:
def highlight_multiple(entity_dict):
    """ Highlight reference numerals used for multiple entities. """
    for key, value in entity_dict.items():
        if len(value) > 1:
            print(key, value)

In [21]:
highlight_multiple(get_entity_dict(entities))

25 ['first opening', 'mounting bracket first and second openings']
35 ['first arm', 'first and second arms', 'arms', 'support arms']
350 ['catch', 'catches']
180 ['inwardly projecting locking cog', 'locking cogs']
405 ['respective angle', 'angles', 'angle']
315 ['collars', 'respective collar']
245 ['indexing detent', 'detent']
45 ['second lower pivot shaft', 'lower pivot shaft']
305 ['first lower pivot shaft', 'first and second lower pivot shafts', 'lower pivot shafts']
165 ['oblong aperture', 'oblong apertures']
235 ['arcuate indexing slot', 'indexing slot']
125 ['notches', 'respective notches']
30 ['first lower pivot shaft', 'first and second lower pivot shafts', 'axis of its respective lower pivot shaft', 'lower pivot shafts', 'collar 140 or notched lower pivot shaft']
325 ['second notch', 'second notches']
410 ['respective angle', 'angle']
220 ['central shaft aperture', 'shaft aperture']
155 ['“ dog-bone ” locking plate', 'dog-bone locking plate', 'locking plate']
120 ['series of c

Now we can cross-check the two dictionaries...

In [22]:
for ngram, ref_numerals in ref_num_dict.items():
    print(ngram, [entity_dict.get(ref_num) for ref_num in ref_numerals])  

first opening [['first opening', 'mounting bracket first and second openings'], ['first opening', 'siderail first and second openings']]
“ dog-bone ” locking plate [['“ dog-bone ” locking plate', 'dog-bone locking plate', 'locking plate'], ['“ dog-bone ” locking plate', 'differences found in locking plate', 'locking plate']]
second arm [['second arm']]
pair of fasteners [['pair of fasteners']]
arms [['first arm', 'first and second arms', 'arms', 'support arms']]
second opening [['second opening'], ['second opening']]
second notch [['second notch', 'second notches']]
collars [['collars', 'respective collar'], ['collars']]
second end [['second end']]
support arms [['first arm', 'first and second arms', 'arms', 'support arms']]
first and second lower pivot shafts [['first lower pivot shaft', 'first and second lower pivot shafts', 'axis of its respective lower pivot shaft', 'lower pivot shafts', 'collar 140 or notched lower pivot shaft'], ['first lower pivot shaft', 'first and second lower

In [23]:
# Or the other way around
for ref_num, ngrams in entity_dict.items():
    print(ref_num, [ref_num_dict.get(ng) for ng in ngrams])  

407 [['410', '405', '412', '407']]
270 [['270']]
160 [['160']]
25 [['25', '55'], ['25']]
390 [['390']]
412 [['410', '405', '412', '407']]
10 [['10', '100', '300']]
35 [['35'], ['35'], ['35'], ['35']]
50 [['50']]
350 [['350', '360'], ['350']]
180 [['180'], ['180', '345']]
225 [['225']]
190 [['190']]
282 [['282']]
405 [['405', '410'], ['405'], ['410', '405', '412', '407']]
315 [['315', '372'], ['315']]
200 [['200']]
60 [['60']]
258 [['258']]
245 [['245'], ['240', '245']]
310 [['45', '310']]
85 [['85']]
70 [['70']]
260 [['260']]
372 [['315', '372']]
45 [['45', '310'], ['45']]
175 [['165', '175', '335', '340', '367']]
305 [['30', '305'], ['30', '305'], ['30', '305']]
165 [['165', '175', '335', '340', '367'], ['165', '335']]
367 [['165', '175', '335', '340', '367']]
235 [['235'], ['235']]
355 [['345', '355', '365']]
125 [['120', '125', '320'], ['120', '125']]
22 [['22']]
30 [['30', '305'], ['30', '305'], ['30'], ['30', '305'], None]
340 [['165', '175', '335', '340', '367']]
360 [['350', '36

## Add as PatentData Functions

In [24]:
# Add as functions to patent document model
pdoc.description.entities

{'10': ['siderail support mechanism'],
 '100': ['siderail support mechanism'],
 '110': ['synchronization link 85 and through apertures'],
 '120': ['series of circumferentially spaced notches',
  'notches',
  'respective notches'],
 '125': ['notches', 'respective notches'],
 '140': ['collar'],
 '145': ['pair of notches'],
 '15': ['siderail'],
 '155': ['“ dog-bone ” locking plate',
  'dog-bone locking plate',
  'locking plate'],
 '160': ['first end'],
 '165': ['oblong aperture', 'oblong apertures'],
 '170': ['second end'],
 '175': ['oblong aperture'],
 '180': ['inwardly projecting locking cog', 'locking cogs'],
 '190': ['lockout pin aperture'],
 '195': ['lock release pin aperture'],
 '20': ['mounting bracket'],
 '200': ['lockout pin'],
 '205': ['washer'],
 '210': ['bushing'],
 '215': ['bypass plate'],
 '22': ['pair of fasteners'],
 '220': ['central shaft aperture', 'shaft aperture'],
 '225': ['face'],
 '230': ['L-shaped slot'],
 '235': ['arcuate indexing slot', 'indexing slot'],
 '240': 

In [25]:
pdoc.description.entity_check()

25 ['first opening', 'mounting bracket first and second openings']
35 ['first arm', 'first and second arms', 'arms', 'support arms']
350 ['catch', 'catches']
180 ['inwardly projecting locking cog', 'locking cogs']
405 ['respective angle', 'angles', 'angle']
315 ['collars', 'respective collar']
245 ['indexing detent', 'detent']
45 ['second lower pivot shaft', 'lower pivot shaft']
305 ['first lower pivot shaft', 'first and second lower pivot shafts', 'lower pivot shafts']
165 ['oblong aperture', 'oblong apertures']
235 ['arcuate indexing slot', 'indexing slot']
125 ['notches', 'respective notches']
30 ['first lower pivot shaft', 'first and second lower pivot shafts', 'axis of its respective lower pivot shaft', 'lower pivot shafts', 'collar 140 or notched lower pivot shaft']
325 ['second notch', 'second notches']
410 ['respective angle', 'angle']
220 ['central shaft aperture', 'shaft aperture']
155 ['“ dog-bone ” locking plate', 'dog-bone locking plate', 'locking plate']
120 ['series of c

To do:  
* We need to handle multiple reference numerals, e.g. lines such as "siderail first and second openings 55, 65"

## Testing on Other Patent Specifications

In [26]:
path2 = '/media/SAMSUNG1/Patent_Downloads'
ds2 = USPublications(path2)

pdoc_generator = ds2.patentdoc_generator(classification=["G","06"], sample_size=5)

In [27]:
e = dict()
for i, pdoc in enumerate(pdoc_generator):
    e[i] = pdoc.description.entities
    pdoc.description.entity_check()

554570 records located.
5 records sampled.
345 ['additional storage bay or compartment configured to receive physical media', 'physical media']
460 ['mechanism', 'media compartment']
145 ['physical media', 'media compartment']
105 ['near-infrared illumination source', 'near-infrared illumination sources']
131 ['“ unit group', 'unit group']
1 ['following Expression', '“ scanning direction', 'scanning direction']
3 ['LOG filter ( Expression', 'following Expression']
12 ['directivity of direct light', 'direct light']
2 ['Gaussian filter ( Expression', 'following Expression', 'Gaussian function as in Expression', '“ scanning direction', 'scanning direction']
111 ['plurality of photoreceptors', 'photoreceptor', 'photoreceptors', 'nine photoreceptors', 'case that nine ( 3×3 ) photoreceptors', 'two photoreceptors']
103 ['plurality of microlenses', 'microlenses', 'microlens', 'respective microlenses', 'adjacent microlenses']
100 ['time ) 98 Socket TX ram interface', 'memory arbitrator']
10 ['c

In [28]:
e[0]

{'100': ['device'],
 '130': ['touch component'],
 '140': ['media compartment'],
 '145': ['physical media', 'media compartment'],
 '150': ['communication channel'],
 '160': ['mechanism'],
 '200': ['device'],
 '205': ['user'],
 '230': ['touch component'],
 '240': ['media compartment'],
 '300': ['device'],
 '310': ['media application'],
 '320': ['controller'],
 '330': ['touch component'],
 '340': ['media compartment'],
 '345': ['additional storage bay or compartment configured to receive physical media',
  'physical media'],
 '360': ['mechanism'],
 '410': ['media application'],
 '420': ['controller'],
 '430': ['touch component'],
 '440': ['media compartment'],
 '445': ['physical media'],
 '460': ['mechanism', 'media compartment'],
 '470': ['list of accessible storage devices'],
 '500': ['device'],
 '510': ['media application']}

In [29]:
e[1]

{'1': ['following Expression', '“ scanning direction', 'scanning direction'],
 '10': ['finger vein authentication apparatus'],
 '101': ['microlens array'],
 '103': ['plurality of microlenses',
  'microlenses',
  'microlens',
  'respective microlenses',
  'adjacent microlenses'],
 '105': ['near-infrared illumination source',
  'near-infrared illumination sources'],
 '107': ['directivity control plate'],
 '109': ['image pickup device'],
 '111': ['plurality of photoreceptors',
  'photoreceptor',
  'photoreceptors',
  'nine photoreceptors',
  'case that nine ( 3×3 ) photoreceptors',
  'two photoreceptors'],
 '113': ['image pickup device control unit'],
 '115': ['vein pattern extraction unit'],
 '117': ['parallax information acquisition unit'],
 '119': ['outer shape detection unit'],
 '12': ['directivity of direct light', 'direct light'],
 '121': ['imaging range detection unit'],
 '123': ['authentication unit'],
 '125': ['vein pattern registration unit'],
 '127': ['vein pattern authenticati

Another check we need to do is the reverse of the dictionary - see which reference numerals are assigned to which n_grams.

In [30]:
e[2]

{'1': ['text “ Internetworking with TCP/IP ” by Douglas E. Comer ( volume',
  'text “ TCP/IP Illustrated ” by W. Richard Stevens ( volume'],
 '10': ['communications processor', 'communication processor'],
 '10.10.150.152': ['HTTP server ( using port 80 ) , and client unit # 2 148 with IP address'],
 '10/470,365': ['continuation of U.S. application Ser. No.'],
 '100': ['time ) 98 Socket TX ram interface', 'memory arbitrator'],
 '104': ['IP raw multiplexor ( mux ) module',
  'IP raw mux',
  'PPP data link path ) and IP raw mux'],
 '11': ['ISO seven-layer model see , for example , chapter', 'external RAM'],
 '112': ['socket receive buffer'],
 '116': ['network stack internal memory',
  'network stack memory',
  'internal network stack memory'],
 '118': ['MAC send and receive buffers',
  'MAC transmit and receive buffers',
  'MAC buffers'],
 '11:12:13': ['Ethernet address'],
 '120': ['MAC buffer control', 'MAC buffer controller'],
 '126': ['internal MAC'],
 '13': ['external ROM', 'optional 