# Reference Numeral Clarity Checks

The aim of this notebook is to achieve functionality similar to that provided by TurboPatent in their report generation. This looks for misuse of reference numerals, e.g. if 100 is used as "telephone 100" and "smartphone 100".

In [1]:
# We'll start with our test XML file
from patentdata.corpus import USPublications

path = '/patentdata/tests/test_files'
ds = USPublications(path)

pdoc = next(ds.iter_xml()).to_patentdoc()
print(pdoc)

<Patent Document object for US20060085912A1, title: Siderail support mechanism - containing: description with 47 paragraphs and claimset with 39 claims; classifications: [['A', '47', 'C', '21', '08']]


In [2]:
pdoc.description.get_paragraph(26).sentences[2]

The mounting bracket 20 includes a first opening 25 adapted for receiving a first lower pivot shaft 30 of a first arm 35 and a second opening 40 adapted for receiving a second lower pivot shaft 45 of a second arm 50.

In [3]:
pos = pdoc.description.get_paragraph(26).sentences[2].pos
print(pos[0:10])

[('The', 'DT'), ('mounting', 'VBG'), ('bracket', 'NN'), ('20', 'CD'), ('includes', 'VBZ'), ('a', 'DT'), ('first', 'JJ'), ('opening', 'NN'), ('25', 'CD'), ('adapted', 'VBN')]


Our POS pattern is DT X+ CD.  

Start by going simple and just create a function that parses the POS.

In [6]:
def entity_finder(pos_list):
    """ Find entities with reference numerals using POS data."""
    entity_list = list()
    entity = []
    record = False
    for i, (word, pos) in enumerate(pos_list):
        if pos == "DT":
            record = True
            entity = []
            
        if record:
            entity.append((word, pos))
            
        if "FIG" in word:
            # reset entity to ignore phrases that refer to Figures
            record = False
            entity = []
        
        if pos == "CD" and entity and record and ('NN' in pos_list[i-1][1]): 
            record = False
            entity_list.append(entity)
    
    return entity_list

In [7]:
entity_finder(pos)

[[('The', 'DT'), ('mounting', 'VBG'), ('bracket', 'NN'), ('20', 'CD')],
 [('a', 'DT'), ('first', 'JJ'), ('opening', 'NN'), ('25', 'CD')],
 [('a', 'DT'),
  ('first', 'JJ'),
  ('lower', 'JJR'),
  ('pivot', 'NN'),
  ('shaft', 'NN'),
  ('30', 'CD')],
 [('a', 'DT'), ('first', 'JJ'), ('arm', 'NN'), ('35', 'CD')],
 [('a', 'DT'), ('second', 'JJ'), ('opening', 'NN'), ('40', 'CD')],
 [('a', 'DT'),
  ('second', 'JJ'),
  ('lower', 'JJR'),
  ('pivot', 'NN'),
  ('shaft', 'NN'),
  ('45', 'CD')],
 [('a', 'DT'), ('second', 'JJ'), ('arm', 'NN'), ('50', 'CD')]]

In [8]:
def filter_entity_list(entity_list):
    """Filter output to remove reference to priority claims."""
    filter_list = list()
    for entity in entity_list:
        if not ({"claims", "priority", "under"} <= set([w for w, _ in entity])):
            filter_list.append(entity)
    return filter_list

In [9]:
filter_entity_list(entity_finder(pos))

[[('The', 'DT'), ('mounting', 'VBG'), ('bracket', 'NN'), ('20', 'CD')],
 [('a', 'DT'), ('first', 'JJ'), ('opening', 'NN'), ('25', 'CD')],
 [('a', 'DT'),
  ('first', 'JJ'),
  ('lower', 'JJR'),
  ('pivot', 'NN'),
  ('shaft', 'NN'),
  ('30', 'CD')],
 [('a', 'DT'), ('first', 'JJ'), ('arm', 'NN'), ('35', 'CD')],
 [('a', 'DT'), ('second', 'JJ'), ('opening', 'NN'), ('40', 'CD')],
 [('a', 'DT'),
  ('second', 'JJ'),
  ('lower', 'JJR'),
  ('pivot', 'NN'),
  ('shaft', 'NN'),
  ('45', 'CD')],
 [('a', 'DT'), ('second', 'JJ'), ('arm', 'NN'), ('50', 'CD')]]

In [10]:
def print_entity_list(entity_list):
    """Little function to print entity list."""
    words = [[word for word, _ in e] for e in entity_list]
    print([" ".join(word_list) for word_list in words])      

In [11]:
print_entity_list(entity_finder(pos))

['The mounting bracket 20', 'a first opening 25', 'a first lower pivot shaft 30', 'a first arm 35', 'a second opening 40', 'a second lower pivot shaft 45', 'a second arm 50']


In [12]:
def list_entities(pdoc):
    """ List entities in a patent document."""
    entities = list()
    for para in pdoc.description.paragraphs:
        for sentence in para.sentences:
            entities += entity_finder(sentence.pos)
    entities = filter_entity_list(entities)
    print_entity_list(entities)
    return entities

In [13]:
entities = list_entities(pdoc)

['a siderail support mechanism 10', 'a siderail 15', 'The siderail support mechanism 10', 'a mounting bracket 20', 'a pair of fasteners 22', 'The mounting bracket 20', 'a first opening 25', 'a first lower pivot shaft 30', 'a first arm 35', 'a second opening 40', 'a second lower pivot shaft 45', 'a second arm 50', 'The siderail 15', 'a first opening 55', 'a first upper pivot shaft 60', 'the first arm 35', 'a second opening 65', 'a second upper pivot shaft 70', 'the second arm 50', 'The siderail first and second openings 55', 'the mounting bracket first and second openings 25', 'the first and second arms 35', 'the siderail 15', 'the mounting bracket 20', 'the first and second arms 35', 'the first and second lower pivot shafts 30', 'The first and second lower pivot shafts 30', 'a synchronization link 85', 'the toggles 75', 'The synchronization link 85', 'the toggles 75', 'the synchronization link 85 and through apertures 110', 'the toggles 75', 'the first and second arms 35', 'the siderai

In my first run there were some anomalies here. These include "This application claims priority under 35", "the embodiments of FIGS. 1-13" and "The siderail support mechanism 10".

I can tweak the parsing function to reset in "FIG" in word. I can also filter out entites with the phrase "claims priority under".  

There will likely be more occurrences to filter. (We maybe also want to check for the presence of a noun (NN\*) before the number)

Once we have a list of entities - we need to compress them into a set and check that the reference numerals are always consistently used.  

First create a set of entity N-grams. Then reverse and create a dictionary with the reference numeralas an index. Compare.

In [14]:
entities

[[('a', 'DT'),
  ('siderail', 'JJ'),
  ('support', 'NN'),
  ('mechanism', 'NN'),
  ('10', 'CD')],
 [('a', 'DT'), ('siderail', 'NN'), ('15', 'CD')],
 [('The', 'DT'),
  ('siderail', 'JJ'),
  ('support', 'NN'),
  ('mechanism', 'NN'),
  ('10', 'CD')],
 [('a', 'DT'), ('mounting', 'VBG'), ('bracket', 'NN'), ('20', 'CD')],
 [('a', 'DT'),
  ('pair', 'NN'),
  ('of', 'IN'),
  ('fasteners', 'NNS'),
  ('22', 'CD')],
 [('The', 'DT'), ('mounting', 'VBG'), ('bracket', 'NN'), ('20', 'CD')],
 [('a', 'DT'), ('first', 'JJ'), ('opening', 'NN'), ('25', 'CD')],
 [('a', 'DT'),
  ('first', 'JJ'),
  ('lower', 'JJR'),
  ('pivot', 'NN'),
  ('shaft', 'NN'),
  ('30', 'CD')],
 [('a', 'DT'), ('first', 'JJ'), ('arm', 'NN'), ('35', 'CD')],
 [('a', 'DT'), ('second', 'JJ'), ('opening', 'NN'), ('40', 'CD')],
 [('a', 'DT'),
  ('second', 'JJ'),
  ('lower', 'JJR'),
  ('pivot', 'NN'),
  ('shaft', 'NN'),
  ('45', 'CD')],
 [('a', 'DT'), ('second', 'JJ'), ('arm', 'NN'), ('50', 'CD')],
 [('The', 'DT'), ('siderail', 'NN'), ('15',

In [15]:
def get_entity_set(entity_list):
    """ Get a set of unique entity n-grams from a list of entities."""
    ngram_list = list()
    for entity in entity_list:
        ngram_list.append(" ".join([word for word, pos in entity if (pos != 'DT' and pos != 'CD')]))
    return set(ngram_list)              

In [16]:
get_entity_set(entities)

{'L-shaped slot',
 'angle',
 'angles',
 'anterior face',
 'aperture',
 'arcuate indexing slot',
 'arms',
 'axis of its respective lower pivot shaft',
 'base',
 'bias of spring',
 'bushing',
 'bypass plate',
 'catch',
 'catches',
 'central shaft aperture',
 'circumferential portion',
 'collar',
 'collar or notched lower pivot shaft',
 'collars',
 'cutout section',
 'cutout sections',
 'detent',
 'differences found in locking plate',
 'dog-bone locking plate',
 'face',
 'first and second arms',
 'first and second lower pivot shafts',
 'first arm',
 'first end',
 'first lower pivot shaft',
 'first notch',
 'first notches',
 'first opening',
 'first upper pivot shaft',
 'indexing aperture',
 'indexing ball bearing',
 'indexing detent',
 'indexing detents',
 'indexing pin',
 'indexing slot',
 'inwardly projecting locking cog',
 'lateral faces , of locking cog',
 'lock release pin aperture',
 'locking cog',
 'locking cogs',
 'locking plate',
 'lockout pin',
 'lockout pin aperture',
 'lower e

In [None]:
def get_refnum_dict(entity_list):
    """ Get a dictionary of reference numerals indexed by entity ngram. """
    pass

In [17]:
def get_entity_dict(entity_list):
    """ Get a dictionary of entities indexed by reference numeral."""
    entity_dict = {}
    for entity in entity_list:
        ref_num = entity[-1][0]
        if ref_num not in entity_dict.keys():
            entity_dict[ref_num] = list()
        # Check if a variation already exists
        exists = False
        n_gram = " ".join([w for w, _ in entity[1:-1]])
        for existing in entity_dict[ref_num]:
            if n_gram == existing:
                exists = True
        if not exists:
            entity_dict[ref_num].append(n_gram)
    return entity_dict

In [18]:
get_entity_dict(entities)

{'10': ['siderail support mechanism'],
 '100': ['siderail support mechanism'],
 '110': ['synchronization link 85 and through apertures'],
 '120': ['series of circumferentially spaced notches',
  'notches',
  'respective notches'],
 '125': ['notches', 'respective notches'],
 '140': ['collar'],
 '145': ['pair of notches'],
 '15': ['siderail'],
 '155': ['“ dog-bone ” locking plate',
  'dog-bone locking plate',
  'locking plate'],
 '160': ['first end'],
 '165': ['oblong aperture', 'oblong apertures'],
 '170': ['second end'],
 '175': ['oblong aperture'],
 '180': ['inwardly projecting locking cog', 'locking cogs'],
 '190': ['lockout pin aperture'],
 '195': ['lock release pin aperture'],
 '20': ['mounting bracket'],
 '200': ['lockout pin'],
 '205': ['washer'],
 '210': ['bushing'],
 '215': ['bypass plate'],
 '22': ['pair of fasteners'],
 '220': ['central shaft aperture', 'shaft aperture'],
 '225': ['face'],
 '230': ['L-shaped slot'],
 '235': ['arcuate indexing slot', 'indexing slot'],
 '240': 

In [19]:
def highlight_multiple(entity_dict):
    """ Highlight reference numerals used for multiple entities. """
    for key, value in entity_dict.items():
        if len(value) > 1:
            print(key, value)

In [20]:
highlight_multiple(get_entity_dict(entities))

120 ['series of circumferentially spaced notches', 'notches', 'respective notches']
165 ['oblong aperture', 'oblong apertures']
155 ['“ dog-bone ” locking plate', 'dog-bone locking plate', 'locking plate']
35 ['first arm', 'first and second arms', 'arms', 'support arms']
125 ['notches', 'respective notches']
240 ['pair of indexing detents', 'indexing detents', 'detent']
410 ['respective angle', 'angle']
220 ['central shaft aperture', 'shaft aperture']
395 ['pair of side walls', 'sidewalls']
405 ['respective angle', 'angles', 'angle']
345 ['locking cog', 'locking cogs']
245 ['indexing detent', 'detent']
330 ['cutout section', 'cutout sections']
335 ['pair of oblong apertures', 'oblong apertures', 'oblong aperture']
290 ['radially extending portion', 'radial portion']
235 ['arcuate indexing slot', 'indexing slot']
325 ['second notch', 'second notches']
315 ['collars', 'respective collar']
302 ['“ dog-bone ” locking plate', 'differences found in locking plate', 'locking plate']
55 ['first

In [21]:
# Add as functions to patent document model
pdoc.description.entities

{'10': ['siderail support mechanism'],
 '100': ['siderail support mechanism'],
 '110': ['synchronization link 85 and through apertures'],
 '120': ['series of circumferentially spaced notches',
  'notches',
  'respective notches'],
 '125': ['notches', 'respective notches'],
 '140': ['collar'],
 '145': ['pair of notches'],
 '15': ['siderail'],
 '155': ['“ dog-bone ” locking plate',
  'dog-bone locking plate',
  'locking plate'],
 '160': ['first end'],
 '165': ['oblong aperture', 'oblong apertures'],
 '170': ['second end'],
 '175': ['oblong aperture'],
 '180': ['inwardly projecting locking cog', 'locking cogs'],
 '190': ['lockout pin aperture'],
 '195': ['lock release pin aperture'],
 '20': ['mounting bracket'],
 '200': ['lockout pin'],
 '205': ['washer'],
 '210': ['bushing'],
 '215': ['bypass plate'],
 '22': ['pair of fasteners'],
 '220': ['central shaft aperture', 'shaft aperture'],
 '225': ['face'],
 '230': ['L-shaped slot'],
 '235': ['arcuate indexing slot', 'indexing slot'],
 '240': 

In [22]:
pdoc.description.entity_check()

120 ['series of circumferentially spaced notches', 'notches', 'respective notches']
165 ['oblong aperture', 'oblong apertures']
155 ['“ dog-bone ” locking plate', 'dog-bone locking plate', 'locking plate']
35 ['first arm', 'first and second arms', 'arms', 'support arms']
125 ['notches', 'respective notches']
240 ['pair of indexing detents', 'indexing detents', 'detent']
410 ['respective angle', 'angle']
220 ['central shaft aperture', 'shaft aperture']
395 ['pair of side walls', 'sidewalls']
405 ['respective angle', 'angles', 'angle']
345 ['locking cog', 'locking cogs']
245 ['indexing detent', 'detent']
330 ['cutout section', 'cutout sections']
335 ['pair of oblong apertures', 'oblong apertures', 'oblong aperture']
290 ['radially extending portion', 'radial portion']
235 ['arcuate indexing slot', 'indexing slot']
325 ['second notch', 'second notches']
315 ['collars', 'respective collar']
302 ['“ dog-bone ” locking plate', 'differences found in locking plate', 'locking plate']
55 ['first

To do:  
* We need to handle multiple reference numerals, e.g. lines such as "siderail first and second openings 55, 65"

## Testing on Other Patent Specifications

In [23]:
path2 = '/media/SAMSUNG1/Patent_Downloads'
ds2 = USPublications(path2)

pdoc_generator = ds2.patentdoc_generator(classification=["G","06"], sample_size=5)

In [26]:
e = dict()
for i, pdoc in enumerate(pdoc_generator):
    e[i] = pdoc.description.entities
    pdoc.description.entity_check()

180B ['dotted circle', 'reaching range']
160 ['second switch', 'node to be started side adapter 120 and second switch']
240 ['node', 'nodes', 'other nodes', 'plurality of nodes']
113 ['power reception unit', 'first power reception unit']
232 ['application unit', 'address confirmation unit']
118 ['start-up signal reception unit', 'short-distance radio reception interface ( start-up signal reception unit']
240S6 ['first group in which node 240S1 to be started to nodes', 'first group ( in which node 240S1 to be started to nodes']
240S7 ['belonging set , and node addresses of nodes', 'node']
180A ['dotted circle', 'start-up signal reaching range', 'reaching range']
110 ['start node side adapter', 'conventional typical switch and start node side adapter']
240S1 ['plurality of nodes', 'nodes', 'node', 'node addresses of nodes']
24031 ['nodes', 'node']
4 ['second modification is classified as CLASS', 'start-up power signal is CLASS', 'transmission destination device is CLASS', 'connection des

In [27]:
e[0]

{'110': ['start node side adapter',
  'conventional typical switch and start node side adapter'],
 '111': ['first interface'],
 '112': ['first data transmission unit'],
 '113': ['power reception unit', 'first power reception unit'],
 '114': ['PoE controller'],
 '115': ['power transmission unit'],
 '116': ['second data transmission unit'],
 '117': ['second interface'],
 '118': ['start-up signal reception unit',
  'short-distance radio reception interface ( start-up signal reception unit'],
 '120': ['started side adapter'],
 '121': ['first interface'],
 '122': ['first data transmission unit'],
 '123': ['first power reception unit'],
 '124': ['PoE conversion unit'],
 '125': ['second power reception unit'],
 '126': ['second data transmission unit'],
 '127': ['second interface'],
 '128': ['start-up signal transmission unit'],
 '130': ['start node'],
 '131': ['data communication interface'],
 '132': ['start-up signal transmission unit'],
 '140': ['node'],
 '141': ['data communication interfa

In [28]:
e[1]

{'102': ['task circle',
  'unlimited number of task circles',
  'large number of task circles',
  'various task circles',
  'number of task circles',
  'circle',
  'single task circle',
  'higher priority than other subsets of task circles',
  'individual task circles',
  'individual task circle'],
 '104': ['orbit', 'given grouping or orbit'],
 '106': ['center point'],
 '108': ['lines'],
 '110': ['task circles 102 are labeled with labels', 'labels'],
 '1100': ['display'],
 '114': ['new task circle'],
 '116': ['indicator', 'task quantity indicator'],
 '118': ['additional indicator', 'abstraction quantity indicator'],
 '120': ['“ Programming ” task represented by task circle',
  '“ Programming ” task circle',
  'task circle'],
 '122': ['currently selected task circle'],
 '124': ['new orbit'],
 '126': ['“ up ” indicator'],
 '128': ['“ down ” indicator'],
 '130': ['corresponding task circle , such as task circle', 'task circle'],
 '132': ['outer shape'],
 '136': ['inner shape'],
 '138': ['

Another check we need to do is the reverse of the dictionary - see which reference numerals are assigned to which n_grams.

In [29]:
e[2]

{'10': ['computer-based platforms', 'computer-based platform'],
 '110': ['loading process ( Step'],
 '120': ['boot program is undamaged ( Step'],
 '130': ['update process ( Step'],
 '20': ['processing unit'],
 '200': ['flag value ( Step'],
 '210': ['pre-determined value ( Step'],
 '230': ['loading is successful ( Step'],
 '240': ['BIOS ( Step'],
 '30': ['detection configuration unit'],
 '40': ['dynamic access memory unit'],
 '50': ['programmable storage unit', 'load unit'],
 '60': ['load unit'],
 '70': ['bus'],
 '80': ['boot program from one external source',
  'boot program to be updated or restored from one source',
  'source',
  'selected source',
  'search for sources']}