# Exploring spaCy using Patent Data

In this notebook we will be looking at applying the functions of spaCy (https://spacy.io/) to patent data.

In [1]:
#Let's import spaCy
import spacy

nlp = spacy.load('en') 

In [2]:
#Let's get some Patent Data

# We'll start with our test XML file
from patentdata.corpus import USPublications

path = '/patentdata/tests/test_files'
ds = USPublications(path)

pdoc = next(ds.iter_xml()).to_patentdoc()
print(pdoc)

<Patent Document object for US20060085912A1, title: Siderail support mechanism - containing: description with 47 paragraphs and claimset with 39 claims; classifications: [['A', '47', 'C', '21', '08']]


In [3]:
# Create a parsed spaCy document object from the patent description
doc = nlp(pdoc.description.text)

In [4]:
doc

A siderail support mechanism with multiple locks and an impact release feature is configured to positively lock in an upright deployed position, but is adapted to release upon imposition of a longitudinal impact load such as that caused by striking a stationary barrier. 
 This application claims priority under 35 U.S.C. §119(e) of copending provisional application Ser. No. 60/622 503 filed Oct. 27, 2004, the entire disclosure of which is herein incorporated by reference.
 1. Field of the Invention 
 The invention relates to support mechanisms for hospital bed siderails. In one of its aspects, the invention relates to a locking mechanism for siderail support mechanisms. In another of its aspects, the invention relates to a siderail support mechanism with an impact release feature. 
 2. Description of Related Art 
 Four-bar link siderail support mechanisms require being locked in various positions. It is important that the siderail stay locked for patient safety. 
 It is also known that 

In [5]:
# A document is represented as a series of tokens
token = doc[0]
print(token)

A


In [6]:
# Sentences can also be returned 
sentence = next(doc.sents)
print(sentence)

A siderail support mechanism with multiple locks and an impact release feature is configured to positively lock in an upright deployed position, but is adapted to release upon imposition of a longitudinal impact load such as that caused by striking a stationary barrier. 
 


I note that sentence segmentation will need some tweaking (e.g. "FIG." and "Ser." are used to segment sentences). We can maybe add these terms as custom terms using something similar to the example of "...gimme..." here: https://spacy.io/docs/usage/customizing-tokenizer.

### Word Token Information
Lets have a look at some of the information available for each word in the parsed documemt.

In [7]:
for word in doc:
    print(word.text, word.lemma, word.lemma_, word.tag, word.tag_, word.pos, word.pos_, word.dep, word.dep_)

A 506 a 460 DT 88 DET 411 det
siderail 776980 siderail 474 NN 90 NOUN 74185 compound
support 1018 support 474 NN 90 NOUN 74185 compound
mechanism 5199 mechanism 474 NN 90 NOUN 426 nsubjpass
with 548 with 466 IN 83 ADP 439 prep
multiple 2407 multiple 467 JJ 82 ADJ 398 amod
locks 4285 lock 477 NNS 90 NOUN 435 pobj
and 512 and 458 CC 87 CCONJ 403 cc
an 591 an 460 DT 88 DET 411 det
impact 3166 impact 474 NN 90 NOUN 74185 compound
release 2537 release 474 NN 90 NOUN 74185 compound
feature 2572 feature 474 NN 90 NOUN 406 conj
is 536 be 493 VBZ 98 VERB 402 auxpass
configured 289176 configure 491 VBN 98 VERB 512817 ROOT
to 504 to 486 TO 92 PART 401 aux
positively 10485 positively 481 RB 84 ADV 396 advmod
lock 4285 lock 488 VB 98 VERB 445 xcomp
in 522 in 484 RP 92 PART 440 prt
an 591 an 460 DT 88 DET 411 det
upright 147539 upright 467 JJ 82 ADJ 398 amod
deployed 447305 deploy 491 VBN 98 VERB 398 amod
position 1599 position 474 NN 90 NOUN 412 dobj
, 450 , 450 , 95 PUNCT 441 punct
but 559 but 458

an 591 an 460 DT 88 DET 411 det
upright 147539 upright 467 JJ 82 ADJ 398 amod
deployed 447305 deploy 491 VBN 98 VERB 398 amod
position 1599 position 474 NN 90 NOUN 412 dobj
, 450 , 450 , 95 PUNCT 441 punct
but 559 but 458 CC 87 CCONJ 403 cc
is 536 be 493 VBZ 98 VERB 402 auxpass
adapted 9081 adapt 491 VBN 98 VERB 406 conj
to 504 to 486 TO 92 PART 401 aux
release 2537 release 488 VB 98 VERB 445 xcomp
upon 1862 upon 466 IN 83 ADP 439 prep
imposition 559888 imposition 474 NN 90 NOUN 435 pobj
of 510 of 466 IN 83 ADP 439 prep
a 506 a 460 DT 88 DET 411 det
longitudinal 236580 longitudinal 467 JJ 82 ADJ 398 amod
impact 3166 impact 474 NN 90 NOUN 74185 compound
load 3202 load 474 NN 90 NOUN 435 pobj
such 829 such 467 JJ 82 ADJ 398 amod
as 557 as 466 IN 83 ADP 439 prep
that 514 that 460 DT 88 DET 435 pobj
caused 1312 cause 491 VBN 98 VERB 758131 acl
by 605 by 466 IN 83 ADP 397 agent
striking 4935 strike 490 VBG 98 VERB 434 pcomp
a 506 a 460 DT 88 DET 411 det
stationary 340953 stationary 467 JJ 8

- 535 - 97 SYM 97 SYM 441 punct
12 2346 12 459 CD 91 NUM 439 prep
in 522 in 466 IN 83 ADP 439 prep
the 501 the 460 DT 88 DET 411 det
intermediate 79754 intermediate 467 JJ 82 ADJ 398 amod
height 6070 height 474 NN 90 NOUN 422 nmod
locked 5024 locked 467 JJ 82 ADJ 398 amod
position 1599 position 474 NN 90 NOUN 435 pobj
; 620 ; 454 : 95 PUNCT 441 punct
and 512 and 458 CC 87 CCONJ 403 cc

  3416 
  485 SP 101 SPACE 0 
FIG 515099 fig 475 NNP 94 PROPN 406 conj
. 453 . 453 . 95 PUNCT 441 punct
14 3601 14 459 CD 91 NUM 425 nsubj
is 536 be 493 VBZ 98 VERB 512817 ROOT
a 506 a 460 DT 88 DET 411 det
partial 8192 partial 467 JJ 82 ADJ 398 amod
cut 1478 cut 474 NN 90 NOUN 398 amod
- 535 - 465 HYPH 95 PUNCT 441 punct
away 944 away 481 RB 84 ADV 396 advmod
view 1514 view 474 NN 90 NOUN 400 attr
of 510 of 466 IN 83 ADP 439 prep
a 506 a 460 DT 88 DET 411 det
locking 4285 lock 490 VBG 98 VERB 398 amod
cog 204126 cog 474 NN 90 NOUN 435 pobj
and 512 and 458 CC 87 CCONJ 403 cc
receiving 3462 receive 490 VB

a 506 a 460 DT 88 DET 411 det
first 774 first 467 JJ 82 ADJ 398 amod
arm 3039 arm 474 NN 90 NOUN 435 pobj
35 5576 35 459 CD 91 NUM 758136 nummod
and 512 and 458 CC 87 CCONJ 403 cc
a 506 a 460 DT 88 DET 411 det
second 1234 second 467 JJ 82 ADJ 398 amod
opening 4162 opening 474 NN 90 NOUN 406 conj
40 2574 40 459 CD 91 NUM 758136 nummod
adapted 9081 adapt 489 VBD 98 VERB 758131 acl
for 531 for 466 IN 83 ADP 439 prep
receiving 3462 receive 490 VBG 98 VERB 434 pcomp
a 506 a 460 DT 88 DET 411 det
second 1234 second 467 JJ 82 ADJ 398 amod
lower 1481 low 468 JJR 82 ADJ 398 amod
pivot 365003 pivot 474 NN 90 NOUN 74185 compound
shaft 21354 shaft 474 NN 90 NOUN 412 dobj
45 5664 45 459 CD 91 NUM 758136 nummod
of 510 of 466 IN 83 ADP 439 prep
a 506 a 460 DT 88 DET 411 det
second 1234 second 467 JJ 82 ADJ 398 amod
arm 3039 arm 474 NN 90 NOUN 435 pobj
50 1745 50 459 CD 91 NUM 758136 nummod
. 453 . 453 . 95 PUNCT 441 punct
The 501 the 460 DT 88 DET 411 det
siderail 776980 siderail 474 NN 90 NOUN 425 n

continue 1696 continue 488 VB 98 VERB 395 advcl
to 504 to 486 TO 92 PART 401 aux
rotate 254715 rotate 488 VB 98 VERB 445 xcomp
in 522 in 466 IN 83 ADP 439 prep
the 501 the 460 DT 88 DET 411 det
same 725 same 467 JJ 82 ADJ 398 amod
direction 2818 direction 474 NN 90 NOUN 435 pobj
as 557 as 466 IN 83 ADP 439 prep
the 501 the 460 DT 88 DET 411 det
siderail 776980 siderail 474 NN 90 NOUN 74185 compound
support 1018 support 474 NN 90 NOUN 74185 compound
mechanism 5199 mechanism 474 NN 90 NOUN 435 pobj
10 1143 10 459 CD 91 NUM 758136 nummod
moves 1393 move 477 NNS 90 NOUN 412 dobj
through 875 through 466 IN 83 ADP 439 prep
its 757862 -PRON- 480 PRP$ 82 ADJ 436 poss
full 1171 full 467 JJ 82 ADJ 74185 compound
range 3124 range 474 NN 90 NOUN 435 pobj
of 510 of 466 IN 83 ADP 439 prep
motion 5592 motion 474 NN 90 NOUN 435 pobj
. 453 . 453 . 95 PUNCT 441 punct

  3416 
  485 SP 101 SPACE 0 
The 501 the 460 DT 88 DET 411 det
first 774 first 467 JJ 82 ADJ 398 amod
lower 1481 low 468 JJR 82 ADJ 398 

the 501 the 460 DT 88 DET 411 det
locking 4285 lock 490 VBG 98 VERB 398 amod
plate 6197 plate 474 NN 90 NOUN 435 pobj
155 32246 155 459 CD 91 NUM 758136 nummod
for 531 for 466 IN 83 ADP 439 prep
cooperating 10925 cooperate 490 VBG 98 VERB 434 pcomp
with 548 with 466 IN 83 ADP 439 prep
a 506 a 460 DT 88 DET 411 det
bypass 188599 bypass 474 NN 90 NOUN 74185 compound
plate 6197 plate 474 NN 90 NOUN 435 pobj
215 59099 215 459 CD 91 NUM 758136 nummod
. 453 . 453 . 95 PUNCT 441 punct

  3416 
  485 SP 101 SPACE 0 
The 501 the 460 DT 88 DET 411 det
bypass 188599 bypass 474 NN 90 NOUN 74185 compound
plate 6197 plate 474 NN 90 NOUN 425 nsubj
215 59099 215 459 CD 91 NUM 758136 nummod
includes 2497 include 493 VBZ 98 VERB 512817 ROOT
a 506 a 460 DT 88 DET 411 det
central 2882 central 467 JJ 82 ADJ 398 amod
shaft 21354 shaft 474 NN 90 NOUN 74185 compound
aperture 314739 aperture 474 NN 90 NOUN 412 dobj
220 55484 220 459 CD 91 NUM 758136 nummod
for 531 for 466 IN 83 ADP 439 prep
receiving 3462 rece

. 453 . 453 . 95 PUNCT 441 punct
The 501 the 460 DT 88 DET 411 det
other 655 other 467 JJ 82 ADJ 398 amod
end 948 end 474 NN 90 NOUN 426 nsubjpass
of 510 of 466 IN 83 ADP 439 prep
the 501 the 460 DT 88 DET 411 det
spring 7245 spring 474 NN 90 NOUN 435 pobj
285 15845 285 459 CD 91 NUM 758136 nummod
is 536 be 493 VBZ 98 VERB 402 auxpass
attached 136004 attach 491 VBN 98 VERB 512817 ROOT
to 504 to 466 IN 83 ADP 439 prep
the 501 the 460 DT 88 DET 411 det
mounting 242066 mount 490 VBG 98 VERB 398 amod
bracket 8706 bracket 474 NN 90 NOUN 435 pobj
20 1485 20 459 CD 91 NUM 758136 nummod
for 531 for 466 IN 83 ADP 439 prep
biasing 3136 bias 490 VBG 98 VERB 434 pcomp
the 501 the 460 DT 88 DET 411 det
lock 4285 lock 474 NN 90 NOUN 74185 compound
release 2537 release 474 NN 90 NOUN 74185 compound
lever 212417 lever 474 NN 90 NOUN 412 dobj
265 15871 265 459 CD 91 NUM 758136 nummod
about 581 about 466 IN 83 ADP 396 advmod
the 501 the 460 DT 88 DET 411 det
pivot 365003 pivot 474 NN 90 NOUN 74185 compo

rotated 254715 rotate 491 VBN 98 VERB 395 advcl
about 581 about 466 IN 83 ADP 439 prep
the 501 the 460 DT 88 DET 411 det
pivot 365003 pivot 474 NN 90 NOUN 74185 compound
pin 4337 pin 474 NN 90 NOUN 435 pobj
270 69400 270 459 CD 91 NUM 758136 nummod
, 450 , 450 , 95 PUNCT 441 punct
the 501 the 460 DT 88 DET 411 det
lock 4285 lock 474 NN 90 NOUN 74185 compound
release 2537 release 474 NN 90 NOUN 74185 compound
pin 4337 pin 474 NN 90 NOUN 399 appos
275 26523 275 459 CD 91 NUM 758136 nummod
shifts 3818 shift 477 NNS 90 NOUN 412 dobj
to 504 to 466 IN 83 ADP 439 prep
the 501 the 460 DT 88 DET 411 det
right 679 right 474 NN 90 NOUN 435 pobj
while 835 while 466 IN 83 ADP 419 mark
inserted 7084 insert 491 VBN 98 VERB 395 advcl
in 522 in 466 IN 83 ADP 439 prep
the 501 the 460 DT 88 DET 411 det
lock 4285 lock 474 NN 90 NOUN 74185 compound
release 2537 release 474 NN 90 NOUN 74185 compound
pin 4337 pin 474 NN 90 NOUN 74185 compound
aperture 314739 aperture 474 NN 90 NOUN 435 pobj
195 73265 195 459

30 1735 30 459 CD 91 NUM 758136 nummod
, 450 , 450 , 95 PUNCT 441 punct
45 5664 45 459 CD 91 NUM 406 conj
. 453 . 453 . 95 PUNCT 441 punct
The 501 the 460 DT 88 DET 411 det
siderail 776980 siderail 474 NN 90 NOUN 398 amod
support 1018 support 474 NN 90 NOUN 74185 compound
mechanism 5199 mechanism 474 NN 90 NOUN 425 nsubj
10 1143 10 459 CD 91 NUM 758136 nummod
is 536 be 493 VBZ 98 VERB 512817 ROOT
releasable 777001 releasable 467 JJ 82 ADJ 394 acomp
to 504 to 466 IN 83 ADP 439 prep
the 501 the 460 DT 88 DET 411 det
lowered 1794 lower 491 VBN 98 VERB 398 amod
position 1599 position 474 NN 90 NOUN 435 pobj
by 605 by 466 IN 83 ADP 439 prep
an 591 an 460 DT 88 DET 411 det
operator 8229 operator 474 NN 90 NOUN 435 pobj
depressing 144054 depress 490 VBG 98 VERB 758131 acl
the 501 the 460 DT 88 DET 411 det
release 2537 release 474 NN 90 NOUN 425 nsubj
handle 2769 handle 488 VB 98 VERB 404 ccomp
280 20728 280 459 CD 91 NUM 412 dobj
to 504 to 486 TO 92 PART 401 aux
unlock 515904 unlock 488 VB 98

of 510 of 466 IN 83 ADP 439 prep
the 501 the 460 DT 88 DET 411 det
detent 774040 detent 474 NN 90 NOUN 435 pobj
240 16163 240 459 CD 91 NUM 758136 nummod
. 453 . 453 . 95 PUNCT 441 punct
When 634 when 497 WRB 84 ADV 396 advmod
the 501 the 460 DT 88 DET 411 det
siderail 776980 siderail 474 NN 90 NOUN 425 nsubj
support 1018 support 474 NN 90 NOUN 74185 compound
mechanism 5199 mechanism 474 NN 90 NOUN 425 nsubj
10 1143 10 459 CD 91 NUM 758136 nummod
reaches 3306 reach 493 VBZ 98 VERB 395 advcl
the 501 the 460 DT 88 DET 411 det
fully 2546 fully 481 RB 84 ADV 396 advmod
lowered 1794 lower 491 VBN 98 VERB 398 amod
position 1599 position 474 NN 90 NOUN 412 dobj
, 450 , 450 , 95 PUNCT 441 punct
the 501 the 460 DT 88 DET 411 det
bypass 188599 bypass 474 NN 90 NOUN 74185 compound
plate 6197 plate 474 NN 90 NOUN 425 nsubj
has 539 have 493 VBZ 98 VERB 401 aux
rotated 254715 rotate 491 VBN 98 VERB 512817 ROOT
approximately 7768 approximately 481 RB 84 ADV 396 advmod
20 1485 20 459 CD 91 NUM 758136 

, 450 , 450 , 95 PUNCT 441 punct
the 501 the 460 DT 88 DET 411 det
siderail 776980 siderail 474 NN 90 NOUN 74185 compound
support 1018 support 474 NN 90 NOUN 74185 compound
mechanism 5199 mechanism 474 NN 90 NOUN 425 nsubj
10 1143 10 459 CD 91 NUM 758136 nummod
does 544 do 493 VBZ 98 VERB 512817 ROOT
not 538 not 481 RB 84 ADV 421 neg
“ 524 " 453 . 95 PUNCT 441 punct
catch 2914 catch 488 VB 98 VERB 512817 ROOT
” 524 " 472 NFP 95 PUNCT 441 punct
in 522 in 466 IN 83 ADP 439 prep
the 501 the 460 DT 88 DET 411 det
intermediate 79754 intermediate 467 JJ 82 ADJ 398 amod
position 1599 position 474 NN 90 NOUN 435 pobj
while 835 while 466 IN 83 ADP 419 mark
the 501 the 460 DT 88 DET 411 det
operator 8229 operator 474 NN 90 NOUN 425 nsubj
attempts 2265 attempt 493 VBZ 98 VERB 395 advcl
to 504 to 486 TO 92 PART 401 aux
raise 2367 raise 488 VB 98 VERB 445 xcomp
it 757862 -PRON- 479 PRP 93 PRON 412 dobj
to 504 to 466 IN 83 ADP 439 prep
the 501 the 460 DT 88 DET 411 det
full 1171 full 467 JJ 82 ADJ 3

a 506 a 460 DT 88 DET 411 det
locking 4285 lock 490 VBG 98 VERB 398 amod
cog 204126 cog 474 NN 90 NOUN 412 dobj
355 29334 355 459 CD 91 NUM 758136 nummod
and 512 and 458 CC 87 CCONJ 403 cc
a 506 a 460 DT 88 DET 411 det
catch 2914 catch 474 NN 90 NOUN 406 conj
360 5226 360 459 CD 91 NUM 758136 nummod
. 453 . 453 . 95 PUNCT 441 punct

  3416 
  485 SP 101 SPACE 0 
As 557 as 466 IN 83 ADP 419 mark
shown 1054 show 491 VBN 98 VERB 512817 ROOT
in 522 in 466 IN 83 ADP 439 prep
FIG 515099 fig 475 NNP 94 PROPN 435 pobj
. 453 . 453 . 95 PUNCT 441 punct
10 1143 10 459 CD 91 NUM 426 nsubjpass
, 450 , 450 , 95 PUNCT 441 punct
with 548 with 466 IN 83 ADP 439 prep
the 501 the 460 DT 88 DET 411 det
locking 4285 lock 490 VBG 98 VERB 398 amod
plate 6197 plate 474 NN 90 NOUN 435 pobj
302 104959 302 459 CD 91 NUM 758136 nummod
in 522 in 466 IN 83 ADP 439 prep
the 501 the 460 DT 88 DET 411 det
locked 5024 locked 467 JJ 82 ADJ 398 amod
position 1599 position 474 NN 90 NOUN 435 pobj
( 562 ( 451 -LRB- 95 PUNC

have 539 have 492 VBP 98 VERB 401 aux
been 536 be 491 VBN 98 VERB 402 auxpass
rotated 254715 rotate 491 VBN 98 VERB 512817 ROOT
into 696 into 466 IN 83 ADP 439 prep
alignment 245271 alignment 474 NN 90 NOUN 435 pobj
with 548 with 466 IN 83 ADP 439 prep
the 501 the 460 DT 88 DET 411 det
locking 4285 lock 490 VBG 98 VERB 398 amod
cogs 204126 cog 477 NNS 90 NOUN 435 pobj
345 61593 345 459 CD 91 NUM 758136 nummod
, 450 , 450 , 95 PUNCT 441 punct
355 29334 355 459 CD 91 NUM 399 appos
. 453 . 453 . 95 PUNCT 441 punct
The 501 the 460 DT 88 DET 411 det
first 774 first 467 JJ 82 ADJ 398 amod
notches 241343 notch 477 NNS 90 NOUN 426 nsubjpass
320 51388 320 459 CD 91 NUM 758136 nummod
, 450 , 450 , 95 PUNCT 441 punct
322 10358 322 459 CD 91 NUM 426 nsubjpass
have 539 have 492 VBP 98 VERB 401 aux
been 536 be 491 VBN 98 VERB 402 auxpass
rotated 254715 rotate 491 VBN 98 VERB 512817 ROOT
into 696 into 466 IN 83 ADP 439 prep
alignment 245271 alignment 474 NN 90 NOUN 435 pobj
with 548 with 466 IN 83 AD

is 536 be 493 VBZ 98 VERB 402 auxpass
splayed 777006 splay 491 VBN 98 VERB 396 advmod
outward 261701 outward 481 RB 84 ADV 429 oprd
at 584 at 466 IN 83 ADP 439 prep
a 506 a 460 DT 88 DET 411 det
respective 119298 respective 467 JJ 82 ADJ 398 amod
angle 5686 angle 474 NN 90 NOUN 435 pobj
405 30854 405 459 CD 91 NUM 758136 nummod
, 450 , 450 , 95 PUNCT 441 punct
407 225874 407 459 CD 91 NUM 758136 nummod
from 595 from 466 IN 83 ADP 439 prep
the 501 the 460 DT 88 DET 411 det
perpendicular 62824 perpendicular 467 JJ 82 ADJ 435 pobj
. 453 . 453 . 95 PUNCT 441 punct
The 501 the 460 DT 88 DET 411 det
angles 5686 angle 477 NNS 90 NOUN 425 nsubj
405 30854 405 459 CD 91 NUM 758136 nummod
, 450 , 450 , 95 PUNCT 441 punct
407 225874 407 459 CD 91 NUM 425 nsubj
define 3445 define 492 VBP 98 VERB 512817 ROOT
a 506 a 460 DT 88 DET 411 det
total 1886 total 467 JJ 82 ADJ 398 amod
inclusive 299333 inclusive 467 JJ 82 ADJ 398 amod
angle 5686 angle 474 NN 90 NOUN 412 dobj
preferably 253349 preferably 481 

reduces 3501 reduce 493 VBZ 98 VERB 395 advcl
the 501 the 460 DT 88 DET 411 det
jarring 70685 jarring 474 NN 90 NOUN 412 dobj
of 510 of 466 IN 83 ADP 439 prep
the 501 the 460 DT 88 DET 411 det
patient 5857 patient 474 NN 90 NOUN 435 pobj
while 835 while 466 IN 83 ADP 419 mark
being 536 be 490 VBG 98 VERB 402 auxpass
transported 8714 transport 491 VBN 98 VERB 395 advcl
in 522 in 466 IN 83 ADP 439 prep
the 501 the 460 DT 88 DET 411 det
bed 3465 bed 474 NN 90 NOUN 435 pobj
, 450 , 450 , 95 PUNCT 441 punct
and 512 and 458 CC 87 CCONJ 403 cc
further 1850 further 467 JJ 82 ADJ 396 advmod
prevents 2347 prevent 493 VBZ 98 VERB 403 cc
the 501 the 460 DT 88 DET 411 det
destruction 4319 destruction 474 NN 90 NOUN 412 dobj
of 510 of 466 IN 83 ADP 439 prep
the 501 the 460 DT 88 DET 411 det
siderail 776980 siderail 474 NN 90 NOUN 74185 compound
support 1018 support 474 NN 90 NOUN 74185 compound
mechanism 5199 mechanism 474 NN 90 NOUN 435 pobj
. 453 . 453 . 95 PUNCT 441 punct

  3416 
  485 SP 101 SP

The meaning of the dependency tags (word.dep\_) can be found here: https://nlp.stanford.edu/software/dependencies_manual.pdf

### Named Entity Recognition 

In [8]:
ents = list(doc.ents)
for entity in ents:
    print(entity.label, entity.label_, ' '.join(t.orth_ for t in entity))

391 QUANTITY 35 U.S.C. §
377 PERSON Ser
393 CARDINAL 60/622 503
387 DATE Oct. 27 , 2004
393 CARDINAL 1
393 CARDINAL one
393 CARDINAL 2
380 ORG Related Art
393 CARDINAL Four
377 PERSON 
 
393 CARDINAL 1
385 WORK_OF_ART 
  FIG
393 CARDINAL 2
380 ORG FIG
393 CARDINAL 1
393 CARDINAL 
 
380 ORG FIG
393 CARDINAL 3
380 ORG FIGS
393 CARDINAL 1
393 CARDINAL 2
383 PRODUCT 
  FIG
393 CARDINAL 4
380 ORG FIGS
393 CARDINAL 1
383 PRODUCT 
  FIG
393 CARDINAL 5
380 ORG FIGS
393 CARDINAL 1
383 PRODUCT 
  FIG
393 CARDINAL 6
380 ORG FIGS
393 CARDINAL 1
383 PRODUCT 
  FIG
393 CARDINAL 7
380 ORG FIGS
393 CARDINAL 1
383 PRODUCT 
  FIG
393 CARDINAL 8
380 ORG FIGS
393 CARDINAL 1
383 PRODUCT 
  FIG
393 CARDINAL 9
380 ORG FIGS
393 CARDINAL 1
383 PRODUCT 
  FIG
393 CARDINAL 10
385 WORK_OF_ART 
  FIG
393 CARDINAL 11
380 ORG FIG
393 CARDINAL 10
383 PRODUCT 
  FIG
393 CARDINAL 12
380 ORG FIGS
387 DATE 10 - 11
383 PRODUCT 
  FIG
393 CARDINAL 13
380 ORG FIGS
393 CARDINAL 10 - 12
393 CARDINAL 14
380 ORG FIGS
389 PERCEN

Named entity extraction is not that useful out of the box. All it extracts are the reference numbers.

### Noun Phrases

In [9]:
import pandas as pd

nounphrases = [[np.orth_, np.root.head.orth_] for np in doc.noun_chunks]
print("There were {} noun phrases found. Here's a sample:".format(len(nounphrases)))

pd.DataFrame(nounphrases[200:250])

There were 683 noun phrases found. Here's a sample:


Unnamed: 0,0,1
0,the synchronization link,in
1,apertures,through
2,the toggles,in
3,The synchronization link,forces
4,the first and second arms,forces
5,the same direction,in
6,the siderail support mechanism,as
7,moves,mechanism
8,its full range,through
9,motion,of


### Verbs

In [10]:
from spacy.symbols import nsubj, VERB
# Finding a verb with a subject from below — good
verbs = set()
for possible_subject in doc:
    if possible_subject.dep == nsubj and possible_subject.head.pos == VERB:
        verbs.add(possible_subject.head)

In [11]:
verbs_text = set([t.text for t in verbs])

In [12]:
verbs_text

{'adapted',
 'affixed',
 'aligns',
 'apertures',
 'applied',
 'are',
 'attempts',
 'be',
 'become',
 'causes',
 'cogs',
 'come',
 'define',
 'deployed',
 'described',
 'describes',
 'detents',
 'does',
 'faces',
 'forces',
 'found',
 'handle',
 'has',
 'hold',
 'illustrate',
 'include',
 'includes',
 'is',
 'locked',
 'locks',
 'moves',
 'passed',
 'passing',
 'reaches',
 'received',
 'receives',
 'reduces',
 'referred',
 'refers',
 'relates',
 'releases',
 'rotate',
 'rotated',
 'rotates',
 'secures',
 'shifted',
 'shows',
 'stay',
 'traverses',
 'unlocked',
 'urges'}

In [13]:
verbs_text_lemma = set([t.lemma_ for t in verbs])
print(verbs_text_lemma)
print("There are (around) {0} unique verbs in the patent application".format(len(verbs_text_lemma)))

{'affix', 'have', 'urge', 'cog', 'align', 'come', 'adapt', 'relate', 'hold', 'include', 'stay', 'illustrate', 'apertur', 'cause', 'describe', 'shift', 'unlock', 'do', 'lock', 'rotate', 'move', 'show', 'pass', 'reduce', 'define', 'receive', 'detent', 'force', 'find', 'handle', 'secure', 'deploy', 'attempt', 'face', 'reach', 'become', 'release', 'apply', 'traverse', 'be', 'refer'}
There are (around) 41 unique verbs in the patent application


What would be interesting is to perform topic modelling based on these extracted verbs and the nouns. 

For example, in the above set verbs such as "refer", "illustrate", "show", "describe" would be relatively common over the patent corpus. But verbs such as "rotate", "shift", "urge", "lock", "force", "handle", "secure" helps set the context of the application (e.g. mechanical invention to do with some form of physical lock).  

You could do a similar thing with nouns or noun phrases.  


### Entity Extraction

In [14]:
#Create POS-style tuples
pos = [(word.text, word.tag_) for word in doc]

In [15]:
pos

[('A', 'DT'),
 ('siderail', 'NN'),
 ('support', 'NN'),
 ('mechanism', 'NN'),
 ('with', 'IN'),
 ('multiple', 'JJ'),
 ('locks', 'NNS'),
 ('and', 'CC'),
 ('an', 'DT'),
 ('impact', 'NN'),
 ('release', 'NN'),
 ('feature', 'NN'),
 ('is', 'VBZ'),
 ('configured', 'VBN'),
 ('to', 'TO'),
 ('positively', 'RB'),
 ('lock', 'VB'),
 ('in', 'RP'),
 ('an', 'DT'),
 ('upright', 'JJ'),
 ('deployed', 'VBN'),
 ('position', 'NN'),
 (',', ','),
 ('but', 'CC'),
 ('is', 'VBZ'),
 ('adapted', 'VBN'),
 ('to', 'TO'),
 ('release', 'VB'),
 ('upon', 'IN'),
 ('imposition', 'NN'),
 ('of', 'IN'),
 ('a', 'DT'),
 ('longitudinal', 'JJ'),
 ('impact', 'NN'),
 ('load', 'NN'),
 ('such', 'JJ'),
 ('as', 'IN'),
 ('that', 'DT'),
 ('caused', 'VBN'),
 ('by', 'IN'),
 ('striking', 'VBG'),
 ('a', 'DT'),
 ('stationary', 'JJ'),
 ('barrier', 'NN'),
 ('.', '.'),
 ('\n ', 'SP'),
 ('This', 'DT'),
 ('application', 'NN'),
 ('claims', 'VBZ'),
 ('priority', 'NN'),
 ('under', 'IN'),
 ('35', 'CD'),
 ('U.S.C.', 'NNP'),
 ('§', '.'),
 ('119(e', 'LS'),

In [16]:
from patentdata.models.lib.utils import (
    check_list, string2printint,
    entity_finder, filter_entity_list, get_entity_dict,
    highlight_multiple
)
entities = entity_finder(pos)
entities = filter_entity_list(entities)

In [17]:
entities

[[('a', 'DT'),
  ('siderail', 'NN'),
  ('support', 'NN'),
  ('mechanism', 'NN'),
  ('10', 'CD')],
 [('a', 'DT'), ('siderail', 'NN'), ('15', 'CD')],
 [('The', 'DT'),
  ('siderail', 'NN'),
  ('support', 'NN'),
  ('mechanism', 'NN'),
  ('10', 'CD')],
 [('a', 'DT'), ('mounting', 'VBG'), ('bracket', 'NN'), ('20', 'CD')],
 [('a', 'DT'),
  ('pair', 'NN'),
  ('of', 'IN'),
  ('fasteners', 'NNS'),
  ('22', 'CD')],
 [('The', 'DT'), ('mounting', 'VBG'), ('bracket', 'NN'), ('20', 'CD')],
 [('a', 'DT'), ('first', 'JJ'), ('opening', 'NN'), ('25', 'CD')],
 [('a', 'DT'),
  ('first', 'JJ'),
  ('lower', 'JJR'),
  ('pivot', 'NN'),
  ('shaft', 'NN'),
  ('30', 'CD')],
 [('a', 'DT'), ('first', 'JJ'), ('arm', 'NN'), ('35', 'CD')],
 [('a', 'DT'), ('second', 'JJ'), ('opening', 'NN'), ('40', 'CD')],
 [('a', 'DT'),
  ('second', 'JJ'),
  ('lower', 'JJR'),
  ('pivot', 'NN'),
  ('shaft', 'NN'),
  ('45', 'CD')],
 [('a', 'DT'), ('second', 'JJ'), ('arm', 'NN'), ('50', 'CD')],
 [('The', 'DT'), ('siderail', 'NN'), ('15',

In [18]:
#Create POS-style tuples
simple_pos = [(word.text, word.pos_) for word in doc]

In [19]:
simple_pos

[('A', 'DET'),
 ('siderail', 'NOUN'),
 ('support', 'NOUN'),
 ('mechanism', 'NOUN'),
 ('with', 'ADP'),
 ('multiple', 'ADJ'),
 ('locks', 'NOUN'),
 ('and', 'CCONJ'),
 ('an', 'DET'),
 ('impact', 'NOUN'),
 ('release', 'NOUN'),
 ('feature', 'NOUN'),
 ('is', 'VERB'),
 ('configured', 'VERB'),
 ('to', 'PART'),
 ('positively', 'ADV'),
 ('lock', 'VERB'),
 ('in', 'PART'),
 ('an', 'DET'),
 ('upright', 'ADJ'),
 ('deployed', 'VERB'),
 ('position', 'NOUN'),
 (',', 'PUNCT'),
 ('but', 'CCONJ'),
 ('is', 'VERB'),
 ('adapted', 'VERB'),
 ('to', 'PART'),
 ('release', 'VERB'),
 ('upon', 'ADP'),
 ('imposition', 'NOUN'),
 ('of', 'ADP'),
 ('a', 'DET'),
 ('longitudinal', 'ADJ'),
 ('impact', 'NOUN'),
 ('load', 'NOUN'),
 ('such', 'ADJ'),
 ('as', 'ADP'),
 ('that', 'DET'),
 ('caused', 'VERB'),
 ('by', 'ADP'),
 ('striking', 'VERB'),
 ('a', 'DET'),
 ('stationary', 'ADJ'),
 ('barrier', 'NOUN'),
 ('.', 'PUNCT'),
 ('\n ', 'SPACE'),
 ('This', 'DET'),
 ('application', 'NOUN'),
 ('claims', 'VERB'),
 ('priority', 'NOUN'),
 ('

In [20]:
def entity_finder(pos_list):
    """ Find entities with reference numerals using POS data."""
    entity_list = list()
    entity = []
    record = False
    for i, (word, pos) in enumerate(pos_list):
        if pos == "DET":
            record = True
            entity = []
            
        if record:
            entity.append((word, pos))
            
        if "FIG" in word:
            # reset entity to ignore phrases that refer to Figures
            record = False
            entity = []
        
        if pos == "NUM" and entity and record and ('NOUN' in pos_list[i-1][1]): 
            record = False
            entity_list.append(entity)
    
    # Filter list
    filter_list = list()
    for entity in entity_list:
        if not ({"claims", "priority", "under"} <= set([w for w, _ in entity])):
            filter_list.append(entity)
            
    return filter_list

In [21]:
entity_finder(simple_pos)

[[('a', 'DET'),
  ('siderail', 'NOUN'),
  ('support', 'NOUN'),
  ('mechanism', 'NOUN'),
  ('10', 'NUM')],
 [('a', 'DET'), ('siderail', 'NOUN'), ('15', 'NUM')],
 [('The', 'DET'),
  ('siderail', 'NOUN'),
  ('support', 'NOUN'),
  ('mechanism', 'NOUN'),
  ('10', 'NUM')],
 [('a', 'DET'), ('mounting', 'VERB'), ('bracket', 'NOUN'), ('20', 'NUM')],
 [('a', 'DET'),
  ('pair', 'NOUN'),
  ('of', 'ADP'),
  ('fasteners', 'NOUN'),
  ('22', 'NUM')],
 [('The', 'DET'), ('mounting', 'VERB'), ('bracket', 'NOUN'), ('20', 'NUM')],
 [('a', 'DET'), ('first', 'ADJ'), ('opening', 'NOUN'), ('25', 'NUM')],
 [('a', 'DET'),
  ('first', 'ADJ'),
  ('lower', 'ADJ'),
  ('pivot', 'NOUN'),
  ('shaft', 'NOUN'),
  ('30', 'NUM')],
 [('a', 'DET'), ('first', 'ADJ'), ('arm', 'NOUN'), ('35', 'NUM')],
 [('a', 'DET'), ('second', 'ADJ'), ('opening', 'NOUN'), ('40', 'NUM')],
 [('a', 'DET'),
  ('second', 'ADJ'),
  ('lower', 'ADJ'),
  ('pivot', 'NOUN'),
  ('shaft', 'NOUN'),
  ('45', 'NUM')],
 [('a', 'DET'), ('second', 'ADJ'), ('arm'

In [22]:
from collections import Counter
c = Counter(["".join(["<{0}>".format(p) for w, p in entity]) for entity in entity_finder(simple_pos)])

In [23]:
c.most_common()

[('<DET><NOUN><NUM>', 63),
 ('<DET><NOUN><NOUN><NUM>', 50),
 ('<DET><VERB><NOUN><NUM>', 48),
 ('<DET><ADJ><NOUN><NUM>', 39),
 ('<DET><NOUN><NOUN><NOUN><NUM>', 35),
 ('<DET><ADJ><ADJ><NOUN><NOUN><NUM>', 14),
 ('<DET><NOUN><PUNCT><VERB><NOUN><NUM>', 8),
 ('<DET><ADJ><NOUN><NOUN><NUM>', 6),
 ('<DET><ADJ><CCONJ><ADJ><ADJ><NOUN><NOUN><NUM>', 4),
 ('<DET><NOUN><NOUN><NOUN><NOUN><NUM>', 3),
 ('<DET><NOUN><ADP><NOUN><NOUN><NUM>', 3),
 ('<DET><ADJ><CCONJ><ADJ><NOUN><NUM>', 3),
 ('<DET><NOUN><ADP><NOUN><NUM>', 3),
 ('<DET><NOUN><VERB><NOUN><NUM>', 2),
 ('<DET><NOUN><ADV><CCONJ><ADJ><NOUN><NUM>', 1),
 ('<DET><ADJ><VERB><NUM><PUNCT><NUM><ADP><VERB><NOUN><NUM>', 1),
 ('<DET><NOUN><ADP><ADV><VERB><NOUN><NUM>', 1),
 ('<DET><ADV><VERB><NOUN><NUM>', 1),
 ('<DET><ADV><VERB><VERB><NOUN><NUM>', 1),
 ('<DET><VERB><NOUN><NOUN><NUM>', 1),
 ('<DET><NOUN><PUNCT><NOUN><VERB><NOUN><NUM>', 1),
 ('<DET><NOUN><VERB><ADP><VERB><NOUN><NUM>', 1),
 ('<DET><NOUN><ADP><ADJ><ADJ><ADJ><NOUN><NOUN><NUM>', 1),
 ('<DET><ADJ><

In [24]:
def simple_entity_finder(pos_list):
    """ Find entities with reference numerals using POS data."""
    entity_list = list()
    record = False
    # Add indices
    enum_pos_list = list(enumerate(pos_list))
    for i, (word, pos) in enum_pos_list:
        if pos == "DET" and not record:
            # Start recording and record start index
            record = True
            start_index = i
            
        if pos == "DET" and record:
            # Step back until last noun is found
            for j, (word, pos) in reversed(enum_pos_list[:i]):
                if "NOUN" in pos:
                    # Add np_chunk to buffer
                    entity_list.append(pos_list[start_index:j+1])
                    break       
            
            # Set new start index
            start_index = i
    
    return entity_list

In [25]:
claims = nlp(pdoc.claimset.text)

In [27]:
simple_claims_pos = [(word.text, word.pos_) for word in claims]

simple_entity_finder(simple_claims_pos)

[[('A', 'DET'),
  ('siderail', 'ADJ'),
  ('support', 'NOUN'),
  ('mechanism', 'NOUN')],
 [('a', 'DET'), ('mounting', 'VERB'), ('bracket', 'NOUN')],
 [('a', 'DET'), ('first', 'ADJ'), ('lower', 'ADJ'), ('pivot', 'NOUN')],
 [('a', 'DET'), ('second', 'ADJ'), ('lower', 'ADJ'), ('pivot', 'NOUN')],
 [('the', 'DET'), ('mounting', 'VERB'), ('bracket', 'NOUN')],
 [('a', 'DET'), ('bed', 'NOUN')],
 [('a', 'DET'), ('first', 'ADJ'), ('support', 'NOUN'), ('arm', 'NOUN')],
 [('a', 'DET'),
  ('first', 'ADJ'),
  ('upper', 'ADJ'),
  ('pivot', 'NOUN'),
  ('shaft', 'NOUN')],
 [('a', 'DET'),
  ('first', 'ADJ'),
  ('lower', 'ADJ'),
  ('pivot', 'NOUN'),
  ('shaft', 'NOUN')],
 [('the', 'DET'),
  ('first', 'ADJ'),
  ('upper', 'ADJ'),
  ('pivot', 'NOUN'),
  ('shaft', 'NOUN')],
 [('a', 'DET'), ('siderail', 'NOUN')],
 [('a', 'DET'), ('first', 'ADJ'), ('upper', 'ADJ'), ('pivot', 'NOUN')],
 [('the', 'DET'),
  ('first', 'ADJ'),
  ('lower', 'ADJ'),
  ('pivot', 'NOUN'),
  ('shaft', 'NOUN')],
 [('the', 'DET'), ('first',

CONJ + VERB needs to be filtered out. Also split on "comprises" as VERB.
Also we can have custom checks for "The X of claim Y". And "at least one of" / "one or more" = DET

In [28]:
from collections import Counter
c = Counter(["".join(["<{0}>".format(p) for w, p in entity]) for entity in entity_finder(simple_claims_pos)])

# Playing With Noun Chunks

In [29]:
nc = next(doc.noun_chunks)
dir(nc)

['__class__',
 '__delattr__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__gt__',
 '__hash__',
 '__init__',
 '__iter__',
 '__le__',
 '__len__',
 '__lt__',
 '__ne__',
 '__new__',
 '__pyx_vtable__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '_recalculate_indices',
 '_vector',
 '_vector_norm',
 'doc',
 'end',
 'end_char',
 'ent_id',
 'ent_id_',
 'has_vector',
 'label',
 'label_',
 'lefts',
 'lemma_',
 'lower_',
 'merge',
 'noun_chunks',
 'orth_',
 'rights',
 'root',
 'sent',
 'sentiment',
 'similarity',
 'start',
 'start_char',
 'string',
 'subtree',
 'text',
 'text_with_ws',
 'upper_',
 'vector',
 'vector_norm']

In [31]:
np = nc
np.text

'A siderail support mechanism'

In [32]:
np.root

mechanism

In [33]:
print(np.start, np.end)

0 4


In [34]:
print(doc[np.start], doc[np.end-1])
print(doc[np.start].pos_)

A mechanism
DET


In [35]:
# We can look for matching roots

In [36]:
np.lemma_

'a siderail support mechanism'

In [37]:
for st in np.subtree:
    print(st)
    print(st.pos_)
    
# Note subtree includes the whole phrase not just the displayed NP portion

A
DET
siderail
NOUN
support
NOUN
mechanism
NOUN
with
ADP
multiple
ADJ
locks
NOUN
and
CCONJ
an
DET
impact
NOUN
release
NOUN
feature
NOUN


Matching:
* First look for direct string matches;
* Then look for lemma matches;
* Then look for root matches.

Rank by occurrences - can use as a confidence level.

We want to keep the noun_chunk objects but link them (e.g. dict with nc as key then list of associated ncs).

In [38]:
for np in doc.noun_chunks:
    print(np)

A siderail support mechanism
multiple locks
an impact release feature
an upright deployed position
imposition
a longitudinal impact load
a stationary barrier
priority
provisional application
the entire disclosure
reference
the Invention
The invention
mechanisms
hospital bed siderails
its aspects
the invention
a locking mechanism
siderail support mechanisms
its aspects
the invention
a siderail support mechanism
an impact release feature
Description
Related Art 
 Four-bar link siderail support mechanisms
various positions
It
the siderail
patient safety
It
patients
the hospital
beds
siderails
the upright deployed position
the bed
the side
a door
the siderail
the doorjamb
This impact
either the bed
the doorjamb
It
a siderail locking mechanism
the siderail
the upright deployed position
a longitudinal impact load
a collision
a stationary barrier
hospital transit
A siderail support mechanism
multiple locks
an impact release feature
an upright deployed position
imposition
a longitudinal impact

In [39]:
from spacy.symbols import DET


        

In [44]:
entities = extract_entities(doc)
entities

{'24.9 degrees': [24.9 degrees],
 '250 bears': [250 bears, 250 bears],
 '275 shifts': [275 shifts],
 '8 degrees': [8 degrees],
 'about the pivot pin': [about the pivot pin],
 'above-described embodiment': [the above-described embodiment],
 'accompanying drawings': [the accompanying drawings],
 'addition': [addition],
 'alignment': [alignment, alignment],
 'angle': [the angle, the angle, the angle, the angle],
 'angles': [The angles, the angles],
 'angularity relationship': [the angularity relationship],
 'anterior face': [an anterior face, the anterior face],
 'aperture': [an aperture],
 'apertures': [apertures, apertures],
 'appended claims': [the appended claims],
 'approximately 160 degrees': [approximately 160 degrees],
 'approximately 20 degrees': [approximately 20 degrees,
  approximately 20 degrees],
 'arc': [an arc],
 'arcuate indexing slot': [an arcuate indexing slot],
 'arms': [the arms, the arms],
 'art': [the art],
 'attachment': [attachment, attachment],
 'axis': [the axis

In [45]:
from operator import itemgetter

def rank_entities(entities):
    """Rank a dictionary of noun phrase entities based on occurrence."""
    occurrences = [(np_string, len(np_list)) for np_string, np_list in entities.items()]
    return sorted(occurrences, key=itemgetter(1), reverse=True)

In [47]:
occurrences = rank_entities(entities)
occurrences

[('siderail support mechanism', 35),
 ('locking plate', 24),
 ('fig', 21),
 ('figs', 19),
 ('siderail', 14),
 ('invention', 14),
 ('bypass plate', 12),
 ('position', 11),
 ('collar', 10),
 ('lock release lever', 9),
 ('locking cog', 9),
 ('notches', 9),
 ('it', 9),
 ('locking cogs', 8),
 ('l-shaped slot', 8),
 ('indexing slot', 8),
 ('indexing pin', 7),
 ('pair', 7),
 ('oblong aperture', 7),
 ('indexing ball', 6),
 ('mounting bracket', 6),
 ('first lower pivot shaft', 6),
 ('notch', 6),
 ('operator', 6),
 ('second lower pivot shaft', 6),
 ('plate', 6),
 ('partial cut-away view', 5),
 ('upright deployed position', 5),
 ('stationary barrier', 5),
 ('bed', 5),
 ('lowered position', 5),
 ('collars', 5),
 ('spring', 5),
 ('detent', 5),
 ('locked position', 5),
 ('upright', 4),
 ('minor axis', 4),
 ('lower extent', 4),
 ('words', 4),
 ('synchronization link', 4),
 ('left', 4),
 ('side view', 4),
 ('unlocked position', 4),
 ('second end', 4),
 ('catches', 4),
 ('angle', 4),
 ('respective angl

In [48]:
claim1 = nlp(pdoc.claimset.get_claim(1).text)

In [49]:
claim1


1. A siderail support mechanism comprising: a mounting bracket having a first lower pivot and a second lower pivot, the mounting bracket configured for mounting to a bed; 
a first support arm having a first upper pivot shaft and a first lower pivot shaft, the first upper pivot shaft configured to pivotally attach to a siderail at a first upper pivot and the first lower pivot shaft configured to pivotally attach to the first lower pivot of the mounting bracket; 
a second support arm having a second upper pivot shaft and a second lower pivot shaft, the second upper pivot shaft configured to pivotally attach to the siderail at a second upper pivot and the second lower pivot shaft configured to pivotally attach to the second lower pivot of the mounting bracket; 
a plurality of circumferentially spaced notches formed about the first lower pivot shaft; 
a plurality of circumferentially spaced notches formed about the second lower pivot shaft; 
a locking plate having a first oblong aperture 

In [50]:
claim1_ents = extract_entities(claim1)
claim1_occs = rank_entities(claim1_ents)

In [51]:
claim1_ents

{'bed': [a bed],
 'entry': [entry, entry],
 'first and second locking cogs': [the first and second locking cogs],
 'first and second lower pivot shafts': [the first and second lower pivot shafts],
 'first lower pivot': [a first lower pivot, the first lower pivot],
 'first lower pivot shaft': [a first lower pivot shaft,
  the first lower pivot shaft,
  the first lower pivot shaft,
  the first lower pivot shaft,
  the first lower pivot shaft],
 'first oblong aperture': [a first oblong aperture, the first oblong aperture],
 'first upper pivot': [a first upper pivot],
 'first upper pivot shaft': [a first upper pivot shaft,
  the first upper pivot shaft],
 'locking plate': [the locking plate],
 'mounting bracket': [a mounting bracket,
  the mounting bracket,
  the mounting bracket],
 'notches': [notches, notches, notches],
 'plurality': [a plurality, a plurality, the plurality, the plurality],
 'respective notches': [the respective notches],
 'second lower pivot': [a second lower pivot, the

In [57]:
print("There are {0} unique entities extracted from claim 1".format(len(claim1_ents)))

There are 21 entities extracted from claim 1


It looks like the first support arm is not included in the list above but is extracted when looking at DET ... patterns.

In [76]:
from spacy.symbols import DET, NOUN

def simple_spacy_entity_finder(doc):
    """ Find entities with reference numerals using POS data."""
    entity_list = list()
    record = False
    # Generate a list of tokens so we can iterate backwards through it
    enum_doc_list = list(enumerate(doc))
    # Add indices
    for i, word in enum_doc_list:
        if word.pos == DET and not record:
            # Start recording and record start index
            record = True
            start_index = i
            
        if word.pos == DET and record:
            # Step back until last noun is found
            for j, word in reversed(enum_doc_list[:i]):
                if word.pos == NOUN:
                    # Add np_chunk to buffer
                    entity_list.append(doc[start_index:j+1])
                    break       
            
            # Set new start index
            start_index = i
    
    entity_dict = dict()
    # Now group by unique
    for entity in entity_list:
        np_start = entity.start
        if doc[np_start].pos == DET:
            np_start += 1
        np_string = doc[np_start:entity.end].text.lower()
        if np_string not in entity_dict.keys():
            entity_dict[np_string] = list()
        entity_dict[np_string].append(entity)
    
    return entity_list, entity_dict

In [78]:
entity_list, entity_dict = simple_spacy_entity_finder(claim1)
print(entity_list[0:5], "\n")
for e in entity_dict:
    print(e)
    
print("\nThere are {0} entities extracted from claim 1 using the simple extractor".format(len(entity_dict)))


[A siderail support mechanism, a mounting bracket, a first lower pivot, a second lower pivot, the mounting bracket] 

first lower pivot
siderail
second oblong aperture
siderail support mechanism
second upper pivot
second lower pivot
first support arm
plurality of circumferentially spaced notches
first locking cog
first lower pivot shaft
bed
plurality of notches
second lower pivot shaft
second locking cog
first oblong aperture
first and second locking cogs
mounting bracket
spring
second upper pivot shaft
second oblong aperture and configured for entry
first upper pivot
locking plate
second support arm
first oblong aperture and configured for entry
first upper pivot shaft
respective notches

There are 26 entities extracted from claim 1 using the simple extractor


The "and configured for entry" is an artefact. Can we perform a further "cleaning" parse?

Can we use order somehow? Could the cleaning parse be based on antecedence? E.g. if subsequent entry contains "the" form of previous entry with "a" form - merge.

In [80]:
entity_dict

{'bed': [a bed],
 'first and second locking cogs': [the first and second locking cogs],
 'first locking cog': [a first locking cog],
 'first lower pivot': [a first lower pivot, the first lower pivot],
 'first lower pivot shaft': [a first lower pivot shaft,
  the first lower pivot shaft,
  the first lower pivot shaft,
  the first lower pivot shaft,
  the first lower pivot shaft],
 'first oblong aperture': [a first oblong aperture],
 'first oblong aperture and configured for entry': [the first oblong aperture and configured for entry],
 'first support arm': [a first support arm],
 'first upper pivot': [a first upper pivot],
 'first upper pivot shaft': [a first upper pivot shaft,
  the first upper pivot shaft],
 'locking plate': [a locking plate, the locking plate],
 'mounting bracket': [a mounting bracket,
  the mounting bracket,
  the mounting bracket,
  the mounting bracket],
 'plurality of circumferentially spaced notches': [a plurality of circumferentially spaced notches,
  a plurali

In [None]:
simple_claim1_pos = [(word.text, word.pos_) for word in claim1]
simple_c1_ents = simple_entity_finder(simple_claim1_pos)

In [61]:
# We now need to collate and create a set of entities
def get_entity_set(entity_list):
    """ Get a set of unique entity n-grams from a list of entities."""
    ngram_list = list()
    for entity in entity_list:
        ngram_list.append(" ".join([word for word, pos in entity if (pos != 'DET')]))
    return set(ngram_list)

In [65]:
simple_c1_ents

[[('A', 'DET'),
  ('siderail', 'ADJ'),
  ('support', 'NOUN'),
  ('mechanism', 'NOUN')],
 [('a', 'DET'), ('mounting', 'VERB'), ('bracket', 'NOUN')],
 [('a', 'DET'), ('first', 'ADJ'), ('lower', 'ADJ'), ('pivot', 'NOUN')],
 [('a', 'DET'), ('second', 'ADJ'), ('lower', 'ADJ'), ('pivot', 'NOUN')],
 [('the', 'DET'), ('mounting', 'VERB'), ('bracket', 'NOUN')],
 [('a', 'DET'), ('bed', 'NOUN')],
 [('a', 'DET'), ('first', 'ADJ'), ('support', 'NOUN'), ('arm', 'NOUN')],
 [('a', 'DET'),
  ('first', 'ADJ'),
  ('upper', 'ADJ'),
  ('pivot', 'NOUN'),
  ('shaft', 'NOUN')],
 [('a', 'DET'),
  ('first', 'ADJ'),
  ('lower', 'ADJ'),
  ('pivot', 'NOUN'),
  ('shaft', 'NOUN')],
 [('the', 'DET'),
  ('first', 'ADJ'),
  ('upper', 'ADJ'),
  ('pivot', 'NOUN'),
  ('shaft', 'NOUN')],
 [('a', 'DET'), ('siderail', 'NOUN')],
 [('a', 'DET'), ('first', 'ADJ'), ('upper', 'ADJ'), ('pivot', 'NOUN')],
 [('the', 'DET'),
  ('first', 'ADJ'),
  ('lower', 'ADJ'),
  ('pivot', 'NOUN'),
  ('shaft', 'NOUN')],
 [('the', 'DET'), ('first',

In [66]:
simple_c1_ents = get_entity_set(simple_c1_ents)
print(simple_c1_ents)

print("There are {0} entities extracted from claim 1 using the simple extractor".format(len(simple_c1_ents)))

{'the plurality of notches', 'a plurality of circumferentially spaced notches', 'the second lower pivot', 'a mounting bracket', 'the respective notches', 'a second oblong aperture', 'the first lower pivot shaft', 'a second support arm', 'A siderail support mechanism', 'a second lower pivot shaft', 'a first locking cog', 'a spring', 'a first lower pivot', 'the first oblong aperture and configured for entry', 'the first lower pivot', 'the locking plate', 'the first upper pivot shaft', 'a second lower pivot', 'the second upper pivot shaft', 'the second oblong aperture and configured for entry', 'a first support arm', 'a siderail', 'a first upper pivot shaft', 'the mounting bracket', 'the first and second locking cogs', 'a bed', 'a first lower pivot shaft', 'a first oblong aperture', 'a second locking cog', 'a second upper pivot', 'a locking plate', 'the second lower pivot shaft', 'a second upper pivot shaft', 'the siderail', 'a first upper pivot'}
There are 35 entities extracted from clai

So there appear to be an extract 14 entities when we use the simple entity extractor.

In [52]:
claim1_occs

[('second lower pivot shaft', 5),
 ('first lower pivot shaft', 5),
 ('plurality', 4),
 ('mounting bracket', 3),
 ('notches', 3),
 ('first lower pivot', 2),
 ('siderail', 2),
 ('first oblong aperture', 2),
 ('second oblong aperture', 2),
 ('second lower pivot', 2),
 ('second upper pivot shaft', 2),
 ('entry', 2),
 ('first upper pivot shaft', 2),
 ('first and second locking cogs', 1),
 ('second upper pivot', 1),
 ('siderail support mechanism', 1),
 ('bed', 1),
 ('first upper pivot', 1),
 ('locking plate', 1),
 ('first and second lower pivot shafts', 1),
 ('respective notches', 1)]

Now we need to use this to parse the claim to build a parent / child tree.  

Can we start with subject, relationship, object? Where the subject and object are NPs. We use the entities as a reference dictionary to the NPs.

In [53]:
for np in claim1.noun_chunks:
    np_start = np.start
    # Ignore a or the
    if claim1[np_start].pos == DET:
        np_start += 1
    np_string = claim1[np_start:np.end].text.lower()
    
    # Here we need to get the verb and the object
    print(np_string, " - ", np.root, " - ", np.root.head, list(np.root.lefts), list(np.root.rights))

siderail support mechanism  -  mechanism  -  mechanism [A, siderail, support] [comprising]
mounting bracket  -  bracket  -  : [a, mounting] [having]
first lower pivot  -  pivot  -  having [a, first, lower] [and, pivot]
second lower pivot  -  pivot  -  pivot [a, second, lower] [,, bracket, ;, plurality]
bed  -  bed  -  to [a] []
first upper pivot shaft  -  shaft  -  having [a, first, upper, pivot] [and, shaft]
first lower pivot shaft  -  shaft  -  shaft [a, first, lower, pivot] [,]
first upper pivot shaft  -  shaft  -  configured [the, first, upper, pivot] []
siderail  -  siderail  -  to [a] [at]
first upper pivot  -  pivot  -  at [a, first, upper] [and]
first lower pivot shaft  -  shaft  -  configured [the, first, lower, pivot] []
first lower pivot  -  pivot  -  to [the, first, lower] [of]
mounting bracket  -  bracket  -  of [the, mounting] []
second upper pivot shaft  -  shaft  -  having [a, second, upper, pivot] [and, shaft]
second lower pivot shaft  -  shaft  -  shaft [a, second, lo

In [54]:
dir(nc)

['__class__',
 '__delattr__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__gt__',
 '__hash__',
 '__init__',
 '__iter__',
 '__le__',
 '__len__',
 '__lt__',
 '__ne__',
 '__new__',
 '__pyx_vtable__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '_recalculate_indices',
 '_vector',
 '_vector_norm',
 'doc',
 'end',
 'end_char',
 'ent_id',
 'ent_id_',
 'has_vector',
 'label',
 'label_',
 'lefts',
 'lemma_',
 'lower_',
 'merge',
 'noun_chunks',
 'orth_',
 'rights',
 'root',
 'sent',
 'sentiment',
 'similarity',
 'start',
 'start_char',
 'string',
 'subtree',
 'text',
 'text_with_ws',
 'upper_',
 'vector',
 'vector_norm']

In [55]:
for word in claim1:
    print(word.text, word.lemma, word.lemma_, word.tag, word.tag_, word.pos, word.pos_, word.dep, word.dep_)


 518 
 485 SP 101 SPACE 0 
1 928 1 470 LS 95 PUNCT 512817 ROOT
. 453 . 453 . 95 PUNCT 441 punct
A 506 a 460 DT 88 DET 411 det
siderail 776980 siderail 467 JJ 82 ADJ 398 amod
support 1018 support 474 NN 90 NOUN 74185 compound
mechanism 5199 mechanism 474 NN 90 NOUN 512817 ROOT
comprising 64075 comprise 490 VBG 98 VERB 758131 acl
: 454 : 454 : 95 PUNCT 441 punct
a 506 a 460 DT 88 DET 411 det
mounting 242066 mount 490 VBG 98 VERB 398 amod
bracket 8706 bracket 474 NN 90 NOUN 412 dobj
having 539 have 490 VBG 98 VERB 758131 acl
a 506 a 460 DT 88 DET 411 det
first 774 first 467 JJ 82 ADJ 398 amod
lower 1481 low 468 JJR 82 ADJ 398 amod
pivot 365003 pivot 474 NN 90 NOUN 412 dobj
and 512 and 458 CC 87 CCONJ 403 cc
a 506 a 460 DT 88 DET 411 det
second 1234 second 467 JJ 82 ADJ 398 amod
lower 1481 low 468 JJR 82 ADJ 398 amod
pivot 365003 pivot 474 NN 90 NOUN 406 conj
, 450 , 450 , 95 PUNCT 441 punct
the 501 the 460 DT 88 DET 411 det
mounting 242066 mount 490 VBG 98 VERB 398 amod
bracket 8706 brac

the 501 the 460 DT 88 DET 411 det
first 774 first 467 JJ 82 ADJ 398 amod
lower 1481 low 468 JJR 82 ADJ 398 amod
pivot 365003 pivot 474 NN 90 NOUN 74185 compound
shaft 21354 shaft 474 NN 90 NOUN 435 pobj
; 620 ; 454 : 95 PUNCT 441 punct

 518 
 485 SP 101 SPACE 0 
a 506 a 460 DT 88 DET 411 det
second 1234 second 467 JJ 82 ADJ 398 amod
locking 4285 lock 490 VBG 98 VERB 398 amod
cog 204126 cog 474 NN 90 NOUN 406 conj
extending 6442 extend 490 VBG 98 VERB 758131 acl
into 696 into 466 IN 83 ADP 439 prep
the 501 the 460 DT 88 DET 411 det
second 1234 second 467 JJ 82 ADJ 398 amod
oblong 617396 oblong 474 NN 90 NOUN 74185 compound
aperture 314739 aperture 474 NN 90 NOUN 435 pobj
and 512 and 458 CC 87 CCONJ 403 cc
configured 289176 configure 491 VBN 98 VERB 758131 acl
for 531 for 466 IN 83 ADP 439 prep
entry 4692 entry 474 NN 90 NOUN 435 pobj
into 696 into 466 IN 83 ADP 439 prep
one 602 one 459 CD 91 NUM 435 pobj
of 510 of 466 IN 83 ADP 439 prep
the 501 the 460 DT 88 DET 411 det
plurality 26142