# Looking at Similarity in Patent Specifications - Gensim

Let's have a play with gensim's similarity measures. Are they useful out of the box for patent specifications?

In [1]:
import numpy as np

In [2]:
from patentdata.models.patentcorpus import PatentCorpus

In [3]:
pc = PatentCorpus.load("5_docs")

In [4]:
# Imports and logging setup
from gensim.corpora import Dictionary, HashDictionary, MmCorpus
from gensim.models import TfidfModel, lsimodel, ldamodel
from gensim.similarities import MatrixSimilarity

import logging

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

Using TensorFlow backend.


First let's find the two most similar documents in the set of 5.

In [6]:
# One pass corpus and dictionary creation
dictionary = Dictionary()
corpus = MmCorpus.serialize(
    '5_docs.mm', 
    (
        dictionary.doc2bow(
            [token.text for token in para.doc], allow_update=True
        ) for doc in pc.documents for para in doc.description.paragraphs
    )
)
dictionary.save('5_docs.dict')

In [7]:
mm = MmCorpus('5_docs.mm')
print(mm)
print(dictionary)

MmCorpus(264 documents, 2682 features, 15910 non-zero entries)
Dictionary(2682 unique tokens: ['possesses', 'interpreter', 'disable', 'own', 'verbal']...)


In [20]:
lsi = lsimodel.LsiModel(mm, id2word=dictionary, num_topics=2)

In [25]:
index = MatrixSimilarity(lsi[mm])

Let's start by trying a paragraph that spaCy's similarity nearly got - it matches description of the Figures paragraphs.

```
FIG. 1 is a schematic block diagram illustrating a method according to a first exemplary embodiment of the present invention
```
Paragraph 19 of doc[2]

In [28]:
doc = pc.documents[2].description.paragraphs[19]
print(doc)

20 FIG. 1 is a schematic block diagram illustrating a method according to a first exemplary embodiment of the present invention;


In [29]:
vec_bow = dictionary.doc2bow([token.text for token in doc.doc])
vec_lsi = lsi[vec_bow] # convert the query to LSI space
print(vec_lsi)

[(0, 2.2659964440858174), (1, 0.85928778825958196)]


In [30]:
sims = index[vec_lsi] 
sims = sorted(enumerate(sims), key=lambda item: -item[1])
print(sims)

[(17, 1.0), (111, 1.0), (7, 0.99999893), (217, 0.99999774), (42, 0.99999666), (86, 0.99999487), (36, 0.99999279), (112, 0.99997848), (113, 0.9999783), (48, 0.99997479), (46, 0.99996424), (52, 0.99996352), (95, 0.99995899), (140, 0.99995649), (27, 0.99995327), (12, 0.9999516), (1, 0.9999454), (45, 0.99991935), (241, 0.99990082), (51, 0.99989629), (120, 0.99987108), (23, 0.99985677), (31, 0.99985617), (227, 0.99981254), (135, 0.99971527), (130, 0.99968046), (24, 0.99965125), (67, 0.99963284), (167, 0.99961615), (238, 0.99959916), (118, 0.99954915), (206, 0.99953151), (6, 0.99951714), (35, 0.99950194), (92, 0.99946791), (224, 0.99946016), (116, 0.99943316), (124, 0.99932837), (43, 0.99917269), (239, 0.99915379), (25, 0.99911904), (160, 0.99908704), (245, 0.99906486), (18, 0.99898368), (34, 0.99888533), (250, 0.99881315), (13, 0.99881041), (41, 0.99876571), (97, 0.9987222), (243, 0.99858433), (44, 0.99848914), (242, 0.99847591), (114, 0.99843049), (14, 0.99840301), (104, 0.99830693), (19, 

We need a list of our paragraphs so we can get via the indices above (e.g. 111, 7 etc).

In [32]:
docs = [para for doc in pc.documents for para in doc.description.paragraphs]

In [33]:
docs[17]

18 FIG. 1 is an environmental, perspective view of a delivery box for a system for delivery of goods ordered via the Internet according to the present invention.

In [34]:
docs[111]

20 FIG. 1 is a schematic block diagram illustrating a method according to a first exemplary embodiment of the present invention;

In [35]:
docs[7] # Bit of a strange match

8 Various receptacles have been employed to receive delivered goods. An example is a locked mailbox with a delivery slot that allows letters or very small packages to be inserted into the mailbox, and only removed by a recipient with a key. This provides a degree of security for the letters and small packages, but does not prevent receipt of unwanted items. Additionally, provision for maintaining an environmental condition is lacking. Larger lock boxes have been devised to overcome package size limitations. However, no known lock box addresses all phases of delivery of goods to provide security and proper handling of goods with special needs.

In [36]:
docs[217]

80 The processor module 805 includes the memory 810. The memory 810 may include random access memory (RAM) and read-only memory (ROM). The memory 810 may store computer-readable, computer-executable software code containing instructions that are configured to, when executed, cause the processor module 805 to perform various functions described herein (e.g., transaction processing). Alternatively, the software may not be directly executable by the processor module 805 but may be configured to cause a computer (e.g., when compiled and executed) to perform functions described herein. The processor module 805 may include an intelligent hardware device, e.g., a central processing unit (CPU), a microcontroller, an application specific integrated circuit (ASIC), etc.

In [37]:
docs[42]

43 Turning now to FIGS. 7-10, a method for delivery of goods ordered via the Internet is described, the method employing a delivery box 100, a transport box 200, and the system briefly described in FIG. 6. The method may be embodied in a computer program executing on an Internet merchant server 14, and in programming of the control circuit 150 in the delivery box and programming of the control circuit 250 of the transport box 200. In such an embodiment, the computer program generally functions as a Web service to provide customer interface functions to a client program such as an Internet browser functioning on a customer computer 12.

So - while this starts off well, after the first two matches it goes a bit random.  

Does a higher dimensionality LSI model help?

In [38]:
lsi_100 = lsimodel.LsiModel(mm, id2word=dictionary, num_topics=100)

In [41]:
index_100 = MatrixSimilarity(lsi_100[mm])
vec_lsi_100 = lsi_100[vec_bow] # convert the query to LSI space
sims = index_100[vec_lsi_100] 
sims = sorted(enumerate(sims), key=lambda item: -item[1])
print(sims)

[(111, 0.99999994), (112, 0.97672701), (113, 0.97050744), (28, 0.86308801), (22, 0.81166101), (70, 0.80718344), (24, 0.805143), (159, 0.80286431), (29, 0.7994045), (25, 0.79757327), (72, 0.79255503), (158, 0.7900005), (23, 0.77996624), (155, 0.77956051), (26, 0.77327967), (154, 0.76801664), (152, 0.75456369), (153, 0.75029546), (150, 0.74708146), (151, 0.74516046), (71, 0.74228609), (157, 0.72904962), (160, 0.72360843), (21, 0.71218669), (17, 0.69650733), (62, 0.67594022), (156, 0.66161311), (20, 0.65550619), (19, 0.64402354), (18, 0.64209926), (233, 0.61583042), (241, 0.61261666), (249, 0.61138594), (259, 0.59438074), (251, 0.58818299), (27, 0.5793159), (247, 0.57411301), (238, 0.56823933), (244, 0.56495422), (42, 0.56403911), (63, 0.56164825), (168, 0.55865782), (130, 0.55425942), (93, 0.54735643), (102, 0.5449599), (94, 0.54489106), (246, 0.54454899), (100, 0.5400852), (127, 0.53657401), (76, 0.53465897), (92, 0.53397441), (146, 0.53086948), (179, 0.5297581), (66, 0.52927184), (161,

In [42]:
docs[111]

20 FIG. 1 is a schematic block diagram illustrating a method according to a first exemplary embodiment of the present invention;

In [43]:
docs[112]

21 FIG. 2 is a schematic block diagram illustrating a method according to a second exemplary embodiment of the present invention; and

In [45]:
for i in range(0,20):
    print(docs[sims[i][0]])

20 FIG. 1 is a schematic block diagram illustrating a method according to a first exemplary embodiment of the present invention;
21 FIG. 2 is a schematic block diagram illustrating a method according to a second exemplary embodiment of the present invention; and
22 FIG. 3 is a schematic diagram illustrating a method according to a third exemplary embodiment of the present invention.
29 FIG. 12A is a flowchart of a point of reception weight verification process according to the present invention.
23 FIG. 6 is a block diagram of a system for delivery of goods ordered via the Internet according to the present invention.
15 FIG. 1 is a block diagram showing logical processes for registering a mobile device with a financial institution to prepare for use in a transaction in accordance of an embodiment of the present invention;
25 FIG. 8 is a flowchart of a process for order fulfillment in a method for delivery of goods ordered via the Internet according to the present invention.
22 FIG. 10 

That is better.  

Can we try with some boilerplate paragraphs?  

Try doc[2][46]

In [48]:
doc2 = pc.documents[2].description.paragraphs[45]
print(doc2)

46 It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be capable of designing many alternative embodiments without departing from the scope of the invention as defined by the appended claims. In the claims, any reference signs placed in parentheses shall not be construed as limiting the claims. The word “comprising” and “comprises”, and the like, does not exclude the presence of elements or steps other than those listed in any claim or the specification as a whole. The singular reference of an element does not exclude the plural reference of such elements and vice-versa. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In a device claim enumerating several means, several of these means may be embodied by one and the same item of hardware. The mere fact that certain measures are recited in mutually different de

In [49]:
vec_bow2 = dictionary.doc2bow([token.text for token in doc2.doc])
vec_lsi_100 = lsi_100[vec_bow2] # convert the query to LSI space
sims = index_100[vec_lsi_100] 
sims = sorted(enumerate(sims), key=lambda item: -item[1])
print(sims)

[(137, 1.0000001), (163, 0.80191636), (232, 0.79484165), (248, 0.7668854), (263, 0.73637301), (85, 0.73375845), (110, 0.7242803), (228, 0.71367604), (55, 0.70880872), (252, 0.70579791), (257, 0.70114821), (5, 0.6989882), (109, 0.69875193), (229, 0.6950621), (83, 0.69296098), (66, 0.68930811), (57, 0.68135661), (16, 0.67348719), (128, 0.67225963), (119, 0.66946), (122, 0.66803932), (4, 0.6668328), (136, 0.66666371), (255, 0.66389334), (149, 0.66103142), (258, 0.65864956), (73, 0.65838605), (256, 0.65727842), (2, 0.65477103), (135, 0.65262645), (239, 0.65218937), (227, 0.64703089), (172, 0.64673412), (134, 0.64229172), (118, 0.64210057), (220, 0.64182758), (117, 0.6409356), (81, 0.64076114), (240, 0.64074004), (131, 0.63716149), (126, 0.63700676), (77, 0.63376671), (259, 0.63047636), (76, 0.62757486), (234, 0.62543815), (116, 0.62494779), (39, 0.62297761), (115, 0.62183356), (42, 0.61995846), (237, 0.6175679), (100, 0.61657596), (15, 0.61581826), (253, 0.61508369), (141, 0.6150437), (190

In [50]:
for i in range(0,10):
    print(docs[sims[i][0]])

46 It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be capable of designing many alternative embodiments without departing from the scope of the invention as defined by the appended claims. In the claims, any reference signs placed in parentheses shall not be construed as limiting the claims. The word “comprising” and “comprises”, and the like, does not exclude the presence of elements or steps other than those listed in any claim or the specification as a whole. The singular reference of an element does not exclude the plural reference of such elements and vice-versa. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In a device claim enumerating several means, several of these means may be embodied by one and the same item of hardware. The mere fact that certain measures are recited in mutually different de

In [51]:
docs

[1 The system for delivery of goods ordered via the Internet utilizes a delivery box having a front door with an electronic lock and electronic key reader for receipt of goods ordered via the Internet. Goods are shipped in a transport box having an interior space and a device for controlling the temperature within the interior space. The delivery box has at least one interior transport box receptacle where a transport box containing goods is placed on delivery. A power and data interface is established between a delivery box control circuit and a transport box control circuit. The transport box is placed into one of the receptacles, so that the delivery box control circuit can power the transport box control circuit and environment-controlling device. A security code is downloaded from an Internet merchant site to the delivery box and stored onto a keycard when an order is placed.,
 2 This application is a continuation-in-part of U.S. patent application Ser. No. 12/379,771, filed Feb. 

In [55]:
pc.documents[1].description.paragraphs[35]

36 Programmable logic may be fixed either permanently or transitorily in a tangible storage medium, such as a semiconductor memory device (e.g., a RAM, ROM, PROM, EEPROM, or Flash-Programmable RAM), a magnetic memory device (e.g., a diskette or fixed disk), an optical memory device (e.g., a CD-ROM), or other memory device. The programmable logic may be fixed in a signal that is transmittable to a computer using any of various communication technologies, including, but in no way limited to, analog technologies, digital technologies, optical technologies, wireless technologies (e.g., Bluetooth), networking technologies, and internetworking technologies. The programmable logic may be distributed as a removable storage medium with accompanying printed or electronic documentation (e.g., shrink wrapped software), preloaded with a computer system (e.g., on system ROM or fixed disk), or distributed from a server or electronic bulletin board over the communication system (e.g., the Internet or 

This looks to have promise - we ought to try with say 1000 G06 documents.  

For each paragraph in a document, we could select the higher matching paragraphs from the other documents.  