# Gensim - Testing on Patent Data

These are some experiments to apply the functions of gensim, such as Latent Semantic Analysis and Latent Dirichlet Allocation, to patent data in the form of patent publications.  

These experiments may be used as a baseline for subsequent experiments with deep learning methods.  

Also starts from general approach here: https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/scripts/make_wikicorpus.py  

One option is we can fork this and adapt for patentdata - https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/corpora/wikicorpus.py

In [1]:
# Imports and logging setup
from gensim.corpora import Dictionary, HashDictionary, MmCorpus
from gensim.models import TfidfModel, lsimodel, ldamodel

import logging
import os
import pickle
import random

from patentdata.corpus import USPublications
# Probably need to move the patentcorpus.py file into the main patentdata directory
from patentdata.models.patentcorpus import LazyPatentCorpus

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

The next 3 cells allow us to get a set of patent documents with the classification G06. We'll start with a small set and check it works before expanding to more documents.

In [2]:
# Load our list of G06 records
PIK = "G06records.data"

if os.path.isfile(PIK):
    with open(PIK, "rb") as f:
        print("Loading data")
        records = pickle.load(f)
        print("{0} records loaded".format(len(records)))
else:
    records = ds.get_records(["G", "06"])
    with open(PIK, "wb") as f:
        pickle.dump(records, f)

Loading data
554570 records loaded


In [3]:
# Get data from 10000 random descriptions across the data
records_random_sample = random.sample(records, 10000)
print("Random sample of {0} records".format(len(records_random_sample)))
print(records_random_sample[0:5])

Random sample of 10000 records
[(58663, '2002/20020103.ZIP', './20020103/UTIL0002/US20020002454A1-20020103.ZIP'), (2141802, '2009/I20090723.ZIP', 'project/pdds/ICEApplication/I20090723/UTIL0187/US20090187923A1-20090723.ZIP'), (2782812, '2011/I20110630.tar', 'I20110630/UTIL0161/US20110161070A1-20110630.ZIP'), (1644802, '2008/I20080103.ZIP', './I20080103/UTIL0005/US20080005016A1-20080103.ZIP'), (2551154, '2010/I20101014.tar', 'I20101014/UTIL0262/US20100262639A1-20101014.ZIP')]


In [4]:
path = '/media/SAMSUNG1/Patent_Downloads'
ds = USPublications(path)

lzy = LazyPatentCorpus()
lzy.init_by_filenames(ds, records_random_sample)

In [5]:
# One pass corpus and dictionary creation
dictionary = Dictionary()
corpus = 
MmCorpus.serialize(
    '10000patentpubs.mm', 
    (dictionary.doc2bow(pd.bag_of_words(), allow_update=True) for pd in lzy.documents)
)
dictionary.save('10000patentpubs.dict')

In [6]:
mm = MmCorpus('10000patentpubs.mm')

In [7]:
print(mm)
print(dictionary)

MmCorpus(10000 documents, 86978 features, 5659790 non-zero entries)
Dictionary(86978 unique tokens: ['mailboxquota', 'outport', 'blanchet', 'fastexport', 'foxboro']...)


---
## Latent Semantic Analysis (LSI)

This requires a corpus with entries comprising TD IDF vectors. 

In [8]:
# So first initialise tfidf object using corpus
tfidf = TfidfModel(mm)
corpus_tfidf = tfidf[mm]
for doc in corpus_tfidf[0:5]:
    print(doc)

[(0, 0.025091024129979083), (1, 0.062293004622230114), (2, 0.006912096508230463), (3, 0.004321466996340276), (4, 0.016355388040559367), (5, 0.02192840938760728), (6, 0.039134985488400054), (7, 0.022337684675173608), (8, 0.022609585466418548), (9, 0.006932871087254353), (10, 0.029829031493892932), (11, 0.008850617835900092), (12, 0.007940984148584464), (13, 0.05302443069388415), (14, 0.059138014996397543), (15, 0.013497049643399001), (16, 0.037122478874686025), (17, 0.010715547979663747), (18, 0.016935190844581012), (19, 0.00424332330892363), (20, 0.030367036970866786), (21, 0.08720988849180912), (22, 0.03258861176195701), (23, 0.00954872484384117), (24, 0.026843363667305897), (25, 0.026244743275483297), (26, 0.04725945951766508), (27, 0.04196037258239056), (28, 0.009603132824662785), (29, 0.0087667261995609), (30, 0.035344398177334965), (31, 0.08211880253418175), (32, 0.05380448439638229), (33, 0.013361386989580136), (34, 0.07236752074718068), (35, 0.08351747006660118), (36, 0.00958497

In [9]:
# extract 10 LSI topics; use the default one-pass algorithm
lsi = lsimodel.LsiModel(corpus=corpus_tfidf, id2word=dictionary, num_topics=10)

In [10]:
lsi.print_topics(10)

[(0,
  '0.033*"flowchart" + 0.033*"rom" + 0.033*"flash" + 0.032*"ram" + 0.032*"bu" + 0.032*"keyboard" + 0.032*"readabl" + 0.032*"lan" + 0.032*"wireless" + 0.032*"cpu"'),
 (1,
  '-0.080*"japanes" + -0.075*"surfac" + -0.068*"pixel" + -0.066*"shape" + -0.064*"angl" + -0.061*"upper" + -0.061*"vertic" + -0.060*"horizont" + -0.060*"sensor" + -0.059*"width"'),
 (2,
  '-0.070*"eras" + -0.069*"pci" + -0.069*"diskett" + -0.068*"cach" + 0.067*"payment" + -0.067*"eprom" + -0.065*"firmwar" + -0.065*"bu" + 0.064*"pay" + -0.063*"semiconductor"'),
 (3,
  '-0.094*"touch" + -0.091*"panel" + -0.089*"crystal" + -0.084*"button" + -0.084*"liquid" + -0.075*"lcd" + -0.073*"japanes" + -0.073*"finger" + -0.068*"press" + -0.067*"icon"'),
 (4,
  '0.149*"japanes" + 0.088*"ye" + 0.079*"notifi" + 0.078*"hereinaft" + 0.077*"recept" + 0.077*"temporarili" + -0.074*"infrar" + -0.070*"rf" + 0.070*"destin" + -0.067*"electromagnet"'),
 (5,
  '-0.084*"voltag" + -0.068*"pin" + 0.067*"pixel" + -0.066*"batteri" + 0.064*"extrac

Quite a few of these words are patent stop words. Topic \#0, for example, appears to be constructed from a combination of patent stop words.

So for LSI it maybe recommended to remove English and Patent stopwords.

In [11]:
lsi.save('10kpatentpubs.lsi')

In [12]:
# We can add new documents online using these commands:
# lsi.add_documents(another_tfidf_corpus) # now LSI has been trained on tfidf_corpus + another_tfidf_corpus
# lsi_vec = lsi[tfidf_vec] # convert some new document into the LSI space, without affecting the model

## Latent Dirichlet Allocation (LDA)

For LDA to be successful we need a large corpus - 100 documents did not provide any meanful results (all probabilities were 0.000).

In [13]:
lda = ldamodel.LdaModel(corpus=corpus_tfidf, id2word=dictionary, num_topics=10, passes=3)

In [14]:
lda.print_topics()

[(0,
  '0.000*"clinician" + 0.000*"jpn" + 0.000*"poli" + 0.000*"bone" + 0.000*"kokai" + 0.000*"patholog" + 0.000*"pac" + 0.000*"dicom" + 0.000*"slit" + 0.000*"sputter"'),
 (1,
  '0.000*"thereat" + 0.000*"unrecord" + 0.000*"nit" + 0.000*"despread" + 0.000*"interfram" + 0.000*"dpl" + 0.000*"pbc" + 0.000*"reassoci" + 0.000*"sck" + 0.000*"nba"'),
 (2,
  '0.000*"edid" + 0.000*"mashup" + 0.000*"mmax" + 0.000*"dpram" + 0.000*"havi" + 0.000*"hcc" + 0.000*"ffh" + 0.000*"cen" + 0.000*"gpe" + 0.000*"takt"'),
 (3,
  '0.001*"flowchart" + 0.000*"cpu" + 0.000*"hereinaft" + 0.000*"schemat" + 0.000*"forego" + 0.000*"explain" + 0.000*"rom" + 0.000*"bu" + 0.000*"predetermin" + 0.000*"circuit"'),
 (4,
  '0.001*"trajectori" + 0.001*"euclidean" + 0.000*"spheric" + 0.000*"ei" + 0.000*"perpendicularli" + 0.000*"dct" + 0.000*"mouth" + 0.000*"sl" + 0.000*"parallax" + 0.000*"tangent"'),
 (5,
  '0.000*"conson" + 0.000*"escon" + 0.000*"fraudster" + 0.000*"uselessli" + 0.000*"vowel" + 0.000*"gop" + 0.000*"costum" +

In [17]:
lda.save('10kpatentpubs.lda')