# 2.3 Text Processing and Industry Mapping

by Constantin Knoll, Christopher Mosch, Rohan Thavarajah

## Summary

This notebook predicts for each patent the industry to which the patent most likely belongs. This is done in three steps. First, we transform the raw patent text data into the form that is required to perform Latent Dirichlet Allocation (LDA). In particular, we obtain the abstract from each patent and the nouns contained in the former. We focus on the abstracts because they are the most expressive part of the patent and thus most relevant for our purpose. Secondly, we extract topics from the patent data using the LDA implementation in gensim. In the third step, we map the LDA topics to industries and thereby obtain an industry prediction for each patent. The first part of the mapping is to input the nouns of the industry definitions into LDA. Thereby, we use the resulting relation of industries to topics to create a sparse matrix of the industry-topic relation. For each patent, we then multiply the sparse vector representing the relation of the patent to all topics with the industry-topic matrix and thereby obtain for each industry the probability that a patent belongs to this industry. Finally, this data and additional data such as the inventor company is put into a data frame, which is then saved to S3.

![Image](Data\Images\Workflow_2.3.png?raw=true)

## Table of Contents

* [Change Log](#Change Log)
* [Setup](#Setup)


* [Reading in Data](#Reading)
* [Preparing Data for LDA](#Preparing)
* [Performing LDA](#Performing)
* [Mapping Patents to Industries](#Mapping)
* [Saving to S3](#Saving)


* [Appendix](#Appendix)
    * [Debugging Spark code](#A1)
    * [Industry-Topic Intersection](#A2)

## Change Log
### v.1
- initial build

### v.2 
- added code for Spark

### v.3 
- added code to load data from web

### v.4
- switched from `ntlk` to `pattern` due to issues of `nltk` with Spark

### v.5
- switched from `BeautifulSoup` to `lxml` for performance
- mapping from topics to industries
- saves results to AWS

<a id='Setup'></a>
## Setup

While some parts do not require Spark, the latter is needed to run the full notebook
on local machine (via vagrant) amound of data that can be used limited

### AWS

To perform analysis on full data, we used a single r3.8xlarge instance, and the code can also be used on a cluster. After setting up AWS CLI and an EC2 SSH key pair, we used the following specifications for the instance.


In [None]:
from pyspark.sql import SQLContext
from pyspark.sql.types import *

### Vagrant

In [None]:
! pip install pattern

In [None]:
# Spark

import findspark
findspark.init()
print findspark.find()
# Depending on your setup you might have to change this line of code
#findspark makes sure I dont need the below on homebrew.
#os.environ['SPARK_HOME']="/usr/local/Cellar/apache-spark/1.5.1/libexec/"
#the below actually broke my spark, so I removed it. 
#Depending on how you started the notebook, you might need it.
#os.environ['PYSPARK_SUBMIT_ARGS']="--master local pyspark --executor-memory 4g"

import pyspark
conf = (pyspark.SparkConf()
    .setMaster('local')
    .setAppName('pyspark')
    .set('spark.executor.memory', '4g')
    )
sc = pyspark.SparkContext(conf=conf)

from pyspark.sql import SQLContext
from pyspark.sql.types import *

### Load Libraries

In [1]:
# Reading in Data
import os, requests, zipfile, StringIO
from bs4 import BeautifulSoup

# Preparing Data for LDA
from lxml import etree 
import collections
from pattern.en import tag
#import nltk
#from nltk.tokenize import word_tokenize

# LDA
import gensim

# Mapping Patents to Industries
import json
import numpy as np
import scipy.sparse as sps

In [2]:
# not requried by default since we use pattern instead of nltk
#nltk.download('punkt')
#nltk.download('maxent_treebank_pos_tagger')
#nltk.download('averaged_perceptron_tagger')

## Reading in Data

In notebook *2.1 Google Patent Data*, we created zip files for each week that contain the patents as xml files. From these xml files, we now create a dictionary `data` with the file names as keys and the xml strings as values. So the signature of `data` looks like

`{'US06334220-20020101.XML': '<?xml version="1.0"...', 'US06334221-20020101.XML': '<?xml version="1.0"...', ...}`.

This dictionary can be created in three different ways from the zip files. First, if the data is contained in several zip files, a txt file with links to the zip files is used. The txt file used below, https://s3.amazonaws.com/cs109project/2002-2004.txt, contains one link in each row, as illustrated below

<br>
<center>
https://s3-us-west-1.amazonaws.com/ckpatents/2002/20020101.zip  
https://s3-us-west-1.amazonaws.com/ckpatents/2002/20020108.zip  
https://s3-us-west-1.amazonaws.com/ckpatents/2002/20020115.zip  
https://s3-us-west-1.amazonaws.com/ckpatents/2002/20020122.zip  
https://s3-us-west-1.amazonaws.com/ckpatents/2002/20020129.zip  
https://s3-us-west-1.amazonaws.com/ckpatents/2002/20020205.zip  
...  
</center>

From each zip file, all xml files contained in it are extracted.

In [3]:
%%time
# load data from web (given url to file with urls of zips)
# e.g.'https://s3.amazonaws.com/cs109project/2004.txt')
# for each zip: loads all xmls files into dictionary with key=filename

urls = 'https://s3.amazonaws.com/cs109project/2002-2004.txt'
rs = requests.get(urls)
data = {}

# loop through urls
for url in rs.content.split('\r\n'):
    #print url
    r = requests.get(url)
    if zipfile.is_zipfile(StringIO.StringIO(r.content)):
        z = zipfile.ZipFile(StringIO.StringIO(r.content))
        xmls = [member for member in z.namelist() if os.path.splitext(member.lower())[1]=='.xml']
        newdata = {os.path.basename(xml): z.open(xml).read() for xml in xmls}
        data.update(newdata)
    else:
        print url+' contains no zip file.'

CPU times: user 3min 1s, sys: 19.7 s, total: 3min 21s
Wall time: 9min 16s


Secondly, if the data is contained in a single zip file, this can be read in directly. Again the code gets any xml file contained in the zip file.

In [None]:
%%time
# load data from web (given url to zip file)
# e.g. e.g.'https://s3.amazonaws.com/cs109project/Unpacked+Data.zip'
# loads all xmls files from given zip into dictionary with key=filename

url = 'https://s3.amazonaws.com/cs109project/Unpacked+Data.zip'
r = requests.get(url)
if zipfile.is_zipfile(StringIO.StringIO(r.content)):
    z = zipfile.ZipFile(StringIO.StringIO(r.content))
    xmls = [member for member in z.namelist() if os.path.splitext(member.lower())[1]=='.xml']
    data = {os.path.basename(xml): z.open(xml).read() for xml in xmls}
else:
    print 'URL does not link to zip file. Please provide valid URL.'

Lastly, data can be read in from disk where a nested structure with two levels is assumed.

In [None]:
# load data from disk
# loads data with structure year\weeks\xmls into dictionary with key=filename
source = os.getcwd()
path = os.path.join(source,'2014')
data = {}
for week in os.listdir(path):
    week_path = os.path.join(path, week)
    for patent in os.listdir(week_path):
        patent_path = os.path.join(week_path, patent)
        if os.path.isfile(patent_path):
            with open(patent_path, 'r') as my_file:
                data[patent] = my_file.read()

In [4]:
len(data)

500642

<a id='Preparing'></a>
## Preparing Data for LDA

In order to extract topics from patents, we need to obtain a list of nouns for each patent. We do this in two steps. First, we a get the abstract of each patent, which we because they are the most expressive part and thus most relevant for our analysis. Secondly, we get the nouns from each abstract and save them in the corpus form that `gensim`, the topic modelling library we use, requires.

Moreover, in the first step we need to account for the fact that the structure of the xml files has changed at the end of 2004, as illustrated below. The image at the top shows the structure before the change while the image at the bottom displays the structure after the change.

Before change (2004 and earlier)

![Image](Data\Images\2.3 2004-.png?raw=true)

After change (2005 and later)

![Image](Data\Images\2.3 2005+.png?raw=true)

To account for this change, the functions below take the parameter `xml_format`, where the value 'pre2005' indicates the first format. Since we will use additional information such as an inventor's geography and company for the analysis later, we also define a function to do this.

In [4]:
def get_abstract_soup(xml, xml_format='pre2005'):
    if xml_format=='pre2005':
        # .strip() to get rid of the tags that are between the abstract and the entrance point ('sdoab') that is used
        abstract = BeautifulSoup(xml, 'lxml').find('sdoab').get_text().strip()
    else:
        abstract = BeautifulSoup(xml, 'lxml').find('abstract').get_text()
    return abstract


def get_abstract_lxml(xml, xml_format='pre2005'):
    try:
        if xml_format=='pre2005':
            xml = xml.replace('&','')  #to avoid issues caused by the way special characters are saved 2004 and earlier
            abstract = etree.XML(xml).xpath('//SDOAB//PDAT')[0].text
        else:
            abstract = etree.XML(xml).xpath('//abstract')[0][0].text

        if not abstract:
            #lxml less robust than soup (e.g. lxml returns None for US08623623-20140107.XML)
            abstract = get_abstract_soup(xml, xml_format)
    except:
        abstract=''
        
    return abstract


def get_supplement_lxml(xml, xml_format='pre2005'):
    
    if xml_format=='pre2005':
        xml = xml.replace('&','')   #to avoid issues caused by the way special characters are saved 2004 and earlier
        city_tag = '//CITY//PDAT'
        state_tag = '//STATE//PDAT'
        assignee_tag = '//ONM//PDAT'
    else:
        city_tag = '//city'
        state_tag = '//state'
        assignee_tag = '//assignee//orgname'        
        
    # inventor geography 
    try:
        inv_city = etree.XML(xml).xpath(city_tag)[0].text
    except:
        inv_city = None

    try:
        inv_state = etree.XML(xml).xpath(state_tag)[0].text
    except:
        inv_state = None

    if inv_state:
        try:
            if xml_format=='pre2005':
                inv_ctry = etree.XML(xml).xpath(state_tag)[0].getnext().getchildren()[0].text
            else:
                inv_ctry = etree.XML(xml).xpath(state_tag)[0].getnext().text
        except:
            inv_ctry = None
    else:
        try:
            if xml_format=='pre2005':
                inv_ctry = etree.XML(xml).xpath('//CITY')[0].getnext().getchildren()[0].text
            else:
                inv_ctry = etree.XML(xml).xpath(city_tag)[0].getnext().text            
        except:
            inv_ctry = None

    # inventor company        
    try:
        assignee = etree.XML(xml).xpath(assignee_tag)[0].text
    except:
        assignee = None
    return [inv_city, inv_state, inv_ctry, assignee]

The following illustrates the usage of these functions and their output.

In [6]:
get_supplement_lxml(data[data.keys()[1]])#, xml_format='post')

['Rolling Hills', 'CA', None, 'Ledtronics, Inc.']

In [7]:
get_abstract_lxml(data[data.keys()[1]])#, xml_format='post')

'A light source in the form of a light emitting diode (LED) cluster module suitable for use as an aircraft forward position light source. The light source comprises multiple LED components mounted on a base structure together with supporting electronic components to regulate the function of the LED components. The LED components are configured on the base structure in a manner so as to be capable of complying with the Federal Aviation Regulations minimum light intensities or candela requirements and color specifications while in a preferred implementation using a traditional aircraft 28-volt power supply. Furthermore, the preferred implementations are capable of meeting stringent dimensional design criteria and therefore are suitably adaptable as replacement light sources for existing aircraft forward position light housings.'

In [8]:
get_abstract_soup(data[data.keys()[1]])#, xml_format='post')

u'A light source in the form of a light emitting diode (LED) cluster module suitable for use as an aircraft forward position light source. The light source comprises multiple LED components mounted on a base structure together with supporting electronic components to regulate the function of the LED components. The LED components are configured on the base structure in a manner so as to be capable of complying with the Federal Aviation Regulations minimum light intensities or candela requirements and color specifications while in a preferred implementation using a traditional aircraft 28-volt power supply. Furthermore, the preferred implementations are capable of meeting stringent dimensional design criteria and therefore are suitably adaptable as replacement light sources for existing aircraft forward position light housings.'

As explained above, in the second step we extract all nouns from the abstracts and turn them into the form that is required for LDA. Due to issues of nltk with AWS, we decided to switch to the library `pattern`. The issues are described in more detail in the appendix, and although we figured out a workaround, we continued to use `pattern` as it is significantly faster than the workaround.

In [5]:
def get_nouns(text):  
    # using pattern
    tagged = tag(text.lower(), tokenize=True)
    
    # using nltk (if statements only necessary on AWS)
    #if '/tmp' not in nltk.data.path:
    #    nltk.data.path.append('/tmp')
    #if 'tokenizers' not in os.listdir('/tmp'):
    #    nltk.download('punkt', '/tmp')
    #    nltk.download('maxent_treebank_pos_tagger', '/tmp')
    #    nltk.download('averaged_perceptron_tagger', '/tmp')
    #tokenized = word_tokenize(text.lower())
    #tagged = nltk.pos_tag(tokenized)
    
    nouns = [a for (a, b) in tagged if b == 'NN']
    return nouns

# for each noun in list of nouns: get its number and number of occurences in list
def tocorpus(nouns,vocabulary):
    count = collections.defaultdict(int) # to count number of occurences of a noun
    for noun in nouns:
        count[vocabulary[noun]] +=1  # for new nouns: creates new key, sets value to 1. for existing keys: increases value by 1
    return count.items()

Next, we apply the functions discussed above to the xml files in our dictionary `data`. Thereby, we obtain the dictionary `id2word` and the list `corpval` of tuples, which are passed to the LDA in the next step. There are both Spark and non-Spark versions of the code to do this. Due the amounts of data that are involved, however, it is highly recommended to use Spark. Nevertheless, the non-spark version that was used for some initial testing can be found further down.

In [None]:
# set number of partitions for rdd throughout notebook
part = 8192

In [None]:
%%time
# code requires Spark

data_rdd = sc.parallelize(data.iteritems(),part)
print data_rdd.getNumPartitions()

data_nouns = (data_rdd.mapValues(lambda v: get_abstract_lxml(v))
               .mapValues(lambda v: get_nouns(v))
).cache()

# associates all distinct words with a number
vocabtups = (data_nouns.flatMap(lambda (k,v): v)
             .distinct()
             .zipWithIndex()
)

vocab = vocabtups.collectAsMap()                            #word-to-number dict
id2word = vocabtups.map(lambda (x,y): (y,x)).collectAsMap() #number-to-word dict

corpus = data_nouns.mapValues(lambda v: tocorpus(v, vocab))
corpval = corpus.values().collect()

Since above code might fail due to insufficient resources, it is often useful to examine disk and memory usage, which can be done with the code below. Among other things, this analysis was useful to identify issues with our Spark configurations. For example, it helped us to recognize the necessity of setting `SPARK_EXECUTOR_DIRS="/mnt/spark/"` and `SPARK_WORKER_DIR="/mnt/spark/"` respectively, as otherwise spark is saving any data into the root directory instead of `mnt`.

In [10]:
# on AWS/vagrant: check inodes
! df -i

Filesystem        Inodes  IUsed     IFree IUse% Mounted on
/dev/xvda1        655360 178241    477119   28% /
devtmpfs        31485576    669  31484907    1% /dev
tmpfs           31487830      1  31487829    1% /dev/shm
/dev/xvdb      314572800   9232 314563568    1% /mnt
/dev/xvdc      314572800     81 314572719    1% /mnt1


In [11]:
# on AWS/vagrant: check disk space
! df -h

Filesystem      Size  Used Avail Use% Mounted on
/dev/xvda1      9.8G  8.1G  1.7G  84% /
devtmpfs        121G   68K  121G   1% /dev
tmpfs           121G     0  121G   0% /dev/shm
/dev/xvdb       300G  231M  300G   1% /mnt
/dev/xvdc       300G  341M  300G   1% /mnt1


In [12]:
# on AWS/vagrant: check memory usage
! free -m

             total       used       free     shared    buffers     cached
Mem:        245998      61609     184389          0         87       3184
-/+ buffers/cache:      58337     187661
Swap:            0          0          0


The non-Spark code is significantly slower but useful for some quick testing on few amounts of data.

In [None]:
%%time
# code does not require Spark

data_parsed = {k: get_abstract_lxml(v) for (k,v) in data.iteritems()}
data_nouns = {k: get_nouns(v) for (k,v) in data_parsed.iteritems()}

# loops through all list of nouns and gets all nouns and creates set of them (set ->each noun only once)
flat = {item for sublist in data_nouns.values() for item in sublist}

vocab = dict(zip(flat,range(len(flat))))    #word-to-number dict 
id2word = dict(zip(range(len(flat)),flat))  #number-to-word dict

corpus = {k: tocorpus(v,vocab) for (k,v) in data_nouns.iteritems()}
corpval = corpus.values()

`data` and `corpval` should have the same length.

In [8]:
print len(data),len(corpval)
print len(vocab)

336352 336352
106625


<a id='Performing'></a>
## Perfoming LDA

Since this step was not the bottleneck with respect to computation time, we used the nondistributed version of ldamodel. Examples for the topics obtained from the LDA can be expanded below.

In [9]:
%%time
num_topics=40
lda2 = gensim.models.ldamodel.LdaModel(corpval, id2word=id2word, num_topics=num_topics, passes=1)

CPU times: user 8min 54s, sys: 45 s, total: 9min 39s
Wall time: 9min 39s


In [10]:
lda2.print_topics(40)

[u'0.119*vehicle + 0.047*table + 0.040*card + 0.035*system + 0.024*tray + 0.022*game + 0.022*feature + 0.021*camera + 0.016*page + 0.015*format',
 u'0.226*image + 0.054*color + 0.050*radiation + 0.049*apparatus + 0.030*article + 0.028*imaging + 0.025*detector + 0.023*rail + 0.018*vector + 0.018*needle',
 u'0.099*composition + 0.076*weight + 0.056*resin + 0.056*polymer + 0.040*b + 0.035*zone + 0.028*invention + 0.025*resistance + 0.023*fabric + 0.021*component',
 u'0.092*set + 0.088*value + 0.062*point + 0.044*sample + 0.038*pad + 0.036*video + 0.035*method + 0.029*pixel + 0.026*strip + 0.021*stream',
 u'0.314*device + 0.119*region + 0.095*area + 0.059*display + 0.053*type + 0.026*screen + 0.025*matrix + 0.014*bond + 0.014*conductivity + 0.010*n',
 u'0.133*heat + 0.105*container + 0.062*transistor + 0.045*injection + 0.039*heating + 0.034*food + 0.027*compartment + 0.022*lid + 0.021*dispenser + 0.019*exchange',
 u'0.130*water + 0.072*coating + 0.057*solution + 0.033*content + 0.027*plas

<a id='Mapping'></a>
## Mapping Patents to Industries

### Topic-Industry Relation

In [11]:
# fetch naics nouns
naics_url = 'https://s3.amazonaws.com/cs109projectr/naics_nouns.json'
naics_r = requests.get(naics_url)
naics_nouns = json.loads(naics_r.content)

# take only manufacturing
industries =  [u'311',u'312',u'313',u'314',u'315',u'316',u'321',u'322',u'323',u'324',u'325',
               u'326',u'327',u'331',u'332',u'333',u'334',u'335',u'336',u'337',u'339'
              ]
naics_subset = []
for i in naics_nouns:
    if i['naics_code'] in industries:
        naics_subset.append(i)


In [12]:
# check that no invalid naics code entered
print len(industries), len(naics_subset)

21 21


In [13]:
# format naics data
'''
format_naics
input = list of dictionaries with naics definitions (json output of ind definitions ipython notebook)
output = naics_tuples, naics_codes
naics_tuples = list of tuples, tuple1 = (noun1, count1), tuple2 = ()...
naics_codes = list of naics codes
'''
def format_naics(input):
    naics_tuples = [i['noun_dict'].items() for i in input]
    naics_codes = [i['naics_code'] for i in input]
    return naics_tuples, naics_codes

In [14]:
naics_tuples, naics_codes = format_naics(naics_subset)
print len(naics_tuples), len(naics_codes)

21 21


In [16]:
# replace naics nouns with vocab id
# i.e. preps naics definitions with the same signature as corpval
naics_output = []
for i in naics_tuples:
    new_noun_list = []
    for j in i:
        try:
            new_noun_list.append((vocab[j[0]], j[1]))
        except:
            pass
            #print j[0]
    naics_output.append(new_noun_list)

In [17]:
# sparse matrix of industry-topic relation
val = []
col = []
row = []
for i,industry in enumerate(naics_output):
    for topic in lda2.get_document_topics(industry):
        row.append(i) # only changes after len(industry) numbers of loops
        col.append(topic[0]) # index of topic
        val.append(topic[1]) # probability that industry belong to this topic
# spark doc recommends np.array over list for efficiency (http://spark.apache.org/docs/latest/mllib-data-types.html)
indstr_tpc = sps.csc_matrix((np.array(val), (np.array(row), np.array(col))), shape=(len(industries), num_topics))

In [18]:
indstr_tpc.toarray() # sanity check: compare this to [lda2.get_document_topics(i) for i in naics_output]

array([[ 0.03742877,  0.01135007,  0.        ,  0.02473819,  0.02121565,
         0.01255455,  0.02051569,  0.        ,  0.04160644,  0.01745859,
         0.02478754,  0.01134821,  0.12848977,  0.02547809,  0.02672588,
         0.        ,  0.01314119,  0.02333849,  0.01169152,  0.        ,
         0.015762  ,  0.02421776,  0.        ,  0.06768398,  0.04058414,
         0.        ,  0.        ,  0.01483675,  0.02052445,  0.        ,
         0.        ,  0.0275468 ,  0.0196355 ,  0.01645676,  0.02288635,
         0.18732518,  0.        ,  0.01367302,  0.        ,  0.01077511],
       [ 0.10917954,  0.        ,  0.        ,  0.01375668,  0.        ,
         0.        ,  0.        ,  0.01833028,  0.        ,  0.        ,
         0.06726431,  0.        ,  0.03188074,  0.06607134,  0.        ,
         0.03609818,  0.0678537 ,  0.        ,  0.        ,  0.        ,
         0.01585597,  0.0197514 ,  0.01150197,  0.        ,  0.01042443,
         0.01588586,  0.        ,  0.        ,  0.

As an alternative to the approach described above, one could take the intersection between the nouns associated with each LDA topic and the nouns of the industry's definitions. For a particular topic, the industry whose definition has the largest intersection with this topic would then be assumed to be the industry that the topic represents. This process is illustrated in the Appendix of this notebook (note that SIC instead of NAICS are used as industry definitions). Another version of the intersection approach would be to directly take the intersection of the nouns of industry defitions and the nouns of patent abstract. However, because a single patent is very spezialized compared to the scope of a whole industry, we prefered the  approach used above.

### Patent-Industry Relation

**Code for Single Machine**

In [19]:
'''
input
nouns: list of nouns with signature as corpval
lda: trained lda classifier
indstr_tpc_matrix: matrix of relation between industries and topics (dimesions: #indutries x # topics)

output
list of probabilities for each topic
'''
def tpc2indstr(nouns, lda, indstr_tpc_matrix):
    topics = lda.get_document_topics(nouns)
    
    # turn topic probabilities into sparse row vector
    val = []
    col = []
    for topic in topics:
        col.append(topic[0]) # index of topic
        val.append(topic[1]) # probability that industry belong to this topic
    tpc_vector = sps.csc_matrix((np.array(val), (np.zeros(len(val)), np.array(col))), shape=(1, num_topics))
    
    # combine probabilities with relation between topics and industries
    indstr = indstr_tpc_matrix * tpc_vector.transpose()
    
    # normalize such that sum = 1 -> interpretable as probability
    indstr_norm = indstr / indstr.sum()
    
    # convert sparse matrix into array into flattened list
    return indstr_norm.transpose().toarray().flatten().tolist()

In [20]:
# on single instance: grant permission
! sudo chmod -R ugo+rw /home

In [21]:
# recall: corpval = corpus.values().collect()
instrs = corpus.mapValues(lambda v: tpc2indstr(v, lda2, indstr_tpc))

**Code for Cluster**

In [22]:
'''
input
nouns: list of nouns with signature as corpval
lda: trained lda classifier
indstr_tpc_matrix: matrix of relation between industries and topics (dimesions: #indutries x # topics)

output
list of probabilities for each topic
'''
def tpc2indstr_cluster(tpcs, indstr_tpc_matrix):
    # turn topic probabilities into sparse row vector
    val = []
    col = []
    for topic in tpcs:
        col.append(topic[0]) # index of topic
        val.append(topic[1]) # probability that industry belong to this topic
    tpc_vector = sps.csc_matrix((np.array(val), (np.zeros(len(val)), np.array(col))), shape=(1, num_topics))
    
    # combine probabilities with relation between topics and industries
    indstr = indstr_tpc_matrix * tpc_vector.transpose()
    
    # normalize such that sum = 1 -> interpretable as probability
    indstr_norm = indstr / indstr.sum()
    
    # convert sparse matrix into array into flattened list
    return indstr_norm.transpose().toarray().flatten().tolist()

In [30]:
%%time
# to avoid permission error when running get_document_topics on executor
tpcs = {k: lda2.get_document_topics(v) for (k,v) in corpus.collect()}
tpcs_rdd = sc.parallelize(tpcs.iteritems(), part)

CPU times: user 11.2 s, sys: 24 ms, total: 11.2 s
Wall time: 19.1 s


In [31]:
# recall: corpval = corpus.values().collect()
instrs = tpcs_rdd.mapValues(lambda v: tpc2indstr_cluster(v, indstr_tpc))

**Output**

Both the code for a single machine and the code for a cluster lead to the following output

In [22]:
instrs.take(1)

[('US06398796-20020604.XML',
  [0.06057124709195175,
   0.10573301575757516,
   0.06966995883601858,
   0.04885883488366097,
   0.03686248434392848,
   0.05246972371934844,
   0.011626566913621985,
   0.02329907329891271,
   0.038881210540898416,
   0.042260456674565385,
   0.05192760959506576,
   0.046482927690773265,
   0.05991594511393557,
   0.07001816033709758,
   0.07116977112764734,
   0.06602644997251943,
   0.02306227708637173,
   0.027036300864515205,
   0.0727274742755223,
   0.009109375931653937,
   0.012291135944415972])]

### Adding Additional Data

In order to put our industry prediction into a larger context and also to validate them, we extract metadata from patents such as the inventor's geographic location, name, and the patent's grant date.

In [None]:
supp = data_rdd.mapValues(lambda v: get_supplement_lxml(v, xml_format='pre2005'))

In [23]:
joined = supp.join(instrs)

In [24]:
%%time
joined.take(1)

CPU times: user 3.07 s, sys: 8.63 s, total: 11.7 s
Wall time: 12min 18s


[('US06659620-20031209.XML',
  (['Saitama', None, 'JP', 'Jordan and Hamburg LLP'],
   [0.08054396266495348,
    0.08128860298136581,
    0.023158182775195733,
    0.01962892205371323,
    0.07515835664634257,
    0.07140715224128373,
    0.019797497078430804,
    0.006786676430313621,
    0.032930273229061564,
    0.02245926020931655,
    0.10732746890213565,
    0.03673797683032208,
    0.0718654910679785,
    0.047969165869005397,
    0.08237789050249114,
    0.08681317757547358,
    0.051267234953028214,
    0.03036526211576938,
    0.01740802801738065,
    0.014293692222322286,
    0.020415725634116057]))]

In [25]:
def in1list(patent,data):
    supp = data[0]
    indstrs = data[1]
    patent_id, date = patent[:-4].split('-')
    merged = [patent_id, date] + supp + indstrs
    return merged

In [26]:
joined_flat = joined.map(lambda (k,v): in1list(k,v))

In [27]:
joined_flat.take(1)

[['US06659620',
  '20031209',
  'Saitama',
  None,
  'JP',
  'Jordan and Hamburg LLP',
  0.08054396266495348,
  0.08128860298136581,
  0.023158182775195733,
  0.01962892205371323,
  0.07515835664634257,
  0.07140715224128373,
  0.019797497078430804,
  0.006786676430313621,
  0.032930273229061564,
  0.02245926020931655,
  0.10732746890213565,
  0.03673797683032208,
  0.0718654910679785,
  0.047969165869005397,
  0.08237789050249114,
  0.08681317757547358,
  0.051267234953028214,
  0.03036526211576938,
  0.01740802801738065,
  0.014293692222322286,
  0.020415725634116057]]

In [28]:
sqlContext = SQLContext(sc)

In [29]:
supplements = ['city', 'state', 'country', 'assignee']
colnames = ['patent_id', 'date'] + supplements + industries

fields = [StructField(field_name, StringType(), True) for field_name in colnames]
schema = StructType(fields)

In [30]:
df = sqlContext.createDataFrame(joined_flat, schema)

In [31]:
print df.count(), len(data)

336352 336352


In [32]:
df.take(3)

[Row(patent_id=u'US06659620', date=u'20031209', city=u'Saitama', state=None, country=u'JP', assignee=u'Jordan and Hamburg LLP', 311=u'0.08054396266495348', 312=u'0.08128860298136581', 313=u'0.023158182775195733', 314=u'0.01962892205371323', 315=u'0.07515835664634257', 316=u'0.07140715224128373', 321=u'0.019797497078430804', 322=u'0.006786676430313621', 323=u'0.032930273229061564', 324=u'0.02245926020931655', 325=u'0.10732746890213565', 326=u'0.03673797683032208', 327=u'0.0718654910679785', 331=u'0.047969165869005397', 332=u'0.08237789050249114', 333=u'0.08681317757547358', 334=u'0.051267234953028214', 335=u'0.03036526211576938', 336=u'0.01740802801738065', 337=u'0.014293692222322286', 339=u'0.020415725634116057'),
 Row(patent_id=u'US06613991', date=u'20030902', city=u'Murfreesboro', state=u'TN', country=None, assignee=u'France/Scott Fetzer Company', 311=u'0.06978115088587478', 312=u'0.1677635784388767', 313=u'0.05630981385863626', 314=u'0.033616124725763585', 315=u'0.0', 316=u'0.041470

<a id='Saving'></a>
## Saving to S3

In [33]:
df.repartition(1).write.save('s3n://cs109project/df11', format='json')

<a id='A'></a>
## Appendix

<a id='A1'></a>
### Spark Issues
**Issue of ldamodel.get_document_topics with spark (YARN modus)**  
Gensim imports theano. This fails because theano can't write in /home on executors

*Solution Attempts*
- Set permissions on executors in runtime (`os.system('sudo chmod -R ugo+rw /home')`)
- Set environment variable either in bootstrap-actions (install-anaconda-emr): `export SPARK_YARN_USER_ENV="HOME=/tmp, THEANORC=/tmp"` or in runtime (`os.environ['HOME'] = '/tmp'`)

*Working Solution*
- None

<br>

**Issue of ntlk with Spark (YARN modus)**  
Spark's executors do not know where the manually downloaded packages of nltk are. 

Attempts
Set environment variable 
- either in configure-spark.sh (`export SPARK_YARN_USER_ENV="NTLK_DATA=/usr/share/ntlk_data"`)
- or on executors in runtime (`os.environ['NLTK_DATA'] = '/usr/share/nltk_data'`)


Working solution (but much slower than using pattern instead of nltk)

Download packages during function where packages are needed.
- new issue: permission denied to default folder
- attempted solution: try to get read and write permissions for /usr via bash
```
    # try to get read and write permissions for /usr
    bashCommand = 'sudo chmod -R ugo+rw /usr'
    import subprocess
    process = subprocess.call(bashCommand.split())
    bashCommand = 'mkdir /usr/share/nltk_data'
    process = subprocess.call(bashCommand.split())
```

- working solution: install packages in /tmp

```
def get_nouns(text):
    if '/tmp' not in nltk.data.path:
        nltk.data.path.append('/tmp')
    if 'tokenizers' not in os.listdir('/tmp'):
        nltk.download('punkt', '/tmp')
        nltk.download('maxent_treebank_pos_tagger', '/tmp')
        nltk.download('averaged_perceptron_tagger', '/tmp')
    tokenized = word_tokenize(text.lower())
    tagged = nltk.pos_tag(tokenized)
    nouns = [a for (a, b) in tagged if b == 'NN']
    return nouns
```

<a id='A2'></a>
### Industry-Topic Intersection

This notebook takes the output of the LDA analysis and maps it to the SIC definitions, which are scraped from the web (as indicated below). We chose not to pursue this approach further, because the NAICS Standard offers a much more finely grained classification system with richer definitions.

First, we will walk through the steps to save the relevant data from the LDA model for further processing. This code is to be appended to the relevant parsing notebook.

In [None]:
#set number of topics to 20 and the displayed words to 50. Output is formatted as a tuple (probability, word)
lda2model = lda2.show_topics(num_topics=20, num_words=50, log=False, formatted=False)

In [None]:
#save entire model as backup
lda2.save('lda2model.txt')

In [None]:
#save to json - our baseline format
import json

with open('lda2model.txt', 'w') as outfile:
    json.dump(lda2.show_topics(num_topics=20, num_words=50, log=False, formatted=False), outfile)

In [None]:
#read in the LDA model
lda2_data = open('lda2model.txt')
json_str = lda2_data.read()
lda2_topics = json.loads(json_str)

The code below maps topics to SIC codes.

In [None]:
#read in the LDA model
lda2_data = open('lda2model.txt')
json_str = lda2_data.read()
lda2_topics = json.loads(json_str)

In [None]:
#extract words (without probabilities) and zip in dict
import time
%time

lda2_words = []
lda2_prob = []
for topic in lda2_topics:
    words = []
    prob = []
    for word in topic:
        words.append(word[1])
        prob.append(word[0])
    lda2_words.append(words)
    lda2_prob.append(prob)

topic_dict = dict(zip(xrange(len(lda2_words)), lda2_words))

In [None]:
#read in the SIC Codes
SIC_Data = open('SIC_Codes_Dict.txt')
json_str = SIC_Data.read()
SIC_Dict = json.loads(json_str)

In [None]:
#finding the intersection
%time
final_topic_dict = {}
for topic in topic_dict:
    counter = 0
    tag = 0
    for industry in SIC_Dict:
        sect = set(topic_dict[topic]) & set(SIC_Dict[industry])
        if len(sect) > counter:
            counter = len(sect)
            tag = industry
    final_topic_dict[topic] = tag

In [None]:
#dict is of form: key = lda topic no., value = SIC Major Group (see example below)
final_topic_dict

![Image](Data\Images\2.4 Example Intersection.png?raw=true)