# Predicting Patent Industries by Text Analysis

by Constantin Knoll, Christopher Mosch, Rohan Thavarajah

## Table of Contents

[1. Overview](#1.)
* [1.1 Motivation](#1.1)
* [1.2 Objectives](#1.2)

[2. Prediction and Analysis](#2.)
* [2.1 Google Patent Data](#2.1)
* [2.2 Industry Definition Data](#2.2)
* [2.3 Text Processing and Industry Mapping](#2.3)
* [2.4 Accuracy via USPTO Baseline](#2.4)

[3. Results](#3.) 
* [3.1 Discussion](#3.1)
* [3.2 Opportunities for Refinement](#3.2)

[4. Conclusion](#4.)

<a id='1.'></a>
## 1. Overview

<a id='1.1'></a>
### 1.1 Motivation

<a id='1.2'></a>
### 1.2 Objectives

<a id='2.'></a>
## 2. Prediction and Analysis

The diagram below illustrates the workflow
split up in 4 notebooks, colored boxes indicate 


overall summary
which is descibred in more detail below

![Image](Data\Images\Workflow.png?raw=true)

<a id='2.1'></a>
### 2.1 Google Patent Data

The purpose of this notebook is to automate the downloading, unpacking, and subsequent uploading (to S3) of patent data available on Google. We scrape patent files with supplements and images from the url found in the User Guide. The data is grouped by weeks, ranging from 2001 to 2015. The years 2011- are compressed in .tar format, while the earlier ones are compressed as .zip. Thus, the code accounts for these options by including the python libraries ZipFile and TarFile. 
The code found in Compact Version (which we used mainly) downloads a single week of patent data, extracts, uploads and deletes it before repeating the process with another week. This is optimized to work on very small storage space such as an SSD on our local machine.
Each week of patent data contains several folders with different content. For the purposes of our project, we are only interested in the abstracts, which are found in the .xml files of the patent application body. Thus, we need to get rid of all supplements (as defined by the user below) and all images that aren't conducive to data analysis. This is reflected in the main Code, where the tree of the downloaded data is searched for all unwanted folders, which are deleted.
It is important to realize that the "deleted" files are typically dropped into the recycle bin of the local machine, and thus the purpose of running the space-optimized code is compromised. Therefore we advise the user to us an automated recycle bin clean-up program. It should be set to clean every 10-15 minutes, which is the average time it takes to complete the cycle for one week of patent data.
Since we get rid of most of the data found in the patent files, the convoluted folder structure that remains is thrown out and replaced by a three-tiered one: 1) Root 2) Year 3) Week.
The uploading to S3 is controlled by the boto3 library, which takes care of most of the uploading automatically.

![Image](Data\Images\2.1 S3.png?raw=true)

<a id='2.2'></a>
### 2.2 Industry Definition Data

**Goal** - scrape the Census Bureau's webpage for NAICS definitions and output a list of dictionaries. Each dictionary will have a high-level parent naics, and the nouns in all the definitions of its lower-level children

**NAICS overview** - The NAICS 2002 is a hierarchical classifiction of industry. Each 2 digit naics is comprised of a set of 3 digit naics, each 3 digit naics of 4 digit naics and so on. The Census Bureau lays out their website to reflect this. If you are interested in finding the definition of a 6 digit naics, you must first select its parent and drill down to it. Therefore we have 4 steps

- Step 1 - step through every tier of Census Bureau definitions and compile a list of urls to terminal definition pages
- Step 2 - from each page fetch title, conceptual definition and items 
- Step 3 - set the definition of parents = the sum of definitions of descendents
- Step 4 - construct a noun dictionary for each 3-digit naics

**Output signature** - We wish to construct a list of dictionaries. Each dictionary will reflect a 3 digit NAICS and have the following signature:
![Image](Data\Images\2.2 naics_get_nouns_output_signature.png?raw=true)


<a id='2.3'></a>
### 2.3 Text Processing and Industry Mapping

<a id='2.4'></a>
### 2.4 Accuracy via USPTO Baseline

**Goal** - This notebook has three parts. In part 1 we compare our predicted industries to a "Silver Standard" to gauge our performance. 

Part 1 - Comparison to Silver Standard 
- Step 1.0 - Discussion - Why isn't it "Gold"?
- Step 1.1 - Construct Silver Standard
- Step 1.2 - Pull in Chronan predictions and merge with those of the USPTO
- Step 1.3 - Analyze Chronan performance

<a id='3.'></a>
## 3. Results

<a id='3.1'></a>
### 3.1 Discussion

The link to the Tableau public workbook is:
https://public.tableau.com/views/Chronan_public/top_patentees_by_ind?:embed=y&:display_count=yes&:showTabs=y

<a id='v1'></a>
**Top Patentees by Industry**
![Image](Data\Images\2.4 viz1.png?raw=true)
- The first tab in the Tableau workbook allows users to view the most active patentees in each industry. Filtering on "334 - Computer and Electronics", we see that the three largest innovators in this industry are Canon, IBM and Hitachi. On the whole, our prediction suggests that American and Japanese companies dominate the computer and electronics industry, which is what we would expect to see.

<a id='v2'></a>
**Company Profile**
![Image](Data\Images\2.4 viz2.png?raw=true)
- The second tab illustrates which industries a firm patents in. 
- Consider Hitachi Ltd which, owing to being an organisation with both chemical and metals arms, has peaks for metal manufacturing and chemical manufacturing.
- Among other things, the profiles can be used as a new way to compare the similarity of companies, for example.

<a id='v3'></a>
**Innovation Landscape**
![Image](Data\Images\2.4 viz3.png?raw=true)
- Finally the last tab maps patent activity by industry both across locations and time.
- We are able to observe the greatest concentration of innovation at innovation hubs (e.g. San Francisco, Austin and New York).

<a id='3.2'></a>
### 3.2 Opportunities for Refinement

**Refinement of Inputs**
- Apply permid.org API to standardize assignee names and render filtering by assignee name less cumbersome.
- Search for more definitions for each NAICS. Initially we had tried to use the intersection between industry definitions and topic keywords to construct the topic -> industry mapping. However, because industry definitions have relatively little content, this led to the intersections being prohibitively small.
- We have focused on 2002-2003 because these are years in which we have a USPTO baseline for comparison. We can readily scale up years of interest to the past 20 years. The main impediment is that the schemas of the patent .xml input files change over time so we need to alter the functions for scraping the .xmls in some years (we have already done this for 2005 but have not implemented it).


**Refinement of Processes**
- Incorporate citations field. For instance we could append to each abstract the abstract of first order citations before running LDA. This would draw the topic of each patent towards the mean topic of its immediate citations.
- Because we map topics to industries, we do not pay as hefty a cost in interpretability if we ask LDA to generate many topics. Our results are based on running LDA with 40 topics and we have saved output which utilizes 60 and 80 respectively. It would be interesting to see how performance evolves as a function of number of topics.
- At the moment we choose patent industry by whichever has the highest weight. We should introduce cutoff thresholds for weights that are either too small or too similar to the second highest weight. 

<a id='4.'></a>
## 4. Conclusion