# Predicting Patent Industries by Text Analysis

by Constantin Knoll, Christopher Mosch, Rohan Thavarajah

## Table of Contents

[1. Overview](#1.)
* [1.1 Motivation](#1.1)
* [1.2 Objectives](#1.2)

[2. Prediction and Analysis](#2.)
* [2.1 Google Patent Data](#2.1)
* [2.2 Industry Definition Data](#2.2)
* [2.3 Text Processing and Industry Mapping](#2.3)
* [2.4 Accuracy via USPTO Baseline](#2.4)

[3. Results](#3.) 

[4. Conclusion](#4.)

<a id='1.'></a>
## 1. Overview

<a id='1.1'></a>
### 1.1 Motivation

<a id='1.2'></a>
### 1.2 Objectives

<a id='2.'></a>
## 2. Prediction and Analysis

The diagram below illustrates the workflow
split up in 4 notebooks, colored boxes indicate 


overall summary
which is descibred in more detail below

![Image](Data\Images\Workflow.png?raw=true)

<a id='2.1'></a>
### 2.1 Google Patent Data

The purpose of this notebook is to automate the downloading, unpacking, and subsequent uploading (to S3) of patent data available on Google. We scrape patent files with supplements and images from the url found in the User Guide. The data is grouped by weeks, ranging from 2001 to 2015. The years 2011- are compressed in .tar format, while the earlier ones are compressed as .zip. Thus, the code accounts for these options by including the python libraries ZipFile and TarFile. 
The code found in Compact Version (which we used mainly) downloads a single week of patent data, extracts, uploads and deletes it before repeating the process with another week. This is optimized to work on very small storage space such as an SSD on our local machine.
Each week of patent data contains several folders with different content. For the purposes of our project, we are only interested in the abstracts, which are found in the .xml files of the patent application body. Thus, we need to get rid of all supplements (as defined by the user below) and all images that aren't conducive to data analysis. This is reflected in the main Code, where the tree of the downloaded data is searched for all unwanted folders, which are deleted.
It is important to realize that the "deleted" files are typically dropped into the recycle bin of the local machine, and thus the purpose of running the space-optimized code is compromised. Therefore we advise the user to us an automated recycle bin clean-up program. It should be set to clean every 10-15 minutes, which is the average time it takes to complete the cycle for one week of patent data.
Since we get rid of most of the data found in the patent files, the convoluted folder structure that remains is thrown out and replaced by a three-tiered one: 1) Root 2) Year 3) Week.
The uploading to S3 is controlled by the boto3 library, which takes care of most of the uploading automatically.

![Image](Data\Images\2.1 S3.png?raw=true)

<a id='2.2'></a>
### 2.2 Industry Definition Data

**Goal** - scrape the Census Bureau's webpage for NAICS definitions and output a list of dictionaries. Each dictionary will have a high-level parent naics, and the nouns in all the definitions of its lower-level children

**NAICS overview** - The NAICS 2002 is a hierarchical classifiction of industry. Each 2 digit naics is comprised of a set of 3 digit naics, each 3 digit naics of 4 digit naics and so on. The Census Bureau lays out their website to reflect this. If you are interested in finding the definition of a 6 digit naics, you must first select its parent and drill down to it. Therefore we have 4 steps

- Step 1 - step through every tier of Census Bureau definitions and compile a list of urls to terminal definition pages
- Step 2 - from each page fetch title, conceptual definition and items 
- Step 3 - set the definition of parents = the sum of definitions of descendents
- Step 4 - construct a noun dictionary for each 3-digit naics

**Output signature** - We wish to construct a list of dictionaries. Each dictionary will reflect a 3 digit NAICS and have the following signature:
![Image](Data\Images\2.2 naics_get_nouns_output_signature.png?raw=true)


<a id='2.3'></a>
### 2.3 Text Processing and Industry Mapping

<a id='2.4'></a>
### 2.4 Accuracy via USPTO Baseline

<a id='3.'></a>
## 3. Results

<a id='4.'></a>
## 4. Conclusion