# landscraper - doing the dirty work for intellectual property (IP) decisions

__Contributor: Akhil Jindal__ | https://github.com/akhil-jindal/

---

## Table of Contents:

1.) [Introduction](#introduction)

2.) [Brief Background](#background)

3.) [Objectives](#objectives)

4.) [Techniques and Tools](#techniques)

5.) [Results](#results)

6.) [Discussion of Outcomes](#discussion)

7.) [Appendix](#appendix)

---

## Introduction <a name="introduction"></a>

It's difficult to make IP decisions for newly developed technologies, especially early on.  Currently, IP decision makers include subject matter experts (SMEs), IP attorneys, and business strategists that work together and try to answer questions such as:

* What intellectual property strategy should we consider for our technology?
  * Patent
  * Trade secret
  * Other technology transfer mechanisms (e.g., publications, defensive public disclosures, etc.)

* How many and what kinds of 'key players' are practicing in this technology area?
  * Licensees?
  * Competitors?
  
* Can we forecast any roadblocks in securing protection and enforceability?

* Which aspects of our technology should we focus on for IP protection, and which require further development?

These questions are difficult to answer and typically require capital and time intensive resources. Furthermore, IP decision makers are required to have a strong understand the technology and the corresponding state of the art, as well as avoid analysis paralysis, emotional bias, and decision fatigue.

The goal of ___landscraper___ is to provide a starting point for understanding the 'patent landscape' for a given patent application.  For example:

* Identification of a published patent classification corresponding to the given patent application
* Identification of 'key players' (e.g., potential competitors and/or licensees) for the given target patent application

---
## Brief Background <a name="background"></a>

__What is a patent?__

* A patent is a form of intellectual property. A patent gives its owner the right to exclude others from making, using, selling, and importing an invention for a limited period of time (e.g., twenty years).
* The patent rights are granted in exchange for an enabling public disclosure of the invention (e.g., a patent application)

__What does 'patent landscape' mean?:__

A patent landscape is a snapshot of a patent situation of a specific technology, which can be used by IP decision makers to inform policy decisions, strategic research planning or technology transfer.

__What is a patent publication classification?:__

It is a system for examiners of patent offices or other people to categorize (code) documents, such as published patent applications, according to the technical features of their content.

For this project I will be focusing on a particular [Cooperative Patent Classifications (CPC)](https://www.uspto.gov/web/patents/classification/cpc/html/cpc.html) class (G06, as discussed below).  Spending a couple minutes navigating through this link can give you a sense of the magnitude and granularity of patent classifications across a broad range of technical domains.

__What a patent application looks like:__

The red boxes below represent portions of the patent application that ___landscraper___ relies upon to predict the CPC parent publication classification (depicted in the top-right green box) and 'key players'.  In this context, 'key players' refers to businesses/organizations with the most "assigned" patents (depicted in bottom-left green box) within the identified CPC parent publication classification.

![alt text](https://raw.githubusercontent.com/akhil-jindal/landscraper/master/data/images/patent_app_I.png)

<center> ........ </center> 

![alt text](https://raw.githubusercontent.com/akhil-jindal/landscraper/master/data/images/patent_app_II.png)

<center> ........ </center> 

![alt text](https://raw.githubusercontent.com/akhil-jindal/landscraper/master/data/images/patent_app_III.png)

---

## Objectives <a name="objectives"></a>

__Major Tasks:__

Below the major tasks are simplifications and assumptions I made for the purposes of this project.

1.) [Prepare a corpus for training](#prep)
- Limited classifications to computer technology subject matter areas.  [In the U.S., it's one of the top technology fields seeking IP protection](https://www.wipo.int/edocs/infogdocs/en/ipfactsandfigures2018/).
- CPC Class G06 represents a broad class covering: computing, counting and calculating class
- [G06 sub-classifications used for training](https://www.uspto.gov/web/patents/classification/cpc/html/cpc-G.html#G06):

| CPC Class |                                      Definition                                      |
|:---------:|:------------------------------------------------------------------------------------:|
|    G06C   |        Digital computer in which all the computation is effected mechanically        |
|    G06D   |                       Digital fluid-pressure computing devices                       |
|    G06E   |                               Optical computing devices                              |
|    G06F   |                           Electric digital data processing                           |
|    G06G   |                                  Analogue computers                                  |
|    G06J   |                             Hybrid computing arrangements                            |
|    G06K   | Recognition of data, presentation of data, record carriers, handling record carriers |
|    G06M   |                Counting mechanisms, counting of objects                              |
|    G06N   |                Computer systems based on specific computational models               |
|    G06Q   |               Data processing systems or methods, specifically adapted               |
|    G06T   |                              Image processing                                        |

- Used U.S. published patent applications available from https://patents.google.com

2.) [Train classifiers and evaluate their performance](#train)

- Grid search was limited to several features/parameters

- Down sampled training data to 100 patent applications in each classification to create a balanced dataset

- Training and testing relied on selected portions of a patent publication (see: [Brief Background](#background)), and does not consider other related documentation (e.g., prosecution history, litigation outcomes, etc.).

3.) [Apply trained classifiers](#apply)


---

## Techniques and Tools <a name="techniques"></a>

__1.) Preparing a corpus:__<a name="prep"></a>

___landscraper___ specific files:

[./notebooks/scraper.ipynb](https://github.com/akhil-jindal/landscraper/blob/master/notebooks/scraper.ipynb)
* Fetch patent applications from https://patents.google.com
* Extract pertinent text from patent applications
* Save text content in respective classification folder

Imported libraries:

  * [BeautifulSoup4](https://pypi.org/project/beautifulsoup4/) - pulling data of HTML and XML files
  * [requests](https://pypi.org/project/requests/2.7.0/) - send HTTP requests
  * glob - finds all the pathnames matching a specified pattern according to the rules used by the Unix shell
  * re - provides regular expression matching operations
  * csv - used to import and export spreadsheets and databases
  * os - provides a portable way of using operating system dependent functionality
  * [pandas](https://pandas.pydata.org/) - provides high-performance, easy-to-use data structures and data analysis tools 


__2.) Training classifiers and evaluating performance:__<a name="train"></a>

___landscraper___ specific files:

[./notebooks/pipeline.ipynb](https://github.com/akhil-jindal/landscraper/blob/master/notebooks/pipeline.ipynb)
  * Convert corpus into matrix of token counts
  * Transform a count matrix to a normalized term-frequency times inverse document-frequency (tf-idf) representation
  * Train a model using the transformed representation of the corpus
  * Test and evaluate performance on a test set
  * Perform a grid-search for parameter tuning
  * Visualize sklearn results
  * Export trained classifiers for future use

Imported libraries:

* [sklearn](https://scikit-learn.org/stable/) - used to build pipeline, perform grid search and evaluate results
* [nltk](https://www.nltk.org/) - added stop words, stemming and lemmatization
* [numpy](https://www.numpy.org/) - handled array objects
* [matplotlib](https://matplotlib.org/) and [scikitplot](https://scikit-plot.readthedocs.io/en/stable/) - visualize results
* glob - finds all the pathnames matching a specified pattern according to the rules used by the Unix shell
* re - provides regular expression matching operations
* os - provides a portable way of using operating system dependent functionality
* pickle - exported trained classification models

__3.) Application of classifiers:__<a name="apply"></a>

___landscraper___ specific files:

[./notebooks/predict_G06.ipynb](https://github.com/akhil-jindal/landscraper/blob/master/notebooks/predict_G06.ipynb)

* Applied three exported models to a large sample of patent applications that were not included in the training or testing of the models
* Provided a list of top 3 'key players' for a selected patent application as well as their predicted CPC classification.

---

## Results <a name="results"></a>

More detailed metrics can be found in the [Appendix](#appendix)

The following data set was used in the 'expanded test'.   

It should be noted, that this 'expanded test' uses data that has not been trained nor tested prior to model generation.

| CPC Class | Number of Patent Applications |
|:---------:|:-----------------------------:|
|    G06C   |              367              |
|    G06D   |               0               |
|    G06E   |               61              |
|    G06F   |              1717             |
|    G06G   |              100              |
|    G06J   |               88              |
|    G06K   |              880              |
|    G06M   |               14              |
|    G06N   |              182              |
|    G06T   |              895              |
|    G06Q   |              963              |


### Results for Classification Prediction:
    
| SGDClassifier (Model 1) | SGDClassifier w/ Stop Words (Model 2) | SGDClassifier w/ Parameter Tuning (Model 3) |
|:-----------:|:-------------:|:-------------:|
|![](https://raw.githubusercontent.com/akhil-jindal/landscraper/master/data/images/model1_final.png) | ![](https://raw.githubusercontent.com/akhil-jindal/landscraper/master/data/images/model2_final.png) | ![](https://raw.githubusercontent.com/akhil-jindal/landscraper/master/data/images/model3_final.png)|

### An Example of 'Key Player' Prediction:

A random sample of 3 applications from the expanded testing set were selected.  Model 1 performs the worst in this sample test, where only 1 out of 3 classifications/'key players' were identified correctly.

However, we see the results improve for Model 2 and Model 3, where 2 out of 3 classifications and 'key players' were identified correctly.


__SGDClassifier (Model 1)__
![](https://raw.githubusercontent.com/akhil-jindal/landscraper/master/data/images/kp_1.png)


__SGDClassifier w/ Stop Words (Model 2)__
![](https://raw.githubusercontent.com/akhil-jindal/landscraper/master/data/images/kp_2.png) 

__SGDClassifier w/ Parameter Tuning (Model 3)__

![](https://raw.githubusercontent.com/akhil-jindal/landscraper/master/data/images/kp_2.png)

---

## Discussion of Outcomes <a name="discussion"></a>

Although the initial results of ___landscaper___ look promising.  It is important to establish a metric to compare the quality of predictions across the current and future models.  __Precision__ and __recall__ are two important evaluation metrics, where precision refers to the percentage of results which are relevant, and recall refers to the total percentage of total relavant results correclty classified by the model.

![](https://github.com/akhil-jindal/landscraper/blob/master/data/images/precision.svg)

To improve the results of ___landscraper___, I believe it would be beneficial to perform another grid-search that prioritizes __recall__ rather than precision.  The rationale being that, precision performance can be biased if the model does not find any 


---

## Appendix <a name="results"></a>


__Model 1 - A starting point with SGDClassifier:__


Cross Validation Confusion Matrix |  Prediction Confusion Matrix
:--------------------------------:|:------------------------------------:
![](https://raw.githubusercontent.com/akhil-jindal/landscraper/master/data/images/model1_cv_visual.png) | ![](https://raw.githubusercontent.com/akhil-jindal/landscraper/master/data/images/model1_predict_visual.png)

Cross Validation Report |  Prediction Report
:--------------------------------:|:------------------------------------:
![](https://raw.githubusercontent.com/akhil-jindal/landscraper/master/data/images/model1_cv_report.png) | ![](https://raw.githubusercontent.com/akhil-jindal/landscraper/master/data/images/model1_predict_report.png)

__Model 2 - SGDClassifier w/ stop-words (both 'english' and patent-specific):__

Cross Validation Confusion Matrix |  Prediction Confusion Matrix
:--------------------------------:|:------------------------------------:
![](https://raw.githubusercontent.com/akhil-jindal/landscraper/master/data/images/model2_cv_visual.png) | ![](https://raw.githubusercontent.com/akhil-jindal/landscraper/master/data/images/model2_predict_visual.png)

Cross Validation Report |  Prediction Report
:--------------------------------:|:------------------------------------:
![](https://raw.githubusercontent.com/akhil-jindal/landscraper/master/data/images/model2_cv_report.png) | ![](https://raw.githubusercontent.com/akhil-jindal/landscraper/master/data/images/model2_predict_report.png)

__Model 3 - SGDClassifier w/ best performing parameters from grid-search:__

Cross Validation Confusion Matrix |  Prediction Confusion Matrix
:--------------------------------:|:------------------------------------:
![](https://raw.githubusercontent.com/akhil-jindal/landscraper/master/data/images/model3_cv_visual.png) | ![](https://raw.githubusercontent.com/akhil-jindal/landscraper/master/data/images/model3_predict_visual.png)

Cross Validation Report |  Prediction Report
:--------------------------------:|:------------------------------------:
![](https://raw.githubusercontent.com/akhil-jindal/landscraper/master/data/images/model3_cv_report.png) | ![](https://raw.githubusercontent.com/akhil-jindal/landscraper/master/data/images/model3_predict_report.png)

---