# landscraper - doing the dirty work for intellectual property (IP) decisions

__Contributer: Akhil Jindal__ | https://github.com/akhil-jindal/

---

## Table of Contents:

1.) [Introduction](#introduction)

2.) [Brief Background](#background)

3.) [Objectives](#objectives)

4.) [Techniques and Tools](#techniques)

5.) [Results and Conclusions](#results)

6.) [Appendix](#appendix)

---

## Introduction <a name="introduction"></a>

It's difficult to make IP decisions for newly developed technologies, especially early on.  Currently, IP decision makers include subject matter experts (SMEs), IP attorneys, and business strategists that work together and try to answer questions such as:

* What intellectual property strategy should we consider for our technology?
  * Patent
  * Trade secret
  * Other technology transfer mechanisms (e.g., publications, defensive public disclosures, etc.)

* How many and what kinds of 'key players' are practicing in this technology area?
  * Licensors?
  * Competitors?
  
* Can we forecast any roadblocks in securing protection and enforceability?

* Which aspects of our technology should we focus on for IP protection, and which require further development?

These questions are difficult to answer and typically require capital and time intensive resources. Furthermore, IP decision makers are required to have a strong understand the technology and the corresponding state of the art, as well as avoid analysis paralysis, emotional bias, and decision fatigue.

---
## Brief Background <a name="background"></a>

__What is a patent?__

* A patent is a form of intellectual property. A patent gives its owner the right to exclude others from making, using, selling, and importing an invention for a limited period of time (e.g., twenty years).
* The patent rights are granted in exchange for an enabling public disclosure of the invention (e.g., a patent application)

__What a patent application looks like:__

In red boxes below are the portions of the patent application that ___landscraper___ relies upon, namely:
* Title
* Assignee - who has the right to the patent
* Publication classification - granular classification of the patented invention
* Abstract - 150 word (or less) description
* Detailed description - an exhaustive description, encompassing all proposed embodiments, variations and applications
* Claims - a legal term that describes the extent (i.e. the scope, of the protection conferred by a patent) or the protection sought in a patent application

![alt text](https://raw.githubusercontent.com/akhil-jindal/landscraper/master/images/patent_app_I.png)

<center> ........ </center> 

![alt text](https://raw.githubusercontent.com/akhil-jindal/landscraper/master/images/patent_app_II.png)

<center> ........ </center> 

![alt text](https://raw.githubusercontent.com/akhil-jindal/landscraper/master/images/patent_app_III.png)


__What does 'patentability' mean?:__

US patent laws usually require that, for an invention to be patentable, it must be:

* Patentable subject matter (i.e., a kind of subject-matter eligible for patent protection)
* Novel (i.e. at least some aspect of it must be new)
* Non-obvious
* Useful

__How many and what kinds of classifications are there for 'patentable subject matter'?:__

Here is a list of [Cooperative Patent Classifications (CPC)](https://www.uspto.gov/web/patents/classification/cpc/html/cpc.html).  I encourage you to navigate through this list to see just how granular the classifications are and to get a sense of the breadth of technologies that have been patented.


---
## Objectives <a name="objectives"></a>

__Goal:__

A user can input a sample patent application (i.e., a target patent application) and ___landscraper___ will provide a starting point for understanding the 'patent landscape' with information such as:
* A patent classification which corresponds to the target patent application
* Identification of 'key players' for your target patent application

__Assumptions:__

* Limited classifications to computer technology subject matter areas.  [In the U.S., it's one of the top technology fields seeking IP protection](https://www.wipo.int/edocs/infogdocs/en/ipfactsandfigures2018/).
  * CPC Class G06 represents a broad class covering: computing, counting and calculating class
  * [G06 sub-classifications used for training](https://www.uspto.gov/web/patents/classification/cpc/html/cpc-G.html#G06):


| CPC Class |                                      Definition                                      |
|:---------:|:------------------------------------------------------------------------------------:|
|    G06C   |        Digital computer in which all the computation is effected mechanically        |
|    G06D   |                       Digital fluid-pressure computing devices                       |
|    G06E   |                               Optical computing devices                              |
|    G06F   |                           Electric digital data processing                           |
|    G06G   |                                  Analogue computers                                  |
|    G06J   |                             Hybrid computing arrangements                            |
|    G06K   | Recognition of data, presentation of data, record carriers, handling record carriers |
|    G06M   |                Counting mechanisms, counting of objects                              |
|    G06N   |                Computer systems based on specific computational models               |
|    G06Q   |               Data processing systems or methods, specifically adapted               |
|    G06T   |                              Image processing                                        |

* Used U.S. published patent applications available from https://patents.google.com

* Predictions rely on selected portions of a patent publication (see: [Brief Background](#background)), rather than the entire document as well as other related documentation (e.g., prosecution history, litigation outcomes, etc.).

* Down sampled training data to 100 patent applications in each classification to create a balanced dataset
---

## Techniques and Tools <a name="techniques"></a>

### 1.) Preparing a corpus:

___landscraper___ __specific files:__

[./notebooks/scraper.ipynb](https://github.com/akhil-jindal/landscraper/blob/master/notebooks/scraper.ipynb)
* Fetch patent applications from https://patents.google.com
* Extract pertinent text from patent applications
* Save the extracted text

__Imported libraries:__

  * [BeautifulSoup4](https://pypi.org/project/beautifulsoup4/) - pulling data of HTML and XML files
  * [requests](https://pypi.org/project/requests/2.7.0/) - send HTTP requests
  * glob - finds all the pathnames matching a specified pattern according to the rules used by the Unix shell
  * re - provides regular expression matching operations
  * csv - used to import and export spreadsheets and databases
  * os - provides a portable way of using operating system dependent functionality
  * [pandas](https://pandas.pydata.org/) - provides high-performance, easy-to-use data structures and data analysis tools 


### 2.) Training a classifier and evaluate performance

___landscraper___ __specific files:__

[./notebooks/pipeline.ipynb](https://github.com/akhil-jindal/landscraper/blob/master/notebooks/pipeline.ipynb)
  * Convert corpus into matrix of token counts
  * Transform a count matrix to a normalized term-frequency times inverse document-frequency (tf-idf) representation
  * Train a model using the transformed representation of the corpus
  * Test and evaluate performance on a test set
  * Perform a grid-search for parameter tuning
  * Visualize sklearn results
  * Export trained classifiers for future use

__Imported libraries:__

* [sklearn](https://scikit-learn.org/stable/) - used to build pipeline, perform grid search and evaluate results
* [nltk](https://www.nltk.org/) - added stop words, stemming and lemmatization
* [numpy](https://www.numpy.org/) - handled array objects
* [matplotlib](https://matplotlib.org/) and [scikitplot](https://scikit-plot.readthedocs.io/en/stable/) - visualize results
* glob - finds all the pathnames matching a specified pattern according to the rules used by the Unix shell
* re - provides regular expression matching operations
* os - provides a portable way of using operating system dependent functionality
* pickle - exported trained classification models

### 3.) Applying the best performing models

___landscraper___ __specific files:__

[./notebooks/expanded_test.ipynb](https://github.com/akhil-jindal/landscraper/blob/master/notebooks/expanded_test.ipynb)

* Applied three exported models to a large sample of patent applications that were not included in the training or testing of the models
* Provided a list of top 3 'key players' for a selected patent application as well as their predicted CPC classification.

---

## Results and Conclusions <a name="results"></a>

### Training SDGClassifier and evaluating performance:

More detailed metrics can be found in the [Appendix](#appendix)

__Model 1 - A starting point with SDGClassifer:__

Cross Validation Confusion Matrix |  Prediction Confusion Matrix
:--------------------------------:|:------------------------------------:
![](https://raw.githubusercontent.com/akhil-jindal/landscraper/master/images/model1_cv_visual.png) | ![](https://raw.githubusercontent.com/akhil-jindal/landscraper/master/images/model1_predict_visual.png)

__Model 2 - SDGClassfier w/ stop-words (both 'english' and patent-specific):__

Cross Validation Confusion Matrix |  Prediction Confusion Matrix
:--------------------------------:|:------------------------------------:
![](https://raw.githubusercontent.com/akhil-jindal/landscraper/master/images/model2_cv_visual.png) | ![](https://raw.githubusercontent.com/akhil-jindal/landscraper/master/images/model2_predict_visual.png)

__Model 3 - SDGClassfier w/ best performing parameters from grid-search:__

Cross Validation Confusion Matrix |  Prediction Confusion Matrix
:--------------------------------:|:------------------------------------:
![](https://raw.githubusercontent.com/akhil-jindal/landscraper/master/images/model3_cv_visual.png) | ![](https://raw.githubusercontent.com/akhil-jindal/landscraper/master/images/model3_predict_visual.png)

### Application of the Models to another data set:

The expanded test was applied to the following data set:

| CPC Class | Number of Patent Applications |
|:---------:|:-----------------------------:|
|    G06C   |              367              |
|    G06D   |               0               |
|    G06E   |               61              |
|    G06F   |              1717             |
|    G06G   |              100              |
|    G06J   |               88              |
|    G06K   |              880              |
|    G06M   |               14              |
|    G06N   |              182              |
|    G06T   |              895              |
|    G06Q   |              963              |



---

## Appendix <a name="results"></a>

---