Skip to content
Script to harvest U.S. Patents
Java Shell
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.

This project harvests patent metadata and files from the United States Patent and Trademark Office (USPTO), using the USPTO Open Data Portal. The program collect patents assigned to Virginia Tech for inclusion in VTechWorks "Virginia Tech Patent" collection in Virginia Tech's DSpace institutional repository. The metadata fields are crosswalked to fields used in VTechWorks. This program can be modified to search for other assignees and/or harvest other fields. After harvesting the metadata and files, a script performs OCR on the PDFs and adds that text to each PDF.

List of fields in API,

Additional detail at (restricted to VTUL members)

Overview of project:


This software is licensed under the GNU General Public License v2.

pdfsandwich is licensed under the GNU General Public License v2.


This is intended to work on Mac OSX. It may work on other platforms if dependencies are installed via the native package manager. All of the following steps are from the Mac Terminal.

For CSV creation and PDF harvesting:

install wget to download this project and pdfsandwich

brew install wget

cd to the directory that will contain the cloned repo download this project and unzip it

unzip javax.json*

cd to the patent-harvest directory

javac -cp javax.json-ri-1.0/lib/javax.json-1.0.jar

To install the OCR software:

brew install poppler
brew install imagemagick
brew install libpng
brew link libpng
brew install tesseract
brew install unpaper
brew install gawk
brew install ocaml
brew link ocaml
chmod u+x pdfsandwich

Run: CSV creation and PDF harvest

cd to patent-harvest directory

java -cp "javax.json-ri-1.0/lib/javax.json-1.0.jar:." Patents

Run: Add OCR text to all PDFs


Note: This can take a long time - over a minute for a single file. So you might want to run overnight with:

caffeinate -i ./
You can’t perform that action at this time.