GitHub - VTUL/patent-harvest: Script to harvest U.S. Patents

This project harvests patent metadata and files from the United States Patent and Trademark Office (USPTO), using the USPTO Open Data Portal. The program collect patents assigned to Virginia Tech for inclusion in VTechWorks "Virginia Tech Patent" collection in Virginia Tech's DSpace institutional repository. The metadata fields are crosswalked to fields used in VTechWorks. This program can be modified to search for other assignees and/or harvest other fields. After harvesting the metadata and files, a script performs OCR on the PDFs and adds that text to each PDF.

List of fields in API, http://www.patentsview.org/api/patent.html

Additional detail at https://git.it.vt.edu/digital-research-services/VTechWorks_Documentation/wikis/Virginia_Tech_Patents (restricted to VTUL members)

Overview of project: https://blogs.lt.vt.edu/openvt/2017/06/02/introducing-the-virginia-tech-patents-collection-in-vtechworks-and-the-patent-harvesting-software-repository-patent-harvest/

License

This software is licensed under the GNU General Public License v2.

pdfsandwich is licensed under the GNU General Public License v2.

Installation

This is intended to work on Mac OSX. It may work on other platforms if dependencies are installed via the native package manager. All of the following steps are from the Mac Terminal.

For CSV creation and PDF harvesting:

install wget to download this project and pdfsandwich

brew install wget

cd to the directory that will contain the cloned repo download this project and unzip it

wget https://java.net/projects/jsonp/downloads/download/ri/javax.json-ri-1.0.zip
unzip javax.json*

cd to the patent-harvest directory

javac -cp javax.json-ri-1.0/lib/javax.json-1.0.jar Patents.java

To install the OCR software:

brew install poppler
brew install imagemagick
brew install libpng
brew link libpng
brew install tesseract
brew install unpaper
brew install gawk
brew install ocaml
brew link ocaml
chmod u+x pdfsandwich

Run: CSV creation and PDF harvest

cd to patent-harvest directory

java -cp "javax.json-ri-1.0/lib/javax.json-1.0.jar:." Patents

Run: Add OCR text to all PDFs

./text-info-pdf.sh`

Note: This can take a long time - over a minute for a single file. So you might want to run overnight with:

caffeinate -i ./text-info-pdf.sh

Name		Name	Last commit message	Last commit date
Latest commit History 64 Commits
Patents.java		Patents.java
README.md		README.md
javax.json-ri-1.0.zip		javax.json-ri-1.0.zip
pdfsandwich		pdfsandwich
pdfsandwich-src.tar.bz2		pdfsandwich-src.tar.bz2
text-info-pdf.sh		text-info-pdf.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

License

Installation

For CSV creation and PDF harvesting:

To install the OCR software:

Run: CSV creation and PDF harvest

Run: Add OCR text to all PDFs

About

Releases

Packages

Contributors 2

Languages

VTUL/patent-harvest

Folders and files

Latest commit

History

Repository files navigation

License

Installation

For CSV creation and PDF harvesting:

To install the OCR software:

Run: CSV creation and PDF harvest

Run: Add OCR text to all PDFs

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages