Building a comprehensive dataset of patent citations
What will you find in patCit?
Patents are at the crossroads of many innovation nodes: science, open knwoledge, products, competition, etc. At patCit, we are building a comprehensive dataset of patent citations to help the community explore this terra incognita. patCit is:
๐ worlwide coverage๐ &๐ front-page and in-text citations๐ all sorts of documents, not just scientific articles
๐ก How we do? We use recent progress in Natural Language Processing (NLP) to extract and structure citations into actionable piece of information.
Front-page
patCit builds on DOCDB, the largest database of Non Patent Literature (NPL) citations. First, we deduplicate this corpus and organize it into 10 categories. Then, we design and apply category specific information extraction models using spaCy. Eventually, when possible, we enrich the data using external domain specific high quality databases.
In-text
patCit builds on Google Patents corpus of USPTO full-text patents. First, we extract patent and bibliographical reference citations. Then, we parse detected in-text citations into a series of category dependent attributes using grobid[grobid. Patent citations are matched with a standard publication number using the Google Patents matching API and bibliographical references are matched with a DOI using biblio-glutton. Eventually, when possible, we enrich the data using external domain specific high quality databases.
Category | Citation extraction | Information extraction | Enrichment | BigQuery table | Colab notebook |
---|---|---|---|---|---|
Bibliographical reference | |||||
Patents |
FAIR
๐จโ๐ If you are new to BigQuery and want to learn the basics of Google BigQuery (GBQ), you can take the GBQ Quickstart. This should not take more than 2 minutes and might help a lot !
Contributing
There are many ways to contribute to patCit, many do not include coding.
Give feedback - We want to make patCit truly useful to the community. We are thus very happy for feedback.
Share your thoughts - We believe that discussions are much more valuable if they are publicly shared. This way, everyone can benefit from it. Hence, we strongly encourage you to share your issues and request on patCit GitHub repository issue section.
Feel like coding today? - We will be more than happy to receive any contributions from you and the community. We have already started to tag some issues with and
.
Team
This project was initiated by Gaรฉtan de Rassenfosse (EPFL) and Cyril Verluise (Collรจge de France) in 2019.
Since then, it has benefited from the contributions of Gabriele Cristelli (EPFL), Francesco Gerotto (Sciences Po), Kyle Higham (Hitsotsubashi University) and Lucas Violon (HEC Paris).
We are also thankful to Domenico Golzio for constant support and to @leflix311, @kermitt2, Tim Simcoe (Boston University) @SuperMayo and @wetherbeei for helpful comments.
Contribution details are available in CRediT.