Skip to content

Commit

Permalink
update README
Browse files Browse the repository at this point in the history
  • Loading branch information
jaanisoe committed Nov 28, 2019
1 parent e55d916 commit 448bdbc
Show file tree
Hide file tree
Showing 2 changed files with 20 additions and 1 deletion.
14 changes: 13 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,12 +2,24 @@

A Java command-line tool and library to download and store publications with metadata by combining content from various online resources (Europe PMC, PubMed, PubMed Central, Unpaywall, journal web pages), plus extract content from general web pages.

## Fetch publications

The main resource is [Europe PMC](https://europepmc.org/), but in case it cannot provide parts of the required content, then other repositories can be consulted. As last resort, there is support to scrape journal articles directly from publisher web sites -- around 50 site scraping rules are built in, mainly for journals in the biomedical and life sciences fields. To not overburden the used APIs and sites, PubFetcher is best used for medium-scale processing of publications, where the number of entries is in the thousands and not in the millions, but where the largest amount of completeness for these few thousand publications is desired.

In addition to the main content of publications (title, abstract, full text), PubFetcher supports different keywords: the user-assigned keywords of the article, MeSH terms from PubMed and GO/EFO terms as mined by Europe PMC. Some extra metadata is saved, like journal title, publication date, etc, however the list of authors is currently missing. Content from higher quality resources is prioritised and good enough publication parts are not re-fetched. There is support for JavaScript while scraping, and content can be extracted from PDF files. Downloaded publications can be persisted to disk to a key-value store for later analysis, or exported to JSON.

## Fetch web pages

In addition to publications, PubFetcher can scrape general web pages. This functionality is geared towards web pages containing software tools descriptions and documentation (GitHub, BioConductor, etc), as PubFetcher has built-in rules (around 25) to extract from these pages and it has fields to store the software license and programming language. If no rules are defined for a given web page, then an automatic extraction of the main content of the page is attempted.

PubFetcher is used in [EDAMmap](https://github.com/edamontology/edammap) and [Pub2Tools](https://github.com/bio-tools/pub2tools).
## Use as a library

In addition to being a command-line tool, PubFetcher can be incorporated as a library into other projects. The library is used in [EDAMmap](https://github.com/edamontology/edammap) and [Pub2Tools](https://github.com/bio-tools/pub2tools).

## Install

Installation instructions can be found in [INSTALL.md](INSTALL.md).

## Documentation

Documentation for PubFetcher can be found at https://pubfetcher.readthedocs.io/.
7 changes: 7 additions & 0 deletions docs/intro.rst
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,13 @@ Outline
* :ref:`future` contains ideas how to improve PubFetcher


*******
Install
*******

Installation instructions can be found in the project's GitHub repo at `INSTALL <https://github.com/edamontology/pubfetcher/blob/master/INSTALL.md>`_.


**********
Quickstart
**********
Expand Down

0 comments on commit 448bdbc

Please sign in to comment.