Skip to content

Commit

Permalink
documentation fixes
Browse files Browse the repository at this point in the history
  • Loading branch information
jaanisoe committed Nov 27, 2019
1 parent 4e1d3af commit e55d916
Show file tree
Hide file tree
Showing 5 changed files with 20 additions and 5 deletions.
11 changes: 9 additions & 2 deletions INSTALL.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,17 @@
# INSTALL

[Apache Maven](https://maven.apache.org/) is required.
[git](https://git-scm.com/), [JDK 8](https://openjdk.java.net/projects/jdk8/) (or later) and [Apache Maven](https://maven.apache.org/) are required.

On the command-line, go to the directory PubFetcher should be installed in and execute:

```shell
$ git clone https://github.com/edamontology/pubfetcher.git
$ cd pubfetcher/
$ mvn clean install
$ java -jar target/pubfetcher-cli-0.2-SNAPSHOT.jar -h
```

PubFetcher can now be run with:

```shell
$ java -jar /path/to/pubfetcher/target/pubfetcher-cli-0.2-SNAPSHOT.jar -h
```
8 changes: 8 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,4 +2,12 @@

A Java command-line tool and library to download and store publications with metadata by combining content from various online resources (Europe PMC, PubMed, PubMed Central, Unpaywall, journal web pages), plus extract content from general web pages.

The main resource is [Europe PMC](https://europepmc.org/), but in case it cannot provide parts of the required content, then other repositories can be consulted. As last resort, there is support to scrape journal articles directly from publisher web sites -- around 50 site scraping rules are built in, mainly for journals in the biomedical and life sciences fields. To not overburden the used APIs and sites, PubFetcher is best used for medium-scale processing of publications, where the number of entries is in the thousands and not in the millions, but where the largest amount of completeness for these few thousand publications is desired.

In addition to the main content of publications (title, abstract, full text), PubFetcher supports different keywords: the user-assigned keywords of the article, MeSH terms from PubMed and GO/EFO terms as mined by Europe PMC. Some extra metadata is saved, like journal title, publication date, etc, however the list of authors is currently missing. Content from higher quality resources is prioritised and good enough publication parts are not re-fetched. There is support for JavaScript while scraping, and content can be extracted from PDF files. Downloaded publications can be persisted to disk to a key-value store for later analysis, or exported to JSON.

In addition to publications, PubFetcher can scrape general web pages. This functionality is geared towards web pages containing software tools descriptions and documentation (GitHub, BioConductor, etc), as PubFetcher has built-in rules (around 25) to extract from these pages and it has fields to store the software license and programming language. If no rules are defined for a given web page, then an automatic extraction of the main content of the page is attempted.

PubFetcher is used in [EDAMmap](https://github.com/edamontology/edammap) and [Pub2Tools](https://github.com/bio-tools/pub2tools).

Documentation for PubFetcher can be found at https://pubfetcher.readthedocs.io/.
2 changes: 1 addition & 1 deletion docs/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,7 @@ Fetcher contains the public method "getDoc", which is described in :ref:`Getting

The Fetcher methods "initPublication" and "initWebpage" must be used to construct a Publication and Webpage. Then, the methods "getPublication" and "getWebpage" can be used to fetch the Publication and Webpage. But instead of these "init" and "get" methods, the "getPublication", "getWebpage" and "getDoc" methods of class `PubFetcher <https://github.com/edamontology/pubfetcher/blob/master/core/src/main/java/org/edamontology/pubfetcher/core/common/PubFetcher.java>`_ should be used, when possible.

Because executing JavaScript is prone to serious bugs in the used `HtmlUnit <http://htmlunit.sourceforge.net/>`_ library, fetching a HTML document with JavaScript support turned on is done in a separate `JavaScriptThread <https://github.com/edamontology/pubfetcher/blob/master/core/src/main/java/org/edamontology/pubfetcher/core/fetching/JavaScriptThread.java>`_, that can be killed if it gets stuck.
Because executing JavaScript is prone to serious bugs in the used `HtmlUnit <http://htmlunit.sourceforge.net/>`_ library, fetching a HTML document with JavaScript support turned on is done in a separate `JavaScriptThread <https://github.com/edamontology/pubfetcher/blob/master/core/src/main/java/org/edamontology/pubfetcher/core/fetching/JavascriptThread.java>`_, that can be killed if it gets stuck.

The `HtmlMeta class <https://github.com/edamontology/pubfetcher/blob/master/core/src/main/java/org/edamontology/pubfetcher/core/fetching/HtmlMeta.java>`_ is explained in :ref:`Meta <meta>` and the `Links class <https://github.com/edamontology/pubfetcher/blob/master/core/src/main/java/org/edamontology/pubfetcher/core/fetching/Links.java>`_ in :ref:`Links <links>`.

Expand Down
2 changes: 1 addition & 1 deletion docs/fetcher.rst
Original file line number Diff line number Diff line change
Expand Up @@ -324,7 +324,7 @@ _`efo`
_`go`
.. _fetcher_go:

`Gene ontology <http://www.geneontology.org/>`_ terms of the publication. Text-mined by the `Europe PMC <https://europepmc.org/>`_ project from the full text of the article. The :ref:`go structure <go>`.
`Gene ontology <http://geneontology.org/>`_ terms of the publication. Text-mined by the `Europe PMC <https://europepmc.org/>`_ project from the full text of the article. The :ref:`go structure <go>`.
_`theAbstract`
.. _fetcher_theAbstract:

Expand Down
2 changes: 1 addition & 1 deletion docs/intro.rst
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ Ideally, all scientific literature would be open and easily accessible through o

The speed of downloading, when :ref:`multithreading <multithreaded>` is enabled, is roughly one publication per second. This limitation, along with the desire to not overburden the used APIs and publisher sites, means that PubFetcher is best used for medium-scale processing of publications, where the number of entries is in the thousands and not in the millions, but where the largest amount of completeness for these few thousand publications is desired. If millions of publications are required, then it is better to restrict oneself to the Open Access subset, which can be downloaded in bulk: https://europepmc.org/downloads.

In addition to the main content of a publication (:ref:`title <fetcher_title>`, :ref:`abstract <fetcher_theabstract>` and :ref:`full text <fetcher_fulltext>`), PubFetcher supports getting different keywords about the publication: the :ref:`user-assigned keywords <fetcher_keywords>`, the :ref:`MeSH terms <fetcher_mesh>` as assigned in PubMed and :ref:`EFO terms <fetcher_efo>` and :ref:`GO terms <fetcher_go>` as mined from the full text by Europe PMC. Each publication has up to three identificators: a :ref:`PMID <fetcher_pmid>`, a :ref:`PMCID <fetcher_pmcid>` and a :ref:`DOI <fetcher_doi>`. In addition, different metadata (found from the different :ref:`resources <resources>`) about a publication is saved, like whether the article is :ref:`Open Access <oa>`, the :ref:`journal <journaltitle>` where it was published, the :ref:`publication date <pubdate>`, etc. The :ref:`source <publication_types>` of each :ref:`publication part <publication_parts>` is remembered, with content from a higher confidence resource potentially overwriting the current content. It is possible to fetch only some :ref:`publication parts <publication_parts>` (thus avoiding querying some :ref:`resources <resources>`) and there is :ref:`an algorithm <can_fetch>` to determine if an already existing entry should be refetched or is it complete enough. Fetching and :ref:`extracting <selecting>` of content is done using various Java libraries with support for :ref:`JavaScript <getting_a_html_document>` and :ref:`PDF <getting_a_pdf_document>` files. The downloaded publications can be persisted to disk to a :ref:`key-value store <database>` for later analysis. A number of :ref:`built-in rules <rules_in_yaml>` are included (along with :ref:`tests <testing_of_rules>`) for :ref:`scraping <scraping>` publication parts from publisher sites, but additional rules can also be defined. Currently, there is support for around 50 publishers of journals and 25 repositories of tools and tools' metadata and documentation and around 750 test cases for the rules have been defined.
In addition to the main content of a publication (:ref:`title <fetcher_title>`, :ref:`abstract <fetcher_theabstract>` and :ref:`full text <fetcher_fulltext>`), PubFetcher supports getting different keywords about the publication: the :ref:`user-assigned keywords <fetcher_keywords>`, the :ref:`MeSH terms <fetcher_mesh>` as assigned in PubMed and :ref:`EFO terms <fetcher_efo>` and :ref:`GO terms <fetcher_go>` as mined from the full text by Europe PMC. Each publication has up to three identificators: a :ref:`PMID <fetcher_pmid>`, a :ref:`PMCID <fetcher_pmcid>` and a :ref:`DOI <fetcher_doi>`. In addition, different metadata (found from the different :ref:`resources <resources>`) about a publication is saved, like whether the article is :ref:`Open Access <oa>`, the :ref:`journal <journaltitle>` where it was published, the :ref:`publication date <pubdate>`, etc. The :ref:`source <publication_types>` of each :ref:`publication part <publication_parts>` is remembered, with content from a higher confidence resource potentially overwriting the current content. It is possible to fetch only some :ref:`publication parts <publication_parts>` (thus avoiding querying some :ref:`resources <resources>`) and there is :ref:`an algorithm <can_fetch>` to determine if an already existing entry should be refetched or is it complete enough. Fetching and :ref:`extracting <selecting>` of content is done using various Java libraries with support for :ref:`JavaScript <getting_a_html_document>` and :ref:`PDF <getting_a_pdf_document>` files. The downloaded publications can be persisted to disk to a :ref:`key-value store <database>` for later analysis. A number of :ref:`built-in rules <rules_in_yaml>` are included (along with :ref:`tests <testing_of_rules>`) for :ref:`scraping <scraping>` publication parts from publisher sites, but additional rules can also be defined. Currently, there is support for around 50 publishers of journals and 25 repositories of tools and tools' metadata and documentation and around 750 test cases for the rules have been defined. If no rules are defined for a given site, then :ref:`automatic cleaning <cleaning>` is applied to get the main content of the page.

PubFetcher has an extensive :ref:`command-line tool <cli>` to use all of its functionality. It contains a few :ref:`helper operations <simple_one_off_operations>`, but the main use is the construction of a simple :ref:`pipeline <pipeline>` for querying, fetching and outputting of publications and general and documentation web pages: first IDs of interest are specified/loaded and filtered, then corresponding content fetched/loaded and filtered, and last it is possible to output the results or store them to a database. Among other functionality, content and all the metadata can be output in :ref:`HTML or plain text <html_and_plain_text_output>`, but also :ref:`exported <export_to_json>` to :ref:`JSON <json_output>`. All fetching operations can be influenced by a few :ref:`general parameters <general_parameters>`. Progress along with error messages is logged to the console and to a :ref:`log file <log_file>`, if specified. The command-line tool can be :ref:`extended <cli_extended>`, for example to add new ways of loading IDs.

Expand Down

0 comments on commit e55d916

Please sign in to comment.