Skip to content

Commit

Permalink
Add docs
Browse files Browse the repository at this point in the history
  • Loading branch information
jaanisoe committed Nov 15, 2019
1 parent 1a30a77 commit 4e1d3af
Show file tree
Hide file tree
Showing 20 changed files with 2,494 additions and 23 deletions.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -2,3 +2,4 @@ target/
.classpath
.project
.settings/
_build/
14 changes: 1 addition & 13 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,16 +2,4 @@

A Java command-line tool and library to download and store publications with metadata by combining content from various online resources (Europe PMC, PubMed, PubMed Central, Unpaywall, journal web pages), plus extract content from general web pages.

PubFetcher used to be part of [EDAMmap](https://github.com/edamontology/edammap) until its functionality was determined to be potentially useful on its own, thus PubFetcher is now an independently usable application. However its features and structure are still influenced by EDAMmap, for example the supported publication resources are mainly from the biomedical and life sciences fields and getting the list of authors of a publication is currently not supported (as it's not needed in EDAMmap). Also, the functionality of extracting content from general web pages is geared towards web pages containing software tools descriptions and documentation (GitHub, BioConductor, etc), as PubFetcher has built-in rules to extract from these pages and it has fields to store the software license and programming language.

Ideally, all scientific literature would be open and easily accessible through one interface for text mining and other purposes. One interface for getting publications is [Europe PMC](https://europepmc.org/), which PubFetcher uses as its main resource. In the middle of 2018, Europe PMC was able to provide almost all of the titles, around 95% of abstracts, 50% of full texts and only 10% of user-assigned keywords for the publications present in the [bio.tools](https://bio.tools/) registry at that time. While some articles don't have keywords and some full texts can't be obtained, many of the gaps can be filled by other resources. And sometimes we need the maximum amount of content about each publication for better results, thus the need for PubFetcher, that extracts and combines data from these different resources.

The speed of downloading, when multithreading is enabled, is roughly one publication per second. This limitation, along with the desire to not overburden the used APIs and publisher sites, means that PubFetcher is best used for medium-scale processing of publications, where the number of entries is in the thousands and not in the millions, but where the largest amount of completeness for these few thousand publications is desired. If millions of publications are required, then it is better to restrict oneself to the Open Access subset, which can be downloaded in bulk: https://europepmc.org/downloads.

In addition to the main content of a publication (title, abstract and full text), PubFetcher supports getting different keywords about the publication: the user-assigned keywords, the MeSH terms as assigned in PubMed and EFO and GO terms as mined from the full text by Europe PMC. Each publication has up to three identificators: a PMID, a PMCID and a DOI. In addition, different metadata (found from the different resources) about a publication is saved, like whether the article is Open Access, the journal where it was published, the publication date, etc. The source of each publication part is remembered, with content from a higher confidence resource potentially overwriting the current content. It is possible to fetch only some publication parts (thus avoiding querying some resources) and there is an algorithm to determine if an already existing entry should be refetched or is it complete enough. Fetching and extracting of content is done using various Java libraries with support for JavaScript and PDF files. The downloaded publications can be persisted to disk to a key-value store for later analysis. A number of built-in rules are included (along with tests) for scraping publication parts from publisher sites, but additional rules can also be defined. Currently, there is support for around 50 publishers of journals and 25 repositories of tools and tools' metadata and documentation and around 750 test cases for the rules have been defined.

PubFetcher has an extensive command-line tool to use all of its functionality. A simple pipeline can be constructed in the tool for querying, fetching and outputting of publications and general and documentation web pages: first IDs of interest are specified/loaded and filtered, then corresponding content fetched/loaded and filtered, and last it is possible to output the results or store them to a database. Among others, content and all the metadata can be output in JSON. Progress along with error messages is logged to the console and to a log file, if specified. The command-line tool can be extended, for example to add new ways of loading IDs.

Installation instructions are provided in [INSTALL](INSTALL.md).

Documentation can be found in the wiki: [PubFetcher documentation](https://github.com/edamontology/pubfetcher/wiki). [Section 1](https://github.com/edamontology/pubfetcher/wiki/cli) documents all parameters of the command-line interface, accompanied by many examples. [Section 2](https://github.com/edamontology/pubfetcher/wiki/output) describes different outputs: the database, the log file and the JSON output, through which the structure of publications, webpages and docs is also explained. [Section 3](https://github.com/edamontology/pubfetcher/wiki/fetcher) deals with fetching logic, describing for example the content fetching methods and the resources and filling logic of publication parts. [Section 4](https://github.com/edamontology/pubfetcher/wiki/scraping) is about scraping rules and how to define and test them. [Section 5](https://github.com/edamontology/pubfetcher/wiki/api) gives a short overview about the source code for those wanting to use the PubFetcher library. [Section 6](https://github.com/edamontology/pubfetcher/wiki/future) contains ideas how to improve PubFetcher.
Documentation for PubFetcher can be found at https://pubfetcher.readthedocs.io/.
3 changes: 1 addition & 2 deletions cli/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -31,8 +31,7 @@
<packaging>jar</packaging>

<name>PubFetcher-CLI</name>
<url>https://github.com/edamontology/pubfetcher/tree/master/cli</url>
<description></description>
<url>https://github.com/edamontology/pubfetcher</url>

<dependencies>
<dependency>
Expand Down
2 changes: 1 addition & 1 deletion cli/src/main/java/org/edamontology/pubfetcher/cli/Cli.java
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@ public static void main(String[] argv) throws IOException, ReflectiveOperationEx
// otherwise invalid.log will be created if arg --log is null
logger = LogManager.getLogger();
logger.debug(String.join(" ", argv));
logger.info("This is {} {}", version.getName(), version.getVersion());
logger.info("This is {} {} ({})", version.getName(), version.getVersion(), version.getUrl());

try {
PubFetcherMethods.run(args.pubFetcherArgs, new Fetcher(args.fetcherArgs.getPrivateArgs()), args.fetcherArgs, null, null, null, version, argv);
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -64,7 +64,7 @@ public class PubFetcherArgs {
@Parameter(names = { "-fetch-document-javascript" }, description = "Fetch a web page (with JavaScript support, i.e. using HtmlUnit) and output its raw HTML to stdout")
String fetchDocumentJavascript = null;

@Parameter(names = { "-post-document" }, variableArity = true, description = "TODO")
@Parameter(names = { "-post-document" }, variableArity = true, description = "Fetch a web resource using HTTP POST. The first parameter specifies the resource URL and is followed by the request data in the form of name/value pairs, with names and values separated by spaces.")
List<String> postDocument = null;

@Parameter(names = { "-fetch-webpage-selector" }, arity = 4, description = "Fetch a webpage and output it to stdout in the format specified by the output modifiers --plain and --format. Works also for PDF files. \"Title\" and \"content\" args are CSS selectors as supported by jsoup. If the \"title selector\" is an empty string, then the page title will be the text content of the document's <title> element. If the \"content selector\" is an empty string, then content will be the whole text content parsed from the HTML/XML. If javascript arg is \"true\", then fetching will be done using JavaScript support (HtmlUnit), if \"false\", then without JavaScript (jsoup). If javascript arg is empty, then fetching will be done without JavaScript and if the text length of the returned document is less than --webpageMinLengthJavascript or if a <noscript> tag is found in it, a second fetch will happen with JavaScript support.")
Expand Down
3 changes: 1 addition & 2 deletions core/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -31,8 +31,7 @@
<packaging>jar</packaging>

<name>PubFetcher-Core</name>
<url>https://github.com/edamontology/pubfetcher/tree/master/core</url>
<description></description>
<url>https://github.com/edamontology/pubfetcher</url>

<dependencies>
<dependency>
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -61,13 +61,13 @@ public static <T extends BasicArgs> T parseArgs(String[] argv, Class<T> clazz, V
try {
jcommander.parse(argv);
} catch (ParameterException e) {
System.err.println(version.getName() + " " + version.getVersion());
System.err.println(version.getName() + " " + version.getVersion() + " (" + version.getUrl() + ")");
System.err.println(e);
System.err.println("Use -h or --help for listing valid options");
System.exit(1);
}
if (args.isHelp()) {
System.out.println(version.getName() + " " + version.getVersion());
System.out.println(version.getName() + " " + version.getVersion() + " (" + version.getUrl() + ")");
jcommander.usage();
System.exit(0);
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -85,7 +85,7 @@ public class FetcherArgs extends Args {
private Integer webpageMinLength = webpageMinLengthDefault;

private static final String webpageMinLengthJavascriptId = "webpageMinLengthJavascript";
private static final String webpageMinLengthJavascriptDescription = "If the length of a whole web page content fetched without JavaScript is below the specified limit and no scraping rules are found for the corresponding URL, then refetching using JavaScript support will be attempted";
private static final String webpageMinLengthJavascriptDescription = "If the length of the whole web page text fetched without JavaScript is below the specified limit and no scraping rules are found for the corresponding URL, then refetching using JavaScript support will be attempted";
private static final Integer webpageMinLengthJavascriptDefault = 200;
@Parameter(names = { "--" + webpageMinLengthJavascriptId }, validateWith = PositiveInteger.class, description = webpageMinLengthJavascriptDescription)
private Integer webpageMinLengthJavascript = webpageMinLengthJavascriptDefault;
Expand Down
20 changes: 20 additions & 0 deletions docs/Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
# Minimal makefile for Sphinx documentation
#

# You can set these variables from the command line, and also
# from the environment for the first two.
SPHINXOPTS ?=
SPHINXBUILD ?= sphinx-build
SOURCEDIR = .
BUILDDIR = _build

# Put it first so that "make" without argument is like "make help".
help:
@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)

.PHONY: help Makefile

# Catch-all target: route all unknown targets to Sphinx using the new
# "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS).
%: Makefile
@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)

0 comments on commit 4e1d3af

Please sign in to comment.