RDF as an Interface to scientific publications
Java
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
deploy
lib
pmc_dtds
resources
src
.gitignore
LICENSE
README.md
README.txt
build.xml
log4j.properties

README.md

Biotea

RDF as an Interface to scientific publications.

Read more at http://www.jbiomedsem.com/content/4/S1/S5

Please cite this work as

Leyla Jael García-Castro, Casey McLaughlin, Alexander García Castro: Biotea: RDFizing PubMed Central in support for the paper as an interface to the Web of Data. J. Biomedical Semantics 4(S-1): S5 (2013)

Main Class

PmcOpenAccessRDFHandlerWithThreads

Usage:

-in -out -sections

-all or -rdfAndAnnotations or any of -rdf -whatizit -ncboAnnotator -bio2rdf (all are optional, if none one present -all)

-pipeline (optional, whatizitEuropePmcAll by default, only used when -whatizit or all are present) Pipeline to be used by Whatizit AO annotator (values: whatizitEuropePmcGenesProteins, whatizitSwissprot, whatizitEuropePmcAll, etc)

-sections (optional) If present, paper sections will be rdfized

Recommended flow

  1. PmcOpenAccessRDFHandlerWithThreads -in -out -rdf -sections
  2. PmcOpenAccessRDFHandlerWithThreads -in -out -ncboAnnotator
  3. PmcOpenAccessRDFHandlerWithThreads -in -out -whatizit -pipeline whatizitEuropePmcAll

Notes:

  • You can use either Biotea or Bio2RDF URI styles. If you use Bio2RDF, in order to keep consistency, please use the -bio2rdf flag when running the process

  • If you decide not to annotate the sections, it will not be possible to run later the RDFization process only to this end. It is either metadata+sections or metadata only, only sections is not an option

  • Once an input file has been processed and an output file has been created, running the RDFization process again will not process the already processed articles as they already exist in the output directory. You would need either to clean the output directory ot to use another one.

  • Not all the content will be RDFizated. For instance, formulas will not. Content not included in the RDF output will be reported as a WARN in the logs. We were not interested in RDFizating all the content but still we wanted to keep a trace of what was not RDFized.

config.properties

  • config.properties has information about the dirs where the WSDL for Whatizit is located, configure that first. If you are not going to use Whatizit annotations then that value will not be used.
  • You can use either Biotea or Bio2RDF URI styles. If you use Bio2RDF, in order to keep consistency, please use the -bio2rdf flag when running the process. The base URI property is named biotea.base
  • If biotea.withBio is true, links to Bio2RDF and identifiers.org will be added
  • For whatizit annotator, ontologies that are also included in NCBOAnnotator, i.e., GO and CHEBI, can be included or not, check properties whatizit.CHEBI and whatizit.GO. By default they are set to true so they will be processed.
  • For NCBO Annotator is possible to include only (ncbo.annotator.include.only) or exclude some ontologies (ncbo.annotator.exclude). You will need to specify the name of the ontology, see names at ontologies.properties. By default ICD9 and NCBITaxon are excluded.

IMPORTANT: Remember to configure your Bioportal API key in order to use the NCBOAnnotator

IMPORTANT: If you are annotating with Whatizit, remember to point to your local copy of whatizit.wsdl.dir

Running environment

Input and Output resources

  • You will need an input dir with all dtds required, as those in /pmc_dtds;
  • You will need in the same input directory the NXML files that you want to process (a sample file is provided in /temp AAPS_J_2008_Apr_2_10(1)_193-199.nxml)
  • You will need an output directory with one subdir output/AO_annotations

jar file

  • Make sure the jar, config.properties, log4j.properties, server-bindings.xml, and whatizit.wsdl are all in the same directory, one next to each other.
  • Modify config.properties and log4j.properties to point to the right paths in your machine

logs

  • Do not forget to modify log4j.properties according to what you want to include in the logs and where you want to locate them

Running options, some examples

  • RDF generation for articles only -in D:\workspace\biotea_rdf4pmc\temp\toProcess -out D:\workspace\biotea_rdf4pmc\temp -rdf

  • RDF generation for articles only, including sections -in D:\workspace\biotea_rdf4pmc\temp\toProcess -out D:\workspace\biotea_rdf4pmc\temp -rdf -sections

  • RDF generation for NCBO Annotator only (RDF for article and sections must already exist in the output directory) -in D:\workspace\biotea_rdf4pmc\temp\toProcess -out D:\workspace\biotea_rdf4pmc\temp -ncboAnnotator

  • RDF generation for Whatizit only (RDF for article and sections must already exist in the output directory) -in D:\workspace\biotea_rdf4pmc\temp\toProcess -out D:\workspace\biotea_rdf4pmc\temp -pipeline whatizitEuropePmcAll -whatizit

  • RDF for article, sections, and annotations -in D:\workspace\biotea_rdf4pmc\temp\toProcess -out D:\workspace\biotea_rdf4pmc\temp rdfAndAnnotations -sections -pipeline whatizitEuropePmcAll

  • All (possible but not recommended) -in D:\workspace\biotea_rdf4pmc\temp\toProcess -out D:\workspace\biotea_rdf4pmc\temp -pipeline whatizitEuropePmcAll -all

Some additional notes

  • NCBOAnnotator may be not responsive all the time. What we do in Biotea is to annotate as many paragraphs as possible but there is no guaranty that all of them will be annotated. Furthermore, it could be that more or less paragraphs are annotated if you try again because of NCBOAnnotator responses. In order to alter this behaviour, and perhaps annotate all or nothing, you will need to adjust the NCBOAnnotator class
  • Whatizit has not been updated lately so annotations will be out of sync regarding current versions of the ontologies and datasets used there.