Skip to content
A set of tools to allow PDF to XML conversion, utilising Apache Beam and other tools. The aim of this project is to bring multiple tools together to generate a full XML document.
Branch: develop
Clone or download
de-code use separate sciencebeam user (#24)
* use separate sciencebeam user

* move user switch down

* added makefile

* use make target in Jenkinsfile

* added sciencebeam-dev

* disable  pytest cache

* added dev-venv

* use make targets in travis build

* added NO_BUILD flag

* added watch target

* mount sciencebeam for watch target

* removed requirements install from project_tests.sh

* fixed sciencebeam-dev image name
Latest commit 8d4949e May 10, 2019
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
doc use docker-compose for ci build (#80) Oct 11, 2018
sciencebeam Bump flake8 from 3.6.0 to 3.7.3 (#107) Jan 31, 2019
xslt added science parse v2 support (#50) Jul 2, 2018
.dockerignore added ci build configuration Jan 4, 2018
.env Use sciencebeam-utils (#73) Aug 24, 2018
.flake8 added pylint and flake8 checking (#83) Oct 15, 2018
.gitignore added .cache and .models to gitignore Jan 4, 2018
.pylintrc added pylint and flake8 checking (#83) Oct 15, 2018
.travis.yml use separate sciencebeam user (#24) May 10, 2019
CONTRIBUTING.md added more contributing details Aug 4, 2017
Dockerfile use separate sciencebeam user (#24) May 10, 2019
Jenkinsfile use separate sciencebeam user (#24) May 10, 2019
LICENSE added license (MIT) Jul 21, 2017
MANIFEST.in automatically start grobid service unless a url was specified Oct 20, 2017
Makefile use separate sciencebeam user (#24) May 10, 2019
README.md Use sciencebeam-utils (#73) Aug 24, 2018
app-defaults.cfg added contentmine pipeline (#69) Jul 26, 2018
docker-compose.ci.yml use docker-compose for ci build (#80) Oct 11, 2018
docker-compose.latest.yml Added docker-compose and updated README (#37) Apr 20, 2018
docker-compose.override.yml use separate sciencebeam user (#24) May 10, 2019
docker-compose.yml use separate sciencebeam user (#24) May 10, 2019
print_version.sh release sciencebeam version (#84) Oct 17, 2018
project_tests.sh use separate sciencebeam user (#24) May 10, 2019
pytest.ini create valid dar jats (#17) Mar 16, 2018
requirements.dev.txt Bump pytest from 4.4.1 to 4.4.2 (#128) May 9, 2019
requirements.prereq.txt Bump apache-beam from 2.9.0 to 2.12.0 (#127) May 9, 2019
requirements.py2.txt Bump configparser from 3.5.0 to 3.7.1 (#99) Jan 29, 2019
requirements.txt Bump sciencebeam-utils from 0.0.1 to 0.0.5 (#130) May 10, 2019
server.sh [wip] add api (#16) Mar 16, 2018
setup.py release sciencebeam version (#84) Oct 17, 2018

README.md

ScienceBeam

Build Status License: MIT

A set of tools to allow PDF to XML conversion, utilising Apache Beam and other tools.

The aim of this project is to bring multiple tools together to generate a full XML document.

You might also be interested in the ScienceBeam Gym, for the model training ground (the model is not yet integrated into the conversion pipeline).

Status

This is in a very early status and may change significantly.

Docker

Note: If you just want to use the API, you could make use of the docker image.

Pre-requisites

Pipeline

The conversion pipeline could for example look as follows:

Example Conversion Pipeline

See below for current example implementations.

Simple Pipeline

A simple non-Apache Beam specific pipeline definition exists and can be configured using app.cfg (defaults in: app-defaults.cfg).

The pipeline can be executed directly (e.g. as part of the API, see below) or translated and run as an Apache Beam pipeline.

To run the pipeline using Apache Beam:

python -m sciencebeam.pipeline_runners.beam_pipeline_runner \
  --data-path=/home/deuser/_git_/elife/pdf-xml/data/other/00666 --source-path=*.pdf \
  --grobid-url=http://localhost:8070/api

To get a list of all of the available parameters:

python -m sciencebeam.pipeline_runners.beam_pipeline_runner --help

Note: the list of parameters may change depending on the configured pipeline.

Current pipelines:

API Server

The API server is currently available in combination with GROBID.

To start the GROBID run:

docker run -p 8070:8070 lfoppiano/grobid:0.5.1

To start the ScienceBeam server run:

./server.sh --host=0.0.0.0 --port=8075 --grobid-url http://localhost:8070/api

The ScienceBeam API will be available on port 8075.

The pipeline used by the API is currently is using the simple pipeline format described above. The pipeline can be configured via app.cfg (default: app-defaults.cfg). The default pipeline uses GROBID.

Extending the Pipeline

You can use the grobid_pipeline.py as a template and add your own pipelines with other step. Please see Simple Pipeline for configuration details.

The recommended way of extending the pipeline is to use a separate API server exposed via another docker container (as is the case for all of the currently integrated tools). If that is impractical for your use case you could also run locally installed programs (similar to the grobid_pipeline.py).

If the simple pipeline is too restrictive, you could consider the deprecated pipeline examples.

Tests

Unit tests are written using pytest. Run for example pytest or pytest-watch.

Contributing

See CONTRIBUTIG

You can’t perform that action at this time.