A set of tools to allow PDF to XML conversion, utilising Apache Beam and other tools.
The aim of this project is to bring multiple tools together to generate a full XML document.
You might also be interested in the ScienceBeam Gym, for the model training ground (the model is not yet integrated into the conversion pipeline).
This is in a very early status and may change significantly.
Note: If you just want to use the API, you could make use of the docker image.
- Python 3 (Apache Beam may still log a warning until it is fully supported)
- Apache Beam
- Libre Office Write (lowrite) for converting Doc(x) files
The conversion pipeline could for example look as follows:
See below for current example implementations.
A simple non-Apache Beam specific pipeline definition exists and can be configured using app.cfg (defaults in: app-defaults.cfg).
The pipeline can be executed directly (e.g. as part of the API, see below) or translated and run as an Apache Beam pipeline.
To run the pipeline using Apache Beam:
python -m sciencebeam.pipeline_runners.beam_pipeline_runner \ --data-path=/home/deuser/_git_/elife/pdf-xml/data/other/00666 --source-path=*.pdf \ --grobid-url=http://localhost:8070/api
To get a list of all of the available parameters:
python -m sciencebeam.pipeline_runners.beam_pipeline_runner --help
Note: the list of parameters may change depending on the configured pipeline.
The API server is currently available in combination with GROBID.
To start the GROBID run:
docker run -p 8070:8070 lfoppiano/grobid:0.5.3
To start the ScienceBeam server run:
./server.sh --host=0.0.0.0 --port=8075 --grobid-url http://localhost:8070/api
The ScienceBeam API will be available on port 8075.
The pipeline used by the API is currently is using the simple pipeline format described above. The pipeline can be configured via
app-defaults.cfg). The default pipeline uses GROBID.
See CONFIG.md for more information.
Doc to PDF
The default configuration includes a Doc to PDF conversion, as most tools will accept a PDF.
The Doc to PDF conversion will (by default):
- remove line no
- remove header and footer (and with it page no)
- remove redline (accept tracked changes)
This can be switched off by either:
- set one of the environment variables to
- or, add one of the URL request parameters to
Extending the Pipeline
The recommended way of extending the pipeline is to use a separate API server exposed via another docker container (as is the case for all of the currently integrated tools). If that is impractical for your use case you could also run locally installed programs (similar to the grobid_pipeline.py).
If the simple pipeline is too restrictive, you could consider the deprecated pipeline examples.
Unit tests are written using pytest. Run for example