An example configuration is provided in the example-config directory. Please copy it to config.
Datasets describe where the data to be converted is coming from. In general it is describing a set of files.
Datasets are configured in: ./config/datasets, each .sh file describing one dataset.
Tools are used to convert files. Currently they configure the ScienceBeam pipeline.
Tools are configured in: ./config/tools, each .sh file describing one tool.
By default the corresponding container is started and stopped from within the sciencebeam-orchester container.
docker-compose run --rm sciencebeam-orchester ./run-all.sh convert
For an invidual dataset and conversion tool:
docker-compose run --rm sciencebeam-orchester \
./run-all.sh \
--dataset pmc-1943-cc-by-sample \
--tool grobid-tei \
--force \
--limit 1000 \
--workers 10 \
convert
docker-compose run --rm sciencebeam-orchester ./run-all.sh evaluation-report
Build containers:
docker-compose up --no-start
Start:
docker-compose start sciencebeam-orchester
docker-compose start scienceparse-v2
docker-compose run --rm sciencebeam-orchester ./run.sh\
--dataset pmc-1943 --tool scienceparse-v2 convert