eLife have handed over stewardship of ScienceBeam to The Coko Foundation. You can now find the updated code repository at https://gitlab.coko.foundation/sciencebeam/sciencebeam-trainer-grobid and continue the conversation on Coko's Mattermost chat server: https://mattermost.coko.foundation/
For more information on why we're doing this read our latest update on our new technology direction: https://elifesciences.org/inside-elife/daf1b699/elife-latest-announcing-a-new-technology-direction
The Trainer for GROBID is a thin wrapper and Docker container around GROBID Training commands. While this container is not complete yet (Header model only), it is cloud-ready.
- Docker and Docker Compose
This isn't very useful unless you want to re-train the model. It is a good test to see how long training takes though.
Using Docker:
docker run --rm -it \
elifesciences/sciencebeam-trainer-grobid_unstable:0.5.4 \
train-header-model.sh \
--use-default-dataset
Using Kubernetes:
kubectl run --rm --attach --restart=Never --generator=run-pod/v1 \
--image=elifesciences/sciencebeam-trainer-grobid_unstable:0.5.4 \
train-header-model -- \
train-header-model.sh \
--use-default-dataset
Using a mounted volume:
docker run --rm -it \
-v /data/mydataset:/data/mydataset \
elifesciences/sciencebeam-trainer-grobid_unstable:0.5.4 \
train-header-model.sh \
--dataset /data/mydataset \
--use-default-dataset
You could also specify a cloud location that gsutil
understands (assuming that the credentials are mounted too).
The --use-default-dataset
flag is optional.
You may also add --cloud-models-path <cloud path>
to copy the resulting model to a cloud storage.
make example-data-processing-end-to-end
Downloads example PDF, converts it to training data and runs the training. The resulting model won't be of much use and merely provides an example.
make get-example-data
Downloads example PDF to the data
Docker volume.
make generate-grobid-training-data
Converts the previously downloaded PDF from the Data volume to GROBID training data. The tei
files will be stored in tei-raw
in the dataset. Training on the raw XML wouldn't be of as that the annotations the model already knows. Usually one would review and correct those generated XML files using the annotation guidelines. The final tei
files should be stored in the tei
sub directory of the corpus in the dataset.
make copy-raw-header-training-data-to-tei
This copies the generated raw tei XML files in tei-raw
to tei
. This is just for demonstration purpose. The XML files should be reviewed (see above).
make train-header-model-with-dataset
Trains the model over the dataset produced using the previous steps. The output will be the trained GROBID Header Model.
make train-header-model-with-default-dataset
Instead of using our own dataset this will use the default dataset that comes with GROBID.
make train-header-model-with-dataset-and-default-dataset
A combination of the two - it will train a model based on the default dataset and our own dataset.
make CLOUD_MODELS_PATH=gs://bucket/path/to/model upload-header-model
Upload the final header model to a location in the cloud. This is assuming that the credentials are mounted to the container. Because the Google Gloud SDK also has some support for AWS' S3, you could also specify an S3 location.