ScienceBeam Trainer for GROBID

⚠️ Under new stewardship

eLife have handed over stewardship of ScienceBeam to The Coko Foundation. You can now find the updated code repository at https://gitlab.coko.foundation/sciencebeam/sciencebeam-trainer-grobid and continue the conversation on Coko's Mattermost chat server: https://mattermost.coko.foundation/

For more information on why we're doing this read our latest update on our new technology direction: https://elifesciences.org/inside-elife/daf1b699/elife-latest-announcing-a-new-technology-direction

Overview

The Trainer for GROBID is a thin wrapper and Docker container around GROBID Training commands. While this container is not complete yet (Header model only), it is cloud-ready.

Prerequisites

Docker and Docker Compose

Using the Docker Container

Header Model Training with Default Dataset

This isn't very useful unless you want to re-train the model. It is a good test to see how long training takes though.

Using Docker:

docker run --rm -it \
    elifesciences/sciencebeam-trainer-grobid_unstable:0.5.4 \
    train-header-model.sh \
        --use-default-dataset

Using Kubernetes:

kubectl run --rm --attach --restart=Never --generator=run-pod/v1 \
    --image=elifesciences/sciencebeam-trainer-grobid_unstable:0.5.4 \
    train-header-model -- \
    train-header-model.sh \
        --use-default-dataset

Header Model Training with your own dataset

Using a mounted volume:

docker run --rm -it \
    -v /data/mydataset:/data/mydataset \
    elifesciences/sciencebeam-trainer-grobid_unstable:0.5.4 \
    train-header-model.sh \
        --dataset /data/mydataset \
        --use-default-dataset

You could also specify a cloud location that gsutil understands (assuming that the credentials are mounted too).

The --use-default-dataset flag is optional.

You may also add --cloud-models-path <cloud path> to copy the resulting model to a cloud storage.

Make Targets

Example End-to-End

make example-data-processing-end-to-end

Downloads example PDF, converts it to training data and runs the training. The resulting model won't be of much use and merely provides an example.

Get Example Data

make get-example-data

Downloads example PDF to the data Docker volume.

Generate GROBID Training Data

make generate-grobid-training-data

Converts the previously downloaded PDF from the Data volume to GROBID training data. The tei files will be stored in tei-raw in the dataset. Training on the raw XML wouldn't be of as that the annotations the model already knows. Usually one would review and correct those generated XML files using the annotation guidelines. The final tei files should be stored in the tei sub directory of the corpus in the dataset.

Copy Raw Header Training Data to TEI

make copy-raw-header-training-data-to-tei

This copies the generated raw tei XML files in tei-raw to tei. This is just for demonstration purpose. The XML files should be reviewed (see above).

Train Header Model with Dataset

make train-header-model-with-dataset

Trains the model over the dataset produced using the previous steps. The output will be the trained GROBID Header Model.

Train Header Model with Default Dataset

make train-header-model-with-default-dataset

Instead of using our own dataset this will use the default dataset that comes with GROBID.

Train Header Model with Dataset and Default Dataset

make train-header-model-with-dataset-and-default-dataset

A combination of the two - it will train a model based on the default dataset and our own dataset.

Upload Header Model

make CLOUD_MODELS_PATH=gs://bucket/path/to/model upload-header-model

Upload the final header model to a location in the cloud. This is assuming that the credentials are mounted to the container. Because the Google Gloud SDK also has some support for AWS' S3, you could also specify an S3 location.

Name		Name	Last commit message	Last commit date
Latest commit History 145 Commits
ci		ci
docker		docker
sciencebeam_trainer_grobid		sciencebeam_trainer_grobid
scripts		scripts
tests		tests
.dockerignore		.dockerignore
.env		.env
.flake8		.flake8
.gitignore		.gitignore
.pylintrc		.pylintrc
Dockerfile		Dockerfile
Dockerfile.builder		Dockerfile.builder
Dockerfile.dev		Dockerfile.dev
Jenkinsfile		Jenkinsfile
Jenkinsfile.update-grobid		Jenkinsfile.update-grobid
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
Makefile		Makefile
README.md		README.md
docker-compose.override.yml		docker-compose.override.yml
docker-compose.yml		docker-compose.yml
maintainers.txt		maintainers.txt
pytest.ini		pytest.ini
requirements.build.txt		requirements.build.txt
requirements.dev.txt		requirements.dev.txt
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ScienceBeam Trainer for GROBID

⚠️ Under new stewardship

Overview

Prerequisites

Recommended

Using the Docker Container

Header Model Training with Default Dataset

Header Model Training with your own dataset

Make Targets

Example End-to-End

Get Example Data

Generate GROBID Training Data

Copy Raw Header Training Data to TEI

Train Header Model with Dataset

Train Header Model with Default Dataset

Train Header Model with Dataset and Default Dataset

Upload Header Model

About

Releases

Packages

Contributors 5

Languages

License

elifesciences/sciencebeam-trainer-grobid

Folders and files

Latest commit

History

Repository files navigation

ScienceBeam Trainer for GROBID

⚠️ Under new stewardship

Overview

Prerequisites

Recommended

Using the Docker Container

Header Model Training with Default Dataset

Header Model Training with your own dataset

Make Targets

Example End-to-End

Get Example Data

Generate GROBID Training Data

Copy Raw Header Training Data to TEI

Train Header Model with Dataset

Train Header Model with Default Dataset

Train Header Model with Dataset and Default Dataset

Upload Header Model

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 5

Languages

Packages