Skip to content
A BERT model for scientific text.
Branch: master
Clone or download
ibeltagy Fix ner config
Addressing #31
Latest commit 64bb606 Apr 8, 2019
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
allennlp_config Fix ner config Apr 8, 2019
data reorganize confidence itnerval code Mar 26, 2019
misc reorganize confidence itnerval code Mar 26, 2019
scibert docstring for pico crf tagger Mar 26, 2019
scripts add script to compute dataset sizes Mar 26, 2019
.gitignore ignore dsstore Feb 14, 2019
LICENSE.txt
README.md
requirements.txt
train_allennlp_local.sh Update train_allennlp_local.sh Mar 27, 2019

README.md

PWC
PWC
PWC
PWC
PWC
PWC
PWC
PWC
PWC
PWC
PWC
PWC
PWC
PWC
PWC
PWC

SciBERT

SciBERT is a BERT model trained on scientific text.

  • SciBERT is trained on papers from the corpus of semanticscholar.org. Corpus size is 1.14M papers, 3.1B tokens. We use the full text of the papers in training, not just abstracts.

  • SciBERT has its own vocabulary (scivocab) that's built to best match the training corpus. We trained cased and uncased versions. We also include models trained on the original BERT vocabulary (basevocab) for comparison.

  • It results in state-of-the-art performance on a wide range of scientific domain nlp tasks. The details of the evaluation are in the paper. Evaluation code and data are included in this repo.

Downloading Trained Models

We release the tensorflow and the pytorch version of the trained models. The tensorflow version is compatible with code that works with the model from Google Research. The pytorch version is created using the Hugging Face library, and this repo shows how to use it in AllenNLP. All combinations of scivocab and basevocab, cased and uncased models are available below. Our evaluation shows that scivocab-uncased usually gives the best results.

Tensorflow Models

PyTorch Models

Using SciBERT in your own model

SciBERT models include all necessary files to be plugged in your own model and are in same format as BERT. If you are using Tensorflow, refer to Google's BERT repo and if you use PyTorch, refer to Hugging Face's repo where detailed instructions on using BERT models are provided.

Training new models using AllenNLP

To run experiments on different tasks and reproduce our results in the paper, you need to first setup the Python 3.6 environment:

pip install -r requirements.txt

which will install dependencies like AllenNLP.

Use the scibert/train_allennlp_local.sh script as an example of how to run an experiment (you'll need to modify paths and variable names like TASK and DATASET).

We include a broad set of scientific nlp datasets under the data/ directory across the following tasks. Each task has a sub-directory of available datasets.

├── ner
│   ├── JNLPBA
│   ├── NCBI-disease
│   ├── bc5cdr
│   └── sciie
├── parsing
│   └── genia
├── pico
│   └── ebmnlp
└── text_classification
    ├── chemprot
    ├── citation_intent
    ├── mag
    ├── rct-20k
    ├── sci-cite
    └── sciie-relation-extraction

For example to run the model on the Named Entity Recognition (NER) task and on the BC5CDR dataset (BioCreative V CDR), modify the scibert/train_allennlp_local.sh script according to:

DATASET='bc5cdr'
TASK='ner'
...

Decompress the PyTorch model that you downloaded using
tar -xvf scibert_scivocab_uncased.tar
The results will be in the scibert_scivocab_uncased directory containing two files: A vocabulary file (vocab.txt) and a weights file (weights.tar.gz). Copy the files to your desired location and then set correct paths for BERT_WEIGHTS and BERT_VOCAB in the script:

export BERT_VOCAB=path-to/scibert_scivocab_uncased.vocab
export BERT_WEIGHTS=path-to/scibert_scivocab_uncased.tar.gz

Finally run the script:

./scibert/train_allennlp_local.sh [serialization-directory]

Where [serialization-directory] is the path to an output directory where the model files will be stored.

Citing

If you use SciBERT in your research, please cite SciBERT: Pretrained Contextualized Embeddings for Scientific Text.

@inproceedings{Beltagy2019SciBERT,
  title={SciBERT: Pretrained Contextualized Embeddings for Scientific Text},
  author={Iz Beltagy and Arman Cohan and Kyle Lo},
  year={2019},
  Eprint={arXiv:1903.10676}
}

SciBERT is an open-source project developed by the Allen Institute for Artificial Intelligence (AI2). AI2 is a non-profit institute with the mission to contribute to humanity through high-impact AI research and engineering.

You can’t perform that action at this time.