# Biomedical named entity recognition with partially annotated data

## Setup

### Installation

In [None]:
!pip install -q spacy-partial-tagger

### Download datasets

Download biomedical named entity recognition datasets from:
- https://github.com/PierreZweigenbaum/bc5cdr-ner

In [2]:
!wget https://raw.githubusercontent.com/PierreZweigenbaum/bc5cdr-ner/main/BC5CDR-IOB/train.tsv
!wget https://raw.githubusercontent.com/PierreZweigenbaum/bc5cdr-ner/main/BC5CDR-IOB/devel.tsv
!wget https://raw.githubusercontent.com/PierreZweigenbaum/bc5cdr-ner/main/BC5CDR-IOB/test.tsv

--2022-10-25 06:34:40--  https://raw.githubusercontent.com/PierreZweigenbaum/bc5cdr-ner/main/BC5CDR-IOB/train.tsv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.108.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1039940 (1016K) [text/plain]
Saving to: ‘train.tsv’


2022-10-25 06:34:41 (17.7 MB/s) - ‘train.tsv’ saved [1039940/1039940]

--2022-10-25 06:34:41--  https://raw.githubusercontent.com/PierreZweigenbaum/bc5cdr-ner/main/BC5CDR-IOB/devel.tsv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1031781 (1008K) [text/plain]
Saving to: ‘devel.tsv’


2022-10-25 06:34:42 (17.5 MB/s) -

In addition to the datasets, we need to download a dictionary which contains biomedical entities:

In [3]:
!wget https://raw.githubusercontent.com/shangjingbo1226/AutoNER/master/data/BC5CDR/dict_core.txt

--2022-10-25 06:34:45--  https://raw.githubusercontent.com/shangjingbo1226/AutoNER/master/data/BC5CDR/dict_core.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.108.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 67944 (66K) [text/plain]
Saving to: ‘dict_core.txt’


2022-10-25 06:34:46 (4.31 MB/s) - ‘dict_core.txt’ saved [67944/67944]



## Prepare datasets

Once we download the datasets, we need to create a partially annotated dataset and convert datasets into spaCy format. The following code will do everything for you:

In [6]:
import spacy
from spacy.tokens import Doc, DocBin
from spacy_partial_tagger.tokenizer import CharacterTokenizer


def preprocess_term(term: str):
    chars = []
    marks = {"-", "[", "]", "(", ")", ",", ".", "'"}
    for i in range(len(term) - 1):
        chars.append(term[i])
        if term[i] in marks and term[i + 1] != " ":
            chars.append(" ")
        if term[i] != " " and term[i + 1] in marks:
            chars.append(" ")
    chars.append(term[-1])
    return "".join(chars).replace("  ", " ").lower()


def load_patterns(file_path: str, nlp):
    with open(file_path, encoding="utf-8") as f:
        for line in f:
            label, term = line.strip().split("\t")
            term = preprocess_term(term)
            pattern = [{"LOWER": token.text} for token in nlp(term)]
            yield {"label": label, "pattern": pattern}


def load_conll(file_path: str):
    x, y = [], []
    words, tags = [], []
    with open(file_path, encoding="utf-8") as f:
        for line in f:
            line = line.strip()
            if line:
                word, tag = line.split("\t")
                words.append(word)
                tags.append(tag)
            else:
                x.append(words)
                y.append(tags)
                words, tags = [], []
    if words:
        x.append(words)
        y.append(tags)
    return x, y


def create_doc(x, y, nlp, set_ents=True):
    for words, ents in zip(x, y):
        spaces = [True] * len(words)
        if not set_ents:
            ents = None
        yield Doc(nlp.vocab, words=words, spaces=spaces, ents=ents)


def store_data(docs, nlp, path: str):
    doc_bin = DocBin()
    for doc in docs:
        ents = [ent for ent in doc.ents]
        doc = nlp.make_doc(doc.text)
        ents = [
            doc.char_span(ent.start_char, ent.end_char, label=ent.label_)
            for ent in ents
        ]
        doc.ents = ents
        doc_bin.add(doc)
    doc_bin.to_disk(path)

Let's use the functions to prepare datasets:

In [4]:
!mkdir -p corpus

In [7]:
nlp = spacy.blank("en")

# set patterns
patterns = list(load_patterns("dict_core.txt", nlp))
ruler = nlp.add_pipe("entity_ruler")
ruler.add_patterns(patterns)

# load data
x_train, y_train = load_conll("train.tsv")
x_valid, y_valid = load_conll("devel.tsv")
x_test, y_test = load_conll("test.tsv")

# create docs
docs_train = create_doc(x_train, y_train, nlp, set_ents=False)
docs_valid = create_doc(x_valid, y_valid, nlp)
docs_test = create_doc(x_test, y_test, nlp)
docs_train = list(map(ruler, docs_train))

# store data
nlp.tokenizer = CharacterTokenizer(nlp.vocab)
store_data(docs_train, nlp, "corpus/train.spacy")
store_data(docs_valid, nlp, "corpus/valid.spacy")
store_data(docs_test, nlp, "corpus/test.spacy")

## Creating a config file

Let's create a configuration file to train spacy-partial-tagger. Since we are using a dataset from the medical domain, let's use [PubMedBERT](https://huggingface.co/microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext) as the pre-trained model.

FYI:
- https://github.com/doccano/spacy-partial-tagger/blob/main/config.cfg

In [8]:
%%writefile base_config.cfg
[paths]
train = "./train.spacy"
dev = "./dev.spacy"
init_tok2vec = null
vectors = null

[corpora]

[corpora.train]
@readers = "spacy.Corpus.v1"
path = ${paths.train}

[corpora.dev]
@readers = "spacy.Corpus.v1"
path = ${paths.dev}

[system]
gpu_allocator = null
seed = 0

[nlp]
lang = "en"
pipeline = ["partial_ner"]
batch_size = 16
tokenizer = {"@tokenizers": "character_tokenizer.v1"}

[nlp.tokenizer]

[components]

[components.partial_ner]
factory = "partial_ner"

[components.partial_ner.loss]
@losses = "spacy-partial-tagger.ExpectedEntityRatioLoss.v1"
padding_index = -1
unknown_index = -100
outside_index = 0

[components.partial_ner.label_indexer]
@label_indexers = "spacy-partial-tagger.TransformerLabelIndexer.v1"
padding_index = ${components.partial_ner.loss.padding_index}
unknown_index= ${components.partial_ner.loss.unknown_index}

[components.partial_ner.model]
@architectures = "spacy-partial-tagger.PartialTagger.v1"

[components.partial_ner.model.misaligned_tok2vec]
@architectures = "spacy-partial-tagger.MisalignedTok2VecTransformer.v1"
model_name = "microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext"

[components.partial_ner.model.encoder]
@architectures = "spacy-partial-tagger.LinearCRFEncoder.v1"
nI = 768
nO = null
dropout = 0.2

[components.partial_ner.model.decoder]
@architectures = "spacy-partial-tagger.ConstrainedViterbiDecoder.v1"
padding_index = ${components.partial_ner.loss.padding_index}

[training]
dev_corpus = "corpora.dev"
train_corpus = "corpora.train"
seed = ${system.seed}
gpu_allocator = ${system.gpu_allocator}
accumulate_gradient = 1
max_steps = 20000
patience = 10000
eval_frequency = 1000
frozen_components = []
before_to_disk = null

[training.batcher]
@batchers = "spacy.batch_by_sequence.v1"
size = 16
get_length = null

[training.logger]
@loggers = "spacy.ConsoleLogger.v1"
progress_bar = false

[training.optimizer]
@optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = false
use_averages = false
grad_clip = 5.0

[training.optimizer.learn_rate]
@schedules = "slanted_triangular.v1"
max_rate = 0.00002
num_steps = ${training.max_steps}
cut_frac = 0.1
ratio = 16
t = -1

[training.score_weights]
ents_per_type = null
ents_f = 1.0
ents_p = 0.0
ents_r = 0.0

[pretraining]

[initialize]

Writing base_config.cfg


In [None]:
!python -m spacy init fill-config base_config.cfg config.cfg

## Training a model

In [None]:
!python -m spacy train config.cfg \
        --output=./pubmed \
        --paths.train corpus/train.spacy \
        --paths.dev corpus/valid.spacy \
        --gpu-id 0 \
        --training.patience 1000

## Evaluating the model

In [None]:
!python -m spacy evaluate pubmed/model-best corpus/test.spacy --gpu-id 0