GH-1193: Document-level sequence labeling #1194

alanakbik · 2019-10-08T12:24:21Z

This PR introduces the option for reading entire documents into one Sentence object for sequence labeling. This option is now supported for CONLL_03, CONLL_03_GERMAN and CONLL_03_DUTCH datasets which indicate document boundaries.

Here's how to train a model on CoNLL-03 on the document level:

# read CoNLL-03 with document_as_sequence=True
corpus = CONLL_03(in_memory=True, document_as_sequence=True)

# what tag do we want to predict?
tag_type = 'ner'

# 3. make the tag dictionary from the corpus
tag_dictionary = corpus.make_tag_dictionary(tag_type=tag_type)

# init simple tagger with GloVe embeddings
tagger: SequenceTagger = SequenceTagger(
    hidden_size=256,
    embeddings=WordEmbeddings('glove'),
    tag_dictionary=tag_dictionary,
    tag_type=tag_type,
)

# initialize trainer
from flair.trainers import ModelTrainer

trainer: ModelTrainer = ModelTrainer(tagger, corpus)

# start training
trainer.train(
    'path/to/your/experiment',
    # set a much smaller mini-batch size because documents are huge
    mini_batch_size=2,
)

yosipk · 2019-10-08T15:03:49Z

👍

alanakbik · 2019-10-08T15:24:33Z

👍

aakbik added 2 commits October 8, 2019 12:08

GH-1193: add parameter to read document as sequence

bf7d536

GH-1193: add parameter to choose encoding

3ae2e7e

alanakbik merged commit e8c0afe into master Oct 8, 2019

alanakbik deleted the GH-1193-doc-sequence branch October 8, 2019 15:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-1193: Document-level sequence labeling #1194

GH-1193: Document-level sequence labeling #1194

alanakbik commented Oct 8, 2019 •

edited

yosipk commented Oct 8, 2019

alanakbik commented Oct 8, 2019

GH-1193: Document-level sequence labeling #1194

GH-1193: Document-level sequence labeling #1194

Conversation

alanakbik commented Oct 8, 2019 • edited

yosipk commented Oct 8, 2019

alanakbik commented Oct 8, 2019

alanakbik commented Oct 8, 2019 •

edited