Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-1193: Document-level sequence labeling #1194

Merged
merged 2 commits into from Oct 8, 2019
Merged

Conversation

alanakbik
Copy link
Collaborator

@alanakbik alanakbik commented Oct 8, 2019

This PR introduces the option for reading entire documents into one Sentence object for sequence labeling. This option is now supported for CONLL_03, CONLL_03_GERMAN and CONLL_03_DUTCH datasets which indicate document boundaries.

Here's how to train a model on CoNLL-03 on the document level:

# read CoNLL-03 with document_as_sequence=True
corpus = CONLL_03(in_memory=True, document_as_sequence=True)

# what tag do we want to predict?
tag_type = 'ner'

# 3. make the tag dictionary from the corpus
tag_dictionary = corpus.make_tag_dictionary(tag_type=tag_type)

# init simple tagger with GloVe embeddings
tagger: SequenceTagger = SequenceTagger(
    hidden_size=256,
    embeddings=WordEmbeddings('glove'),
    tag_dictionary=tag_dictionary,
    tag_type=tag_type,
)

# initialize trainer
from flair.trainers import ModelTrainer

trainer: ModelTrainer = ModelTrainer(tagger, corpus)

# start training
trainer.train(
    'path/to/your/experiment',
    # set a much smaller mini-batch size because documents are huge
    mini_batch_size=2,
)

@yosipk
Copy link
Collaborator

yosipk commented Oct 8, 2019

👍

1 similar comment
@alanakbik
Copy link
Collaborator Author

👍

@alanakbik alanakbik merged commit e8c0afe into master Oct 8, 2019
@alanakbik alanakbik deleted the GH-1193-doc-sequence branch October 8, 2019 15:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants