Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Treat entire document as sequence #1193

Closed
alanakbik opened this issue Oct 8, 2019 · 7 comments
Closed

Treat entire document as sequence #1193

alanakbik opened this issue Oct 8, 2019 · 7 comments
Labels
feature A new feature sequence tagger Related to sequence tagger

Comments

@alanakbik
Copy link
Collaborator

In sequence labeling, we currently split documents at the sentence level and shuffle sentences at each epoch during training. However, some information may carry over between sentences in the same document, for instance an entity might be mentioned several times in different sentences in the same document.

Similarly, some datasets like CoNLL-03 explicitly mark up document boundaries with the -DOCSTART- token.

Task: Create a modified dataset reader for sequence labeling datasets that lets us provide a document separator token so that each full document is read into a single sequence object.

@alanakbik alanakbik added feature A new feature sequence tagger Related to sequence tagger labels Oct 8, 2019
@mauryaland
Copy link
Contributor

It is definitely a nice feature to develop! You could be interested by the paper released by Valentin Barrière and I about named entity recognition on long documents, specifically french court decisions : May I Check Again? -- A simple but efficient way to generate and use contextual dictionaries for Named Entity Recognition. Application to French Legal Texts

@alanakbik
Copy link
Collaborator Author

@mauryaland interesting paper! Did you try this method for publicly available datasets like CoNLL-03?

@mauryaland
Copy link
Contributor

My colleague is currently working on it, I let you know for the results!

@alanakbik
Copy link
Collaborator Author

Thanks!

alanakbik pushed a commit that referenced this issue Oct 8, 2019
@pommedeterresautee
Copy link
Contributor

@mauryaland can you publish your results here? (if it s ok for the Court)
I will do the same on my dataset and update the article Flair / Spacy accordingly?
Spacy has included some data augmentation on their side. Should improve their results out of the box (closer to what I get with pre train on synthetic data)
@alanakbik have you noticed better scores on CONLL 2003?

@alanakbik
Copy link
Collaborator Author

There was a slight improvement on CoNLL-03, but I only did one experiment run so far so I am not sure if it is significant. I'll do a 5x run to get better numbers soon and report them here!

However, I think with sequence of this length it might make sense to add attention to the sequence tagger so that it can more explicitly learn document-level entity tags.

@mauryaland
Copy link
Contributor

@pommedeterresautee No problem, I will post results here, probably next week!

@alanakbik Bringing attention is a good idea indeed!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature A new feature sequence tagger Related to sequence tagger
Projects
None yet
Development

No branches or pull requests

3 participants