New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Treat entire document as sequence #1193
Comments
It is definitely a nice feature to develop! You could be interested by the paper released by Valentin Barrière and I about named entity recognition on long documents, specifically french court decisions : May I Check Again? -- A simple but efficient way to generate and use contextual dictionaries for Named Entity Recognition. Application to French Legal Texts |
@mauryaland interesting paper! Did you try this method for publicly available datasets like CoNLL-03? |
My colleague is currently working on it, I let you know for the results! |
Thanks! |
GH-1193: Document-level sequence labeling
@mauryaland can you publish your results here? (if it s ok for the Court) |
There was a slight improvement on CoNLL-03, but I only did one experiment run so far so I am not sure if it is significant. I'll do a 5x run to get better numbers soon and report them here! However, I think with sequence of this length it might make sense to add attention to the sequence tagger so that it can more explicitly learn document-level entity tags. |
@pommedeterresautee No problem, I will post results here, probably next week! @alanakbik Bringing attention is a good idea indeed! |
In sequence labeling, we currently split documents at the sentence level and shuffle sentences at each epoch during training. However, some information may carry over between sentences in the same document, for instance an entity might be mentioned several times in different sentences in the same document.
Similarly, some datasets like CoNLL-03 explicitly mark up document boundaries with the -DOCSTART- token.
Task: Create a modified dataset reader for sequence labeling datasets that lets us provide a document separator token so that each full document is read into a single sequence object.
The text was updated successfully, but these errors were encountered: