Treat entire document as sequence #1193

alanakbik · 2019-10-08T09:59:42Z

In sequence labeling, we currently split documents at the sentence level and shuffle sentences at each epoch during training. However, some information may carry over between sentences in the same document, for instance an entity might be mentioned several times in different sentences in the same document.

Similarly, some datasets like CoNLL-03 explicitly mark up document boundaries with the -DOCSTART- token.

Task: Create a modified dataset reader for sequence labeling datasets that lets us provide a document separator token so that each full document is read into a single sequence object.

mauryaland · 2019-10-08T12:35:43Z

It is definitely a nice feature to develop! You could be interested by the paper released by Valentin Barrière and I about named entity recognition on long documents, specifically french court decisions : May I Check Again? -- A simple but efficient way to generate and use contextual dictionaries for Named Entity Recognition. Application to French Legal Texts

alanakbik · 2019-10-08T12:40:32Z

@mauryaland interesting paper! Did you try this method for publicly available datasets like CoNLL-03?

mauryaland · 2019-10-08T12:44:34Z

My colleague is currently working on it, I let you know for the results!

alanakbik · 2019-10-08T12:51:36Z

Thanks!

GH-1193: Document-level sequence labeling

pommedeterresautee · 2019-10-10T06:58:36Z

@mauryaland can you publish your results here? (if it s ok for the Court)
I will do the same on my dataset and update the article Flair / Spacy accordingly?
Spacy has included some data augmentation on their side. Should improve their results out of the box (closer to what I get with pre train on synthetic data)
@alanakbik have you noticed better scores on CONLL 2003?

alanakbik · 2019-10-10T08:07:35Z

There was a slight improvement on CoNLL-03, but I only did one experiment run so far so I am not sure if it is significant. I'll do a 5x run to get better numbers soon and report them here!

However, I think with sequence of this length it might make sense to add attention to the sequence tagger so that it can more explicitly learn document-level entity tags.

mauryaland · 2019-10-10T12:22:44Z

@pommedeterresautee No problem, I will post results here, probably next week!

@alanakbik Bringing attention is a good idea indeed!

alanakbik added feature A new feature sequence tagger Related to sequence tagger labels Oct 8, 2019

alanakbik pushed a commit that referenced this issue Oct 8, 2019

GH-1193: add parameter to read document as sequence

bf7d536

alanakbik pushed a commit that referenced this issue Oct 8, 2019

GH-1193: add parameter to choose encoding

3ae2e7e

alanakbik pushed a commit that referenced this issue Oct 8, 2019

Merge pull request #1194 from zalandoresearch/GH-1193-doc-sequence

e8c0afe

GH-1193: Document-level sequence labeling

alanakbik closed this as completed Oct 8, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Treat entire document as sequence #1193

Treat entire document as sequence #1193

alanakbik commented Oct 8, 2019

mauryaland commented Oct 8, 2019

alanakbik commented Oct 8, 2019

mauryaland commented Oct 8, 2019

alanakbik commented Oct 8, 2019

pommedeterresautee commented Oct 10, 2019

alanakbik commented Oct 10, 2019

mauryaland commented Oct 10, 2019

Treat entire document as sequence #1193

Treat entire document as sequence #1193

Comments

alanakbik commented Oct 8, 2019

mauryaland commented Oct 8, 2019

alanakbik commented Oct 8, 2019

mauryaland commented Oct 8, 2019

alanakbik commented Oct 8, 2019

pommedeterresautee commented Oct 10, 2019

alanakbik commented Oct 10, 2019

mauryaland commented Oct 10, 2019