The class STACKOVERFLOW_NER is taking some metadata in the corpus #2228

symeneses · 2021-04-18T12:17:18Z

Describe the bug
The class STACKOVERFLOW_NER is taking lines that are used to identify questions and answers into the corpus. Also, the entities used need some cleaning to be the same as in the author's paper.

To Reproduce

from flair.data import Corpus
from flair.datasets import STACKOVERFLOW_NER

corpus: Corpus = STACKOVERFLOW_NER()
print(corpus)

Corpus: 14545 train + 4607 dev + 4940 test sentences

The corpus has fewer sentences as reported in the paper. Looking inside the datasets in the corpus, we can see it has metadata.

corpus.train[0:3]

[Sentence: "Question_ID : 37985879" [− Tokens: 3],
Sentence: "Question_URL : https://stackoverflow.com/questions/37985879/" [− Tokens: 3],
Sentence: "If I would have 2 tables" [− Tokens: 6 − Token-Labels: "If I would have 2 tables <S-Data_Structure>"]]

Expected behavior
The summary of the corpus should be:

print(corpus)

Corpus: 9263 train + 2896 dev + 3108 test sentences

The above values are the same number of sentences processed with the paper authors code.

Environment (please complete the following information):

OS: Debian GNU/Linux 10 (buster)
Version: 0.8

The text was updated successfully, but these errors were encountered:

symeneses added the bug Something isn't working label Apr 18, 2021

symeneses mentioned this issue Apr 18, 2021

Stackoverflow NER dataset cleaning #2229

Merged

alanakbik closed this as completed in #2229 Apr 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The class STACKOVERFLOW_NER is taking some metadata in the corpus #2228

The class STACKOVERFLOW_NER is taking some metadata in the corpus #2228

symeneses commented Apr 18, 2021

The class STACKOVERFLOW_NER is taking some metadata in the corpus #2228

The class STACKOVERFLOW_NER is taking some metadata in the corpus #2228

Comments

symeneses commented Apr 18, 2021