Stackoverflow NER dataset cleaning #2229

symeneses · 2021-04-18T12:41:08Z

It adds entity mappings, removes sentences with metadata and adds questions/answers counts.
It fixes #2228.

Expected Results:

from flair.data import Corpus
from flair.datasets import STACKOVERFLOW_NER

corpus: Corpus = STACKOVERFLOW_NER()
print(corpus)

Corpus: 9263 train + 2896 dev + 3108 test sentences

corpus.train[0:3]

[Sentence: "If I would have 2 tables" [− Tokens: 6 − Token-Labels: "If I would have 2 tables <S-Data_Structure>"],
Sentence: "How do I get this result" [− Tokens: 6],
Sentence: "The following query needs to be adjusted , but I dont know how" [− Tokens: 13]]

Notes:
It would be nicer to extend the parameter comment_symbol in the ColumnCorpus class or create one similar that accepts also alist. In this case, the new parameter would take the values in the list disallowed_list.

alanakbik · 2021-04-19T14:35:36Z

@symeneses thanks a lot for spotting and fixing this!

Good idea with extending the comment_symbol into a disallowed_list! Care to do a PR for this? ;)

symeneses · 2021-04-19T17:07:15Z

I could do that and then use it to clean this particular dataset. 👍🏽

symeneses added 3 commits April 18, 2021 14:04

add entity mapping

9c070a1

add data cleaning SO ner

f5e4b56

add summary corpus to log in SO ner

edb1ccd

alanakbik merged commit d40e96c into flairNLP:master Apr 19, 2021

symeneses deleted the fix/stackoverflow-ner branch April 19, 2021 17:04

symeneses mentioned this pull request May 2, 2021

Add banned sentences parameter in SequenceTagger #2262

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stackoverflow NER dataset cleaning #2229

Stackoverflow NER dataset cleaning #2229

symeneses commented Apr 18, 2021

alanakbik commented Apr 19, 2021

symeneses commented Apr 19, 2021

Stackoverflow NER dataset cleaning #2229

Stackoverflow NER dataset cleaning #2229

Conversation

symeneses commented Apr 18, 2021

alanakbik commented Apr 19, 2021

symeneses commented Apr 19, 2021