Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stackoverflow NER dataset cleaning #2229

Merged
merged 3 commits into from Apr 19, 2021

Conversation

symeneses
Copy link
Contributor

It adds entity mappings, removes sentences with metadata and adds questions/answers counts.
It fixes #2228.

Expected Results:

from flair.data import Corpus
from flair.datasets import STACKOVERFLOW_NER

corpus: Corpus = STACKOVERFLOW_NER()
print(corpus)

Corpus: 9263 train + 2896 dev + 3108 test sentences

corpus.train[0:3]

[Sentence: "If I would have 2 tables" [− Tokens: 6 − Token-Labels: "If I would have 2 tables <S-Data_Structure>"],
Sentence: "How do I get this result" [− Tokens: 6],
Sentence: "The following query needs to be adjusted , but I dont know how" [− Tokens: 13]]

Notes:
It would be nicer to extend the parameter comment_symbol in the ColumnCorpus class or create one similar that accepts also alist. In this case, the new parameter would take the values in the list disallowed_list.

@alanakbik
Copy link
Collaborator

@symeneses thanks a lot for spotting and fixing this!

Good idea with extending the comment_symbol into a disallowed_list! Care to do a PR for this? ;)

@alanakbik alanakbik merged commit d40e96c into flairNLP:master Apr 19, 2021
@symeneses symeneses deleted the fix/stackoverflow-ner branch April 19, 2021 17:04
@symeneses
Copy link
Contributor Author

I could do that and then use it to clean this particular dataset. 👍🏽

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

The class STACKOVERFLOW_NER is taking some metadata in the corpus
2 participants