Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Partitioning and representativeness of TRAIN / DEV/ TEST data for NER #2259

Closed
Bilel-tn opened this issue Apr 30, 2021 · 3 comments
Closed
Labels
question Further information is requested wontfix This will not be worked on

Comments

@Bilel-tn
Copy link

Hello,

I am working on an already developed NER project in which my colleague used only a train.text file in the FLAIR workflow.
The train file contains 2022 sentences. When I printed my Corpus Object instance to analyze how FLAIR deals with such case, I got the following message :

Corpus: 2022 train + 225 dev + 250 test sentences

1. Does that mean that FLAIR splits automatically the number of sentences into DEV/TRAIN/TEST sets ?

2. If that's the case, how do FLAIR make the split : Does it pickup random samples of each entity class ? How does it ensure the representativeness ratio of each named entity in the TRAIN/DEV/TEST Split ?

3. Is there an empirically recommended ratio for the sentences split and the representativeness of the entities in each split ?
Example split sentences : 70% of the sentences in the train.txt, 20% of the sentences in the test.txt, and 10% for the dev.txt .
Example representativeness ratio of each named entity : Each named entity should be represented enough in each file (we split the overall annotated text related to each entity) as follows : 70% of the annotated tokens related to a given entity in the train.txt, 20% in the test.txt, and 10% for the _dev.txt

Thank you in advance for your insights.

@Bilel-tn Bilel-tn added the question Further information is requested label Apr 30, 2021
@alanakbik
Copy link
Collaborator

Hello @Bilel-tn, yes Flair automatically creates 10% splits for test and then another 10% split for dev if no test and dev splits are explicitly passed to the corpus object. What is important is that it does random sampling of these splits, so each time you load the corpus, different sentences might end up in the test and dev splits.

If you do not want this behavior, you can set a seed first:

import flair
flair.set_seed(123)

In this case, each time you load the corpus, the same sentences are sampled for test and dev.

Other ratios are currently not supported, so if you'd like a 20% split instead or a split that satisfies certain entity statistics, you'd have to either manually perform the split and create a separate text file for the test sentences.

@Bilel-tn
Copy link
Author

Bilel-tn commented May 6, 2021

Thanks for your valuable reply @alanakbik !

@stale
Copy link

stale bot commented Sep 3, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix This will not be worked on label Sep 3, 2021
@stale stale bot closed this as completed Sep 10, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested wontfix This will not be worked on
Projects
None yet
Development

No branches or pull requests

2 participants