You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am working on an already developed NER project in which my colleague used only a train.text file in the FLAIR workflow.
The train file contains 2022 sentences. When I printed my Corpus Object instance to analyze how FLAIR deals with such case, I got the following message :
Corpus: 2022 train + 225 dev + 250 test sentences
1. Does that mean that FLAIR splits automatically the number of sentences into DEV/TRAIN/TEST sets ?
2. If that's the case, how do FLAIR make the split : Does it pickup random samples of each entity class ? How does it ensure the representativeness ratio of each named entity in the TRAIN/DEV/TEST Split ?
3. Is there an empirically recommended ratio for the sentences split and the representativeness of the entities in each split ? Example split sentences : 70% of the sentences in the train.txt, 20% of the sentences in the test.txt, and 10% for the dev.txt . Example representativeness ratio of each named entity : Each named entity should be represented enough in each file (we split the overall annotated text related to each entity) as follows : 70% of the annotated tokens related to a given entity in the train.txt, 20% in the test.txt, and 10% for the _dev.txt
Thank you in advance for your insights.
The text was updated successfully, but these errors were encountered:
Hello @Bilel-tn, yes Flair automatically creates 10% splits for test and then another 10% split for dev if no test and dev splits are explicitly passed to the corpus object. What is important is that it does random sampling of these splits, so each time you load the corpus, different sentences might end up in the test and dev splits.
If you do not want this behavior, you can set a seed first:
importflairflair.set_seed(123)
In this case, each time you load the corpus, the same sentences are sampled for test and dev.
Other ratios are currently not supported, so if you'd like a 20% split instead or a split that satisfies certain entity statistics, you'd have to either manually perform the split and create a separate text file for the test sentences.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Hello,
I am working on an already developed NER project in which my colleague used only a train.text file in the FLAIR workflow.
The train file contains 2022 sentences. When I printed my Corpus Object instance to analyze how FLAIR deals with such case, I got the following message :
Corpus: 2022 train + 225 dev + 250 test sentences
1. Does that mean that FLAIR splits automatically the number of sentences into DEV/TRAIN/TEST sets ?
2. If that's the case, how do FLAIR make the split : Does it pickup random samples of each entity class ? How does it ensure the representativeness ratio of each named entity in the TRAIN/DEV/TEST Split ?
3. Is there an empirically recommended ratio for the sentences split and the representativeness of the entities in each split ?
Example split sentences : 70% of the sentences in the train.txt, 20% of the sentences in the test.txt, and 10% for the dev.txt .
Example representativeness ratio of each named entity : Each named entity should be represented enough in each file (we split the overall annotated text related to each entity) as follows : 70% of the annotated tokens related to a given entity in the train.txt, 20% in the test.txt, and 10% for the _dev.txt
Thank you in advance for your insights.
The text was updated successfully, but these errors were encountered: