Partitioning and representativeness of TRAIN / DEV/ TEST data for NER #2259

Bilel-tn · 2021-04-30T10:35:11Z

Hello,

I am working on an already developed NER project in which my colleague used only a train.text file in the FLAIR workflow.
The train file contains 2022 sentences. When I printed my Corpus Object instance to analyze how FLAIR deals with such case, I got the following message :

Corpus: 2022 train + 225 dev + 250 test sentences

1. Does that mean that FLAIR splits automatically the number of sentences into DEV/TRAIN/TEST sets ?

2. If that's the case, how do FLAIR make the split : Does it pickup random samples of each entity class ? How does it ensure the representativeness ratio of each named entity in the TRAIN/DEV/TEST Split ?

3. Is there an empirically recommended ratio for the sentences split and the representativeness of the entities in each split ?
Example split sentences : 70% of the sentences in the train.txt, 20% of the sentences in the test.txt, and 10% for the dev.txt .
Example representativeness ratio of each named entity : Each named entity should be represented enough in each file (we split the overall annotated text related to each entity) as follows : 70% of the annotated tokens related to a given entity in the train.txt, 20% in the test.txt, and 10% for the _dev.txt

Thank you in advance for your insights.

alanakbik · 2021-04-30T11:50:37Z

Hello @Bilel-tn, yes Flair automatically creates 10% splits for test and then another 10% split for dev if no test and dev splits are explicitly passed to the corpus object. What is important is that it does random sampling of these splits, so each time you load the corpus, different sentences might end up in the test and dev splits.

If you do not want this behavior, you can set a seed first:

import flair
flair.set_seed(123)

In this case, each time you load the corpus, the same sentences are sampled for test and dev.

Other ratios are currently not supported, so if you'd like a 20% split instead or a split that satisfies certain entity statistics, you'd have to either manually perform the split and create a separate text file for the test sentences.

Bilel-tn · 2021-05-06T14:21:12Z

Thanks for your valuable reply @alanakbik !

stale · 2021-09-03T14:48:42Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Bilel-tn added the question Further information is requested label Apr 30, 2021

stale bot added the wontfix This will not be worked on label Sep 3, 2021

stale bot closed this as completed Sep 10, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Partitioning and representativeness of TRAIN / DEV/ TEST data for NER #2259

Partitioning and representativeness of TRAIN / DEV/ TEST data for NER #2259

Bilel-tn commented Apr 30, 2021

alanakbik commented Apr 30, 2021

Bilel-tn commented May 6, 2021

stale bot commented Sep 3, 2021

Partitioning and representativeness of TRAIN / DEV/ TEST data for NER #2259

Partitioning and representativeness of TRAIN / DEV/ TEST data for NER #2259

Comments

Bilel-tn commented Apr 30, 2021

alanakbik commented Apr 30, 2021

Bilel-tn commented May 6, 2021

stale bot commented Sep 3, 2021