Integrate a data checker/gatekeeper #26

felixbur · 2023-05-10T18:28:44Z

I keep having problems with datasets that contain wav files that are

would be nice to have a flag that checks the data before processing (i.e. train and devel) and removes faulty ones

felixbur · 2023-05-11T14:16:47Z

probably easiest to use a VAD for this, e.g. Inaspeechsegmenter

felixbur · 2023-05-11T14:30:08Z

or a very simple approach of VAD like here https://github.com/marsbroshok/VAD-python

felixbur · 2023-07-11T11:32:33Z

or a very simple approach of VAD like here https://github.com/marsbroshok/VAD-python

i tried that and it didn't really work: using some test samples with street noise and such it always declared a lot of speech

felixbur · 2023-07-11T17:25:52Z

started with version 0.53.0
added
check_size
check_vad (with silero)

not tested though....

felixbur · 2023-07-11T17:26:35Z

should add check_samplerate

felixbur added the enhancement New feature or request label May 10, 2023

felixbur changed the title ~~Integrate a data checker~~ Integrate a data checker/gatekeeper May 11, 2023

Provide feedback