Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate a data checker/gatekeeper #26

Open
felixbur opened this issue May 10, 2023 · 5 comments
Open

Integrate a data checker/gatekeeper #26

felixbur opened this issue May 10, 2023 · 5 comments
Labels
enhancement New feature or request

Comments

@felixbur
Copy link
Owner

I keep having problems with datasets that contain wav files that are

  • not 16khz mono wav
  • zero length
  • no speech contained
  • too short

would be nice to have a flag that checks the data before processing (i.e. train and devel) and removes faulty ones

@felixbur felixbur added the enhancement New feature or request label May 10, 2023
@felixbur
Copy link
Owner Author

probably easiest to use a VAD for this, e.g. Inaspeechsegmenter

@felixbur felixbur changed the title Integrate a data checker Integrate a data checker/gatekeeper May 11, 2023
@felixbur
Copy link
Owner Author

or a very simple approach of VAD like here https://github.com/marsbroshok/VAD-python

@felixbur
Copy link
Owner Author

or a very simple approach of VAD like here https://github.com/marsbroshok/VAD-python

i tried that and it didn't really work: using some test samples with street noise and such it always declared a lot of speech

@felixbur
Copy link
Owner Author

started with version 0.53.0
added
check_size
check_vad (with silero)

not tested though....

@felixbur
Copy link
Owner Author

should add check_samplerate

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant