Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ground truth materials #49

Open
MikeSmithEU opened this issue Apr 3, 2019 · 1 comment
Open

Ground truth materials #49

MikeSmithEU opened this issue Apr 3, 2019 · 1 comment
Labels
enhancement Backlog idea or requirement for future discussions/development.
Milestone

Comments

@MikeSmithEU
Copy link
Contributor

2 freely available possible datasets have already been identified, more are welcome:

  1. Mozilla Common Voice https://voice.mozilla.org/en
    CC-0 license
  2. Openslr resources http://openslr.org/resources.php
    Each resource has own license ranging from "unrestricted" to "CC-BY-NC-ND 3.0"
    Remark: Some of the Openslr data is likely to have been used for training various STT systems, as such it may not always be the most fair indicator

Open questions:

  • Which ground truth materials might we use for evaluating vendors' solutions? Will we build our own dataset? Or both?
  • How will we include these resources into our product?
@MikeSmithEU MikeSmithEU added the enhancement Backlog idea or requirement for future discussions/development. label Apr 3, 2019
@MikeSmithEU MikeSmithEU added this to the Backlog milestone Apr 3, 2019
@MikeSmithEU
Copy link
Contributor Author

Suggestion: the user of our benchmark should have the choice of which data to use (with a sensible default, following the 'Extensibility', 'Specific over generic' and 'Pragmatism' principles).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Backlog idea or requirement for future discussions/development.
Projects
None yet
Development

No branches or pull requests

1 participant