New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Will remove sentence clip count for clip selection for validation #4116
Will remove sentence clip count for clip selection for validation #4116
Conversation
@raivisdejus very interesting find, I didn't know that there was such a limit in the code. This also explains one issue I posted a while back. On the other hand, with your locale, there is a more pressing issue that resulted in this. The number of sentences in your text-corpus is very low, and you have many users recording - which in turn results in a sentence being recorded too many times. AFAIK, for the SotA models, this does not help much. One exception would be wake-word or command-like applications where it is better for several words to be recorded by more than a thousand people (and CV code run by Mozilla is not very suitable to implement these). Otherwise, a synthetic limit of 15 at this stage is very interesting indeed. This will result in many hours of volunteer effort while recording to get wasted. |
91aa0a4
to
5332db5
Compare
I have adjusted the code so that the clip recording limit is applied to sentence selection for recordings. We definitely need this solution for Latvian, but I think it is a good solution for all other locales as well. |
@raivisdejus please check this merged PR: https://github.com/common-voice/common-voice/pull/4127/files |
@HarikalarKutusu This still leaves some sentences unvalidatable, for example if a sentence has 16 recordings. For future recordings we will add more sentences, but some solution is needed to get already recorded sentences to validation. |
Yes, I'm aware of that. I'm not in a position to comment on this, but it might be related to the dataset health. |
Hey @raivisdejus , thank you for your effort. I cannot merge that PR as is, since there is still a problem with the query that retrieves clips for validation that I am investigating as of right now. Latvian should have many more clips to validate. Also there is already a limit as to how many clips a sentence can gather, and every of those clips can get a vote. Here is a short excerpt from the Latvian clips corpus:
So the |
@moz-dfeller, on the general scope, I'm sure many languages have such recordings stuck in their "other" buckets. The team might consider removing the limit for a while, e.g. parallel to a global campaign, then put it back, to make them available in datasets and clean the "other" bucket. |
Great to see progress! And thanks for looking into this. For Latvian, we were doing campaigns where a large number of people came to record in one day. So on that day, they recorded a lot from the cached set of clips that were selected for that day. As the overall amount of sentences was not too big we got the situation of having too many recordings. A limit of 16 will help, but IMHO any limit on validation clip retrieval may run into problems at some point. The main problem for Latvian will be fixed by adding more sentences to record. We are working on this. |
@raivisdejus I think you are right. I though we would only retrieve sentences that have less than 15-16 clips, but we are not. So we will always have the problem that there might be clips that will not be validated if we put a limit to that. @raivisdejus , @HarikalarKutusu |
That would be a wonderful solution! Thank you! |
I believe that the current models perform well with 5 clips per sentence (somebody more knowledgeable please correct me) but I assume that many languages would be saturated with clips relatively fast and discourage further contributions (especially during events like @raivisdejus mentioned). Saving too many clips on the other hand wouldn't necessarily increase the model's performance significantly after a certain threshold and also increase storage and traffic costs. So 15 clips seems like a sensible number until proven it's not :) |
You are right. I didn't see any scientific paper on that thou... |
@HarikalarKutusu @moz-dfeller Agree with the clip limit per sentence. No question there. Also, more Latvian sentences are on the way. Some were added after the v14 data set was released already. No question about this as well. What are your thoughts on moving it to the sentence selection for recording? So we do not get to the state where unusable sentences get recorded. With all the above acknowledged I would still like a solution where big chunk of Latvian sentences is not stuck in an unusable state. For voice tool creators some filtering of the dataset is needed anyway, as some people can record thousands of sentences, so if the dataset is used without any filtering and processing those super recorders will also influence overall speech recognition quality. |
That's the job of the splitting algorithm. |
Issue fixed by #4131 Thanks, everybody! Kudos and extra karma points as well as warmest greetings from Latvia :) |
Pull Request Form
Type of Pull Request
This pull request will remove the 15 clip limit for sentences that is currently applied when clips are selected for validation. This will let all recorded clips to be validated.
If a hard limit of the maximum number of clips per sentence is deemed necessary, it can be introduced in the query that selects sentences for recording. This way sentences that already have enough clips recorded would not be presented to the users.
https://github.com/common-voice/common-voice/blob/main/server/src/lib/model/db.ts#L313