Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Will remove sentence clip count for clip selection for validation #4116

Closed

Conversation

raivisdejus
Copy link
Contributor

Pull Request Form

Type of Pull Request

  • Related to a listed issue

This pull request will remove the 15 clip limit for sentences that is currently applied when clips are selected for validation. This will let all recorded clips to be validated.

If a hard limit of the maximum number of clips per sentence is deemed necessary, it can be introduced in the query that selects sentences for recording. This way sentences that already have enough clips recorded would not be presented to the users.
https://github.com/common-voice/common-voice/blob/main/server/src/lib/model/db.ts#L313

@raivisdejus raivisdejus requested a review from a team as a code owner July 16, 2023 14:19
@raivisdejus raivisdejus requested review from data-sync-user and removed request for a team July 16, 2023 14:19
@HarikalarKutusu
Copy link
Contributor

HarikalarKutusu commented Jul 16, 2023

@raivisdejus very interesting find, I didn't know that there was such a limit in the code. This also explains one issue I posted a while back.

On the other hand, with your locale, there is a more pressing issue that resulted in this. The number of sentences in your text-corpus is very low, and you have many users recording - which in turn results in a sentence being recorded too many times. AFAIK, for the SotA models, this does not help much. One exception would be wake-word or command-like applications where it is better for several words to be recorded by more than a thousand people (and CV code run by Mozilla is not very suitable to implement these).

Otherwise, a synthetic limit of 15 at this stage is very interesting indeed. This will result in many hours of volunteer effort while recording to get wasted.

@raivisdejus
Copy link
Contributor Author

I have adjusted the code so that the clip recording limit is applied to sentence selection for recordings.
This will limit the number of clips a sentence can gather and still let users validate all the recorded sentences so no sentence is lost.

We definitely need this solution for Latvian, but I think it is a good solution for all other locales as well.

@HarikalarKutusu
Copy link
Contributor

@raivisdejus
Copy link
Contributor Author

@HarikalarKutusu This still leaves some sentences unvalidatable, for example if a sentence has 16 recordings.
For Latvian, we have 50% or about 50 hours of unacessable sentences.

For future recordings we will add more sentences, but some solution is needed to get already recorded sentences to validation.

@HarikalarKutusu
Copy link
Contributor

Yes, I'm aware of that. I'm not in a position to comment on this, but it might be related to the dataset health.

@moz-dfeller
Copy link
Contributor

moz-dfeller commented Jul 25, 2023

Hey @raivisdejus , thank you for your effort. I cannot merge that PR as is, since there is still a problem with the query that retrieves clips for validation that I am investigating as of right now. Latvian should have many more clips to validate.

Also there is already a limit as to how many clips a sentence can gather, and every of those clips can get a vote. Here is a short excerpt from the Latvian clips corpus:

Total clips count Validated clips count
16 6
15 15
16 11
16 6
16 12
16 10
16 10
16 13
16 10
16 10
16 10
16 11
16 13
16 10
16 12
16 11
16 12
16 8
16 11
16 6
16 12

So the total clips count is the number of clips per sentence. As you can see we have a maximum of 16 clips per sentence (probably an off-by-one error, I suppose it was meant to be 15) and on the other side the number of validated clips for the same sentence. The issue right now is that the clips missing validation are not returned. The fix could be as easy as changing the the WHERE clause to take a clips_count <= 16 but the JOIN also seems weird. I am working on it and the fix should be coming soon.

@HarikalarKutusu
Copy link
Contributor

@moz-dfeller, on the general scope, I'm sure many languages have such recordings stuck in their "other" buckets. The team might consider removing the limit for a while, e.g. parallel to a global campaign, then put it back, to make them available in datasets and clean the "other" bucket.
Just an idea...

@raivisdejus
Copy link
Contributor Author

Great to see progress! And thanks for looking into this.

For Latvian, we were doing campaigns where a large number of people came to record in one day. So on that day, they recorded a lot from the cached set of clips that were selected for that day. As the overall amount of sentences was not too big we got the situation of having too many recordings.

A limit of 16 will help, but IMHO any limit on validation clip retrieval may run into problems at some point.

The main problem for Latvian will be fixed by adding more sentences to record. We are working on this.

@moz-dfeller
Copy link
Contributor

A limit of 16 will help, but IMHO any limit on validation clip retrieval may run into problems at some point.

@raivisdejus I think you are right. I though we would only retrieve sentences that have less than 15-16 clips, but we are not. So we will always have the problem that there might be clips that will not be validated if we put a limit to that.

@raivisdejus , @HarikalarKutusu
I will probably remove the clips limit per sentence for validation but add one for new recordings as we don't need more than 15 clips per sentence. This should alleviate some of the problems we are currently experiencing.

@HarikalarKutusu
Copy link
Contributor

I will probably remove the clips limit per sentence for validation but add one for new recordings as we don't need more than 15 clips per sentence. This should alleviate some of the problems we are currently experiencing.

That would be a wonderful solution! Thank you!
I wonder how the value 15 is decided thou...

@moz-dfeller
Copy link
Contributor

I believe that the current models perform well with 5 clips per sentence (somebody more knowledgeable please correct me) but I assume that many languages would be saturated with clips relatively fast and discourage further contributions (especially during events like @raivisdejus mentioned). Saving too many clips on the other hand wouldn't necessarily increase the model's performance significantly after a certain threshold and also increase storage and traffic costs. So 15 clips seems like a sensible number until proven it's not :)

@HarikalarKutusu
Copy link
Contributor

HarikalarKutusu commented Jul 25, 2023

You are right. I didn't see any scientific paper on that thou...
AFAIK, currently many large-scale models like Whisper or those provided by nVidia don't care about this thou. They suck all the material they have.
The problem with "many recordings per sentence" is the distribution as it might cause sentence/phoneme biasing.
So if a dataset has 1000 sentences with 15 recordings and 1000 sentences with 1 recording, and you take the whole dataset for training, you will have a huge problem.
This also should interest @raivisdejus as after they add new sentences they will see this effect in Latvian.
@raivisdejus, please check the "sentences" tab in this link. All your sentences have 15 recordings. None with less...

@raivisdejus
Copy link
Contributor Author

@HarikalarKutusu @moz-dfeller Agree with the clip limit per sentence. No question there.

Also, more Latvian sentences are on the way. Some were added after the v14 data set was released already. No question about this as well.

What are your thoughts on moving it to the sentence selection for recording? So we do not get to the state where unusable sentences get recorded. With all the above acknowledged I would still like a solution where big chunk of Latvian sentences is not stuck in an unusable state.

For voice tool creators some filtering of the dataset is needed anyway, as some people can record thousands of sentences, so if the dataset is used without any filtering and processing those super recorders will also influence overall speech recognition quality.

@HarikalarKutusu
Copy link
Contributor

What are your thoughts on moving it to the sentence selection for recording?

That's the job of the splitting algorithm.
Maybe continue this discussion on Matrix (@bozden:mozilla.org)? It is not directly related to the issue at hand...

@raivisdejus
Copy link
Contributor Author

Issue fixed by #4131

Thanks, everybody! Kudos and extra karma points as well as warmest greetings from Latvia :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants