Will remove sentence clip count for clip selection for validation #4116

raivisdejus · 2023-07-16T14:19:14Z

Pull Request Form

Type of Pull Request

Related to a listed issue

[BUG] Latvian is missing sentences to validate #4104

This pull request will remove the 15 clip limit for sentences that is currently applied when clips are selected for validation. This will let all recorded clips to be validated.

If a hard limit of the maximum number of clips per sentence is deemed necessary, it can be introduced in the query that selects sentences for recording. This way sentences that already have enough clips recorded would not be presented to the users.
https://github.com/common-voice/common-voice/blob/main/server/src/lib/model/db.ts#L313

HarikalarKutusu · 2023-07-16T23:02:48Z

@raivisdejus very interesting find, I didn't know that there was such a limit in the code. This also explains one issue I posted a while back.

On the other hand, with your locale, there is a more pressing issue that resulted in this. The number of sentences in your text-corpus is very low, and you have many users recording - which in turn results in a sentence being recorded too many times. AFAIK, for the SotA models, this does not help much. One exception would be wake-word or command-like applications where it is better for several words to be recorded by more than a thousand people (and CV code run by Mozilla is not very suitable to implement these).

Otherwise, a synthetic limit of 15 at this stage is very interesting indeed. This will result in many hours of volunteer effort while recording to get wasted.

raivisdejus · 2023-07-24T16:23:23Z

I have adjusted the code so that the clip recording limit is applied to sentence selection for recordings.
This will limit the number of clips a sentence can gather and still let users validate all the recorded sentences so no sentence is lost.

We definitely need this solution for Latvian, but I think it is a good solution for all other locales as well.

HarikalarKutusu · 2023-07-24T16:42:55Z

@raivisdejus please check this merged PR:

https://github.com/common-voice/common-voice/pull/4127/files

raivisdejus · 2023-07-24T17:02:40Z

@HarikalarKutusu This still leaves some sentences unvalidatable, for example if a sentence has 16 recordings.
For Latvian, we have 50% or about 50 hours of unacessable sentences.

For future recordings we will add more sentences, but some solution is needed to get already recorded sentences to validation.

HarikalarKutusu · 2023-07-24T17:06:03Z

Yes, I'm aware of that. I'm not in a position to comment on this, but it might be related to the dataset health.

moz-dfeller · 2023-07-25T08:32:18Z

Hey @raivisdejus , thank you for your effort. I cannot merge that PR as is, since there is still a problem with the query that retrieves clips for validation that I am investigating as of right now. Latvian should have many more clips to validate.

Also there is already a limit as to how many clips a sentence can gather, and every of those clips can get a vote. Here is a short excerpt from the Latvian clips corpus:

Total clips count	Validated clips count
16	6
15	15
16	11
16	6
16	12
16	10
16	10
16	13
16	10
16	10
16	10
16	11
16	13
16	10
16	12
16	11
16	12
16	8
16	11
16	6
16	12

So the total clips count is the number of clips per sentence. As you can see we have a maximum of 16 clips per sentence (probably an off-by-one error, I suppose it was meant to be 15) and on the other side the number of validated clips for the same sentence. The issue right now is that the clips missing validation are not returned. The fix could be as easy as changing the the WHERE clause to take a clips_count <= 16 but the JOIN also seems weird. I am working on it and the fix should be coming soon.

HarikalarKutusu · 2023-07-25T08:42:49Z

@moz-dfeller, on the general scope, I'm sure many languages have such recordings stuck in their "other" buckets. The team might consider removing the limit for a while, e.g. parallel to a global campaign, then put it back, to make them available in datasets and clean the "other" bucket.
Just an idea...

raivisdejus · 2023-07-25T10:19:22Z

Great to see progress! And thanks for looking into this.

For Latvian, we were doing campaigns where a large number of people came to record in one day. So on that day, they recorded a lot from the cached set of clips that were selected for that day. As the overall amount of sentences was not too big we got the situation of having too many recordings.

A limit of 16 will help, but IMHO any limit on validation clip retrieval may run into problems at some point.

The main problem for Latvian will be fixed by adding more sentences to record. We are working on this.

moz-dfeller · 2023-07-25T11:04:31Z

A limit of 16 will help, but IMHO any limit on validation clip retrieval may run into problems at some point.

@raivisdejus I think you are right. I though we would only retrieve sentences that have less than 15-16 clips, but we are not. So we will always have the problem that there might be clips that will not be validated if we put a limit to that.

@raivisdejus , @HarikalarKutusu
I will probably remove the clips limit per sentence for validation but add one for new recordings as we don't need more than 15 clips per sentence. This should alleviate some of the problems we are currently experiencing.

HarikalarKutusu · 2023-07-25T11:09:13Z

I will probably remove the clips limit per sentence for validation but add one for new recordings as we don't need more than 15 clips per sentence. This should alleviate some of the problems we are currently experiencing.

That would be a wonderful solution! Thank you!
I wonder how the value 15 is decided thou...

moz-dfeller · 2023-07-25T11:17:01Z

I believe that the current models perform well with 5 clips per sentence (somebody more knowledgeable please correct me) but I assume that many languages would be saturated with clips relatively fast and discourage further contributions (especially during events like @raivisdejus mentioned). Saving too many clips on the other hand wouldn't necessarily increase the model's performance significantly after a certain threshold and also increase storage and traffic costs. So 15 clips seems like a sensible number until proven it's not :)

HarikalarKutusu · 2023-07-25T11:29:50Z

You are right. I didn't see any scientific paper on that thou...
AFAIK, currently many large-scale models like Whisper or those provided by nVidia don't care about this thou. They suck all the material they have.
The problem with "many recordings per sentence" is the distribution as it might cause sentence/phoneme biasing.
So if a dataset has 1000 sentences with 15 recordings and 1000 sentences with 1 recording, and you take the whole dataset for training, you will have a huge problem.
This also should interest @raivisdejus as after they add new sentences they will see this effect in Latvian.
@raivisdejus, please check the "sentences" tab in this link. All your sentences have 15 recordings. None with less...

raivisdejus · 2023-07-25T13:35:56Z

@HarikalarKutusu @moz-dfeller Agree with the clip limit per sentence. No question there.

Also, more Latvian sentences are on the way. Some were added after the v14 data set was released already. No question about this as well.

What are your thoughts on moving it to the sentence selection for recording? So we do not get to the state where unusable sentences get recorded. With all the above acknowledged I would still like a solution where big chunk of Latvian sentences is not stuck in an unusable state.

For voice tool creators some filtering of the dataset is needed anyway, as some people can record thousands of sentences, so if the dataset is used without any filtering and processing those super recorders will also influence overall speech recognition quality.

HarikalarKutusu · 2023-07-25T16:17:10Z

What are your thoughts on moving it to the sentence selection for recording?

That's the job of the splitting algorithm.
Maybe continue this discussion on Matrix (@bozden:mozilla.org)? It is not directly related to the issue at hand...

raivisdejus · 2023-07-26T17:00:44Z

Issue fixed by #4131

Thanks, everybody! Kudos and extra karma points as well as warmest greetings from Latvia :)

raivisdejus requested a review from a team as a code owner July 16, 2023 14:19

raivisdejus requested review from data-sync-user and removed request for a team July 16, 2023 14:19

raivisdejus added 2 commits July 24, 2023 19:09

Will remove sentence clip count for clip selection for validation

7188720

Will move clip limit to sentence selection for recording

5332db5

raivisdejus force-pushed the remove-validation-clip-limit branch from 91aa0a4 to 5332db5 Compare July 24, 2023 16:18

raivisdejus closed this Jul 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Will remove sentence clip count for clip selection for validation #4116

Will remove sentence clip count for clip selection for validation #4116

raivisdejus commented Jul 16, 2023

HarikalarKutusu commented Jul 16, 2023 •

edited

raivisdejus commented Jul 24, 2023

HarikalarKutusu commented Jul 24, 2023

raivisdejus commented Jul 24, 2023

HarikalarKutusu commented Jul 24, 2023

moz-dfeller commented Jul 25, 2023 •

edited

HarikalarKutusu commented Jul 25, 2023

raivisdejus commented Jul 25, 2023

moz-dfeller commented Jul 25, 2023

HarikalarKutusu commented Jul 25, 2023

moz-dfeller commented Jul 25, 2023

HarikalarKutusu commented Jul 25, 2023 •

edited

raivisdejus commented Jul 25, 2023

HarikalarKutusu commented Jul 25, 2023

raivisdejus commented Jul 26, 2023

Will remove sentence clip count for clip selection for validation #4116

Will remove sentence clip count for clip selection for validation #4116

Conversation

raivisdejus commented Jul 16, 2023

Pull Request Form

Type of Pull Request

HarikalarKutusu commented Jul 16, 2023 • edited

raivisdejus commented Jul 24, 2023

HarikalarKutusu commented Jul 24, 2023

raivisdejus commented Jul 24, 2023

HarikalarKutusu commented Jul 24, 2023

moz-dfeller commented Jul 25, 2023 • edited

HarikalarKutusu commented Jul 25, 2023

raivisdejus commented Jul 25, 2023

moz-dfeller commented Jul 25, 2023

HarikalarKutusu commented Jul 25, 2023

moz-dfeller commented Jul 25, 2023

HarikalarKutusu commented Jul 25, 2023 • edited

raivisdejus commented Jul 25, 2023

HarikalarKutusu commented Jul 25, 2023

raivisdejus commented Jul 26, 2023

HarikalarKutusu commented Jul 16, 2023 •

edited

moz-dfeller commented Jul 25, 2023 •

edited

HarikalarKutusu commented Jul 25, 2023 •

edited