Conversation
Wrote a new validation script for Toki Pona. Changes compared to default: - raised word cap from 14 to 18 (Due to Toki Pona's minimalistic design, its words tend to be shorter than English words. Based on the current Sentence Collector corpus, 18 words in Toki Pona are on average equivalent in length to 14 words in English) - added rules to enforce Toki Pona's phonotactics, which should eliminate words and names with ambiguous or impossible pronunciations, and also catch a lot of typos - added "capital letters at start of word only" rule, which can replace the default "no abbreviations" one
|
||
{ | ||
// No non-Toki-Pona letters | ||
regex: /[BbCcDdFfGgHhQqRrVvXxYyZzĈĜĤĴŜŬĉĝĥĵŝŭäÄöÖüÜßððÀÁÂÃÅÆÇÈÉÊËÌÍİÎÏÐÑÒÓÔÕØÙÚÛÛÝŽàáâãåæçèéêëìíîïðñòóôõøùúûýþÿāăąćċčďđēĕėęěğġģħĩīĭįıķĸĺļľŀłńņņṫšЎḃḋḟṁṗṡẁẃẅẛỳαβΓγΔδεζηΘθικΛλμνΞξΠπρΣσςτșếōůūŁşşǐżőňựňžịŌŏČŠřś]/, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sina wile ala pana e sitelen ike ale. sina ken pana e sitelen pona ale kepeken nasin ni: /[^aeijklmnopstuw]/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a taso ni li ken ala e sitelen .
e weka sitelen e ijo ante pona. n... mi sona ala
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mi lanpan e sitelen ike ni tan https://github.com/common-voice/sentence-collector/blob/main/server/lib/validation/languages/eo.js
open la mi jo e [BbCcDdFfGgHhQqRrVvXxYyZz] taso
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for contributing this. Generally looks good to me, just a few comments and questions from my side.
const MIN_WORDS = 1; | ||
|
||
// Maximum of words allowed per sentence to keep recordings in a manageable duration. | ||
const MAX_WORDS = 18; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How long does it approximately take to read out 18 words out loud?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We timed a 18-word sentence with 87 characters (a bit above average in length for 18 words), which took around 7 seconds to record at normal speed. At the fastest and slowest speeds that felt natural to me, my speeds ranged from 5 or 6 seconds on the low end to about 11 or 12 seconds on the high end (including the short silences between reading out the sentence and hitting the play/pause button).
By the way, due its very regular and minimalistic phonology and phonotactics, in Toki Pona counting characters would probably be a more accurate measure of how long a sentence takes to read out. If you want to prevent outliers with many long words and proper nouns, a 90-character limit might be better. But we shouldn't get many of those.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just looked it up, the maximum recording length is 10 seconds. I'll leave it up to you to judge what the best approach is here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just looked it up, the maximum recording length is 10 seconds. I'll leave it up to you to judge what the best approach is here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After discussing this question with other contributors and gauging opinions, I decided to go with a character limit rather than a word limit. That option clearly has more support. I updated the file accordingly.
|
||
{ | ||
// No non-Toki-Pona letters | ||
regex: /[BbCcDdFfGgHhQqRrVvXxYyZzĈĜĤĴŜŬĉĝĥĵŝŭäÄöÖüÜßððÀÁÂÃÅÆÇÈÉÊËÌÍİÎÏÐÑÒÓÔÕØÙÚÛÛÝŽàáâãåæçèéêëìíîïðñòóôõøùúûýþÿāăąćċčďđēĕėęěğġģħĩīĭįıķĸĺļľŀłńņņṫšЎḃḋḟṁṗṡẁẃẅẛỳαβΓγΔδεζηΘθικΛλμνΞξΠπρΣσςτșếōůūŁşşǐżőňựňžịŌŏČŠřś]/, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would make sense to negate this and only list valid letters. I think that's what the suggestion from @tbodt is, but I don't understand Toki Pona, so can't say for sure :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed. I couldn't figure out a good regex to allow punctuation as well, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, we would probably need to specify an exhaustive list of all valid characters, including punctuation, for that to work.
The currently list is copied and slightly modified from Esperanto's validation script, which is meant to only allow Esperanto letters. After tbodt's comment I had the idea that it might be better to specify entire blocks of codepoints for this. E.g. [BbCcDdFfGgHhQqRrVvXxYyZz\u00C0-\u02BF\u1E00-\u1EFF]
That should weed out the vast majority of Latin script letter variants.
And actually, we might also want to add \uF1900-\uF19FF
, the codepoints to which Toki Pona's sitelen pona script is assigned in the UCSUR standard.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed it to [BbCcDdFfGgHhQqRrVvXxYyZz\u00C0-\u02BF\u1E00-\u1EFF\uF1900-\uF19FF]
for now, when I updated the translations part
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made a similar change to Esperanto's script, changing
[wWqQxXyYäÄöÖüÜßððÀÁÂÃÅÆÇÈÉÊËÌÍİÎÏÐÑÒÓÔÕØÙÚÛÛÝŽàáâãåæçèéêëìíîïðñòóôõøùúûýþÿāăąćċčďđēĕėęěğġģħĩīĭįıķĸĺļľŀłńņņṫšЎḃḋḟṁṗṡẁẃẅẛỳαβΓγΔδεζηΘθικΛλμνΞξΠπρΣσςτșếōůūŁşşǐżőňựňžịŌŏČŠřś]
to a more clean and more complete
[qQwWxXyYÀ-ćĊ-ěĞ-ģĞ-ģĦ-ijĶ-śŞ-ūŮ-\u02AF\u1E00-\u1EFFα-ωΑ-ΩЀ-ӿ]
.
And I also imported Esperanto translations from Pontoon while I was at it.
- added translations into the script - updated prohibited letters
minor edit: added a comment about UCSUR
Changed word limit to character limit, after discussing it with various contributors
- Split "symbols" and "non-Esperanto letters" to two separate invalidations and error messages - Changed unwieldy list of forbidden characters to a more concise and more complete list - Moved translations from Pontoon into document, and added missing translation - Changed "word limit" and "symbols" translations to be easier to understand IMO
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the updates. This generally looks good to me. The last missing piece here is to add tok
to the validations:
- Add the require above this line: https://github.com/common-voice/sentence-collector/blob/main/server/lib/validation/index.js#L14
- Add it to the validation exports above this line: https://github.com/common-voice/sentence-collector/blob/main/server/lib/validation/index.js#L31
Thanks!
small fix to capital letters invalidation
@MichaelKohler By the way, at the moment there are 11 sentences among the validated sentences at https://github.com/common-voice/common-voice/blob/main/server/data/tok/sentence-collector.txt that violate these new validation rules, because of typos or invalid words and names. Do you know what would be the best way to get these and any associated recordings removed? |
You can give me a list of these sentences and I can remove them from Sentence Collector. However removing the recordings itself is probably too much work for the benefit of removing only 11 sentences. I would suggest to just report them through the UI when you encounter them. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
🎉 This PR is included in version 2.17.2 🎉 The release is available on GitHub release Your semantic-release bot 📦🚀 |
@MichaelKohler Thank you for the help! As for the 11 sentences I mentioned, they are:
I've also ran the corpus through a word frequency counter and found these 7 typos at the bottom of the list:
|
Oh crap, for some reason I'll write a hotfix... Edit: |
Wrote a new validation script for Toki Pona.
Changes compared to default:
Also updated Esperanto's validation script with translations and minor improvements.