Create tok.js, update eo.js #610

janPensa · 2022-03-22T00:57:41Z

Wrote a new validation script for Toki Pona.

Changes compared to default:

changed sentence limit from 14 words to 90 characters
added rules to enforce Toki Pona's phonotactics, which should eliminate words and names with ambiguous or impossible pronunciations, and also catch a lot of typos
added "capital letters at start of word only" rule, which can replace the default "no abbreviations" one

Also updated Esperanto's validation script with translations and minor improvements.

Wrote a new validation script for Toki Pona. Changes compared to default: - raised word cap from 14 to 18 (Due to Toki Pona's minimalistic design, its words tend to be shorter than English words. Based on the current Sentence Collector corpus, 18 words in Toki Pona are on average equivalent in length to 14 words in English) - added rules to enforce Toki Pona's phonotactics, which should eliminate words and names with ambiguous or impossible pronunciations, and also catch a lot of typos - added "capital letters at start of word only" rule, which can replace the default "no abbreviations" one

tbodt · 2022-03-22T02:43:25Z

server/lib/validation/languages/tok.js

+
+{
+  // No non-Toki-Pona letters
+  regex: /[BbCcDdFfGgHhQqRrVvXxYyZzĈĜĤĴŜŬĉĝĥĵŝŭäÄöÖüÜßððÀÁÂÃÅÆÇÈÉÊËÌÍİÎÏÐÑÒÓÔÕØÙÚÛÛÝŽàáâãåæçèéêëìíîïðñòóôõøùúûýþÿāăąćċčďđēĕėęěğġģħĩīĭįıķĸĺļľŀłńņņṫšЎḃḋḟṁṗṡẁẃẅẛỳαβΓγΔδεζηΘθικΛλμνΞξΠπρΣσςτșếōůūŁşşǐżőňựňžịŌŏČŠřś]/,


sina wile ala pana e sitelen ike ale. sina ken pana e sitelen pona ale kepeken nasin ni: /[^aeijklmnopstuw]/

a taso ni li ken ala e sitelen . e weka sitelen e ijo ante pona. n... mi sona ala

mi lanpan e sitelen ike ni tan https://github.com/common-voice/sentence-collector/blob/main/server/lib/validation/languages/eo.js

open la mi jo e [BbCcDdFfGgHhQqRrVvXxYyZz] taso

MichaelKohler

Thanks for contributing this. Generally looks good to me, just a few comments and questions from my side.

MichaelKohler · 2022-03-22T21:02:03Z

server/lib/validation/languages/tok.js

+const MIN_WORDS = 1;
+
+// Maximum of words allowed per sentence to keep recordings in a manageable duration.
+const MAX_WORDS = 18;


How long does it approximately take to read out 18 words out loud?

We timed a 18-word sentence with 87 characters (a bit above average in length for 18 words), which took around 7 seconds to record at normal speed. At the fastest and slowest speeds that felt natural to me, my speeds ranged from 5 or 6 seconds on the low end to about 11 or 12 seconds on the high end (including the short silences between reading out the sentence and hitting the play/pause button).

By the way, due its very regular and minimalistic phonology and phonotactics, in Toki Pona counting characters would probably be a more accurate measure of how long a sentence takes to read out. If you want to prevent outliers with many long words and proper nouns, a 90-character limit might be better. But we shouldn't get many of those.

I just looked it up, the maximum recording length is 10 seconds. I'll leave it up to you to judge what the best approach is here.

After discussing this question with other contributors and gauging opinions, I decided to go with a character limit rather than a word limit. That option clearly has more support. I updated the file accordingly.

server/lib/validation/languages/tok.js

MichaelKohler · 2022-03-22T21:25:21Z

server/lib/validation/languages/tok.js

+
+{
+  // No non-Toki-Pona letters
+  regex: /[BbCcDdFfGgHhQqRrVvXxYyZzĈĜĤĴŜŬĉĝĥĵŝŭäÄöÖüÜßððÀÁÂÃÅÆÇÈÉÊËÌÍİÎÏÐÑÒÓÔÕØÙÚÛÛÝŽàáâãåæçèéêëìíîïðñòóôõøùúûýþÿāăąćċčďđēĕėęěğġģħĩīĭįıķĸĺļľŀłńņņṫšЎḃḋḟṁṗṡẁẃẅẛỳαβΓγΔδεζηΘθικΛλμνΞξΠπρΣσςτșếōůūŁşşǐżőňựňžịŌŏČŠřś]/,


I think it would make sense to negate this and only list valid letters. I think that's what the suggestion from @tbodt is, but I don't understand Toki Pona, so can't say for sure :)

Indeed. I couldn't figure out a good regex to allow punctuation as well, though.

Yeah, we would probably need to specify an exhaustive list of all valid characters, including punctuation, for that to work.

The currently list is copied and slightly modified from Esperanto's validation script, which is meant to only allow Esperanto letters. After tbodt's comment I had the idea that it might be better to specify entire blocks of codepoints for this. E.g. [BbCcDdFfGgHhQqRrVvXxYyZz\u00C0-\u02BF\u1E00-\u1EFF] That should weed out the vast majority of Latin script letter variants.

And actually, we might also want to add \uF1900-\uF19FF, the codepoints to which Toki Pona's sitelen pona script is assigned in the UCSUR standard.

I changed it to [BbCcDdFfGgHhQqRrVvXxYyZz\u00C0-\u02BF\u1E00-\u1EFF\uF1900-\uF19FF] for now, when I updated the translations part

I made a similar change to Esperanto's script, changing
[wWqQxXyYäÄöÖüÜßððÀÁÂÃÅÆÇÈÉÊËÌÍİÎÏÐÑÒÓÔÕØÙÚÛÛÝŽàáâãåæçèéêëìíîïðñòóôõøùúûýþÿāăąćċčďđēĕėęěğġģħĩīĭįıķĸĺļľŀłńņņṫšЎḃḋḟṁṗṡẁẃẅẛỳαβΓγΔδεζηΘθικΛλμνΞξΠπρΣσςτșếōůūŁşşǐżőňựňžịŌŏČŠřś]
to a more clean and more complete
[qQwWxXyYÀ-ćĊ-ěĞ-ģĞ-ģĦ-ĳĶ-śŞ-ūŮ-\u02AF\u1E00-\u1EFFα-ωΑ-ΩЀ-ӿ].

And I also imported Esperanto translations from Pontoon while I was at it.

server/lib/validation/languages/tok.js

- added translations into the script - updated prohibited letters

minor edit: added a comment about UCSUR

Changed word limit to character limit, after discussing it with various contributors

- Split "symbols" and "non-Esperanto letters" to two separate invalidations and error messages - Changed unwieldy list of forbidden characters to a more concise and more complete list - Moved translations from Pontoon into document, and added missing translation - Changed "word limit" and "symbols" translations to be easier to understand IMO

MichaelKohler

Thanks for the updates. This generally looks good to me. The last missing piece here is to add tok to the validations:

Add the require above this line: https://github.com/common-voice/sentence-collector/blob/main/server/lib/validation/index.js#L14
Add it to the validation exports above this line: https://github.com/common-voice/sentence-collector/blob/main/server/lib/validation/index.js#L31

Thanks!

server/lib/validation/languages/tok.js

small fix to capital letters invalidation

janPensa · 2022-03-27T03:27:34Z

@MichaelKohler By the way, at the moment there are 11 sentences among the validated sentences at https://github.com/common-voice/common-voice/blob/main/server/data/tok/sentence-collector.txt that violate these new validation rules, because of typos or invalid words and names. Do you know what would be the best way to get these and any associated recordings removed?

MichaelKohler · 2022-03-27T11:58:52Z

You can give me a list of these sentences and I can remove them from Sentence Collector. However removing the recordings itself is probably too much work for the benefit of removing only 11 sentences. I would suggest to just report them through the UI when you encounter them.

MichaelKohler

Thanks!

MichaelKohler · 2022-03-27T13:39:54Z

🎉 This PR is included in version 2.17.2 🎉

The release is available on GitHub release

Your semantic-release bot 📦🚀

janPensa · 2022-03-27T13:46:21Z

@MichaelKohler Thank you for the help!

As for the 11 sentences I mentioned, they are:

jan lilii o tawa supa lape.
jan Timi li sona ala e sona pi pali ni.
jan Timi li wile ala lon jan pi kulupu ni.
mi ken ala pali tawa pona nim
mi tawa ma Sanai.
ona li jo e nimi mute pi toki Inl
sike pan pi ma Italia li kama ala lon insa mi.
sina sutopatikuna.
tenpo mute la jan Timi li ante e tomo ona.
tenpo pini la jan Sonja li yupekosi lili.
toki ni la mi yupekosi e kulupu ni pi toki kalama, a a a!

I've also ran the corpus through a word frequency counter and found these 7 typos at the bottom of the list:

ale li pona. tan same a la mi pilin monsuta?
jan lawa li lona poka mi.
kupupu musi ni li pakala e tomo sina.
ma seli la ko lejo li tawa sama kiwen telo lete.
mi sone e ni: tomo sina li lon nasin seme.
ona mute li loki e ni: mi toki mute.
tenpo ni la mi lukine ijo sin pi pakala mute.

janPensa · 2022-03-27T14:04:26Z

Oh crap, for some reason [\u00C0-\u02BF] and [\u1E00-\u1EFF] match with every letter in the Latin alphabet, making the Sentence Collector reject every submission.

I'll write a hotfix...

Edit:
Here it is #616

tbodt reviewed Mar 22, 2022

View reviewed changes

MichaelKohler suggested changes Mar 22, 2022

View reviewed changes

janPensa added 4 commits March 23, 2022 00:45

Update tok.js

0012039

- added translations into the script - updated prohibited letters

Update tok.js

f85cc1e

minor edit: added a comment about UCSUR

Update tok.js

6a60814

Changed word limit to character limit, after discussing it with various contributors

janPensa changed the title ~~Create tok.js~~ Create tok.js, update eo.js Mar 26, 2022

MichaelKohler reviewed Mar 26, 2022

View reviewed changes

server/lib/validation/languages/tok.js Outdated Show resolved Hide resolved

janPensa added 3 commits March 27, 2022 03:48

Update tok.js

7e5c978

small fix to capital letters invalidation

added "tok" to index.js

b2355a1

Merge pull request #1 from janPensa/patch-2

f9f39d6

janPensa requested a review from MichaelKohler March 27, 2022 02:05

MichaelKohler approved these changes Mar 27, 2022

View reviewed changes

MichaelKohler merged commit 9e8e723 into common-voice:main Mar 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create tok.js, update eo.js #610

Create tok.js, update eo.js #610

janPensa commented Mar 22, 2022 •

edited

tbodt Mar 22, 2022

tbodt Mar 22, 2022

janPensa Mar 22, 2022

MichaelKohler left a comment

MichaelKohler Mar 22, 2022

janPensa Mar 22, 2022

MichaelKohler Mar 23, 2022

MichaelKohler Mar 23, 2022

janPensa Mar 26, 2022

MichaelKohler Mar 22, 2022

tbodt Mar 22, 2022

janPensa Mar 22, 2022

janPensa Mar 22, 2022

janPensa Mar 26, 2022 •

edited

MichaelKohler left a comment

janPensa commented Mar 27, 2022

MichaelKohler commented Mar 27, 2022

MichaelKohler left a comment

MichaelKohler commented Mar 27, 2022

janPensa commented Mar 27, 2022

janPensa commented Mar 27, 2022 •

edited

Create tok.js, update eo.js #610

Create tok.js, update eo.js #610

Conversation

janPensa commented Mar 22, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MichaelKohler left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

janPensa Mar 26, 2022 • edited

Choose a reason for hiding this comment

MichaelKohler left a comment

Choose a reason for hiding this comment

janPensa commented Mar 27, 2022

MichaelKohler commented Mar 27, 2022

MichaelKohler left a comment

Choose a reason for hiding this comment

MichaelKohler commented Mar 27, 2022

janPensa commented Mar 27, 2022

janPensa commented Mar 27, 2022 • edited

janPensa commented Mar 22, 2022 •

edited

janPensa Mar 26, 2022 •

edited

janPensa commented Mar 27, 2022 •

edited