Skip to content
This repository has been archived by the owner on May 10, 2023. It is now read-only.

Create tok.js, update eo.js #610

Merged
merged 8 commits into from Mar 27, 2022
Merged

Create tok.js, update eo.js #610

merged 8 commits into from Mar 27, 2022

Conversation

janPensa
Copy link
Contributor

@janPensa janPensa commented Mar 22, 2022

Wrote a new validation script for Toki Pona.

Changes compared to default:

  • changed sentence limit from 14 words to 90 characters
  • added rules to enforce Toki Pona's phonotactics, which should eliminate words and names with ambiguous or impossible pronunciations, and also catch a lot of typos
  • added "capital letters at start of word only" rule, which can replace the default "no abbreviations" one

Also updated Esperanto's validation script with translations and minor improvements.

Wrote a new validation script for Toki Pona.

Changes compared to default:
- raised word cap from 14 to 18 (Due to Toki Pona's minimalistic design, its words tend to be shorter than English words. Based on the current Sentence Collector corpus, 18 words in Toki Pona are on average equivalent in length to 14 words in English)
- added rules to enforce Toki Pona's phonotactics, which should eliminate words and names with ambiguous or impossible pronunciations, and also catch a lot of typos
- added "capital letters at start of word only" rule, which can replace the default "no abbreviations" one

{
// No non-Toki-Pona letters
regex: /[BbCcDdFfGgHhQqRrVvXxYyZzĈĜĤĴŜŬĉĝĥĵŝŭäÄöÖüÜßððÀÁÂÃÅÆÇÈÉÊËÌÍİÎÏÐÑÒÓÔÕØÙÚÛÛÝŽàáâãåæçèéêëìíîïðñòóôõøùúûýþÿāăąćċčďđēĕėęěğġģħĩīĭįıķĸĺļľŀłńņņṫšЎḃḋḟṁṗṡẁẃẅẛỳαβΓγΔδεζηΘθικΛλμνΞξΠπρΣσςτșếōůūŁşşǐżőňựňžịŌŏČŠřś]/,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sina wile ala pana e sitelen ike ale. sina ken pana e sitelen pona ale kepeken nasin ni: /[^aeijklmnopstuw]/

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a taso ni li ken ala e sitelen . e weka sitelen e ijo ante pona. n... mi sona ala

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mi lanpan e sitelen ike ni tan https://github.com/common-voice/sentence-collector/blob/main/server/lib/validation/languages/eo.js

open la mi jo e [BbCcDdFfGgHhQqRrVvXxYyZz] taso

Copy link
Member

@MichaelKohler MichaelKohler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for contributing this. Generally looks good to me, just a few comments and questions from my side.

const MIN_WORDS = 1;

// Maximum of words allowed per sentence to keep recordings in a manageable duration.
const MAX_WORDS = 18;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How long does it approximately take to read out 18 words out loud?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We timed a 18-word sentence with 87 characters (a bit above average in length for 18 words), which took around 7 seconds to record at normal speed. At the fastest and slowest speeds that felt natural to me, my speeds ranged from 5 or 6 seconds on the low end to about 11 or 12 seconds on the high end (including the short silences between reading out the sentence and hitting the play/pause button).

By the way, due its very regular and minimalistic phonology and phonotactics, in Toki Pona counting characters would probably be a more accurate measure of how long a sentence takes to read out. If you want to prevent outliers with many long words and proper nouns, a 90-character limit might be better. But we shouldn't get many of those.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just looked it up, the maximum recording length is 10 seconds. I'll leave it up to you to judge what the best approach is here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just looked it up, the maximum recording length is 10 seconds. I'll leave it up to you to judge what the best approach is here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After discussing this question with other contributors and gauging opinions, I decided to go with a character limit rather than a word limit. That option clearly has more support. I updated the file accordingly.

server/lib/validation/languages/tok.js Outdated Show resolved Hide resolved

{
// No non-Toki-Pona letters
regex: /[BbCcDdFfGgHhQqRrVvXxYyZzĈĜĤĴŜŬĉĝĥĵŝŭäÄöÖüÜßððÀÁÂÃÅÆÇÈÉÊËÌÍİÎÏÐÑÒÓÔÕØÙÚÛÛÝŽàáâãåæçèéêëìíîïðñòóôõøùúûýþÿāăąćċčďđēĕėęěğġģħĩīĭįıķĸĺļľŀłńņņṫšЎḃḋḟṁṗṡẁẃẅẛỳαβΓγΔδεζηΘθικΛλμνΞξΠπρΣσςτșếōůūŁşşǐżőňựňžịŌŏČŠřś]/,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would make sense to negate this and only list valid letters. I think that's what the suggestion from @tbodt is, but I don't understand Toki Pona, so can't say for sure :)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed. I couldn't figure out a good regex to allow punctuation as well, though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we would probably need to specify an exhaustive list of all valid characters, including punctuation, for that to work.

The currently list is copied and slightly modified from Esperanto's validation script, which is meant to only allow Esperanto letters. After tbodt's comment I had the idea that it might be better to specify entire blocks of codepoints for this. E.g. [BbCcDdFfGgHhQqRrVvXxYyZz\u00C0-\u02BF\u1E00-\u1EFF] That should weed out the vast majority of Latin script letter variants.

And actually, we might also want to add \uF1900-\uF19FF, the codepoints to which Toki Pona's sitelen pona script is assigned in the UCSUR standard.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed it to [BbCcDdFfGgHhQqRrVvXxYyZz\u00C0-\u02BF\u1E00-\u1EFF\uF1900-\uF19FF] for now, when I updated the translations part

Copy link
Contributor Author

@janPensa janPensa Mar 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made a similar change to Esperanto's script, changing
[wWqQxXyYäÄöÖüÜßððÀÁÂÃÅÆÇÈÉÊËÌÍİÎÏÐÑÒÓÔÕØÙÚÛÛÝŽàáâãåæçèéêëìíîïðñòóôõøùúûýþÿāăąćċčďđēĕėęěğġģħĩīĭįıķĸĺļľŀłńņņṫšЎḃḋḟṁṗṡẁẃẅẛỳαβΓγΔδεζηΘθικΛλμνΞξΠπρΣσςτșếōůūŁşşǐżőňựňžịŌŏČŠřś]
to a more clean and more complete
[qQwWxXyYÀ-ćĊ-ěĞ-ģĞ-ģĦ-ijĶ-śŞ-ūŮ-\u02AF\u1E00-\u1EFFα-ωΑ-ΩЀ-ӿ].

And I also imported Esperanto translations from Pontoon while I was at it.

server/lib/validation/languages/tok.js Outdated Show resolved Hide resolved
- added translations into the script
- updated prohibited letters
minor edit:
added a comment about UCSUR
Changed word limit to character limit, after discussing it with various contributors
- Split "symbols" and "non-Esperanto letters" to two separate invalidations and error messages
- Changed unwieldy list of forbidden characters to a more concise and more complete list
- Moved translations from Pontoon into document, and added missing translation
- Changed "word limit" and "symbols" translations to be easier to understand IMO
@janPensa janPensa changed the title Create tok.js Create tok.js, update eo.js Mar 26, 2022
Copy link
Member

@MichaelKohler MichaelKohler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates. This generally looks good to me. The last missing piece here is to add tok to the validations:

Thanks!

@janPensa
Copy link
Contributor Author

@MichaelKohler By the way, at the moment there are 11 sentences among the validated sentences at https://github.com/common-voice/common-voice/blob/main/server/data/tok/sentence-collector.txt that violate these new validation rules, because of typos or invalid words and names. Do you know what would be the best way to get these and any associated recordings removed?

@MichaelKohler
Copy link
Member

You can give me a list of these sentences and I can remove them from Sentence Collector. However removing the recordings itself is probably too much work for the benefit of removing only 11 sentences. I would suggest to just report them through the UI when you encounter them.

Copy link
Member

@MichaelKohler MichaelKohler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@MichaelKohler MichaelKohler merged commit 9e8e723 into common-voice:main Mar 27, 2022
@MichaelKohler
Copy link
Member

🎉 This PR is included in version 2.17.2 🎉

The release is available on GitHub release

Your semantic-release bot 📦🚀

@janPensa
Copy link
Contributor Author

@MichaelKohler Thank you for the help!

As for the 11 sentences I mentioned, they are:

jan lilii o tawa supa lape.
jan Timi li sona ala e sona pi pali ni.
jan Timi li wile ala lon jan pi kulupu ni.
mi ken ala pali tawa pona nim
mi tawa ma Sanai.
ona li jo e nimi mute pi toki Inl
sike pan pi ma Italia li kama ala lon insa mi.
sina sutopatikuna.
tenpo mute la jan Timi li ante e tomo ona.
tenpo pini la jan Sonja li yupekosi lili.
toki ni la mi yupekosi e kulupu ni pi toki kalama, a a a!

I've also ran the corpus through a word frequency counter and found these 7 typos at the bottom of the list:

ale li pona. tan same a la mi pilin monsuta?
jan lawa li lona poka mi.
kupupu musi ni li pakala e tomo sina.
ma seli la ko lejo li tawa sama kiwen telo lete.
mi sone e ni: tomo sina li lon nasin seme.
ona mute li loki e ni: mi toki mute.
tenpo ni la mi lukine ijo sin pi pakala mute.

@janPensa
Copy link
Contributor Author

janPensa commented Mar 27, 2022

Oh crap, for some reason [\u00C0-\u02BF] and [\u1E00-\u1EFF] match with every letter in the Latin alphabet, making the Sentence Collector reject every submission.

I'll write a hotfix...

Edit:
Here it is #616

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants