Skip to content
This repository has been archived by the owner on May 10, 2023. It is now read-only.

Add sentence validator for Cantonese #605

Merged
merged 7 commits into from Feb 15, 2022
Merged

Add sentence validator for Cantonese #605

merged 7 commits into from Feb 15, 2022

Conversation

laubonghaudoi
Copy link
Contributor

No description provided.

Copy link
Member

@MichaelKohler MichaelKohler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this. I've added a few minor comments.

Additionally to those: I think it would be beneficial to translate the error messages. That was for example done here: https://github.com/common-voice/sentence-collector/blob/main/server/lib/validation/languages/ckb.js . We decided not to translate these error messages through the normal UI translation process, as that would require quite some work for not much benefit. However if somebody is adding sentences in a given language, it's likely they would understand a translated error message anyway. Translating those error messages would then enable contributors to contribute without knowing English. Do you think that would be a good improvement here?

server/lib/validation/languages/yue.js Outdated Show resolved Hide resolved
server/lib/validation/languages/yue.js Outdated Show resolved Hide resolved
regex: /[<>+*#@%^[\]()\/]/,
error: "Sentence should not contain symbols",
}, {
// 7 or more repeating characters in a row is likely a non-formal spelling or difficult to read.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, did that happen often in Sentence Collector? 7 repeating characters seems like a lot, but I have absolutely no language experience apart from latin-based languages.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sometimes happens when people dump uncleaned sentences directly crawled from web. For examples sentences with long tailing dots, such as額........ Such sentences are most likely junk.

error: "Sentence should not contain emojis or other special Unicode symbols",
}, {
regex: /[\u5427](\s|$)/,
error: 'Sentence should not end with Mandarin particles',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this message correct? This would also reject \u5427 followed by a space. Is that also considered ending a sentence?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes sometimes a space indicates a pause or the end of a sentence. I have amended this rule in the latest commit.

@laubonghaudoi
Copy link
Contributor Author

Thanks for reviewing! It is now fixed in the latest commit.

Copy link
Member

@MichaelKohler MichaelKohler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this. One more small comment that I didn't catch before and then this can get merged :)

server/lib/validation/languages/yue.js Outdated Show resolved Hide resolved
Co-authored-by: Michael Kohler <me@michaelkohler.info>
MichaelKohler
MichaelKohler previously approved these changes Feb 15, 2022
Copy link
Member

@MichaelKohler MichaelKohler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@MichaelKohler MichaelKohler merged commit 18fa4b2 into common-voice:main Feb 15, 2022
@MichaelKohler
Copy link
Member

This will be part of the next release. I can't say yet when that will be, but there will be a new comment once the release is done.

@laubonghaudoi
Copy link
Contributor Author

Great, thank you so much!

MichaelKohler pushed a commit that referenced this pull request Feb 20, 2022
# [2.17.0](v2.16.4...v2.17.0) (2022-02-20)

### Bug Fixes

* add sentence validator for Cantonese ([#605](#605)) ([18fa4b2](18fa4b2))
* remove certain sentences for tok language ([#607](#607)) ([7aa6ad7](7aa6ad7))

### Features

* add sentence validator for catalan ([#606](#606)) ([1096aef](1096aef))
@MichaelKohler
Copy link
Member

🎉 This PR is included in version 2.17.0 🎉

The release is available on GitHub release

Your semantic-release bot 📦🚀

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants