Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use proper BCP 47 language tags for Chinese #1

Closed
r12a opened this issue Nov 17, 2020 · 7 comments
Closed

Use proper BCP 47 language tags for Chinese #1

r12a opened this issue Nov 17, 2020 · 7 comments
Labels
i18n-tracker Group bringing to attention of Internationalization, or tracked by i18n but not needing response.

Comments

@r12a
Copy link

r12a commented Nov 17, 2020

https://github.com/WICG/handwriting-recognition/blob/main/explainer.md

Languages are identified by IETF BCP 47 language tags (e.g. en, zh-CN). If there's no dedicated models for that language tag, the recognizer falls back to the macro language (zh-CN becomes zh).

zh-CN is presumably meant to indicate Simplified Chinese, which is also used in Singapore. That's why it is better to use zh-Hans as the language tag, rather than zh-CN (and zh-Hant, rather than zh-TW).

Please change the example.

@wacky6
Copy link
Member

wacky6 commented Nov 26, 2020

I agree. zh-CN is not the technically correct way of unambiguously specifying a language, but its arguably more commonly used on the Web (Accept-Language, navigator.languages).

So I'd keep it in the example. In the meanwhile, I've created a PR to include script subtag in the example.

See #8 (in conjunction with #2)

@r12a
Copy link
Author

r12a commented Nov 27, 2020

I think the time when it was commonly used on the Web as a way to refer to Simplified Chinese was many years ago. Nowadays zh-Hans works fine pretty well everywhere. And by using zh-CN prominently in your example you only promote the incorrect usage. So i still think you should change it. I can refer this issue to the i18n WG if you like.

I think it's fine to mention zh-CN as something that a user may type in, but which should be interpreted to mean zh-Hans, which you do in #8. But i think it should be framed to look like the recogniser is correcting incorrect input (which it is, since the actual script/orthography is very important for handwriting). I see an implication in the quoted text above (esp. because it doesn't even mention zh-Hans) that zh-CN is an appropriate way of referring to SC. It's really not. It's only appropriate if the language tag ignores script information and actually focuses on the region – which it may do, for example, when what's important is the spoken language (although that's problematic wrt zh too unless there's an implicit association of zh with cmn), or the locale (eg. for location services, legal reasons, etc.)

@wacky6
Copy link
Member

wacky6 commented Nov 30, 2020

I woundn't say using "zh-CN" here is incorrect, given:

  1. The attribute is language, not script. Language is a broad term. "zh-CN" basically means "Chinese used in Mainland China".

    • In fact, simplified chinese, traditional chinese and latin alphabets are all used in Mainland China.
    • Assuming we promote "zh-Hans", would "Hans" exclude characters from latin alphabet (from the recognizer), I'm not sure.
    • We don't want the API to say "you need to unambiguously specify all the scripts". It's probably more confusing than just specifying the region.
  2. The script can be determined by using some established rules (e.g. Unicode likely subtag). "zh-CN" gets interpreted to "zh-Hans-CN". Though this precise interpretation may be undesirable (see the point above).

    • The recognizer may have to include more scripts (i.e. Latn + Hans / Hani).
  3. From API ergonomic point of view, we don't want to give developers the impression that they need / should convert "zh-CN" to "zh-Hans" so they use the API correctly.

    • I don't know of a simple way to get the script for any language tag in the browser. My feeling is developers will use "zh-CN" (even it's technically incorrect for a script), as long as it works (if the browser interprets reasonably).
    • It's perfectly okay for a website to target users in a region, and don't worry about the exact script being used (and let the browser deal with it). The recognizer is free to (and should) find out the scripts (appropriate for that region, and include all of them).

@r12a r12a added the i18n-tracker Group bringing to attention of Internationalization, or tracked by i18n but not needing response. label Feb 10, 2021
@wacky6
Copy link
Member

wacky6 commented Feb 15, 2021

Hi @r12a , we have a question about language tag for non-standard "languages".

We have handwriting models for recognizing geometric shapes and/or user guestures (e.g. a square), what language tag could we use for this case?

I see there is a "zxx" primary tag for "No linguistic content; Not applicable". Is it suitable? For example, use "zxx-Shape" for the above recognizer. Or is private subtags more suitable?

@r12a
Copy link
Author

r12a commented Jun 11, 2021

I think it's best to avoid private subtags if at all possible, and zxx may indeed be what you need, but i refer this question to @aphillips, since he's a co-author of BCP-47.

@aphillips
Copy link

There are really two choices that occur to me here. One is to use zxx. The other would be und (Undetermined). The und tag is usually imputed to content with no language tag and it is used in CLDR and locale systems (such as JS's Intl.Locale) to mean the "root" locale. This might be more like what you intend. Regardless of what primary language subtag you choose, you should not use invalid tags such as zxx-shape. You might use a private-use tag, though, such as zxx-x-shape or und-x-symbols.

@wacky6
Copy link
Member

wacky6 commented Dec 8, 2021

Closing this issue.

zh_CN and zh_Hans convey different meanings, "zh_CN" means "Chinese as used in mainland China", "zh_Hans" means "Simplified Chinese regardless of where it's used". Web applications should choose whichever is more suitable for their use cases.

We allow the browser implementation and the underlying recognizer to make reasonable assumptions about the script (considering different handwriting recognizer implementations identifies their models differently).


For shape / user gesture models, we will use a zxx private tag ("zxx-x-shape"), following this precedence: MLKit shape detection models.

@wacky6 wacky6 closed this as completed Dec 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
i18n-tracker Group bringing to attention of Internationalization, or tracked by i18n but not needing response.
Projects
None yet
Development

No branches or pull requests

3 participants