Use proper BCP 47 language tags for Chinese #1

r12a · 2020-11-17T11:57:55Z

https://github.com/WICG/handwriting-recognition/blob/main/explainer.md

Languages are identified by IETF BCP 47 language tags (e.g. en, zh-CN). If there's no dedicated models for that language tag, the recognizer falls back to the macro language (zh-CN becomes zh).

zh-CN is presumably meant to indicate Simplified Chinese, which is also used in Singapore. That's why it is better to use zh-Hans as the language tag, rather than zh-CN (and zh-Hant, rather than zh-TW).

Please change the example.

wacky6 · 2020-11-26T06:13:35Z

I agree. zh-CN is not the technically correct way of unambiguously specifying a language, but its arguably more commonly used on the Web (Accept-Language, navigator.languages).

So I'd keep it in the example. In the meanwhile, I've created a PR to include script subtag in the example.

See #8 (in conjunction with #2)

r12a · 2020-11-27T12:27:58Z

I think the time when it was commonly used on the Web as a way to refer to Simplified Chinese was many years ago. Nowadays zh-Hans works fine pretty well everywhere. And by using zh-CN prominently in your example you only promote the incorrect usage. So i still think you should change it. I can refer this issue to the i18n WG if you like.

I think it's fine to mention zh-CN as something that a user may type in, but which should be interpreted to mean zh-Hans, which you do in #8. But i think it should be framed to look like the recogniser is correcting incorrect input (which it is, since the actual script/orthography is very important for handwriting). I see an implication in the quoted text above (esp. because it doesn't even mention zh-Hans) that zh-CN is an appropriate way of referring to SC. It's really not. It's only appropriate if the language tag ignores script information and actually focuses on the region – which it may do, for example, when what's important is the spoken language (although that's problematic wrt zh too unless there's an implicit association of zh with cmn), or the locale (eg. for location services, legal reasons, etc.)

wacky6 · 2020-11-30T06:30:17Z

I woundn't say using "zh-CN" here is incorrect, given:

The attribute is language, not script. Language is a broad term. "zh-CN" basically means "Chinese used in Mainland China".
- In fact, simplified chinese, traditional chinese and latin alphabets are all used in Mainland China.
- Assuming we promote "zh-Hans", would "Hans" exclude characters from latin alphabet (from the recognizer), I'm not sure.
- We don't want the API to say "you need to unambiguously specify all the scripts". It's probably more confusing than just specifying the region.
The script can be determined by using some established rules (e.g. Unicode likely subtag). "zh-CN" gets interpreted to "zh-Hans-CN". Though this precise interpretation may be undesirable (see the point above).
- The recognizer may have to include more scripts (i.e. Latn + Hans / Hani).
From API ergonomic point of view, we don't want to give developers the impression that they need / should convert "zh-CN" to "zh-Hans" so they use the API correctly.
- I don't know of a simple way to get the script for any language tag in the browser. My feeling is developers will use "zh-CN" (even it's technically incorrect for a script), as long as it works (if the browser interprets reasonably).
- It's perfectly okay for a website to target users in a region, and don't worry about the exact script being used (and let the browser deal with it). The recognizer is free to (and should) find out the scripts (appropriate for that region, and include all of them).

wacky6 · 2021-02-15T03:42:54Z

Hi @r12a , we have a question about language tag for non-standard "languages".

We have handwriting models for recognizing geometric shapes and/or user guestures (e.g. a square), what language tag could we use for this case?

I see there is a "zxx" primary tag for "No linguistic content; Not applicable". Is it suitable? For example, use "zxx-Shape" for the above recognizer. Or is private subtags more suitable?

r12a · 2021-06-11T14:52:47Z

I think it's best to avoid private subtags if at all possible, and zxx may indeed be what you need, but i refer this question to @aphillips, since he's a co-author of BCP-47.

aphillips · 2021-06-11T16:38:21Z

There are really two choices that occur to me here. One is to use zxx. The other would be und (Undetermined). The und tag is usually imputed to content with no language tag and it is used in CLDR and locale systems (such as JS's Intl.Locale) to mean the "root" locale. This might be more like what you intend. Regardless of what primary language subtag you choose, you should not use invalid tags such as zxx-shape. You might use a private-use tag, though, such as zxx-x-shape or und-x-symbols.

wacky6 · 2021-12-08T08:45:03Z

Closing this issue.

zh_CN and zh_Hans convey different meanings, "zh_CN" means "Chinese as used in mainland China", "zh_Hans" means "Simplified Chinese regardless of where it's used". Web applications should choose whichever is more suitable for their use cases.

We allow the browser implementation and the underlying recognizer to make reasonable assumptions about the script (considering different handwriting recognizer implementations identifies their models differently).

For shape / user gesture models, we will use a zxx private tag ("zxx-x-shape"), following this precedence: MLKit shape detection models.

wacky6 mentioned this issue Nov 26, 2020

Revise language handling #8

Merged

r12a added the i18n-tracker Group bringing to attention of Internationalization, or tracked by i18n but not needing response. label Feb 10, 2021

w3cbot mentioned this issue Feb 10, 2021

Use proper BCP 47 language tags for Chinese w3c/i18n-activity#1033

Closed

xfq mentioned this issue Mar 12, 2021

add zh_CN translation for subtitles (#141) w3c/wot-marketing#149

Merged

wacky6 closed this as completed Dec 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use proper BCP 47 language tags for Chinese #1

Use proper BCP 47 language tags for Chinese #1

r12a commented Nov 17, 2020

wacky6 commented Nov 26, 2020

r12a commented Nov 27, 2020

wacky6 commented Nov 30, 2020

wacky6 commented Feb 15, 2021

r12a commented Jun 11, 2021

aphillips commented Jun 11, 2021

wacky6 commented Dec 8, 2021 •

edited

Loading

Use proper BCP 47 language tags for Chinese #1

Use proper BCP 47 language tags for Chinese #1

Comments

r12a commented Nov 17, 2020

wacky6 commented Nov 26, 2020

r12a commented Nov 27, 2020

wacky6 commented Nov 30, 2020

wacky6 commented Feb 15, 2021

r12a commented Jun 11, 2021

aphillips commented Jun 11, 2021

wacky6 commented Dec 8, 2021 • edited Loading

wacky6 commented Dec 8, 2021 •

edited

Loading