Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create a proper UniSourceHanSansHK #32

Closed
hfhchan opened this issue Jul 20, 2014 · 19 comments
Closed

Create a proper UniSourceHanSansHK #32

hfhchan opened this issue Jul 20, 2014 · 19 comments

Comments

@hfhchan
Copy link

hfhchan commented Jul 20, 2014

The Hong Kong standardized forms and social norms depart from the Taiwanese MOE variants to the extent that certain characters are exact replicas to the PRC variants, instead of the Taiwanese MOE. Such a problem means that UniSourceHanSansTWHK is actually unfit for daily use in Hong Kong.

The main standard, called 《香港常用字字表》"List of Graphemes of Commonly-used Chinese Characters", is now incorporated into《香港小學學習字詞表》, online version can be found at http://www.edbchinese.hk/lexlist_ch/. This list of graphemes is the de-facto standard as all Kai scripts in the primary and secondary school textbooks are required to adhere to this standard. Local dictionaries also adhere to this standard, instead of the ROC MOE.

There is also a set of guideline for the IT industry prepared by the Chinese Language Interface Advisory Committee (CLIAC). Despite this guideline drafted to cover all CJK characters in the BMP and the CJK Extension A, the guideline is specified in a list "basic components", which can be used to apply to newly created characters. Certain ROC MOE variations are deemed incorrect, especially 周、告、骨. This guideline requests that certain glyphs have a glypheme that is already specified by another codepoint*.

*The word 兌 in Big5 character set is required to be mapped to codepoint U+514C. However, the standard in Hong Kong is to render it exactly as 兑, which-so-happens-to-be the form mapped to U+5151 by GBK. Therefore, the guideline requires both U+514C and U+5151 to be rendered as 兑 (and similarly for all characters that utilize the component).

Of the characters, they generally consist of five types, roughly in order of prevalence:

  • Exact Replicas of both PRC & ROC Variants
  • Exact replicas of ROC variants
  • Exact replicas of PRC variants
  • Contain components similar to PRC and ROC variants
  • Contain components that do not exist in either PRC or ROC variants

Exact Replicas of both PRC & ROC Variants

  • e.g. 我你什共們原的話**

Exact Replicas of ROC Variants:

  • e.g. 他甚花草這勻鈞敢茲今

Exact Replicas of PRC Variants:

  • e.g. 示隸黃廣總統
  • e.g. those that involve "月" "次" "户" "兌" components: 有育胃臂資扇請 資次 肩房扁 說銳

Contain components similar to PRC and ROC variants

  • 脫能化

Contain components that do not exist in either PRC or ROC variants

  • 滋、磁、龜、告、周

**《香港電腦漢字字形參考指引》specify a different glyph where the top right component is a 干 instead of a 千. This deviates from that specified in 《香港常用字字表》. As the standard is prepared by a committee of teachers and experts, and of de-facto standard, while the guideline is prepared by non-experts and of suggestive nature, the standard should prevail.

As most characters can be directly re-used, the work should be feasible in a short span of time.

@hfhchan
Copy link
Author

hfhchan commented Jul 20, 2014

@kenlunde It was mentioned in issue #6:

First, the representative glyphs as used in the national standards of each region are those
that are preferred, and it is not really a matter of correctness (which can be subjective, and
can change over time) but rather one of following current conventions.

Hong Kong's 《香港常用字字表》 may not be written into law like the ROC MOE, but the actual implementation means it is a de-facto standard. It is by standard and by convention that the ROC MOE forms are not used in Hong Kong, especially those regarding the 「月」 component and 周/告 etc.

Would this very probably warrant the creation of a font that is actually suitable for use in Hong Kong?

@hfhchan
Copy link
Author

hfhchan commented Jul 23, 2014

It was mentioned about the 64K character limit in the multilingual Source Han Sans font. Is it possible that the HK glyphs are not merged into the multilingual Source Han Sans font, but in the Souce Han Sans HK font only?

@acuteaccent
Copy link

"The word 兌 in Big5 character set is required to be mapped to codepoint U+514C. However, the standard in Hong Kong is to render it exactly as 兑, which-so-happens-to-be the form mapped to U+5151 by GBK. Therefore, the guideline requires both U+514C and U+5151 to be rendered as 兑 (and similarly for all characters that utilize the component)."

No. U+5151 is for 兑, and U+514C is for 兌. They are separated because of the source separation rule. If you want 兑, use U+5151; if you want 兌, use U+514C. If you don't like this, go back in time and change the national standard that separated 兌 and 兑.

@azurerime
Copy link

In my view, the 《香港常用字字表》 standard can not reflect what the forms commonly use in Hong Kong.
Press media in Hong Kong prefers to use Montype fonts, while it's Dynacw fonts in Taiwan.
This make the social norms different.

With the Taiwanese MOE standard is adapted on MS Windows in recent years, many MOE variants has been accepted in Hong Kong. However, some variants liked - 充 - 黃 - 窗 are highly controversial.

4 main free newspapers in Hong Kong:
Metro daily - http://www.readmetro.com/en/china/hong-kong/
am730 - http://epaper.am730.com.hk
Headline daily - http://hd.stheadline.com/
Skypost - http://www.skypost.hk/epaper

@hfhchan
Copy link
Author

hfhchan commented Jul 23, 2014

@acuteaccent

No. U+5151 is for 兑, and U+514C is for 兌. They are separated because of the source separation rule. If you want 兑, use U+5151; if you want 兌, use U+514C. If you don't like this, go back in time and change the national standard that separated 兌 and 兑.

Sorry, please re-read what I wrote, and read the guideline. I'm quite sure you have misunderstood it.

This behavior to render U+514C as U+5151 is specified inside the guideline produced by the Hong Kong Government. This is IN FACT due to the source separation rule. Back in the days where Big5 was employed, Hong Kong used the codepoint for 兌 as 兑 (and 脫 for 脱, etc). As glyphs in the URO are defined by their mapped character, 兌/兑 in Big5 was mapped to U+514C (and not U+5151). Therefore, U+514C is required to be rendered as 兑 (U+5151) when the HK locale is used (and ditto for every word with the 兌 component). Even the government produced reference font uses the same glyph for both codepoints (and for words that use this component).

As for the reference rendering by Unicode, again, the HB source are glyphs from Taiwan, thus the Unicode chart is useless until 2015 when the HB glyphs are replaced with glyphs that are actually sourced from the HK government.

@hfhchan
Copy link
Author

hfhchan commented Jul 23, 2014

@azurerime the 《香港常用字字表》 standard in the EDB Lex-list was modified in 2007 to be in-line and more consistent with the normal forms in use in Hong Kong (e.g. 潛字不出頭, 求字係棟鈎).

In Taiwan, the MOE forms are not in use in most newspapers either. Yet since they were taught in primary and secondary school, they were adopted. In HK this standard is required to be stuck to by authorized Chinese textbooks. Also, the fonts in the major newspapers, esp. 東方日報 are by MonoType which is one of the closest to this standard than any other fonts.

P.S the 2005 version often found online is not the official one. Latest official one (2007 ver) can be found at http://www.edbchinese.hk/lexlist/.

@tamcy
Copy link

tamcy commented Jul 23, 2014

While I support the development of UniSourceHanSansHK, we may need to sort out the standard forms that are not commonly written by people and see what we should do with it. For example, the standard form of "於" in HK may surprise many people:
http://www.edbchinese.hk/lexlist_ch/result.jsp?id=1746&sortBy=stroke&jpC=lshk

@acuteaccent
Copy link

@hfhchan Then it is Hong Kong's fault using U+514C 兌 for 兑 (and 脱 for U+812B 脫, etc). Having 兑 at the code point U+514C 兌 (and having 脱 at the code point U+812B 脫, etc) will only confuse users and it is not following Unicode correctly.

See what Japan did in 2004, when it revised its character code standard JIS X 0213. Japan changed the glyphs of some JIS X 0213 characters so that those characters follow the glyphs on the Hyōgai Kanji Jitaihyō (表外漢字字体表).
(You can see the glyph changes here: http://www.asahi-net.or.jp/~ax2s-kmtn/ref/jis2000-2004.html AFAIK, a lot of Japanese fonts that support JIS X 0213 are complying with those glyph changes.)

However, there were ten characters that required the changes of Unicode mappings in order to follow the Hyōgai Kanji Jitaihyō. For those ten characters, Japan did not change the glyphs; instead, it added ten new characters to other JIS X 0213 code points. For example, the one on the Hyōgai Kanji Jitaihyō is 剝, but JIS X 0213 only had 剥 (at 1-39-77). So Japan did not change the glyph of 剥, but it added 剝 to another JIS X 0213 code point (1-15-94) instead. This way, the Unicode mapping of JIS X 0213's 1-39-77 can be retained and both 剥 and 剝 can be used in JIS X 0213.

I think the 兌/兑 problem can be dealt with like the 2004 revision of JIS X 0213. For the code point that is mapped to U+812B, keep 脫 (or change 脱 to 脫), and add U+8131 脱 to another HKSCS code point. This way, the Unicode mappings for the existing Big5/HKSCS characters can be retained and both 脫 and 脱 can be used in HKSCS.

@hfhchan
Copy link
Author

hfhchan commented Jul 23, 2014

@acuteaccent It's not up to me our you to have the authority to dictate an alternative. You can send a letter to the CLIAC of OGCIO of HKSARG to complain as you like. How to deal with it has been outlined in the industry guideline and I'm only stating it here for reference in case that this particular font variant is made: to remind of peculiarities in the standard that don't necessarily exist in other standards.

Big5 has never had separate codepoints for 兌 and 兑, and from the point of view as a Hong Kong citizen it is completely understandable why it has been treated as the same character. It is just DUE TO THE source separation rule the SAME GLYPH in GBK was mapped to a different codepoint, and by CULTURAL REASONS in Hong Kong, 兌's representative glyph has always been equal to 兑. Big5 character shouldn't suddenly become maped to 兑 at U+5151 because HK's representative glyph is an exact replica of U+5151. I do not see why you can suggest that Unicode has encoded two characters means that Hong Kong's representative glyph must follow suit either.

Remember that Unicode characters in the URO are defined by their MAPPING TO LEGACY ENCODINGS. It is already a fact that there has been no distinguishing between the two separate-in-Unicode characters in Big5. It is technically infeasible to amend this legacy encoding and (re-)convert the large amount of content using it. It is also impractical as people are already discouraged from using Big5. It is also illogical to ask for amendment to a legacy encoding JUST BECAUSE of a discrepancy between the Unicode treatment and legacy encoding treatment. What's more is that Big5 was not a mandatory standard for Taiwan and it has long been adopted in Hong Kong. What gives that Hong Kong is the fault for disagreeing with a representative glyph? Just because that when URO did mappings, Hong Kong was not involved? This sounds a bit like bullying imho.

I also want to point out that JIS X 0213:2004's problem was that their representative glyph submitted to Unicode differed to that required by Hyōgai Kanji Jitaihyō. I understanding to your logic is, the new glyph would be more similar to another character encoded in Unicode, therefore, a new separate codepoint was assigned. However, this representative glyph problem occurred AFTER URO was formed. And using an encoding that could display characters in JIS X 0213:2000, 剥 (at 1-39-77) was supposed to be rendered as 剥 anyway. Changing the glyphs would cause confusion to users. Thus the new character should be assigned a new codepoint in JIS X 0213:2004. Thus if you take out Unicode from context, it still warrants their treatment method. It is also well known that the other characters where the representative gylph changed caused confusion. It just HAPPENS that these new glyphs didn't have a separate codepoint in Unicode, so it wasn't even bothered to create a new codepoint in the JIS X 0213:2004 standard. Chaos wrecked across font makers. It was quite lucky that these characters weren't in use in names. It is well known what happened to the change of representative glyph for 辻.

WHILE in this case, the HK government's representative glyph for 兌 is always exactly equal to 兑. It is an oversight of the formation of URO as it just so happens that at then there was no H-source at that time (GSCS characters were classified with a G-source), thus the issue was never raised until Hong Kong decided to create the guideline for standardized glyphs to realize that the representative glyph at U+514C should look like U+5151. But even if H-source existed, it is unlikely one Big5 codepoint will be mapped to two codepoint in the URO.

It is the industry guideline states that the representative glyph for 兌 at U+514C is equal to 兑. It is not invented by ME and it is a direct CONSEQUENCE that URO characters are defined by their mapping to their legacy encoding. It is also up to the government (arguably, its people?) to decide its representative glyph. It SHOULD BE RESPECTED that Hong Kong and Taiwan have different representative glyphs. I have done nothing to encourage or discourage it; if you strongly believe that Hong Kong should adopt the method adopted in JIS X 0213:2004 to resolve these discrepancies (despite it being a completely different case), feel free to voice you opinion at the Hong Kong Government. However it still remains that the representative glyph for Hong Kong should be U+514C = U+5151 and for U+812B = U+8131 etc.

If you still do not like it, remember you can choose the UniSourceHanSansTW(HK) or UniSourceHanSansJP. No one force you to use this proposed new UniSourceHanSansHK. The font should (and does) render U+514C with your preferred glyph.

@hfhchan
Copy link
Author

hfhchan commented Jul 23, 2014

Given that
0xA749 (Big5) glyph as 兌 in Taiwan.
0xA749 (Big5) glyph as 兑 in Hong Kong.
0xBBA1 (Big5) glyph as 說 in Taiwan.
0xBBA1 (Big5) glyph as 説 in Hong Kong.
0xB2E6 (Big5) glyph as 脫 in Taiwan.
0xB2E6 (Big5) glyph as 脱 in Hong Kong.

And Unicode Big5 Mapping
Excerpt http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/OTHER/BIG5.TXT
0xA749 0x514C #
0xBBA1 0x8AAA #
0xB2E6 0x812B #

Thus,
0x514C (Unicode) should render as 兑 in Hong Kong,
0x8AAA (Unicode) should render as 説 in Hong Kong.
0x812B (Unicode) should render as 脱 in Hong Kong.

But you disagree.

I can only deduct two possibilities:

  1. Unicode's mapping is wrong
  2. Hong Kong's Big5 glyph =/= Taiwan Big5 glyph = Hong Kong's fault

For 1, maybe Unicode should have produced two mappings, one for Big5 in Taiwan, and one for HKSAR. But sorry, published n-years ago, damage done. Hong Kong wasn't recognized as H-source until way after the URO was formed.

For 2, that is blatantly unfair to Hong Kong. Hong Kong is inferior to Taiwan? Hong Kong cannot chose it's own representative glyph? Unicode forgot about Hong Kong, so Hong Kong should declare its own representative glyphs wrong and use Taiwan's, re-extend HKSCS:2008 to HKSCS:2013, re-convert all the Unicode-coded Chinese to use the "correct" code-point in Unicode, ask all font makers for Hong Kong to amend their fonts to put the glyph for the right codepoint, just because URO fail to acknowledge the difference in two dots?

Interesting.

@kenlunde
Copy link
Contributor

This is simply a quick note to reassure the community that I am not ignoring this issue. Rather, because I am on vacation in South Dakota, and 1,500 miles away from my reference materials, I cannot take any substantive or meaningful action until I return to work sometime next week.

To answer @hfhchan's question in the third post of this issue, the 64K glyph barrier is an issue. Our desire is for all region-specific subsets to be pure subsets of the full glyph set. Given the extent to which Hong Kong–specific glyphs may need to be added, this may become a challenge, mainly because the scope is not simply Hong Kong SCS, but Big Five as well. More investigation is necessary.

@acuteaccent
Copy link

@hfhchan But two different code points with the same glyph is not a good thing. It will confuse users (especially those who are not familiar with character sets) and search engines. Users might not know that there are two 兑s (if both U+514C and U+5151 are designed 兑). Some will use U+514C and some will use U+5151. What is the point of separating U+514C and U+5151, if they are both rendered as 兑?

So I think the separated code points in Unicode should be respected first. I have no intention to disrespect Hong Kong's glyph guideline at all. If I seemed to disrespect Hong Kong's guidelines, I apologize. I definitely do agree that the guideline should be respected, but I think that two code points with the same glyph should be avoided, as that can confuse users.
It is just that I think having the glyph of U+514C as 兑 is not a good idea; I think either the Unicode mapping or the glyph should be changed.

@kenlunde
Copy link
Contributor

I need to agree with @hfhchan about how to handle the pairs of similar characters as being discussed recently. When you think about how users are entering characters, which is not by selecting code points in charts (a cumbersome task), but rather via an input method that is generally geared for a particular region or language, which particular code point is inserted into the document becomes clear. It actually makes no difference what glyph is at the other code point, because it is effectively invisible (or inaccessible) to the input method.

Using the U+514C/U+5151 pair as an example, a Big Five– or Hong Kong SCS–based input method will allow the user to enter U+514C, but not U+5151. Of course, U+5151 corresponds to a CNS 11643 Plane 3 character, but that is out of the scope of these fonts and this discussion.

@acuteaccent
Copy link

@kenlunde What about copied-and-pasted strings? Such strings can contain U+5151.

@kenlunde
Copy link
Contributor

That's simply bad or inappropriate data for the intended purpose. The same thing is true for U+20F96 if someone (inappropriately) were to use it as the Japanese form of U+5668 (器). No one will propose that Japan changes its mapping for their form of 器 from U+5668 to U+20F96 simply because their shapes, as appropriate for Japanese, happen to be identical

@acuteaccent
Copy link

@kenlunde Well, I think you are right. As you pointed out, there are already duplicates like U+20F96 and U+29516… I guess having several more like that for Hong Kong would not hurt that much…

@azurerime
Copy link

I wonder if Big5-HKSCS changes the mapping as below in the future as a solution.
0xA749 (Big5-HKSCS) <=> 兑 U+5151
0xBBA1 (Big5-HKSCS) <=> 説 U+8AAC

It seems that only Hong Kong believes that 兌=兑 or 說=説. This may cause confusion when people communicate with Taiwanese, or others don't use this font.

Currently, there is a pair U+81F4(致) and U+26936(𦤶) in Traditional Chinese differs from Simplified Chinese, which is reversed. I don't think that 至夊 is equal to 至攵.
       T-Chinese S-Chinese
U+81F4 致  至夊    至攵
U+26936 𦤶  至攵    至夊

@hfhchan
Copy link
Author

hfhchan commented Jul 24, 2014

@azurerime

It seems that only Hong Kong believes that 兌=兑 or 說=説. This may cause confusion when people communicate with Taiwanese, or others don't use this font.

I think this problem is fairly limited, as users have to explicitly choose a Hong Kong font (or set the Hong Kong locale) to get this behavior. By default even Hong Kong users are assumed as Taiwan users unless otherwise set :(

Currently, there is a pair U+81F4(致) and U+26936(𦤶) in Traditional Chinese differs from Simplified Chinese, which is reversed. I don't think that 至夊 is equal to 至攵.

致 (U+81F4) and 𦤶 (U+26936) are historically the same words, with 至攵 as 俗寫 (common) and 至夊 as 本寫 (original). Taiwan asks for original while China asks for the common form as standardized. It's just due to historical reasons that the common form in both areas are unified to U+81F4 and less common form unified to U+26936. Too bad, the mappings are more or less fixed. And it is not up to the font to correct them either. Maybe you can submit a disunification request to the IRG but I doubt it will be accepted due to the shear number of content already encoded.

Note: the Hong Kong standardized glyphs do not distinguish between radical 34 (夂) and radical 35 (夊) when as a component of a character. Neither does China. That means separate glyphs are not required: glyph for U+81F4 in HK should rip glyph for U+26936 from G-source. Great to save another glyph!

@kenlunde
Copy link
Contributor

kenlunde commented Aug 1, 2014

I am closing this Issue, and have opened Issue #48 to indicate the action that is planned to address the concerns. Please feel free to continue posting to this issue if appropriate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants