Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consolidation of Additional Glyph & Character Suggestions (See Issue #180) #115

Closed
ShikiSuen opened this issue Jun 26, 2015 · 60 comments
Closed

Comments

@ShikiSuen
Copy link

Currently, Source Han Sans TW does not include simplified kanji glyphs used in PRC. But MOE had shown in their 「全字庫正宋體」 and 「全字庫正楷體」 that their standards are applied to such glyphs.
image
image

The downloadable fonts of「全字庫正宋體」 and 「全字庫正楷體」 are available at this website:
http://data.gov.tw/node/5961

Meanwhile, there are other MOE-oriented fonts who adopts such MOE-standard glyphs regarding PRC-Simplified Chinese:
// PingFang, the default Traditional Chinese fallback font since OS X El Capitan and iOS9:
image
// DFKai-SB, a.k.a. 標楷體, supports Simplified Chinese kanji glyphs since Windows Vista:
image
// MOE Sung UN
image
// PMingLiU, MOE CNS11643 standard font since Windows Vista:
image

I created this thread in order to follow Ken Lunde's slogan in issue #99 :
image
So that people could make related discussions here (instead of in issue #99 ) before Ken Lunde make his final decisions regarding it.

@ShikiSuen
Copy link
Author

Here are my opinions:

As a grown-up mainland PRC passport owner, I feel that those simplified kanji glyphs used in PRC are better-designed in 「全字庫正宋體」 and 「全字庫正楷體」 since they are easier and faster to write. Meanwhile, this makes each glyph looks more unique.

Update: PingFang is manufactured by DynaComware only.

@ShikiSuen
Copy link
Author

(carbon copy sent to @jimmymasaru .)

@kenlunde
Copy link
Contributor

PingFang is a Pan-Chinese typeface family that does not use region-specific subsets, which means that the Simplified and Traditional Chinese fonts have the same Unicode coverage. The character in question, U+604B (恋), is in CNS 11643 Plane 3, but the scope of Traditional Chinese in SHS is capped at Big Five Levels 1 and 2, which are equivalent to CNS 11643 Planes 1 and 2.

@kenlunde kenlunde self-assigned this Jun 26, 2015
@ShikiSuen
Copy link
Author

Reference:
https://zh.wikipedia.org/zh-hant/%E5%A4%A7%E4%BA%94%E7%A2%BC

Based on such reference, there are some glyphs not included in Big5. some of them are:
image
But, some of them are still used in current Taiwan even though they are not in Big5... such as "峯" (used in a Taiwan singer & songwriter's name "吳青峯"), "栢" (a Hong-Kong movie star "張栢芝"), "邨", and "啓" (Traditional Chinese version of C&C Red Alert 2, "天啓坦克" = "Apocalypse tank").

I am not familiar with this since I never use those fonts which supports Big5 glyphs only. Thus, I couldn't tell whether SHS TW (Regional Specific Release) supports it or not.

@tamcy
Copy link

tamcy commented Jun 30, 2015

Unlike 恋, words listed on the above table (蟎綫綉滙栢峯頴邨着双啓) are all covered by HKSCS-2008, which should imply that they are also covered by Source Han Sans TW. 

@kenlunde
Copy link
Contributor

The scope of Source Han Sans TW, which is a subset of the 65,535-glyph glyph set, is Big Five + Hong Kong SCS (in terms of code points for hanzi). The best work-around is to simply use Source Han Hans TC, which includes all 65,535 glyphs, and thus has the maximum coverage of code points. (This suggestion is separate from having a glyph that is appropriate for TW.)

In terms of actually extending TW coverage, which means using glyphs that are appropriate for TW use (and thus follow MOE guidelines), the issue is of scope. Big Five is used because it represents the most common hanzi in use, and the problem that we will run into is the lack of available CIDs.

In any case, I am now working on the plan and scope for Version 2.000, and am taking all of this into consideration, though the highest priority is proper Hong Kong support.

@kenlunde kenlunde changed the title Discussions regarding necessity of extending TW glyphs. Consolidation of Additional Glyphs & Characters, Mainly for TW Jul 20, 2015
@kenlunde
Copy link
Contributor

I am consolidating Issue #118 (for adding a CN glyph for Extension E U+2C386, 𬎆, ⿰王莹) here.

@kenlunde kenlunde changed the title Consolidation of Additional Glyphs & Characters, Mainly for TW Consolidation of Additional Glyph & Character Suggestions, Mainly for TW Jul 20, 2015
@kenlunde
Copy link
Contributor

kenlunde commented Aug 2, 2015

I am consolidating Issue #121 (for adding glyphs for U+9FD1 through U+9FE9 and U+2B7F7) here. I am adding a note on 2017-02-21 to indicate that U+2B7F7 𫟷 is the Simplified Chinese name of Element 116, which is an outlier in terms of covering all of the elements.

@extc
Copy link

extc commented Aug 2, 2015

Adobe-CNS1-6 is the CMap standard of Traditional Chinese OpenType CIDfonts. The basis were BIG5, extended characters from DynaLab and Monotype, GCCS, HKSCS-1999, HKSCS-2001, HKSCS-2004 and HKSCS-2008. Ken Lunde had released CNS11643 Plane 1 to 7 and 15 (1992 standard) PDFs in
ftp://ftp.oreilly.com/examples/cjkvinfo/AppG/

BIG5 was a very old standard (1984). Now the CNS11643 included 107171 characters. I know It is not possible to include all as only 86655 characters are mapped to Unicode. But CNS11643-1992 was a de facto standard. It was implemented as EUC-TW encoding in UNIX terminals. Also, all EUC-TW characters had already mapped to Unicode. The Adobe CMap should at least include all the characters in CNS11643-1992 version so as to reflect its name. The development of Source Han Sans depends on CIDs. Therefore I think Adobe should update the Adobe-CNS1 Cmap in parallel to the development of Source Han Sans.

@kenlunde
Copy link
Contributor

kenlunde commented Aug 3, 2015

@extc: I read and re-read your note above, and am still at a loss as to what you are requesting, but perhaps what I wrote below may help.

Adobe-CNS1-6 is cumulative, meaning that glyphs are added incrementally, so there is a diachronic effect. Supplement 0 supported only Big Five and the ETen extensions, but the /Ordering was set to CNS1 because of Big Five's relationship to CNS 11643, and such a name opened the possibility of extending the glyph set to cover additional CNS 11643 planes. Supplement 1 added support for Hong Kong GCCS and the Hong Kong extensions from DynaComware (Dynalab) and Monotype Imaging (Monotype). Supplement 2 simply added pre-rotated versions of non–full-width glyphs that are accessible via the (now-deprecated) 'vrt2' GSUB feature. Supplements 3 through 6 were for supporting the 1999, 2001, 2004, and 2008 versions of Hong Kong SCS, respectively.

The PDFs from CJKV Information Processing (First Edition) were made by using an experimental Adobe-CNS2-0 glyph set whose purpose was to simply show all characters in CNS 11643-1992, along with Plane 15.

Although CNS 11643 is large, and has expanded beyond the 1992 version, it is not nearly as interesting as the CJK Unified Ideographs in Unicode, meaning the URO and Extensions A through E. The latter has excellent interchange, but the former has very poor interchange. CNS standards are also quite messy, and provide little or no metadata, such as dictionary mappings or other ways to verify a character's meaning or shape.

@fei0316
Copy link

fei0316 commented Nov 8, 2015

Add the character 𧒽(U+274BD) #133

The character, although not being in the any of the supported standards, should be added. This character is used as a station name of Guangzhou Metro (𧒽岗站), as a park name near that station (𧒽岗公园) and also as a name of a type seafood produced in that area. The character is supported by MingLiU_HKSCS-ExtB font and it's also successfully shown properly on OS X 10.10 Yosemite and Windows 10 by deault. This character was proposed to be added, but later removed from 通用规范汉字表. Any documents, banners, and websites with that character would usually be written as 「虫雷」or「礌」. People also claimed to have problems finding that station on the mobile phone app. Maps showing that station or the park have to use other words to replace the unsupported word. As the goal of this font is to maximize compatibility, adding this character can really benefit a lot of people considering the fact that all Android devices running Android 5.0 or above are using this font.
Reference:
https://zh.wikipedia.org/wiki/%F0%A7%92%BD%E5%B4%97%E7%AB%99
https://zh.wikipedia.org/wiki/%E9%BB%83%E6%B2%99%E8%9C%86
http://news.sina.com.cn/o/2014-07-24/142430572774.shtml
http://baike.baidu.com/view/4731307.htm
https://www.google.com/maps/place/%E7%A4%8C%E5%B2%97/@23.0442069,113.1465266,16z/data=!4m5!1m2!2m1!1z6Jmr6Zu35bKX5YWs5ZyS!3m1!1s0x0000000000000000:0x84aea54ce06ea2e9

@kenlunde
Copy link
Contributor

Consider adding (KR) glyphs for U+200D7 𠃗, U+2042D 𠐭, U+224E1 𢓡, and U+23D18 𣴘. The last three are used in traditional Korean musical notation.

@hfhchan
Copy link

hfhchan commented Dec 24, 2015

The HKSCS 2015 update is redefining some mappings from big5 to ucs. Would that affect character coverage, especially the full-width symbols?

@kenlunde
Copy link
Contributor

@hfhchan: With regard to Hong Kong support, we sort of have a fresh slate, because to date the project does not include any Hong Kong font resources. This effectively means that accommodating mapping changes should not be problematic.

@kenlunde
Copy link
Contributor

kenlunde commented Jan 7, 2016

New CN glyphs for U+35F4 (㗴) and U+6D73 (浳), uni35F4-CN and uni6D73-CN, need to be added.

@kenlunde kenlunde changed the title Consolidation of Additional Glyph & Character Suggestions, Mainly for TW Consolidation of Additional Glyph & Character Suggestions Jan 13, 2016
@kenlunde
Copy link
Contributor

Consider adding KR glyphs for Extension B characters 𪓟 (U+2A4DF) and 𣖄 (U+23584) per Issue #136.

This was referenced Jan 13, 2016
@kenlunde
Copy link
Contributor

Per Issue #137, VN (Chữ nôm) glyphs will be supported when Extension B and beyond are supported in their entirety.

@hfhchan
Copy link

hfhchan commented Mar 21, 2016

is "𠻹" (H-9E77) supported? It doesn't show up correctly using Noto Sans TC (http://fonts.googleapis.com/earlyaccess/notosanstc.css) on hk01.com (the character uses SimSun-ExtB instead on both MSEdge and Chrome)

Edit: Nor does 䮎 (H-92D7). On the other hand, 罉 (H-9DD1) displays correctly. 𦉘 (H-9DBC) doesn't.

@kenlunde
Copy link
Contributor

@hfhchan: 𠻹 (Extension B U+20EF9; CID+59693) is supported by Source Han Sans / Noto Sans CJK, and is also included in the region-specific subset OTFs for Traditional Chinese (which are the fonts that are referenced in that CSS file). I inspected one of the OTFs that is referenced by the CSS file, and it has been further subsetted, and includes only 9,876 glyphs, and only three characters outside the BMP are supported: U+210C1, U+24A12, and U+25683. This is therefore a question to pose to Google.

@kenlunde
Copy link
Contributor

罉 (URO U+7F49; CID+32230) is among the 9,876 glyphs in the OTFs that are referenced by that CSS file. 䮎 (Extension A U+4B8E; CID+9231) and 𦉘 (Extension B U+26258; CID+60806), on the other hand, are not. The glyphs for all three characters are in the official region-specific subset OTFs for Traditional Chinese. I recommend that you ask Google here.

@kenlunde
Copy link
Contributor

kenlunde commented Nov 12, 2016

@acuteaccent: These have been on my Version 2.000 list for some time, and as Frank mentioned, that list specifies that U+1F10B and U+1F10C will be handled as double mappings. Also, U+312E, U+312F, U+9FD1 through U+9FEA, and U+1F12F are on the same list.

@jungshik
Copy link

jungshik commented Feb 7, 2017

from: notofonts/noto-cjk#80

I compared the character repertoire of Noto Sans CJK 1.004 against the list of characters allowed for South Korean family registry and found that 47 characters are missing.

The list is
kr_names_missing_in_noto_sans.txt

The 1st column is Korean reading in Hangul. The second column is a Unicode code point. The 3rd is a character.

@kenlunde
Copy link
Contributor

kenlunde commented Feb 7, 2017

@jungshik: Thank you. I count 48 characters in your list, not 47, but U+23343 𣍃 appears twice, making it actually 47.

@jungshik
Copy link

jungshik commented Feb 7, 2017

@kenlunde Yes, that's why I said there are 47 characters :-) (I should have deleted the 2nd line with U+23343 before uploading).

@acuteaccent
Copy link

@kenlunde @jungshik Well, in fact, there are indeed 48 missing, as there is unencoded ⿰氵恩 (은). If Source Han Sans is targeting all the South Korean personal name hanja, one glyph needs to be reserved for ⿰氵恩.

Also, I think notofonts/noto-cjk#80 (comment) this is a very good idea, as no one actually uses/needs halfwidth hangul jamo. To begin with, I wonder why they are encoded in Unicode.

@acuteaccent
Copy link

(This is in regard to #115 (comment))

Oh, the suggestion about U+02EA and U+02EB was already made before (notofonts/noto-cjk#56). As I usually don't check the Noto Sans CJK side, I was not aware of it until now.
FYI, I learned about those two characters from here: http://www.unicode.org/versions/Unicode9.0.0/ch18.pdf#page=27

@acuteaccent
Copy link

BTW, if you are running out of glyphs, you can get rid of Œ, œ, and ƒ, as they are not used in CJKV languages (including common romanization systems).
(If Œ and œ are included to cover French, then Ÿ also needs to be included.)

@justinrleung
Copy link

œ might be used in IPA and its derivative romanizations, like S. L. Wong (phonetic symbols). It might be useful to keep it for people who need to use IPA (e.g. when dealing with a Chinese dialect that does not have a romanization system).

@acuteaccent
Copy link

acuteaccent commented Apr 4, 2017

Well, I don't think the IPA is the reason for the inclusion of œ though. Source Han Sans does not cover most letters used in the IPA and its derivative romanizations (ɐ, ɛ, ɔ, ŋ, etc.) anyway.

@acuteaccent
Copy link

U+2780 ➀ to U+2789 ➉ and U+278A ➊ to U+2793 ➓ can be covered by using the glyphs at U+2460 ① to U+2469 ⑩ and the ones at U+2776 ❶ to U+277F ❿ respectively, as Source Han Sans is a sans-serif font. As this can simply be done by inserting additional code point mappings to existing glyphs, no new glyphs are needed.

@jimmymasaru
Copy link

Well, probably œ and other Latin alphabets are included in AdobeJapan1-6 which is why they are included in SHS.

@acuteaccent
Copy link

acuteaccent commented Apr 6, 2017

Another suggestion if you are running out of glyphs:
You can remove the glyphs for Cyrillic letters, as Cyrillic letters are not used in CJK texts.
(Greek letters are needed because they are used in mathematics and science. But Cyrillic letters are not used anywhere in CJK texts.)

@jungshik
Copy link

jungshik commented Apr 7, 2017

@acuteaccent There are a lot of characters I personally want to drop (not just Cyrillic but also an incomplete set of box drawing, various symbol characters, Latin outside ASCII, Korean Half-width Jamo, etc) to make room for more/better CJK coverage.

@jungshik
Copy link

jungshik commented Apr 7, 2017

, I think notofonts/noto-cjk#80 (comment) this is a very good idea, as no one actually uses/needs halfwidth hangul jamo. To begin with, I wonder why they are encoded in Unicode.

My guess is that they're encoded in the Unicode because they're encoded in a pre-KS C 5601-1987 (pre-KS X 1001) standard. Nonetheless, they're completely useless and nobody would notice that if they're gone. If we want to keep the character coverage, we can just map them to the corresponding glyphs for U+313x (Hangul Compatibility Jamos).

Well, in fact, there are indeed 48 missing, as there is unencoded ⿰氵恩 (은). If Source Han Sans is targeting all the South Korean personal name hanja, one glyph needs to be reserved for ⿰氵恩.

You're right, @acuteaccent

@jungshik
Copy link

jungshik commented Apr 7, 2017

As for the South Korean Hanja list for names, see also http://www.unicode.org/L2/L2017/17084-korean-name-var.pdf (Jaemin Chung's proposal).

@acuteaccent
Copy link

acuteaccent commented Apr 12, 2017

For Latin letters, I think covering ISO/IEC 8859-1 (or Windows-1252) and the characters used in Hanyu Pinyin and McCune-Reischauer (as McCune-Reischauer is used by libraries around the world) is good enough.
Source Han Sans and Serif don't need to cover Vietnamese Latin letters, as modern Vietnamese is far from CJK characters. Vietnamese can be (and should be) covered by Latin fonts, not by CJK fonts.

I agree with removing glyphs for box drawing characters (retracted). I also agree with removing glyphs for halfwidth hangul jamo (and map Hangul Compatibility Jamo glyphs for those halfwidth hangul jamo instead).

@rschiang
Copy link

Box drawing characters are still widely used in Traditional Chinese context, primarily on BBS (e.g. PTT) and plain-text documents with tabular content; removing these characters would break these preformatted tables, rendering them unreadable.

If glyph count really matters, I would suggest fulfilling the character set only on TC / half-width variant, or extracting a fallback font compatible with SHS metrics.

@acuteaccent
Copy link

Oh, okay. Then I take back what I said about box drawing characters.

@acuteaccent
Copy link

Come to think of it, since box drawing characters are in the two-byte range of GB 18030, the glyphs for them will not be removed (as Ken wants to completely cover the mandatory portion of GB 18030).

@kenlunde
Copy link
Contributor

@acuteaccent wrote:

as Ken wants to completely cover the mandatory portion of GB 18030

Precisely.

Also note that glyphs will not be removed on a whim. Everything discussed in this issue will be considered, but there are several factors that will play into the actual decisions.

@kenlunde kenlunde changed the title Consolidation of Additional Glyph & Character Suggestions Consolidation of Additional Glyph & Character Suggestions (TO CLOSE) May 26, 2017
@kenlunde kenlunde changed the title Consolidation of Additional Glyph & Character Suggestions (TO CLOSE) Consolidation of Additional Glyph & Character Suggestions (See Issue #180) May 26, 2017
@kenlunde
Copy link
Contributor

Consolidated with Issue #180.

@adobe-fonts adobe-fonts locked as resolved and limited conversation to collaborators Nov 20, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests