[subset] add an option to exclude glyphs specific to 'lang/script' #244

jungshik · 2015-04-20T19:43:54Z

I was using subset.py to subset NotoSansCJKkr and NotoSansKR.

The former is the full repertoire Noto Sans CJK with the default glyphs set to Korean variants and non-Korean glyph variants accessible via locl. It's about 16MB.

The latter is a Korean specific subset (both in terms of character repertoire and glyph repertoire) and does not have non-Korean glyph variants for CJK ideographs. Its size is about 5MB.

I specified the exact same subset (of Unicode code points) and started with the above two original fonts.
The results are different mainly because the result of subsetting from the first keeps all the non-Korean variant glyphs for CJK ideographs while the subset from the second does not.

I propose to add an option to either exclude glyphs specific to 'language/scripts' (negative list) or only include glyphs specific to lang-system specified + default (positive list).

jungshik · 2015-04-20T20:07:17Z

For the record, the two font files I tested are NotoSansCJKkr-Regular.otf and NotoSansKR-Regular.otf (1.002 version)

Below is the 'ttx -l' output of the subsetted result from the 1st font.

    tag     checksum   length   offset
    ----  ----------  -------  -------
    CFF   0x8553A6CE  4565367   191828
    GPOS  0x571D490A    35126  4757196
    GSUB  0x12897EC7    46364  4792324
    OS/2  0x9F5317A0       96   100072
    VORG  0x16DD0EEB      644  4838688
    cmap  0x3DF965D3    89211   102584
    head  0x09307D95       54      244
    hhea  0x0AF96B99       36   100036
    hmtx  0x9D31D292    99734      300
    maxp  0x61BC5000        6      236
    name  0x1E7E5D7C     2416   100168
    post  0xFF860032       32   191796
    vhea  0x097E782E       36  4939068
    vmtx  0x5172020D    99734  4839332

    tag     checksum   length   offset
    ----  ----------  -------  -------
    CFF   0xF0EF99D3  2983018   155892
    GPOS  0x18CB7FDC    35052  3138912
    GSUB  0x0E58F4B7    17260  3173964
    OS/2  0x9F53177C       96    77708
    VORG  0xF7DF0EEB      644  3191224
    cmap  0x5B59684C    75665    80192
    head  0x09307D71       54      244
    hhea  0x0AF955C3       36    77672
    hmtx  0x4C1DFA5B    77372      300
    maxp  0x4BE55000        6      236
    name  0x1C0B5ADF     2388    77804
    post  0xFF860032       32   155860
    vhea  0x097E6258       36  3269240
    vmtx  0x06D3DD14    77372  3191868

The differences in CFF and GSUB can be explained by the former including all the variant (non-default) glyphs for CJK ideographs while the latter not doing so. It'll be taken care of by taking care of this issue.

The difference in the size of cmap is a bit puzzling (89kB vs 75kB). When I dumpled the cmap table from both subsetted output and diff'd them, the only difference is as below. There's no character repertoire difference (as expected because I used the same character list to include), but 'length' are different in two cmap tables in the dump. I guess this should be a separate issue (cmap optimization). I'll file a new issue on this.

@@ -17475,7 +17475,7 @@
       <map code="0xffe5" name="cid59206"/><!-- FULLWIDTH YEN SIGN -->
       <map code="0xffe6" name="cid59207"/><!-- FULLWIDTH WON SIGN -->
     </cmap_format_4>
-    <cmap_format_12 platformID="0" platEncID="4" format="12" reserved="0" length="56512" language="0" nGroups="4708">
+    <cmap_format_12 platformID="0" platEncID="4" format="12" reserved="0" length="46024" language="0" nGroups="3834">
       <map code="0x0" name="cid00001"/><!-- ???? -->
       <map code="0x20" name="cid00001"/><!-- SPACE -->
       <map code="0x21" name="cid00002"/><!-- EXCLAMATION MARK -->
@@ -52689,7 +52689,7 @@
       <map code="0xffe5" name="cid59206"/><!-- FULLWIDTH YEN SIGN -->
       <map code="0xffe6" name="cid59207"/><!-- FULLWIDTH WON SIGN -->
     </cmap_format_4>
-    <cmap_format_12 platformID="3" platEncID="10" format="12" reserved="0" length="56512" language="0" nGroups="4708">
+    <cmap_format_12 platformID="3" platEncID="10" format="12" reserved="0" length="46024" language="0" nGroups="3834">
       <map code="0x0" name="cid00001"/><!-- ???? -->
       <map code="0x20" name="cid00001"/><!-- SPACE -->
       <map code="0x21" name="cid00002"/><!-- EXCLAMATION MARK -->

behdad · 2015-04-23T03:39:45Z

So, the glyph order is exactly the same?

jungshik mentioned this issue Apr 20, 2015

[subset] cmap table is different with the same subset when starting from different original fonts #245

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[subset] add an option to exclude glyphs specific to 'lang/script' #244

[subset] add an option to exclude glyphs specific to 'lang/script' #244

jungshik commented Apr 20, 2015

jungshik commented Apr 20, 2015

behdad commented Apr 23, 2015

[subset] add an option to exclude glyphs specific to 'lang/script' #244

[subset] add an option to exclude glyphs specific to 'lang/script' #244

Comments

jungshik commented Apr 20, 2015

jungshik commented Apr 20, 2015

behdad commented Apr 23, 2015