Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generate font subsets #119

Open
vernnobile opened this issue Sep 23, 2012 · 39 comments
Open

Generate font subsets #119

vernnobile opened this issue Sep 23, 2012 · 39 comments
Labels

Comments

@vernnobile
Copy link
Contributor

Feature request - When generating a font from File->Generate Fonts... it would be a helpful feature to be able to generate the font in subsets, based on fontforge's unicode encoding sets.

This could be set up / enabled at 'Generate Fonts...' and / or from the 'Font Info' tables.

So, for example, a font that contains characters from ISO 8859-1 Latin, ISO 8859-5 Cyrillic, and ISO-8859-2 Latin, should be able to generate 3 subset fonts; named for example 'Latin', 'Latin-2', and 'Cyrillic'.

@davelab6
Copy link
Member

"named" you mean file name? can you provide a set of UI mocks/sketches? :)

@chemoelectric
Copy link
Contributor

vernon adams notifications@github.com skribis:

Feature request - When generating a font from File->Generate Fonts... it would be a helpful feature to be able to generate the font in subsets, based on fontforge's unicode encoding sets.

This could be set up enabled at 'Generate Fonts...' and / or from the 'Font Info' tables.

So, for example, a font that contains characters from ISO 8859-1 Latin, ISO 8859-5 Cyrillic, and ISO-8859-2 Latin, should be able to generate 3 subset fonts; named for example 'Latin', 'Latin-2', and 'Cyrillic'.

Scripting is far better for this than building in a feature.

For instance, how should the subset be determined, since encoding is
not a fundamental characteristic of a glyph? Plenty of characters may
have no encoding at all, or they may have more than one encoding. Do
you include a glyph or not, if it might take the place of an encoded
glyph after substitution? If there is more than one encoding in the
font, which one do you use? Are there some glyphs you want to include
‘just because’, such as a foundry logo? These are all decisions to be
made by the person tailoring her own script.

@dmeranda
Copy link
Contributor

Just a head-up; I've already been working on a subsetting script.
It's not ready yet, but I've been making progress and hope to add it
into FF. As some of the things Barry mentioned, it's not quite as
simple as it may at first seem; but it is possible. Perhaps if others
have already been working on this or had additional needs or
requirements extra input would be welcome. Or of course you can make

your own too rather than waiting on me.

Deron Meranda
http://deron.meranda.us/

@vernnobile
Copy link
Contributor Author

My idea was that the subsets would be determined by the encoding sets in fontforge's 'Encoding' menu, and therefore could include any user made sets too.
Any chars missing from the sfd font file, but present in the encoding set, would create a blank but correctly encoded char slot in the generated subset font.
Dave - 'named', yes i mean the font name, eg. 'Font-Regular.latin', 'Font-Regular-Cyrillic', 'Font-Regular-Lao', etc

@dmeranda
Copy link
Contributor

I you need something right away without waiting on me, you may want to
look at the Google font subsetting tool (which by the way uses
FontForge's python scripting). It's far from perfect, but it's a
place to start.

http://code.google.com/p/googlefontdirectory/source/browse/tools/subset/

Deron

Deron Meranda
http://deron.meranda.us/

@vernnobile
Copy link
Contributor Author

Deron. Google's subsetting tool is what i am having to use now, and so... it is what brought me here :)

@dmeranda
Copy link
Contributor

There's also Tevor King's font-reduce
http://blog.tremily.us/posts/font-reduce/ but it also has lots of
problems. It only makes an ASCII subset and also has some other bugs.
But it's yet another example of a subsetting script that uses
FontForge's python interface.

Deron Meranda
http://deron.meranda.us/

@davelab6
Copy link
Member

@davelab6
Copy link
Member

For instance, how should the subset be determined, since encoding is not a fundamental characteristic of a glyph?

I'm not 100% clear on how FF handled encodings; I believe what Vern means is really handled by 'NameLists,' which have both unicode values and their ascii names and are easily user definable; whereas in FF the concept of an encoding means a well-known platform encoding, which is kind of obsolete today because of Unicode.

I think this ought to better surfaced to users; and namelists ought to be stored in /usr/local/share/fontforge/namelists/ and then ~/.FontForge/namelists/

Plenty of characters may have no encoding at all

Right, but they always have a name.

or they may have more than one encoding

I'm not quite sure how this works :-)

Do you include a glyph or not, if it might take the place of an encoded
glyph after substitution?

I don't understand this

If there is more than one encoding in the font, which one do you use?

I thought a font in FF can only have one encoding, but FF can re-encode easily.

Are there some glyphs you want to include ‘just because’, such as a foundry logo?

Right, which won't be part of any pre-defined encoding, but can be part of a namelist.

These are all decisions to be made by the person tailoring her own script.

I think what Vern wants is widgets in the Generate dialog.

@davelab6
Copy link
Member

It would be great for a subset feature in FF to subset OT features correctly (as mentioned in https://groups.google.com/d/msg/googlefontdirectory-discuss/XOPxZm_nyE8/G1oFG_QrogYJ)

@chemoelectric
Copy link
Contributor

Dave Crossland notifications@github.com skribis:

For instance, how should the subset be determined, since encoding is not a fundamental characteristic of a glyph?

I'm not 100% clear on how FF handled encodings; I believe what Vern
means is really handled by 'NameLists,' which have both unicode
values and their ascii names and are easily user definable; whereas
in FF the concept of an encoding means a well-known platform
encoding, which is kind of obsolete today because of Unicode.

I don’t get that. The encoding or encodings are stored in the font. It
may happen to correspond to Unicode codepoints. A glyph name may
correspond to the encoding according to some convention, or it may
not. These are all factors that a script has to take into account, and
it is going to be too much for any one script to handle all
cases. There will be different scripts for different purposes.

I would take the time to get accustomed with the writing of such
scripts, if I had need of such features.

Do you include a glyph or not, if it might take the place of an encoded
glyph after substitution?

I don't understand this

If I have a glyph, either unencoded or with an encoding value that
otherwise would be left out, but that glyph might take the place of
another after substitution, do I include it, or do I leave it out, and
leave out its substitution rules as well, to save yet more space?

If there is more than one encoding in the font, which one do you use?

I thought a font in FF can only have one encoding, but FF can re-encode easily.

Well, suppose you have a Type 1 or other font with more than one
encoding. I’m not sure what FF does with that.

Are there some glyphs you want to include ‘just because’, such as a foundry logo?

Right, which won't be part of any pre-defined encoding, but can be part of a namelist.

These are all decisions to be made by the person tailoring her own script.

I think what Vern wants is widgets in the Generate dialog.

I know, but this is a sort of thing for which extension languages
exist. Generating fonts with serious options is such a complicated
matter that I long ago quit using the Generate dialog for anything but
testing or playing around. I have used scripts to generate my fonts
for a long time. You get a lot of flexibility. You are hamstringing
yourself if you depend on someone else’s C code, and it would clutter
an already cluttered dialog.

@vernnobile
Copy link
Contributor Author

Hi Deron, i would be interested to hear more of the script you are developing; what your aim is, and what functions you had thought to include etc.
My main idea is actually a very task-specific tool; to output more than one webfont from a single .sfd file, with the character sets of these fonts being based on name lists in fontforge's Encoding menu. The idea would be that the user designs and builds the font in e.g. a single sfd, and from that single sfd the user can control and mark how the font will be generated into subsets.

@dmeranda
Copy link
Contributor

Just for information, another source of potentially useful font
subsetting information is the SIL NRSI categorization of characters by
use in a region. Not quite a script nor a language.

http://scripts.sil.org/cms/scripts/page.php?item_id=FontSubsets

Also there looks to be a few interesting presentations at next month's
ATypI conference in Hong Kong that promise to touch on real-world

subsetting, especially for CJK.

Deron Meranda
http://deron.meranda.us/

@dmeranda
Copy link
Contributor

Do you include a glyph or not, if it might take the place of an
encoded glyph after substitution?

I don't understand this

If I have a glyph, either unencoded or with an encoding value that
otherwise would be left out, but that glyph might take the place of
another after substitution, do I include it, or do I leave it out, and
leave out its substitution rules as well, to save yet more space?

Another even simpler condition that can occur, and which the Google
font subset script fails, is when a glyph is multiply-encoded. This
happens when a glyph object is reused in multiple encoding slots.

The most simple approach to font subsetting (ignoring tables, hints,
etc.) is to simply do set operations to either include or exclude
certain characters/glyphs. But these set operations (in FF at least)
do not do what you might expect in the presence of these multiply-
encoded glyphs.

For example a font might reuse the same glyph for U+0020 (space)
and U+00A0 (non-break space). Now if you do a simple subset
by something like the following pseudo code:

  1. font.selection.select( ('ranges','unicode'), 0x20, 0x7E )
  2. font.selection.invert()
  3. font.cut()

your new subsetted font will NOT have a glyph for U+0020,
because it was cut out because it was also used for U+00A0.

Deron Meranda
http://deron.meranda.us/

@davelab6
Copy link
Member

when a glyph object is reused in multiple encoding slots

I thought you might be referring to that, since I recalled you had some
issue with this kind of caper.

It seems to me that allowing a glyph to have multiple encoding values
(which is different to what I understood Vern originally meant by
'encoding' which is an attribute of a font rather than of a glyph) is more
trouble than its worth.

@davelab6
Copy link
Member

On 23 September 2012 00:27, Deron Meranda notifications@github.com wrote:

Also there looks to be a few interesting presentations at next month's
ATypI conference in Hong Kong that promise to touch on real-world
subsetting, especially for CJK.

I will attend and hope to liveblog as much as possible :)

@lemzwerg
Copy link
Member

For instance, how should the subset be determined, since encoding
is not a fundamental characteristic of a glyph?

I'm not 100% clear on how FF handled encodings; I believe what Vern
means is really handled by 'NameLists,' which have both unicode
values and their ascii names and are easily user definable; whereas
in FF the concept of an encoding means a well-known platform
encoding, which is kind of obsolete today because of Unicode.

FontForge already has a mechanism to create subfonts using subfont definition files' (ironically also having the.sfd' extension, but I
think I created this before George invented his format :-)

Here's the man page which defines the format (see page 7):

http://www.tug.org/svn/texlive/trunk/Master/texmf/doc/man/man1/ttf2tfm.man1.pdf?revision=26689&view=co

and here you can find Unicode.sfd:

http://www.tug.org/svn/texlive/trunk/Master/texmf-dist/fonts/sfd/ttf2pk/Unicode.sfd?revision=11785&view=markup

Maybe this mechanism can be extended to cover more than 256 characters
per font.

Werner

@dmeranda
Copy link
Contributor

when a glyph object is reused in multiple encoding slots

I thought you might be referring to that, since I recalled you had some
issue with this kind of caper.

Yes, that is in fact the one "itch" which brought me into the whole
fontforge source code, and I've kind of gotten distracted by other
issues and possibilities.

It seems to me that allowing a glyph to have multiple encoding values
(which is different to what I understood Vern originally meant by
'encoding' which is an attribute of a font rather than of a glyph) is more
trouble than its worth.

Just curious, how does one even get a font into that state where
a glyph is reused? Also is that a property of FontForge itself, the
sfd file format, or is it more intrinsic to TTF or such?

Regarding the fonts I was using that had these, I did report them
upstream and supposedly the next version will use glyph references
instead. But that still doesn't mean this same "problem" can't

occur elsewhere in other fonts.

Deron Meranda
http://deron.meranda.us/

@davelab6
Copy link
Member

On 23 September 2012 01:33, Deron Meranda notifications@github.com wrote:

But that still doesn't mean this same "problem" can't
occur elsewhere in other fonts.

I think it should be treated as a bug.

Can you say what the fonts were, and do you know what tools the developers
used to create them? :)

@khaledhosny
Copy link
Contributor

On Sun, Sep 23, 2012 at 01:33:08AM -0700, Deron Meranda wrote:

It seems to me that allowing a glyph to have multiple encoding values
(which is different to what I understood Vern originally meant by
'encoding' which is an attribute of a font rather than of a glyph) is more
trouble than its worth.

Just curious, how does one even get a font into that state where
a glyph is reused? Also is that a property of FontForge itself, the
sfd file format, or is it more intrinsic to TTF or such?

It is handled by special cmap subtable format, but FontForge hides the
details.

Regards,
Khaled

@khaledhosny
Copy link
Contributor

On Sun, Sep 23, 2012 at 01:49:53AM -0700, Dave Crossland wrote:

On 23 September 2012 01:33, Deron Meranda notifications@github.com wrote:

But that still doesn't mean this same "problem" can't
occur elsewhere in other fonts.

I think it should be treated as a bug.

Can you say what the fonts were, and do you know what tools the developers
used to create them? :)

It is a feature supported by cmap table, just because you don’t know how
handle it does not make it a bug. Check Apple’s Last Resort font for a
widely used font utilizing this feature heavily.

Actually, I think most (if not all) MS fonts use it.

@dmeranda
Copy link
Contributor

Can you say what the fonts were, and do you know what tools the
developers
used to create them? :)

The particular fonts I ran across were the linux Liberation font family.

But as Khaled said, it this is permitted by the underlying font
formats, that it is not per-se a bug.

However perhaps we want to revisit what the section.invert() function
does; because the results are definitely non-intuitive. Part of the
surprise is I think that selections are sets of glyph objects, instead

of sets of encoding slots.

Deron Meranda
http://deron.meranda.us/

@wking
Copy link

wking commented Jan 9, 2013

@dmeranda commented:

There's also Tevor King's font-reduce but it also has lots of problems. It only makes an ASCII subset and also has some other bugs.

It should be able to make non-ASCII subsets (with the -s and -r options). Patches and bug reports against my current script are welcome. I can't fix bugs that I don't know about ;).

@davelab6
Copy link
Member

davelab6 commented Jul 1, 2014

This is best done with pyftsubset from https://github.com/behdad/fonttools/ - it would be good to add as a python scripts that we bundle with FF rather than a feature in the C code.

@HinTak
Copy link

HinTak commented Jun 9, 2017

Yes, same problem as #3085 .

@HinTak
Copy link

HinTak commented Jun 9, 2017

It is not very practical with pysubset for about 400 subsets to write for CJK fonts, and one also want to re-encode on the way...

@HinTak
Copy link

HinTak commented Jun 9, 2017

It would be nice if MultipleEncodingsToReferences() does the right thing - it currently does not.

@HinTak
Copy link

HinTak commented Jun 9, 2017

I have recently encountered, and worked around, a problem with re-encoding and subsetting Source CJK for TeXLive . CJK/TeXLive wants it as 400+ subfonts each of 256 glyphs :-). #3080

@HinTak
Copy link

HinTak commented Jun 9, 2017

My freetype-py script to fix fontforge's encoding problem is up at

https://github.com/HinTak/freetype-py/blob/fontval-diag/examples/subfonts-script-generate.py

This is for generating about 400 subfonts from source cjk/noto cjk for TexLive's use, and also fixes the encoding problems on the way.

@HinTak
Copy link

HinTak commented Jun 9, 2017

@davelab6 : I realized you wrote this some years ago before Adobe Source CJK/ Noto CJK, but they have about 400 glyphs which are multiple-encoded, and hence fontforge cannot cope :-).

It seems to me that allowing a glyph to have multiple encoding values
(which is different to what I understood Vern originally meant by
'encoding' which is an attribute of a font rather than of a glyph) is more
trouble than its worth.

@khaledhosny
Copy link
Contributor

khaledhosny commented Jun 12, 2017

pyftsubset serves a different purpose, its main use is cleverly subsetting layout tables. What you want is a dump subsetter that just subsets the glyph tables. AFDKO has one, sfntly has another (but does not handle fonts with CFF table), and Cairo has one too (but tied to its API and not publicly exposed), and I’m pretty sure there are others.

@HinTak
Copy link

HinTak commented Jun 12, 2017 via email

@gwern
Copy link

gwern commented Mar 13, 2019

I'm glad to see that there's an active discussion on this topic - I actually came here to file a feature request for a font subsetting tool but fortunately I checked whether there was an open issue first. :)

The discussion looks like it's gotten a bit bogged down in the 7 years since this issue was opened, so let me summarize it from my perspective: it would be useful, to web developers in particular, to have a script or other built-in feature for generating a font file or a set of font files which are subsets of an existing font, such as creating a new font file with only the ASCII range or creating 26 font files with one capital letter each.

This is useful because download sizes and latency are a major barrier to the use of non-system fonts on performant web pages; one or two unusual font choices could easily weigh more than the HTML page itself, and block or disrupt page rendering. In the extreme case, initials for drop cap use, one might be downloading a 700KB file to use on a single letter. Drop caps are a snazzy feature, but that comes at too great a price.

One way to largely eliminate that cost is to create subset font files, which contain only the specific letters necessary. In the extreme case of drop caps, you can break out 26 font files from the original initials font, and using CSS, load one for each capital letter; this means only 4-7kb must be downloaded & rendered (rather than anywhere up to 700kb or worse), and it is a near-zero performance hit. For more details, see our feature request upstreaming the subset fonts for yinit.

Redesigning my site recently to use 5 drop caps as topics/themes, we recently did this for them to eliminate the performance impact, and it works very nicely. We did this manually in FontForge for the first 3, exporting one letter at a time, but that was ridiculous.

For the final 2, I hacked up webfonts.py (run like fontforge -lang=py -script genwebfonts.py Cheshire-Initials.ttf webfont) to dump A-Z to separate files:

#!/usr/bin/env fontforge -lang=py -script
# -*- coding: utf-8 -*-

import fontforge
import sys
import os

def _select_codepoint(codepoint, selection, more=True):
    more_str = 'more' if more else 'less'
    if isinstance(codepoint, tuple) and len(codepoint) == 2:
        first, last = codepoint
        if isinstance(first, int) and isinstance(last, int):
            selection.select((more_str, 'ranges', 'unicode'), first, last)
        elif isinstance(first, unicode) and isinstance(last, unicode):
            selection.select((more_str, 'ranges', 'unicode'), ord(first), ord(last))
        else:
            message = 'invalid codepoint type: ' + str(codepoint)
            raise Exception(message)
    elif isinstance(codepoint, int):
        selection.select((more_str, 'unicode'), codepoint)
    elif isinstance(codepoint, unicode):
        for c in codepoint:
            selection.select((more_str, 'unicode'), ord(c))
    else:
        message = 'invalid codepoint type: ' + str(codepoint)
        raise Exception(message)

def _clear_unneeded_glyphs(font, codepoints):
    font.selection.all()
    for codepoint in codepoints:
        _select_codepoint(codepoint, font.selection, more=False)
    font.clear()

def generate_webfonts(input, outdir, codepoints):
    if not os.path.exists(outdir):
        os.mkdir(outdir)
    if not os.path.isdir(outdir):
        message = 'last argument is not directory: ' + outdir
        raise Exception(message)
    font = fontforge.open(input)
    _clear_unneeded_glyphs(font, codepoints)
    filename = os.path.split(input)[-1]
    basename, ext = os.path.splitext(filename)
    for newext in [ext, '.woff', '.eot']:
        output = os.path.join(outdir, basename + "-" + chr(codepoints[0]) + newext)
        font.generate(output)
    font.close()

if __name__ == '__main__':
    # codepoints = range(65, 91)  #[(0x0000, 0x03FF), (0x2200, 0x22FF)]
    if len(sys.argv) > 2:
        for input in sys.argv[1:-1]:
            print(input)
            print(sys.argv[-1])
            for codepoint in range(65, 91):
              generate_webfonts(input, sys.argv[-1], [codepoint])
    else:
        print('usage: ' + sys.argv[0] + ' inputs... outdir')
exit(1)

This is extremely ugly and I assume breaks on a lot of edge-cases, but it kinda mostly worked? At least for the Cheshire & Kanzlei initial fonts I used it on...
A proper subset tool would have better scriptability & control: let you specify multiple ranges, whether to dump to a single font file or multiple ones, and so on.

Apparently, there is no builtin FontForge support for this particular usecase, nothing in the documentation, nor is there any well-known standard tool for doing this? At least, I was unable to find one in a quick search and Obormot didn't know of any in his experience. suspect this might be one reason this particular font+CSS technique is rarely used - aside from its support in Google Fonts where you can include a specific range/subset as part of the URL query parameters to optimize the font (assuming you don't mind using Google Fonts, of course...)

I see there is a link in this discusson to some Google tool, but the link seems to be broken. pyftsubset is mentioned, but I didn't run into that before, and the only documentation seems to be the source file itself, so not very discoverable.

Reading the pyftsubset help documentation in the source file, it looks like it supports everything necessary for font subsetting? I am not a font expert and don't understand the objections here about how pyftsubset "doesn't really" do font subsetting, because it sounds like it does. Even if it is missing some edge-case or exotic feature, that's better than not doing subsetting at all, I would think. (Letting the perfect be the enemy of better.)

If this is the case, perhaps this issue can be resolved by adding to the FontForge documentation a section on font subsetting: what it is, why you would want to, and explaining that pyftsubset is the 'blessed' way of doing so with perhaps an example command.

@davelab6
Copy link
Member

davelab6 commented Mar 14, 2019 via email

@faywong
Copy link

faywong commented Oct 10, 2021

There's an another option today --- hb-subset. It only handle font subset task(can't change postscript name, etc. not to be a font edit tool/library), and its performance is 150x better than pyftsubset. If font subset performance is a key point to you, and limited feature is accepted, then hb-subset maybe is your best option.

@HinTak
Copy link

HinTak commented Oct 10, 2021

Font sub-setting is rarely done, and usually on a one-off basis, so doing it correctly is more important than doing it fast (but wrong/buggy). That said, it is always useful to have alternatives...

@pfahlstrom
Copy link

Another vote of confidence for hb-subset. I came here today looking for subsetting advice, and hb-subset worked perfectly. It's part of the HarfBuzz package, which I found I already had installed in homebrew. The command line I used was:
hb-subset --output-file=fontname-subset.otf --text-file=EmbedCharacters.txt fontname.otf
I just put all the characters I wanted embedded (including relevant ligatures) into the EmbedCharacters.txt file, and hb-subset did everything I wanted, such as removing unused kerning pairs after subsetting.

@gwern
Copy link

gwern commented Jun 13, 2023

Font sub-setting is rarely done, and usually on a one-off basis

That may be the case, but may also have more to do with the difficulty of doing so than the need being rare.

In our case, we had to do the splitting for each dropcap, and we'd like to add more, so that's 5+ there just for the dropcaps; and we have done so repeatedly for the regular fonts too, for the more conventional usecase of dropping unused glyphs - we do so repeatedly because of course the website has kept changing over the years (as do most websites), and we occasionally need a new letter or symbol. Doing so only on a 'one-off basis' would look bad & be slow as the subset became increasingly out of date and fallbacks were used.

@ctrlcctrlv
Copy link
Member

Another vote from me for hb-subset. I don't see a reason to use anything else personally.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

13 participants