Detect and remove duplicated fonts #117

Closed
Toneti777 opened this Issue Apr 8, 2013 · 50 comments

Comments

Projects
None yet
8 participants

In some PDF file, different fonts may point to the same font descriptor with different encodings.
Suppose that the font descriptor contains two glyphs a and b, one font may specify the encoding c->a and d->b, but another font may specify d->a and c->b. PDF viewers may choose correct glyphs to show, while providing correct text for users to select, according to the encodings and tounicode maps, but there's no such mechanisms for HTML.

Generally we have to generate different (re-encoded) fonts such that both rendering and unicode codes are correct, and thus the total file size increases, but in some cases we can be smart to avoid that:

  • When both the encoding and tounicode map are compatible (or the product of both mapping are compatible), we can use a single font file
  • When we can ignore a set of encoding of tounicode map, we can reencode text for one font according to another (both fonts share the same font descriptor), such that the rendering will be correct but unicode codes are likely to be wrong.

Another case is that there are multiple font descriptors embedded in the PDF file, while they are actually the same fonts. They can be recognized with font comparison function in Fontforge, but I think this is more like a PDF optimizer's job, because no extra space is introduced by pdf2htmlEX.

/// original report
I tried to execute the program with a pdf with some pages. I have a question about the generated font files.
Why the program generates several files for the (in theory) the same font in diferent page?
I've opened the font files and the font name is the same in cases that I've observed the same font type in the output. Is it true or I'm doing something wrong?

Can I do anything to generate only one font file for the same font? For a pdf with a lot of pages this may reduce optimization.

Thanks a lot...

Owner

coolwanglu commented Apr 8, 2013

Can you provide a sample PDF?

Usually (unless it is a bug) pdf2htmlEX never duplicates fonts. So the reason you got too many fonts should be that there are actually those different fonts in the PDF. Here different fonts means different objects in PDF, which have different IDs.

And pdf2htmlEX (so far) does not detect duplicated fonts by the shape of glyphs.

Ok, you are rigth. We create this PDF with a merge of a lot of single page PDF and this merge dose not detect duplicated fonts.
I'm going to change the merge process.

It would be a great improvement detect duplicated fonts even by the name or type font.

Thanks.

@Toneti777 Toneti777 closed this Apr 9, 2013

@Toneti777 Toneti777 reopened this Apr 9, 2013

Owner

coolwanglu commented Apr 9, 2013

I found that there are font comparison functions in Fontforge. So theoretically duplicated fonts can be detected and merged. But it might be quite slow.

Please provide a sample PDF for debugging.

As this is not a critical issue right now, please do not expect that it will be fixed soon.

Even a test with a good merge pdf with a 20 diferents fonts (view in adobe acrobat pro). The pdf2htmlEX increase the number of exit font files when I execute program with a major number of pages procesess.

An example... please download and tell me for erase it...

Thanks..

Owner

coolwanglu commented Apr 9, 2013

Done. Please remove it.

Thanks!

@ghost

ghost commented May 7, 2013

For common fonts, such as Verdana, Arial, Georgia etc., is is possible to skip embedding entirely and just write css such as font-family: verdana, sans-serif?

Owner

coolwanglu commented May 7, 2013

For an embedded font, how do you know if it is 'common' or not?

On Tue, May 7, 2013 at 10:06 PM, fmalina notifications@github.com wrote:

For common fonts, such as Verdana, Arial, Georgia etc., is is possible to
skip embedding entirely and just write css such as font-family: verdana,
sans-serif?


Reply to this email directly or view it on GitHubhttps://github.com/coolwanglu/pdf2htmlEX/issues/117#issuecomment-17544080
.

@ghost

ghost commented May 7, 2013

Is there at least any way to tell the name of the embedded font?

@ghost

ghost commented May 7, 2013

By common fonts I meant core web fonts, those are very highly likely to be installed on user's machine.
I think the relevant ones are specifically: Arial, Arial Black, Andale Mono, Courier New, Georgia, Impact, Times New Roman, Trebuchet MS, Verdana

Owner

coolwanglu commented May 7, 2013

Sometimes yes. If you have installed poppler, there is a command pdffonts available. The first column shows the names.

The names are not reliable. The accuracy of the HTML depends on the metrics of the glyphs and the encoding of the font. There may be two sets of metrics stored in a PDF, one is defined in the embedded font, and the other stored in the font descriptor in PDF. When they are not consistent, the second one should be used, which means the font have to be transformed -- which can not be done by the browser.

Encoding, if you expect a good font at the client side, the text must be encoded in Unicode correctly, which is not reliable either -- there are too many PDF files whose text cannot be copied out correctly.

Since PDF is not designed for Web, actually I don't see these fonts often used in PDF (except for Times), at least in my test cases. You may verify this use pdffonts.

So my point is that it might be worth it to do so. But later I may export the embedded names in CSS, such that you may do post-procssing if you wan. And this should be done carefully since malicious code could be injected in this way.

@ghost

ghost commented May 7, 2013

I admire the depth of your knowledge on the subject.
Having a font name, available at least as a clue would be valuable for post processing. PDF files published on government websites almost always use a subset of core web fonts.

Also I am guessing the problem with duplicate fonts the OP is experiencing might be that bold and regular could end up as two embedded fonts. It seems to be the case for my example PDF.

Owner

coolwanglu commented May 7, 2013

Well, actuall the bold variant is a new font, Usually a bold font is not
just the regular font with wider strokes, since lots of tuning would be
necessary. Although it is possbile to simulate that by setting
font-weight in CSS, I don't want to do the recognition here.

Duplicated fonts, actually, is what a PDF generator should concern, not
this converter. But I will try to make it smarter if a local font is
matched for multiple external fonts in PDF.

On Wednesday, May 8, 2013, drinkupper wrote:

I admire the depth of your knowledge on the subject.
Having a font name, available at least as a clue would be valuable for
post processing. PDF files published on government websites almost always
use a subset of core web fonts.

Also I am guessing the problem with duplicate fonts the OP is experiencing
might be that bold and regular could end up as two embedded fonts. It seems
to be the case for my example PDF.


Reply to this email directly or view it on GitHubhttps://github.com/coolwanglu/pdf2htmlEX/issues/117#issuecomment-17552736
.

I'll apreciate very much the solution of this topic...
I think other people agree.
It would help to improve the charge time minimizing requests.

Owner

coolwanglu commented Jun 17, 2013

My thought about this issue is that the fonts are not introduced by pdf2htmlEX, they have been there in the PDF file. So probably you are looking for a PDF optimization tool.

pdf2htmlEX is meant to be a conversion tool, so if there are multiple fonts in the PDF file, there should also be in the HTML output.

Imagine a JPG->PNG tool, you don't expect it will retouch the images for you, do you?

Owner

coolwanglu commented Jun 17, 2013

Please refer to https://github.com/coolwanglu/pdf2htmlEX/wiki/Unfeatures for an explanation of what pdf2htmlEX does not do. Sorry for having lasting this so long.

@coolwanglu coolwanglu closed this Jun 17, 2013

I'm not agree with you..
At first I thank that you are right and saw the pdf properties and is true that it had a duplicated fonts..

The problem is that I optimized the pdf with Acrobat and deleted duplicate fonts and other optimizations...it weigth 3 times less.

After I run the conversion and I obtain the same number of font files in both cases (786).

I think that is good idea rewiew this issue..

Thanks a lot..

I share the 2 pdf examples for you...Tell me when can I delete the link...(The first without optimization)

https://www.dropbox.com/s....

https://www.dropbox.com/s/4qty0xje...

Owner

coolwanglu commented Jun 20, 2013

@Toneti777 That's actually what I mean, if you can do that with a PDF optimizer, you should do it. While pdf2htmlEX is focused only on conversion.

You can always optimize the PDF file before feed it into pdf2htmlEX.

Owner

coolwanglu commented Jun 20, 2013

@Toneti777 And please remove the links, if necessary I might ask you for some smaller examples. Thanks.

I do that!
And I obtain duplicated fonts too..
In both cases the result of pdf2htmlEX is the same...

Owner

coolwanglu commented Jun 20, 2013

OK, that sounds like a bug of pdf2htmlEX. Is it possible for you to provide a smaller file?

fedebot commented Jun 26, 2013

Same problem here

Owner

coolwanglu commented Jun 30, 2013

Hi @Toneti777 , I have checked both lne and lne_opt, according to the pdffonts command, they have exactly the same (132) fonts embedded.

What kind of optimizations did you apply?

Sorry for delay... I was on hollidays..

I check it out in Acrobat or Adobe reader and the first one have 132 (and you can see how repeat the name of font) but the optimization one only around 40 fonts.

Owner

coolwanglu commented Jul 18, 2013

@Toneti777 Yes, I see that in Adobe Reader. Let me check if this is something defined in the PDF standard, or some trick done by the reader.

Owner

coolwanglu commented Jul 18, 2013

@Toneti777 Seems that there is only one font “stored", other fonts refer to the same binary with different encodings. This can be done in PDF but unfortunately I don't find a counterpart in HTML. Currently I have to store separated, reencoded fonts in HTML.

On the other hand, such fonts might be merged into one if their encodings are same. But this should be out of scope of pdf2htmlEX.

I think this isn't out of scope, becouse in the optimized one there are only one font refer for one binary. In this case, the pdf2htmlEX library create lot of output files for the same font.

I can't do anything in this case, if my input file have the same reference for all the ocurrences of the same "stored" font and the library duplicate them..

I and other people will apreciate you if you can check it out..

Thanks..

Owner

coolwanglu commented Aug 2, 2013

@Toneti777 Yes, you points make sense, this is indeed something extra created by pdf2htmlEX, so I'm opening it.

But right now I don't have a good solution for it. I'll think about it.

@coolwanglu coolwanglu reopened this Aug 2, 2013

This library is awesome but for complete my project with garanties I need to reduce the size and files that the process creates.
I'll apreciate very much some improvements about duplicate fonts problem. A lot of people that use large pdf can take advantage.

Owner

coolwanglu commented Sep 25, 2013

@Toneti777 Unfortunately I have not found the solution yet. I'm not sure if this happens for only specific PDF files or a number of them.

Also there doesn't seems to be others suffering from this issue, which keeps this issue at a low priority. Maybe you can consider sponsor this issue and see if any one else would like to solve it.

I am also experiencing hundreds of duplicate fonts spit out from a multi-page pdf. I would like to take a look at comparison and merge using fontforge. @coolwanglu, can you provide more detail on what you know and where a good place to perform the comparison/merge would be?

Owner

coolwanglu commented Oct 9, 2013

@mickgiles I've updated the issue description. The comparison part is more like a PDF optimization job. The one concerned here is about encodings.

Owner

coolwanglu commented Oct 11, 2013

@mickgiles Please first check (with pdffonts) if there are actually many fonts in the PDF, otherwise it may be the linking case here.
There is a tool called sfddiff in fontforge, not sure if that will work for you. Also you might try to write Fontforge scripts with python or its native syntax.
http://fontforge.org/scripting-tutorial.html
http://fontforge.org/python.html

Collaborator

duanyao commented Jun 4, 2014

I also encountered this issue for some pdf files recently.
There are fonts share a same DescendantFonts, and has different ToUnicode tables, and these tables are compatible. In this case, I think pdf2htmlEx should be able to avoid duplicated fonts.

Example:
the first font:

379 0 obj
<<
/BaseFont /AGKMCI+FZDBSJW--GB1-0
/DescendantFonts [1146 0 R]
/Encoding /Identity-H
/Subtype /Type0
/ToUnicode 1152 0 R
/Type /Font
>>
endobj

1152 0 obj
<<
/Length 380
>>
stream
/CIDInit /ProcSet findresource begin
12 dict begin
begincmap
/CIDSystemInfo
<< /Registry (Adobe)
/Ordering (UCS) /Supplement 0 >> def
/CMapName /Adobe-Identity-UCS def
/CMapType 2 def
1 begincodespacerange
<0000> <FFFF>
endcodespacerange
4 beginbfchar
<040B> <5143>
<1284> <7B2C>
<02B7> <4E00>
<04ED> <5355>
endbfchar
endcmap CMapName currentdict /CMap defineresource pop end end

endstream
endobj

the second font.

628 0 obj
<<
/BaseFont /YOLBEC+FZDBSJW--GB1-0
/DescendantFonts [1146 0 R]
/Encoding /Identity-H
/Subtype /Type0
/ToUnicode 1176 0 R
/Type /Font
>>
endobj

1176 0 obj
<<
/Length 380
>>
stream
/CIDInit /ProcSet findresource begin
12 dict begin
begincmap
/CIDSystemInfo
<< /Registry (Adobe)
/Ordering (UCS) /Supplement 0 >> def
/CMapName /Adobe-Identity-UCS def
/CMapType 2 def
1 begincodespacerange
<0000> <FFFF>
endcodespacerange
4 beginbfchar
<0304> <4E8C>
<040B> <5143>
<1284> <7B2C>
<04ED> <5355>
endbfchar
endcmap CMapName currentdict /CMap defineresource pop end end

endstream
endobj
Owner

coolwanglu commented Jun 4, 2014

@duanyao Yes indeed, that should be optimized without much difficulties.

But sometimes I would hesitate about these features. Do you think that there should be a PDF optimizer for this, or this feature should be integrated into a converter?

Collaborator

duanyao commented Jun 5, 2014

These pdf's are produced by acrobat X's optimizer (ghostscript complains about them so no result), so I think it's unlikely that most PDF optimizers would handle this situation. After all, these structures in a pdf don't introduce much bloat.

Collaborator

duanyao commented Jun 5, 2014

@coolwanglu
Update: I re-tested with ghostscript 9.10, and it does optimize these fonts in my pdf's. After that, pdf2htmlEX will not produce duplicated fonts. However, both acrobat and ghostscript have their limitations.

I think this could be a low priority feature, and I suggest that people who encounter this issue give ghostscript a try. These docs may be useful:

PDF manipulation tips, Part 1 Ghostscript
Ghostscript pdfwrite options

Owner

coolwanglu commented Jun 5, 2014

@duanyao Thanks for your input!

I try to reduce woff file size by merging woff files with the same font name in a post-processing way. Say, I find out f19f.woff and f1a6.woff are both IIKSQH+MS-Gothic, then merged them into a woff file like IIKSQH+MS-Gothic.woff by fontforge. And replace @font-face src with correct woff file in css file.

Not sure whether or not it is a correct approach, but the total woff file size is reduced significantly after merging duplicate fonts. And I don't encounter anything weird by using the merged font.

The repository is here. https://github.com/yu-liang-kono/pdf2htmlEXOptimize
Please use it at your own risk and any suggestion is welcomed.

Collaborator

duanyao commented Jun 10, 2014

@yu-liang-kono
Would you try ghostscript to optimize your pdf? As I said above, ghostscript can probably merge fonts with same name.

@duanyao
I tried gs -sDEVICE=pdfwrite -sOutputFile='optimize.pdf' -dNOPAUSE -dBATCH original.pdf, and it really does some magic to the pdf file, the pdf2htmlEX output woff file size is reduced. That is great, thanks.

Collaborator

duanyao commented Jun 10, 2014

@yu-liang-kono
You are welcome. Ghostscript is awesome!

Owner

coolwanglu commented Jun 14, 2014

[Wiki page] added.
I wonder if @Toneti777 could verify it works for the original issue.

@coolwanglu coolwanglu closed this Jun 14, 2014

zowers commented Jul 23, 2014

example pdf: http://worldtracker.org/media/library/How-To/For%20Dummies%20eBook%20Collection/Hacking%20for%20DUMmIES%202nd.pdf
pdf2htmlex takes forever, produces lots of fonts
Ghostscript is sometimes an overkill, and cannot be used

I have try the solution, but don't work for me.
I still have a lot of duplicate fonts on pdf2htmlEX result.

I receive this...

-ERROR>Working: 0/48
INFO-es.renr.gdr.hemeroteca.utils.StreamControl-[Thread-11][2014-10-24 08:57:44,238]-ERROR>No glyph for the key character to derive standard width and height.
INFO-es.renr.gdr.hemeroteca.utils.StreamControl-[Thread-11][2014-10-24 08:57:44,238]-ERROR>For the latin script, this key character is o' (U+006F). INFO-es.renr.gdr.hemeroteca.utils.StreamControl-[Thread-11][2014-10-24 08:57:44,305]-ERROR>No glyph for the key character to derive standard width and height. INFO-es.renr.gdr.hemeroteca.utils.StreamControl-[Thread-11][2014-10-24 08:57:44,305]-ERROR>For the latin script, this key character iso' (U+006F).
INFO-es.renr.gdr.hemeroteca.utils.StreamControl-[Thread-11][2014-10-24 08:57:44,758]-ERROR>No glyph for the key character to derive standard width and height.
INFO-es.renr.gdr.hemeroteca.utils.StreamControl-[Thread-11][2014-10-24 08:57:44,759]-ERROR>For the latin script, this key character is o' (U+006F). INFO-es.renr.gdr.hemeroteca.utils.StreamControl-[Thread-11][2014-10-24 08:57:44,829]-ERROR>No glyph for the key character to derive standard width and height. INFO-es.renr.gdr.hemeroteca.utils.StreamControl-[Thread-11][2014-10-24 08:57:44,830]-ERROR>For the latin script, this key character iso' (U+006F).
INFO-es.renr.gdr.hemeroteca.utils.StreamControl-[Thread-11][2014-10-24 08:57:45,103]-ERROR>No glyph for the key character to derive standard width and height.
INFO-es.renr.gdr.hemeroteca.utils.StreamControl-[Thread-11][2014-10-24 08:57:45,104]-ERROR>For the latin script, this key character is o' (U+006F). INFO-es.renr.gdr.hemeroteca.utils.StreamControl-[Thread-11][2014-10-24 08:57:45,179]-ERROR>No glyph for the key character to derive standard width and height. INFO-es.renr.gdr.hemeroteca.utils.StreamControl-[Thread-11][2014-10-24 08:57:45,180]-ERROR>For the latin script, this key character iso' (U+006F).
INFO-es.renr.gdr.hemeroteca.utils.StreamControl-[Thread-11][2014-10-24 08:57:45,338]-ERROR>No glyph for the key character to derive standard width and height.
INFO-es.renr.gdr.hemeroteca.utils.StreamControl-[Thread-11][2014-10-24 08:57:45,338]-ERROR>For the latin script, this key character is o' (U+006F). INFO-es.renr.gdr.hemeroteca.utils.StreamControl-[Thread-11][2014-10-24 08:57:45,405]-ERROR>No glyph for the key character to derive standard width and height. INFO-es.renr.gdr.hemeroteca.utils.StreamControl-[Thread-11][2014-10-24 08:57:45,406]-ERROR>For the latin script, this key character iso' (U+006F).
INFO-es.renr.gdr.hemeroteca.utils.StreamControl-[Thread-11][2014-10-24 08:57:45,469]-ERROR>No glyph for the key character to derive standard width and height.
INFO-es.renr.gdr.hemeroteca.utils.StreamControl-[Thread-11][2014-10-24 08:57:45,469]-ERROR>For the latin script, this key character is o' (U+006F). INFO-es.renr.gdr.hemeroteca.utils.StreamControl-[Thread-11][2014-10-24 08:57:45,635]-ERROR>No glyph for the key character to derive standard width and height. INFO-es.renr.gdr.hemeroteca.utils.StreamControl-[Thread-11][2014-10-24 08:57:45,635]-ERROR>For the latin script, this key character iso' (U+006F).

zowers commented Oct 24, 2014

btw: the fonts generated are also reported by the poppler library and xpdf too, so that's an old problem rooting to the libraries used by pdf2htmlEX

@yu-liang-kono In my case, optimization with GS like:
gs -sDEVICE=pdfwrite -sOutputFile='optimize.pdf' -dNOPAUSE -dBATCH original.pdf
only remove a few fonts, remains more than 500.

In your python optimization then number of fonts have been reduce to 132. The problem here is that the font result on the visualization isn't the same than before..

You can try with this pdf:
https://www.dropbox.com/s/ba1tzdx7lpyoe36/prePDF_fddd5.pdf?dl=0

I tried another gs optimizations, and I've proved other libraries but I don't find a good result.

Thanks

During the process you generate page by page result.
I think you read the fonts used in each page and add them to result.
You don't check if actual font has already created in previous pages.

I think this is a solution.

lmtoo commented Dec 12, 2016

@Toneti777 how do you solve the
No glyph for the key character to derive standard width and height.
problom?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment