Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change wrapping lines algorithm #783

Closed
diegomura opened this issue Feb 18, 2018 · 3 comments
Closed

Change wrapping lines algorithm #783

diegomura opened this issue Feb 18, 2018 · 3 comments

Comments

@diegomura
Copy link
Collaborator

Hi @devongovett @alafr !

Im building react-pdf and I'm using pdfkit under the hood. The library got a lot of popularity so far and there are plenty of users, so I would like to thank you for also making this possible.
However, we now need to support justify paragraphs and words wrapping (support soft-hyphens, non breakable spaces, etc). I know you already support the first point, but not the second, and there are also people asking for it. Also the current paragraph justification is not optimal.

I started to work on a fix to this using the Breaking paragraphs into lines algorithm, by Donald E. Knuth and Michael F. Plass which I think it brings an excellent solution for this. There is already a JS library who implements it here. However, it's crucial to us to know if this library it's still maintained, and if someone will be able to test this implementation and eventually merge it into master. If now, we would be forced to come up with another solution 😄

Thanks for your time and work!

@devongovett
Copy link
Member

I would definitely recommend doing this somewhere higher in the stack than PDFKit. PDFKit basically immediately writes content to the file as you are adding it, but for justification, especially when multiple styles or fonts can be inlined, you need a multi-pass layout algorithm. You'll need to do the layout first, and then once the final glyphs and their positions are known, send them to PDFKit for rendering.

Text layout is actually a really hard problem - way harder than it seems at first glance. Getting the details right in a way that works for all languages is crazy challenging. Here's the basic text layout algorithm most text systems (like word processors, operating systems, web browsers, etc.) follow:

  1. Split text into paragraphs - the following steps are applied to each paragraph
  2. Get bidi runs and compute paragraph direction - This is the unicode bidirectional algorithm. See http://www.unicode.org/reports/tr9/
  3. Font substitution - check whether the user-defined font actually supports each character the user wants to render. If not, replace with a font that does. This produces "runs" of text in the same font. See https://github.com/devongovett/font-manager for a way to do font substitution using the native OS.
  4. Script itemization - in Unicode, each character is part of a script. Break the text into runs of similar scripts. This data is exposed by https://github.com/devongovett/unicode-properties.
  5. Font shaping - for each run of text, convert characters to glyphs from that font. This can be done using http://github.com/devongovett/fontkit - the library PDFKit already uses.
  6. Line breaking - Using the generated glyph runs for the paragraph, break into lines using the Unicode line breaking algorithm. This can be done using https://github.com/devongovett/linebreak.
  7. Bidi reordering - Using the bidi information computed earlier, reorder the generated glyph runs on each line according to the bidi algorithm.
  8. Apply tab stops - make sure the tab characters on each line are the correct width so that they align with tab stops.
  9. Justification - If justification is enabled, adjust the spacing between each glyph on each line to justify it.
  10. Finalize lines - Apply text-decoration, hanging punctuation, etc.

As you can see, there are a lot of steps here to do text layout correctly in a Unicode friendly way. PDFKit currently implements an extremely basic version of this without a lot of the steps. It basically only works well for unidirectional text in a single font, and you'll hit that limitation very quickly for anything complicated.

I worked on this problem a bit in https://github.com/devongovett/textkit a while ago. I've been meaning to clean that up and release it, but I don't really have time. It's not really finished or well tested at the moment, but if you feel like taking a look at it feel free! Seems like it might be useful for react-pdf and other similar libraries that want to do text layout. Happy to help out - let me know if you have questions or if you want to help take over that code!

@diegomura
Copy link
Collaborator Author

Thanks for your answer @devongovett . Was very informative and helpful.
I know text layouting is a very complex subject.
I will definitely check outtextkit and see how I can fit in in my solution.

As I explained, I really need to implement the Knuth and Plass line breaking algorithm for my solution, and based on what you said, the linebreak lib implements the Unicode line breaking algorithm. I'm not an expert of the subject, but I think they do things a bit different. Do you think this can be something we can parametrize in linebreak to support both ways to split lines?

@devongovett
Copy link
Member

The linebreak library only tells you where in a string of characters it is valid to break a line according to unicode (e.g. on spaces for latin text). Knuth and Plass is a line layout algorithm. It would use something like linebreak to determine the valid breakpoints.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants