Reflowable Text / Unwrapping Lines #56

Open
coolwanglu opened this Issue Dec 11, 2012 · 10 comments

Projects

None yet

4 participants

@coolwanglu
Owner

PDF was designed for printing, and it has limited support with devices with different sizes. But HTML is in another direction, which actually originates from a reflowable plain text stream.

This page shows the difficulties of recognizing and producing reflowable text for pdf2htmlEX. But maybe we can focus on the simplest cases with proper parameters for users.

This issue is created and left for discussion about this feature, please read the wiki page above before leaving message here.

@404pnf
404pnf commented Dec 12, 2012

Thank you for trying this!

@coolwanglu
Owner

@iapain, I remember that you have shown me a video for reflowing text in pdf, but I forgot the name. Could you please tell me?

@coolwanglu
Owner

As a start we may make some assumptions and start from the easiest case:

No header, footer, figure, table. And single column of text.

The task is to combine text lines in the same paragraph. We can further extend this to two-column layouts, as text rendering order is usually the same as reading order.

Spacing might be a problem, I think it's hard to preserve the exact spacing (width of letter-space, word-space, space-char), need to relax and estimate them.

The process should be easy if facilitated with manual marking/adjustments.

Ref: http://scribdtech.wordpress.com/2012/02/29/why-zooming-on-mobile-is-broken-and-how-to-fix-it/

Another direction is to tag the PDF in some way, and to utilize tag information in PDF file.

@coolwanglu
Owner

Lots of useful information here:
http://wiki.mobileread.com/wiki/PDF

@alvis
alvis commented Oct 30, 2013

For text reflowing, you may consider the approach of K2pdfopt. It is an open-source tool which convert pdf file into different page size with reflowed text.

http://www.willus.com/k2pdfopt

@coolwanglu
Owner

@alvisty Thanks for your information!

It seems to have been actively developed, and it looks promising. I'll check it out and see how it works.

@alvis
alvis commented Oct 30, 2013

An indian PDF converter (Aiox) seems to be able to convert document to ePub perfectly (with OCR & some manual input). https://www.youtube.com/watch?v=qC1hwJ8KFL8

Infty reader is even able to convert math equation to LaTex/MathML format.
https://www.youtube.com/watch?v=PHDZEjwWjx0
http://www.inftyproject.org/en/demo.html#0002

Unfortunately, both of them are proprietary. Sources are hence unavailable.
Yet, at least technically speaking, complete reflowing (even with math & table) is achievable.
Also, both of them use OCR as input. It help to identify the relationship between texts.

@coolwanglu
Owner

I wouldn't say "reflowing is achivevable" based on these videos. For example in the first video, the page box is (or maybe has to be) provided by manually drawing it. The document seems to be well formed, single-column, the paragraphs are quite aligned (no item lists etc), and I didn't see any tables in the video.

But I do believe that it's doable with some assumptions on the document type and format.

The OCR example is indeed amazing, but I couldn't try my own files due to maintenance.

@coolwanglu
Owner

@iapain k2pdfopt seems to be image based, I'll try to read more of its code later. For now I'll try to create a new text model for the reflowable text in the future. The new model will be somewhat like Crocodoc's, which consists of text groups with relative positioned text lines.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment