PDF was designed for printing, and it has limited support with devices with different sizes. But HTML is in another direction, which actually originates from a reflowable plain text stream.
This page shows the difficulties of recognizing and producing reflowable text for pdf2htmlEX. But maybe we can focus on the simplest cases with proper parameters for users.
This issue is created and left for discussion about this feature, please read the wiki page above before leaving message here.
Thank you for trying this!
@iapain, I remember that you have shown me a video for reflowing text in pdf, but I forgot the name. Could you please tell me?
@coolwanglu Here it is: http://www.youtube.com/watch?v=6VjVlhJGs6I
As a start we may make some assumptions and start from the easiest case:
No header, footer, figure, table. And single column of text.
The task is to combine text lines in the same paragraph. We can further extend this to two-column layouts, as text rendering order is usually the same as reading order.
Spacing might be a problem, I think it's hard to preserve the exact spacing (width of letter-space, word-space, space-char), need to relax and estimate them.
The process should be easy if facilitated with manual marking/adjustments.
Another direction is to tag the PDF in some way, and to utilize tag information in PDF file.
Lots of useful information here:
For text reflowing, you may consider the approach of K2pdfopt. It is an open-source tool which convert pdf file into different page size with reflowed text.
@alvisty Thanks for your information!
It seems to have been actively developed, and it looks promising. I'll check it out and see how it works.
An indian PDF converter (Aiox) seems to be able to convert document to ePub perfectly (with OCR & some manual input). https://www.youtube.com/watch?v=qC1hwJ8KFL8
Infty reader is even able to convert math equation to LaTex/MathML format.
Unfortunately, both of them are proprietary. Sources are hence unavailable.
Yet, at least technically speaking, complete reflowing (even with math & table) is achievable.
Also, both of them use OCR as input. It help to identify the relationship between texts.
I wouldn't say "reflowing is achivevable" based on these videos. For example in the first video, the page box is (or maybe has to be) provided by manually drawing it. The document seems to be well formed, single-column, the paragraphs are quite aligned (no item lists etc), and I didn't see any tables in the video.
But I do believe that it's doable with some assumptions on the document type and format.
The OCR example is indeed amazing, but I couldn't try my own files due to maintenance.
@iapain k2pdfopt seems to be image based, I'll try to read more of its code later. For now I'll try to create a new text model for the reflowable text in the future. The new model will be somewhat like Crocodoc's, which consists of text groups with relative positioned text lines.