Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PDFs don't render correctly. #82

Open
bigfatbird opened this issue May 20, 2017 · 19 comments
Open

PDFs don't render correctly. #82

bigfatbird opened this issue May 20, 2017 · 19 comments
Assignees

Comments

@bigfatbird
Copy link

Text is aligned oddly, code indentation isn't looking right, and i guess some characters are not encoded correctly.

@babluboy
Copy link
Owner

@bigfatbird Yes, I am aware of this issue. I'm using poppler utils at the moment to convert PDF to HTML to render the content and the conversion is not great...I will use this issue to track this and see if I can extract the text and images programatically to have greater control in rendering the content...I can also see if some css can be applied to render the text a little better...

At the current time you can use the reading preferences for line width and line height to adjust the content a little bit better...

@babluboy
Copy link
Owner

@bigfatbird Can you post a screen shot here to show how the text currently for a PDF and whether the PDF is image rich or just text...

@bigfatbird
Copy link
Author

Sure. Here are two screenshots of the same book for example.
http://imgur.com/a/WaZpn

@babluboy
Copy link
Owner

@bigfatbird thanks. looks like if I can center the content the rendering will look better...that should not be hard to achieve...will update here when I get to this issue

See if line width helps a little better until I get the fix in..

@babluboy babluboy self-assigned this May 20, 2017
@babluboy babluboy added the Bug label May 20, 2017
@babluboy babluboy added this to the 0.8 milestone May 20, 2017
@bigfatbird
Copy link
Author

Just curious: Why do you want to style it yourself, if there is an existing PDF standard?
A pdf should look exactly like it was released, I assume.

@babluboy
Copy link
Owner

Not sure, at the moment I'm using poppler util pdftohtml and I dont see any option to render the html the way it looks like in Evince viewer...perhaps I should check the Evince code to see how the rendering is done

@bigfatbird
Copy link
Author

bigfatbird commented May 20, 2017 via email

@babluboy
Copy link
Owner

Thats sounds great...thanks for the suggestion...looks workable at a quick glance..
https://mozilla.github.io/pdf.js/examples/

@babluboy babluboy removed this from the 0.8 milestone Jul 13, 2017
@unhammer
Copy link

unhammer commented Jul 29, 2017

would using on-the-fly pdf rendering instead of cached pdftohtml give a smaller ~/.config/bookworm db too? Mine is already up to 1.9GB

@babluboy
Copy link
Owner

@unhammer how many books are there in the library and how many are PDFs? The actual book content is cached on the file system(if the cache preference is set) but the metadata including the table of contents is stored in the db. I have seen that pdf2html generates a lot of pages as I separate the html by page break tag... Will look at better PDF handling in the future....

If you turn off caching then book content will be cached in /tmp and automatically be removed on restart... If you open the book again the same will be parsed and the html content regenerated in /tmp...this takes slightly longer to resume reading....

@unhammer
Copy link

unhammer commented Jul 30, 2017

in my case, 297 pdf's and 34 html/txt/epub

@babluboy
Copy link
Owner

hmm...while 300 PDFs seem a largi-ish library (i have not tested more than 100 PDFs), yet it does feel high just for the content data to be 1.9 GB...will look into this to see if I can replicate...

@babluboy
Copy link
Owner

babluboy commented Sep 8, 2017

@bigfatbird It dosen't look like it will be possible to extract PDF to HTML using pdf.js based on this:
mozilla/pdf.js#8732
Bookworm relies upon HTML files to apply all the text/color modifications, highlighting, search, navigation, etc.
I will need to either render the output of PDF2HTML in a better way or find some other way to create html pages out of PDF...

@babluboy
Copy link
Owner

Looks like poppler can be used to get the chapters from the book using this example:
https://stackoverflow.com/questions/7131906/how-to-extract-pdf-index-table-of-contents-with-poppler

At least it will reduce the data in the meta data database by just storing just the chapters and their corresponding html file. Currently i'm storing the location of all the html files which is one per page of the PDF thereby bloating the DB size as mentioned here by @unhammer

@babluboy babluboy removed this from the 0.9.5 milestone Dec 10, 2017
@babluboy
Copy link
Owner

@Preconf unfortunately I have not spent further time on this. I tried the following library but the extraction was too slow although the rendering was better:
https://github.com/coolwanglu/pdf2htmlEX

Will check evince to see if it is usable.

@prog-amateur
Copy link

@Preconf unfortunately I have not spent further time on this. I tried the following library but the extraction was too slow although the rendering was better:
https://github.com/coolwanglu/pdf2htmlEX

Will check evince to see if it is usable.

Hello, I came here after a review of your app in a website. Everything is perfect, except this PDF support : it should be as the original one, not re-arranged. This is clearly a No-Go for this specific aspect, and people can have many PDF in their library.

So do you have any solution please ? this issue was open in 2017.

Thank you, and please remember : I claim here but I really like this app.

@0xBRM
Copy link

0xBRM commented Oct 24, 2019

Completely unusable for pdfs. Might as well just stick to koreader for the epub support and evince for everything else.

@prog-amateur
Copy link

Completely unusable for pdfs. Might as well just stick to koreader for the epub support and evince for everything else.

Be careful, you can have a thumb down (like for my request) for saying that this app cannot read PDF correctly. There is some people here who maybe want to stay with a pdf code reader ( @bigfatbird ?)

@bigfatbird
Copy link
Author

@prog-amateur I downvoted your and the other reply as they are completely useless for fixing the bug and just create unnecessary work. A "it doesn't work for me, too" under a bug which just describes that something is not working as expected is not helpful at all and is more work for the developers. It was not personal.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants
@unhammer @bigfatbird @0xBRM @babluboy @prog-amateur and others