New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
original images not kept unaltered #78
Comments
Unfortunately, it is not (easily) possible to extract the original images from a pdf file using opensource linux sw and keep its orientation like in the pdf file. Therefore I use pdftoppm to GENERATE an image from the pdf file. The image is generated with the same resolution than the original image, but it is not the original image. If anybody has an idea how proceed to solve this limitation, please let me know! |
The PyPDF2 library can read internal PDF structure and get the page orientation.
That field records the rotation anyone has applied to fix the orientation of a given page and should work in a lot of cases. It would be possible to get the page and image dimensions as well, I believe, and faster than the various calls to For multiple images to a page you would have to interpret the PostScript to determine where images are rendered, because PostScript can apply an arbitrary transformation matrix (translate, rotate to arbitrary angle, scale, skew) to an image before rendering an image. In this case you'd run OCR jobs on the extracted images and then insert the OCR hidden layer into the PostScript stream. Needless to say this would be much harder. |
Even if the page contains only 1 image, I am not sure knowing the page orientation would be enough. Indeed there would still be 2 possible rotation angles for the image (x and x+180°). Are there tools to easily extract the transformation matrix of the image? (especially the rotation angle? |
It's not page orientation as in landscape/portrait use of paper. /Rotate records the 0/90/180/270° rotation that is usually set by a user. It should be enough to determine the image rotation for simple cases like scanned PDF output. The /MediaBox is also part of the picture - one can specify the virtual paper size with /MediaBox, and then rotate it. So for the simple case (scanner PDF output), with /MediaBox and /Rotate, you should be able to determine the orientation of the image. If I understand correctly the transformation matrix is sort of like a CPU register in the PostScript language, sensitive to state, so in general you have to interpret all of the preceding PostScript on a page to determine its value at a point of interest. So it is harder (although |
Ok. I understand what you mean now. |
Actually, I believe adding a text layer without changing the original contents of the PDF file is easy. PDFTK can do that. Here is what I do with my personal files:
Here is a little shell script that does a very similar task (it takes the images from my scanner, not from a PDF file).
|
When using the 2.x version available as zip file at the right side of
https://github.com/fritz-hh/OCRmyPDF
with xubuntu 14.04 the original pdf is altered although i did not use -i
The first page of
http://www.loaditup.de/files/817245_gcstsh3wuy.pdf
shows the original black and white pdf, the second page the altered pdf which unfortunately looks frazzled. (I merged both pages for convenience.)
Is there a way to avoid this quality loss?
I tested the suggestion of #61 but without success, which is clear as no "-i" was used.
I also tested a pdf with integer number of pixels but without success.
Maybe it has to do with the problem described here? http://lists.freedesktop.org/archives/poppler-bugs/2013-August/010469.html
Thanks for the help.
Here the output with -g:
The text was updated successfully, but these errors were encountered: