Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

original images not kept unaltered #78

Closed
femifrak opened this issue May 28, 2014 · 6 comments
Closed

original images not kept unaltered #78

femifrak opened this issue May 28, 2014 · 6 comments
Milestone

Comments

@femifrak
Copy link

When using the 2.x version available as zip file at the right side of
https://github.com/fritz-hh/OCRmyPDF
with xubuntu 14.04 the original pdf is altered although i did not use -i
The first page of
http://www.loaditup.de/files/817245_gcstsh3wuy.pdf
shows the original black and white pdf, the second page the altered pdf which unfortunately looks frazzled. (I merged both pages for convenience.)
Is there a way to avoid this quality loss?

I tested the suggestion of #61 but without success, which is clear as no "-i" was used.
I also tested a pdf with integer number of pixels but without success.
Maybe it has to do with the problem described here? http://lists.freedesktop.org/archives/poppler-bugs/2013-August/010469.html

Thanks for the help.

Here the output with -g:

># /opt/OCRmyPDF/OCRmyPDF-2.x/OCRmyPDF.sh -g sw_original.pdf sw_original_OCR.pdf
OCRmyPDF version: v2.0-stable
Arguments: -g sw_original.pdf sw_original_OCR.pdf
Checking if all dependencies are installed
--------------------------------
ImageMagick version:
Version: ImageMagick 6.7.7-10 2014-03-06 Q16 http://www.imagemagick.org
Copyright: Copyright (C) 1999-2012 ImageMagick Studio LLC
Features: OpenMP    

--------------------------------
GNU Parallel version:
GNU parallel 20130922
Copyright (C) 2007,2008,2009,2010,2011,2012,2013 Ole Tange and Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
GNU parallel comes with no warranty.

Web site: http://www.gnu.org/software/parallel

When using GNU Parallel for a publication please cite:

O. Tange (2011): GNU Parallel - The Command-Line Power Tool, 
;login: The USENIX Magazine, February 2011:42-47.
--------------------------------
Poppler-utils version:
pdfimages version 0.24.5
Copyright 2005-2013 The Poppler Developers - http://poppler.freedesktop.org
Copyright 1996-2011 Glyph & Cog, LLC
pdftoppm version 0.24.5
Copyright 2005-2013 The Poppler Developers - http://poppler.freedesktop.org
Copyright 1996-2011 Glyph & Cog, LLC
pdffonts version 0.24.5
Copyright 2005-2013 The Poppler Developers - http://poppler.freedesktop.org
Copyright 1996-2011 Glyph & Cog, LLC
--------------------------------
unpaper version:
0.4.2
--------------------------------
tesseract version:
tesseract 3.03
 leptonica-1.70
  libgif 4.1.6(?) : libjpeg 8d : libpng 1.2.50 : libtiff 4.0.3 : zlib 1.2.8 : webp 0.4.0

--------------------------------
python2 version:
Python 2.7.6
--------------------------------
Ghostscript version:
9.10
--------------------------------
Java version:
java version "1.7.0_55"
OpenJDK Runtime Environment (IcedTea 2.4.7) (7u55-2.4.7-1ubuntu1)
OpenJDK 64-Bit Server VM (build 24.51-b03, mixed mode)
--------------------------------
Created temporary folder: "/tmp/tmp.ZIHGjUFKJS"
Input file: Extracting size of each page (in pt)
Processing page 0001 / 0001
Page 0001: Size 842x594 (h*w in pt)
Page 0001: Size 3508x2477 (in pixel)
Page 0001: Extracting image as pbm file (300 dpi)
Page 0001: Performing OCR
Page 0001: Embedding text in PDF
Page 0001: Embedding text in PDF (debug page)
Output file: Concatenating all pages to the final PDF/A file
Output file: Checking compliance to PDF/A standard
The full validation log is available here: "/tmp/tmp.ZIHGjUFKJS/pdf_validation.log"
Output file: The generated PDF/A file is VALID
Script took 25 seconds
@femifrak femifrak changed the title original pdf original pdf not kept unaltered May 28, 2014
@femifrak femifrak changed the title original pdf not kept unaltered original images not kept unaltered May 28, 2014
@fritz-hh
Copy link
Owner

Unfortunately, it is not (easily) possible to extract the original images from a pdf file using opensource linux sw and keep its orientation like in the pdf file. Therefore I use pdftoppm to GENERATE an image from the pdf file. The image is generated with the same resolution than the original image, but it is not the original image.

If anybody has an idea how proceed to solve this limitation, please let me know!

@jbarlow83
Copy link
Collaborator

The PyPDF2 library can read internal PDF structure and get the page orientation.

$ ipython3
import PyPDF2 as pypdf
pdf = pypdf.PdfFileReader('example.pdf')
pdf.pages[0]['/Rotate']

That field records the rotation anyone has applied to fix the orientation of a given page and should work in a lot of cases. It would be possible to get the page and image dimensions as well, I believe, and faster than the various calls to pdftoppm and pdfimages since everything could be done in a single process. That would be a first step and should work as long as each page contains one image that fills the page – probably good enough for most scanned PDFs with no OCR.

For multiple images to a page you would have to interpret the PostScript to determine where images are rendered, because PostScript can apply an arbitrary transformation matrix (translate, rotate to arbitrary angle, scale, skew) to an image before rendering an image. In this case you'd run OCR jobs on the extracted images and then insert the OCR hidden layer into the PostScript stream. Needless to say this would be much harder.

@fritz-hh
Copy link
Owner

Even if the page contains only 1 image, I am not sure knowing the page orientation would be enough. Indeed there would still be 2 possible rotation angles for the image (x and x+180°). Are there tools to easily extract the transformation matrix of the image? (especially the rotation angle?

@jbarlow83
Copy link
Collaborator

It's not page orientation as in landscape/portrait use of paper. /Rotate records the 0/90/180/270° rotation that is usually set by a user. It should be enough to determine the image rotation for simple cases like scanned PDF output. The /MediaBox is also part of the picture - one can specify the virtual paper size with /MediaBox, and then rotate it. So for the simple case (scanner PDF output), with /MediaBox and /Rotate, you should be able to determine the orientation of the image.

If I understand correctly the transformation matrix is sort of like a CPU register in the PostScript language, sensitive to state, so in general you have to interpret all of the preceding PostScript on a page to determine its value at a point of interest. So it is harder (although pdftoppm would have code to do this). But this should only be necessary for more complex PDFs, not the output of scanning software.

@fritz-hh fritz-hh added this to the v3.x milestone Sep 20, 2014
@fritz-hh
Copy link
Owner

Ok. I understand what you mean now.
I would propose to interoduce this feature one ocrpage has been rewritten in python

@kebekus
Copy link

kebekus commented Nov 24, 2014

Actually, I believe adding a text layer without changing the original contents of the PDF file is easy. PDFTK can do that. Here is what I do with my personal files:

  • extract content of PDF file as image
  • run tesseract on each image, generating hocr files
  • use the program hocrTransform.py that comes with OCRmyPDF on each hocr file, in order to generate a PDF file containing the text. I do not include the graphics file
  • join the so-generated text-layer-PDF into one file that has exactly as many pages as the original file
  • use pdftk 'multbackground' to add merge the text-layer-PDF with the original one

Here is a little shell script that does a very similar task (it takes the images from my scanner, not from a PDF file).

#!/bin/bash

# Clear directory
rm -f *.pnm *.djvu


# Scan images
scanimage --batch=scan\%03d.pnm --mode=Gray --adf-auto-scan=yes -x 210mm -y 296mm --resolution 600 --adf-mode=Simplex


# Threshold and cut scanned pages
for page in `ls scan*|sort`
do
    file_name=$(basename $page)
    file_name_witout_ext=${file_name%.*}

    echo "Cut and threshold page $file_name_witout_ext"
    pnmcut -left 105 -bottom 6500 <$page | pgmtopbm -threshold -value 0.78 >$file_name_witout_ext.pbm
done


# Generate PDF document
echo "Compress as PDF"
jbig2 -v -p -s scan*.pbm
pdf.py output >fertig.pdf
# Delete jbig2 temporary files
rm output* 


# OCR

# OCR each page, and produce PDF file(s) containing the background (=text) layer
for page in `ls scan*.pbm|sort`
do
    file_name=$(basename $page)
    file_name_witout_ext=${file_name%.*}

    echo "Character recognition $page"
    tesseract -l eng $page $file_name_witout_ext hocr >/dev/null
    python2 ~/bin/OCRmyPDF-2.2-stable/src/hocrTransform.py -r 600 $file_name_witout_ext.html $file_name_witout_ext.pdf
    # Delete temporary hocr file
    rm $file_name_witout_ext.html
done

# Join PDF files into one file that contains all OCR backgrounds
pdftk scan*.pdf output ocr.pdf
# Delete temporary scan*.pdf files
rm scan*.pdf

# Merge OCR background PDF into the main PDF document
pdftk fertig.pdf multibackground ocr.pdf output fertig-ocr.pdf

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants