Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

output file much bigger (7x), because not original embedded image files copied #70

Closed
alphablue52 opened this issue Feb 18, 2014 · 2 comments
Milestone

Comments

@alphablue52
Copy link

Hello,
first thanks for the great work with this script. It made me work with OCR again at all after 10 years of frustrated absence :-)
Only one negative thing: Most of my PDFs come from a Canon ImageRunner scan, and are very good in quality vs. size. OCR gives great results, but the output PDFs are 7-8x bigger than input. As far as I can see, the embedded images get recompressed to JPEG, while the original is /CCITTFaxDecode.
If this is because of PDF/A compatibility, I suggest to add an option for non-PDF/A output.

You can download input.pdf and output.pdf here:
https://www.dropbox.com/l/KYlpYRiSs6IjWVOmF1fX39

Here is the output of the script with -g option.

~/bin/OCRmyPDF-2.0-stable$ sh OCRmyPDF.sh -g -l deu input.pdf output.pdf
OCRmyPDF version: v2.0-stable
Arguments: -g -l deu input.pdf output.pdf

Checking if all dependencies are installed

ImageMagick version:
Version: ImageMagick 6.7.7-10 2013-09-10 Q16 http://www.imagemagick.org
Copyright: Copyright (C) 1999-2012 ImageMagick Studio LLC
Features: OpenMP


GNU Parallel version:
GNU parallel 20130622
Copyright (C) 2007,2008,2009,2010,2011,2012,2013 Ole Tange and Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html
This is free software: you are free to change and redistribute it.
GNU parallel comes with no warranty.

Web site: http://www.gnu.org/software/parallel

When using GNU Parallel for a publication please cite:

O. Tange (2011): GNU Parallel - The Command-Line Power Tool,

;login: The USENIX Magazine, February 2011:42-47.

Poppler-utils version:
pdfimages version 0.24.1
Copyright 2005-2013 The Poppler Developers - http://poppler.freedesktop.org
Copyright 1996-2011 Glyph & Cog, LLC
pdftoppm version 0.24.1
Copyright 2005-2013 The Poppler Developers - http://poppler.freedesktop.org
Copyright 1996-2011 Glyph & Cog, LLC
pdffonts version 0.24.1
Copyright 2005-2013 The Poppler Developers - http://poppler.freedesktop.org

Copyright 1996-2011 Glyph & Cog, LLC

unpaper version:

OCRmyPDF.sh: 190: OCRmyPDF.sh: unpaper: not found

tesseract version:
tesseract 3.02.01
leptonica-1.69
libgif 4.1.6 : libjpeg 8d : libpng 1.2.49 : libtiff 4.0.2 : zlib 1.2.8


python2 version:

Python 2.7.5+

Ghostscript version:

9.10

Java version:
java version "1.7.0_51"
OpenJDK Runtime Environment (IcedTea 2.4.4) (7u51-2.4.4-0ubuntu0.13.10.1)

OpenJDK 64-Bit Server VM (build 24.45-b08, mixed mode)

Created temporary folder: "/tmp/tmp.X82OQourlI"
Input file: Extracting size of each page (in pt)
Processing page 0001 / 0014
Page 0001: Size 842x595 (h_w in pt)
Page 0001: Expecting exactly 1 image covering the whole page (found 3). Cannot compute dpi value.
Page 0001: Continuing anyway, assuming a default resolution of 300 dpi
Page 0001: Extracting image as ppm file (300 dpi)
Page 0001: Performing OCR
Page 0001: Embedding text in PDF
Page 0001: Embedding text in PDF (debug page)
Processing page 0002 / 0014
Page 0002: Size 842x595 (h_w in pt)
Page 0002: Expecting exactly 1 image covering the whole page (found 2). Cannot compute dpi value.
Page 0002: Continuing anyway, assuming a default resolution of 300 dpi
Page 0002: Extracting image as ppm file (300 dpi)
Page 0002: Performing OCR
Page 0002: Embedding text in PDF
Page 0002: Embedding text in PDF (debug page)
Processing page 0003 / 0014
Page 0003: Size 842x595 (h_w in pt)
Page 0003: Expecting exactly 1 image covering the whole page (found 3). Cannot compute dpi value.
Page 0003: Continuing anyway, assuming a default resolution of 300 dpi
Page 0003: Extracting image as ppm file (300 dpi)
Page 0003: Performing OCR
Page 0003: Embedding text in PDF
Page 0003: Embedding text in PDF (debug page)
Processing page 0004 / 0014
Page 0004: Size 842x595 (h_w in pt)
Page 0004: Expecting exactly 1 image covering the whole page (found 2). Cannot compute dpi value.
Page 0004: Continuing anyway, assuming a default resolution of 300 dpi
Page 0004: Extracting image as ppm file (300 dpi)
Page 0004: Performing OCR
Page 0004: Embedding text in PDF
Page 0004: Embedding text in PDF (debug page)
Processing page 0005 / 0014
Page 0005: Size 842x595 (h_w in pt)
Page 0005: Expecting exactly 1 image covering the whole page (found 2). Cannot compute dpi value.
Page 0005: Continuing anyway, assuming a default resolution of 300 dpi
Page 0005: Extracting image as ppm file (300 dpi)
Page 0005: Performing OCR
Page 0005: Embedding text in PDF
Page 0005: Embedding text in PDF (debug page)
Processing page 0006 / 0014
Page 0006: Size 842x595 (h_w in pt)
Page 0006: Expecting exactly 1 image covering the whole page (found 3). Cannot compute dpi value.
Page 0006: Continuing anyway, assuming a default resolution of 300 dpi
Page 0006: Extracting image as ppm file (300 dpi)
Page 0006: Performing OCR
Page 0006: Embedding text in PDF
Page 0006: Embedding text in PDF (debug page)
Processing page 0007 / 0014
Page 0007: Size 842x595 (h_w in pt)
Page 0007: Expecting exactly 1 image covering the whole page (found 3). Cannot compute dpi value.
Page 0007: Continuing anyway, assuming a default resolution of 300 dpi
Page 0007: Extracting image as ppm file (300 dpi)
Page 0007: Performing OCR
Page 0007: Embedding text in PDF
Page 0007: Embedding text in PDF (debug page)
Processing page 0008 / 0014
Page 0008: Size 842x595 (h_w in pt)
Page 0008: Expecting exactly 1 image covering the whole page (found 8). Cannot compute dpi value.
Page 0008: Continuing anyway, assuming a default resolution of 300 dpi
Page 0008: Extracting image as ppm file (300 dpi)
Page 0008: Performing OCR
Page 0008: Embedding text in PDF
Page 0008: Embedding text in PDF (debug page)
Processing page 0009 / 0014
Page 0009: Size 842x595 (h_w in pt)
Page 0009: Expecting exactly 1 image covering the whole page (found 5). Cannot compute dpi value.
Page 0009: Continuing anyway, assuming a default resolution of 300 dpi
Page 0009: Extracting image as ppm file (300 dpi)
Page 0009: Performing OCR
Page 0009: Embedding text in PDF
Page 0009: Embedding text in PDF (debug page)
Processing page 0010 / 0014
Page 0010: Size 842x595 (h_w in pt)
Page 0010: Expecting exactly 1 image covering the whole page (found 4). Cannot compute dpi value.
Page 0010: Continuing anyway, assuming a default resolution of 300 dpi
Page 0010: Extracting image as ppm file (300 dpi)
Page 0010: Performing OCR
Page 0010: Embedding text in PDF
Page 0010: Embedding text in PDF (debug page)
Processing page 0011 / 0014
Page 0011: Size 842x595 (h_w in pt)
Page 0011: Expecting exactly 1 image covering the whole page (found 4). Cannot compute dpi value.
Page 0011: Continuing anyway, assuming a default resolution of 300 dpi
Page 0011: Extracting image as ppm file (300 dpi)
Page 0011: Performing OCR
Page 0011: Embedding text in PDF
Page 0011: Embedding text in PDF (debug page)
Processing page 0012 / 0014
Page 0012: Size 842x595 (h_w in pt)
Page 0012: Expecting exactly 1 image covering the whole page (found 3). Cannot compute dpi value.
Page 0012: Continuing anyway, assuming a default resolution of 300 dpi
Page 0012: Extracting image as ppm file (300 dpi)
Page 0012: Performing OCR
Page 0012: Embedding text in PDF
Page 0012: Embedding text in PDF (debug page)
Processing page 0013 / 0014
Page 0013: Size 842x595 (h_w in pt)
Page 0013: Expecting exactly 1 image covering the whole page (found 4). Cannot compute dpi value.
Page 0013: Continuing anyway, assuming a default resolution of 300 dpi
Page 0013: Extracting image as ppm file (300 dpi)
Page 0013: Performing OCR
Page 0013: Embedding text in PDF
Page 0013: Embedding text in PDF (debug page)
Processing page 0014 / 0014
Page 0014: Size 842x595 (h_w in pt)
Page 0014: Size 1240x1753 (in pixel)
Page 0014: Low image resolution detected (150 dpi). If needed, please use the "-o" to try to get better OCR results.
Page 0014: Extracting image as pgm file (150 dpi)
Page 0014: Performing OCR
Page 0014: Embedding text in PDF
Page 0014: Embedding text in PDF (debug page)
Output file: Concatenating all pages to the final PDF/A file
Output file: Checking compliance to PDF/A standard
The full validation log is available here: "/tmp/tmp.X82OQourlI/pdf_validation.log"
Output file: The generated PDF/A file is VALID
Script took 20 seconds

@fritz-hh fritz-hh added this to the v3.x milestone Sep 20, 2014
@fritz-hh
Copy link
Owner

Thanks for your problem report. I will try to solve that in v3.x.
Please do not remove the files from dropbox !!! Thanks.

@gunnicom
Copy link

Is there a fast way to change the code here locally, to not reconvert the images? As example a 3MB pdf now grows to 25MB and thats really big for alle the pdf we would like to convert.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants