problem with unpaper #75

femifrak · 2014-05-07T05:34:43Z

When using OCRmyODF-2.x with -dci there remain black borders in the generated pdf. Shouldn't unpaper remove them? The input is a black and white pdf.

Here the output:

root@xu:/home/tho/test# /opt/OCRmyPDF/OCRmyPDF-2.x/OCRmyPDF.sh -g -d -c -i test.pdf testOCR.pdf
OCRmyPDF version: v2.0-stable
Arguments: -g -d -c -i test.pdf testOCR.pdf

Checking if all dependencies are installed

ImageMagick version:
Version: ImageMagick 6.7.7-10 2014-03-06 Q16 http://www.imagemagick.org
Copyright: Copyright (C) 1999-2012 ImageMagick Studio LLC
Features: OpenMP

GNU Parallel version:
GNU parallel 20130922
Copyright (C) 2007,2008,2009,2010,2011,2012,2013 Ole Tange and Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html
This is free software: you are free to change and redistribute it.
GNU parallel comes with no warranty.

Web site: http://www.gnu.org/software/parallel

When using GNU Parallel for a publication please cite:

O. Tange (2011): GNU Parallel - The Command-Line Power Tool,

;login: The USENIX Magazine, February 2011:42-47.

Poppler-utils version:
pdfimages version 0.24.5
Copyright 2005-2013 The Poppler Developers - http://poppler.freedesktop.org
Copyright 1996-2011 Glyph & Cog, LLC
pdftoppm version 0.24.5
Copyright 2005-2013 The Poppler Developers - http://poppler.freedesktop.org
Copyright 1996-2011 Glyph & Cog, LLC
pdffonts version 0.24.5
Copyright 2005-2013 The Poppler Developers - http://poppler.freedesktop.org

Copyright 1996-2011 Glyph & Cog, LLC

unpaper version:

0.4.2

tesseract version:
tesseract 3.03
leptonica-1.70
libgif 4.1.6(?) : libjpeg 8d : libpng 1.2.50 : libtiff 4.0.3 : zlib 1.2.8 : webp 0.4.0

python2 version:

Python 2.7.6

Ghostscript version:

9.10

Java version:
java version "1.7.0_55"
OpenJDK Runtime Environment (IcedTea 2.4.7) (7u55-2.4.7-1ubuntu1)

OpenJDK 64-Bit Server VM (build 24.51-b03, mixed mode)

Created temporary folder: "/tmp/tmp.cL2lCvVStC"
Input file: Extracting size of each page (in pt)
Processing page 0001 / 0001
Page 0001: Size 578x342 (h*w in pt)
Page 0001: Size 3424x2208 (in pixel)
Page 0001: Extracting image as pbm file (445 dpi)
Page 0001: Deskewing image
Page 0001: Cleaning image with unpaper
Page 0001: Performing OCR
Page 0001: Embedding text in PDF
Page 0001: Embedding text in PDF (debug page)
Output file: Concatenating all pages to the final PDF/A file
Output file: Checking compliance to PDF/A standard
The full validation log is available here: "/tmp/tmp.cL2lCvVStC/pdf_validation.log"
Output file: The generated PDF/A file is VALID
Script took 31 seconds

femifrak · 2014-05-07T06:45:20Z

i just see in ocrPage.sh that unpaper is called with the arguments:

--mask-scan-size 100 --no-deskew --no-grayfilter --no-blackfilter --no-mask-center --no-border-align

Are there any disadvantages when removing the last four arguments or some of them?

Thanks!

fritz-hh · 2014-05-23T18:13:46Z

Hi. I had some issues with some documents (e.g. images removed by unpaper in some case), therefore I added these parameters.
Feel free to change it if it better fits your use case.

OCRmyPDF-issuebot mentioned this issue Sep 14, 2015

problem with unpaper ocrmypdf/OCRmyPDF#7

Closed

jbarlow83 closed this as completed Dec 5, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

problem with unpaper #75

problem with unpaper #75

femifrak commented May 7, 2014

femifrak commented May 7, 2014

fritz-hh commented May 23, 2014

problem with unpaper #75

problem with unpaper #75

Comments

femifrak commented May 7, 2014

Checking if all dependencies are installed

;login: The USENIX Magazine, February 2011:42-47.

Copyright 1996-2011 Glyph & Cog, LLC

0.4.2

Python 2.7.6

9.10

OpenJDK 64-Bit Server VM (build 24.51-b03, mixed mode)

femifrak commented May 7, 2014

fritz-hh commented May 23, 2014