Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

problem with unpaper #75

Closed
femifrak opened this issue May 7, 2014 · 2 comments
Closed

problem with unpaper #75

femifrak opened this issue May 7, 2014 · 2 comments

Comments

@femifrak
Copy link

femifrak commented May 7, 2014

When using OCRmyODF-2.x with -dci there remain black borders in the generated pdf. Shouldn't unpaper remove them? The input is a black and white pdf.

Here the output:

root@xu:/home/tho/test# /opt/OCRmyPDF/OCRmyPDF-2.x/OCRmyPDF.sh -g -d -c -i test.pdf testOCR.pdf
OCRmyPDF version: v2.0-stable
Arguments: -g -d -c -i test.pdf testOCR.pdf

Checking if all dependencies are installed

ImageMagick version:
Version: ImageMagick 6.7.7-10 2014-03-06 Q16 http://www.imagemagick.org
Copyright: Copyright (C) 1999-2012 ImageMagick Studio LLC
Features: OpenMP


GNU Parallel version:
GNU parallel 20130922
Copyright (C) 2007,2008,2009,2010,2011,2012,2013 Ole Tange and Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html
This is free software: you are free to change and redistribute it.
GNU parallel comes with no warranty.

Web site: http://www.gnu.org/software/parallel

When using GNU Parallel for a publication please cite:

O. Tange (2011): GNU Parallel - The Command-Line Power Tool,

;login: The USENIX Magazine, February 2011:42-47.

Poppler-utils version:
pdfimages version 0.24.5
Copyright 2005-2013 The Poppler Developers - http://poppler.freedesktop.org
Copyright 1996-2011 Glyph & Cog, LLC
pdftoppm version 0.24.5
Copyright 2005-2013 The Poppler Developers - http://poppler.freedesktop.org
Copyright 1996-2011 Glyph & Cog, LLC
pdffonts version 0.24.5
Copyright 2005-2013 The Poppler Developers - http://poppler.freedesktop.org

Copyright 1996-2011 Glyph & Cog, LLC

unpaper version:

0.4.2

tesseract version:
tesseract 3.03
leptonica-1.70
libgif 4.1.6(?) : libjpeg 8d : libpng 1.2.50 : libtiff 4.0.3 : zlib 1.2.8 : webp 0.4.0


python2 version:

Python 2.7.6

Ghostscript version:

9.10

Java version:
java version "1.7.0_55"
OpenJDK Runtime Environment (IcedTea 2.4.7) (7u55-2.4.7-1ubuntu1)

OpenJDK 64-Bit Server VM (build 24.51-b03, mixed mode)

Created temporary folder: "/tmp/tmp.cL2lCvVStC"
Input file: Extracting size of each page (in pt)
Processing page 0001 / 0001
Page 0001: Size 578x342 (h*w in pt)
Page 0001: Size 3424x2208 (in pixel)
Page 0001: Extracting image as pbm file (445 dpi)
Page 0001: Deskewing image
Page 0001: Cleaning image with unpaper
Page 0001: Performing OCR
Page 0001: Embedding text in PDF
Page 0001: Embedding text in PDF (debug page)
Output file: Concatenating all pages to the final PDF/A file
Output file: Checking compliance to PDF/A standard
The full validation log is available here: "/tmp/tmp.cL2lCvVStC/pdf_validation.log"
Output file: The generated PDF/A file is VALID
Script took 31 seconds

@femifrak
Copy link
Author

femifrak commented May 7, 2014

i just see in ocrPage.sh that unpaper is called with the arguments:

--mask-scan-size 100 --no-deskew --no-grayfilter --no-blackfilter --no-mask-center --no-border-align

Are there any disadvantages when removing the last four arguments or some of them?

Thanks!

@fritz-hh
Copy link
Owner

Hi. I had some issues with some documents (e.g. images removed by unpaper in some case), therefore I added these parameters.
Feel free to change it if it better fits your use case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants