Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove dependency on ghostscript and opencv #13

Open
vinayak-mehta opened this issue Jul 4, 2019 · 1 comment
Open

Remove dependency on ghostscript and opencv #13

vinayak-mehta opened this issue Jul 4, 2019 · 1 comment
Labels
enhancement New feature or request

Comments

@vinayak-mehta
Copy link
Member

vinayak-mehta commented Jul 4, 2019

Something to think about for the future:

  • OpenCV: maybe implement morph transform within the library itself/vendorize the code (not sure about dependency on C extensions)?
  • tk: Required for matplotlib.
  • ghostscript: maybe use some Python library to convert PDF to image (same quality as ghostscript).

Some questions:
[1] Can pdftoppm be an alternative to ghostscript?
[2] Are poppler-utils more widely available (pre-installed) than ghostscript?


@tkelman wrote:

Could the matplotlib dependency be made optional? The plotting features here look like not a lot of code, and it's a pretty complicated dependency to pull in.

Similarly might pillow be a viable smaller alternative to the use of opencv here?


Hello @tkelman! I think making matplotlib optional makes sense. Let me look into it as I go on to adding more tests for the plotting code atlanhq/camelot#127.

Camelot uses adaptive threshold and morphological transformations from opencv. I haven't worked with pillow in the past but a quick google search got me this morph transform equivalent in pillow. I think removing opencv as a dependency would mean replacing the current image processing code with a combination of pillow + adaptive threshold / morph transform implementations. Let me explore this a bit further. Meanwhile if you have any other alternatives or suggestions on how we could do this, would love if you could share them on this thread!


matplotlib is now an optional requirement!


@sweco-sekrsv wrote:

I'm not exaclty sure what you are using Ghostscript for but I switched to pdftoppm for rasterizing pdf to images. I'm using the CLI tool and calling it from python.
For my scenarios, it's stable and generate images quicker than Ghostscript. I have had better success with fonts using pdftoppm as well.

I'm on windows and are using the latest binaries from here:
http://blog.alivate.com.au/poppler-windows

On a side note it can also fix "broken" PDF' files. As the ones in this ticket:
atlanhq/camelot#306
Resaving them with pdftocairo in the poppler tools makes the file load ok with pdf-miner

On another side note I tried making Ghostscript run using multiprocessing (to speed things up) but that did not seem to work very good. Not sure Ghostscript is designed to run using several threads.

@lycanthropes
Copy link

OCR detection is really a tough work to do. If you could do that, I will bend my knee.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants