Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use pdfminer for OSX; retain ocrmypdf for Linux #17

Closed
billfitzgerald opened this issue Jan 24, 2022 · 2 comments
Closed

Use pdfminer for OSX; retain ocrmypdf for Linux #17

billfitzgerald opened this issue Jan 24, 2022 · 2 comments

Comments

@billfitzgerald
Copy link
Owner

Pdfminer is already used to extract metadata, and ocrmypdf is not behaving well in testing with OSX (although that's likely due to my human error).

In any case, pdfminer has the ability to extract text from pdfs, and it is working without issue in OSX (so far, anyways).

This thread has info from one of the pdfminer maintainers, and will be a good starting point: https://stackoverflow.com/questions/26494211/extracting-text-from-a-pdf-file-using-pdfminer-in-python/61855361#61855361

@billfitzgerald billfitzgerald changed the title Evaluate replacing ocrmypdf with pdfminer Use pdfminer for OSX; retain ocrmypdf for Linux Jan 25, 2022
@billfitzgerald
Copy link
Owner Author

Updating the title to reflect the status of the issue.

OCRmyPDF does a better job handling pdfs across a broaderrange of pdf types than pdfminer.

The temporary fix will be to check OS type and route all OSX users to clean pdfs using pdfminer and route all Linux users to use OCRmyPDF.

This is not ideal; in the future I want to have a single mechanism for all OS's, but it will have to do for now.

@billfitzgerald
Copy link
Owner Author

Closing.

Fixed by d3f26ad

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant