You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, PDF cleanup is working in both Linux and OSX.
The way it's working definitely needs improvement. This issue documents the current approach, some of the rationale behind this less-than-ideal approach, and some general thoughts on moving forward.
The thoughts in this ticket are reflected in the update pushed to the repo here: d3f26ad
Current approach and rationale
The original version was written and tested in Linux, and used OCRMyPDF. The results from OCRMyPDF are good.
To address this issue, I switched over to PDFMiner.six, which worked in OSX, and did not throw exceptions. However, the results were not as clean, and part of what's nice about OCRMyPDF is that it will also OCR text from images.
The current "solution" (which isn't awesome) is to check for the OS of the machine running the script. OSX users are routed to use PDFMiner; Linux users are routed to use OCRMyPDF. Windows users should probably be routed to use PDFMiner as well, but I don't have a Windows machine to test against, so Windows is not currently supported.
Future path
In the future, I'd rather use a single method for cleaning PDFs.
Additionally, even using OCRMyPDF, the average PDF still has a lot of cruft that needs to be cleaned from the output, so future will will also include better text cleanup.
The text was updated successfully, but these errors were encountered:
Currently, PDF cleanup is working in both Linux and OSX.
The way it's working definitely needs improvement. This issue documents the current approach, some of the rationale behind this less-than-ideal approach, and some general thoughts on moving forward.
The thoughts in this ticket are reflected in the update pushed to the repo here: d3f26ad
Current approach and rationale
The original version was written and tested in Linux, and used OCRMyPDF. The results from OCRMyPDF are good.
However, using OCRMyPDF in OSX didn't work cleanly, even when using an "ifmain" guard as specified by the documentation here: https://ocrmypdf.readthedocs.io/en/latest/api.html
To address this issue, I switched over to PDFMiner.six, which worked in OSX, and did not throw exceptions. However, the results were not as clean, and part of what's nice about OCRMyPDF is that it will also OCR text from images.
The current "solution" (which isn't awesome) is to check for the OS of the machine running the script. OSX users are routed to use PDFMiner; Linux users are routed to use OCRMyPDF. Windows users should probably be routed to use PDFMiner as well, but I don't have a Windows machine to test against, so Windows is not currently supported.
Future path
In the future, I'd rather use a single method for cleaning PDFs.
Additionally, even using OCRMyPDF, the average PDF still has a lot of cruft that needs to be cleaned from the output, so future will will also include better text cleanup.
The text was updated successfully, but these errors were encountered: