-
Notifications
You must be signed in to change notification settings - Fork 240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PDFpig parser not parsing some pdfs correctly #319
Comments
Hi, document layout analysis isn't an exact process, it's based on best-guesses. It sounds like the word extraction is the problem so maybe try using the Nearest Neighbour word extractor https://github.com/UglyToad/PdfPig/wiki/Document-Layout-Analysis#nearest-neighbour-method before feeding the data into the Docstrum method. Alternatively it may be necessary to write custom logic to construct words using |
Hi ,We have already tried to implement the Nearest Neighbors word extractor to extract words from the pdf but got the similar type of output (sequence of letters without whitespace). While investigating we found out that the whitespace between letters are also not appearing in the The Font name of most of the letters were |
where the gap is small but much larger than all previous gaps at this font size (and still larger than some minimum threshold) then break the word at this gap boundary.
It looks like these documents contain unusually narrow word gaps. I've added a slightly more 'intelligent' approach to word gap calculation in 0.1.5-alpha002 https://www.nuget.org/packages/PdfPig/0.1.5-alpha002 please give it a try and let me know,. |
Hi @EliotJones I have tested the new changes for lots of documents and its working as expected. When do we expect this new changes in release branch ? |
Hi @EliotJones |
Hi @ArkadeepDey71 sorry for the delay. I have no firm plans to cut a release yet. The changes are available in the 0.1.5-alpha002 package which is publicly available but to do a full release takes time I don't currently have to do release notes and document breaking changes. There's no difference from a code perspective with the pre-releases versus the main releases so you should be able to use the pre-release. To be honest any lumbering corporations that mandates no pre-releases in production when the code between full and pre-releases is equally auditable should probably be buying a package with full support like Aspose 😆 |
Closed as answered - feel free to reopen if need be |
Hi,
I am using the below method to read the text from a pdf. The parsing works fine for most of the pdfs. However, for some of the pdf, parser is unable to identify the space between words. As a result, we are getting a redundant output of large sequence of alphanumeric character without any space.
I have also attached one such pdf for which I am getting the similar issue along with the parsed result. Please look into the issue Thanks in advance.
Sample PDF document.pdf
Parsed result.docx
The text was updated successfully, but these errors were encountered: