Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PDFpig parser not parsing some pdfs correctly #319

Closed
ArkadeepDey71 opened this issue Apr 29, 2021 · 7 comments
Closed

PDFpig parser not parsing some pdfs correctly #319

ArkadeepDey71 opened this issue Apr 29, 2021 · 7 comments

Comments

@ArkadeepDey71
Copy link

Hi,
I am using the below method to read the text from a pdf. The parsing works fine for most of the pdfs. However, for some of the pdf, parser is unable to identify the space between words. As a result, we are getting a redundant output of large sequence of alphanumeric character without any space.

// strfileDir : This is the file Path.
private static string GetTextfromPDF(string strfileDir) {
            try
            {
                using (var document = PdfDocument.Open(strfileDir))
                {
                    string result = "";
                    for (var i = 0; i < document.NumberOfPages; i++)
                    {
                        var page = document.GetPage(i + 1);
                        var words_2 = page.GetWords();
                        var blocks = DocstrumBoundingBoxes.Instance.GetBlocks(words_2);
                        foreach (var block in blocks)
                        {
                             result = result + "  " + block.Text.ToString() ;
                        }
                   }
                  
                    return result;
                }
            }
            catch(Exception) {
                return "";
            }
        }

I have also attached one such pdf for which I am getting the similar issue along with the parsed result. Please look into the issue Thanks in advance.

Sample PDF document.pdf
Parsed result.docx

@EliotJones
Copy link
Member

Hi, document layout analysis isn't an exact process, it's based on best-guesses. It sounds like the word extraction is the problem so maybe try using the Nearest Neighbour word extractor https://github.com/UglyToad/PdfPig/wiki/Document-Layout-Analysis#nearest-neighbour-method before feeding the data into the Docstrum method. Alternatively it may be necessary to write custom logic to construct words using page.Letters. The default approach is very naive and doesn't seek to be correct except in the simplest cases.

@ArkadeepDey71
Copy link
Author

Hi ,We have already tried to implement the Nearest Neighbors word extractor to extract words from the pdf but got the similar type of output (sequence of letters without whitespace).

While investigating we found out that the whitespace between letters are also not appearing in the Page.Letters . So could you please suggest how can we customize the Page.Letters to generate words.

The Font name of most of the letters were HGKNAF+AdvOTb49d9406 and it was one of the embedded font in the PDF . Do embedded fonts cause any issues while scanning the letters in a page?

EliotJones added a commit that referenced this issue May 9, 2021
where the gap is small but much larger than all previous gaps at this
font size (and still larger than some minimum threshold) then break
the word at this gap boundary.
@EliotJones
Copy link
Member

It looks like these documents contain unusually narrow word gaps. I've added a slightly more 'intelligent' approach to word gap calculation in 0.1.5-alpha002 https://www.nuget.org/packages/PdfPig/0.1.5-alpha002 please give it a try and let me know,.

@ArkadeepDey71
Copy link
Author

Hi @EliotJones I have tested the new changes for lots of documents and its working as expected. When do we expect this new changes in release branch ?

@ArkadeepDey71
Copy link
Author

Hi @EliotJones
Any updates on this. When do we expect this new changes in main release branch ?

@EliotJones
Copy link
Member

Hi @ArkadeepDey71 sorry for the delay. I have no firm plans to cut a release yet. The changes are available in the 0.1.5-alpha002 package which is publicly available but to do a full release takes time I don't currently have to do release notes and document breaking changes.

There's no difference from a code perspective with the pre-releases versus the main releases so you should be able to use the pre-release. To be honest any lumbering corporations that mandates no pre-releases in production when the code between full and pre-releases is equally auditable should probably be buying a package with full support like Aspose 😆

@BobLd
Copy link
Collaborator

BobLd commented Aug 17, 2021

Closed as answered - feel free to reopen if need be

@BobLd BobLd closed this as completed Aug 17, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants