PDFpig parser not parsing some pdfs correctly #319

ArkadeepDey71 · 2021-04-29T12:46:42Z

Hi,
I am using the below method to read the text from a pdf. The parsing works fine for most of the pdfs. However, for some of the pdf, parser is unable to identify the space between words. As a result, we are getting a redundant output of large sequence of alphanumeric character without any space.

// strfileDir : This is the file Path.
private static string GetTextfromPDF(string strfileDir) {
            try
            {
                using (var document = PdfDocument.Open(strfileDir))
                {
                    string result = "";
                    for (var i = 0; i < document.NumberOfPages; i++)
                    {
                        var page = document.GetPage(i + 1);
                        var words_2 = page.GetWords();
                        var blocks = DocstrumBoundingBoxes.Instance.GetBlocks(words_2);
                        foreach (var block in blocks)
                        {
                             result = result + "  " + block.Text.ToString() ;
                        }
                   }
                  
                    return result;
                }
            }
            catch(Exception) {
                return "";
            }
        }

I have also attached one such pdf for which I am getting the similar issue along with the parsed result. Please look into the issue Thanks in advance.

Sample PDF document.pdf
Parsed result.docx

The text was updated successfully, but these errors were encountered:

EliotJones · 2021-05-01T13:42:53Z

Hi, document layout analysis isn't an exact process, it's based on best-guesses. It sounds like the word extraction is the problem so maybe try using the Nearest Neighbour word extractor https://github.com/UglyToad/PdfPig/wiki/Document-Layout-Analysis#nearest-neighbour-method before feeding the data into the Docstrum method. Alternatively it may be necessary to write custom logic to construct words using page.Letters. The default approach is very naive and doesn't seek to be correct except in the simplest cases.

ArkadeepDey71 · 2021-05-03T11:51:18Z

Hi ,We have already tried to implement the Nearest Neighbors word extractor to extract words from the pdf but got the similar type of output (sequence of letters without whitespace).

While investigating we found out that the whitespace between letters are also not appearing in the Page.Letters . So could you please suggest how can we customize the Page.Letters to generate words.

The Font name of most of the letters were HGKNAF+AdvOTb49d9406 and it was one of the embedded font in the PDF . Do embedded fonts cause any issues while scanning the letters in a page?

where the gap is small but much larger than all previous gaps at this font size (and still larger than some minimum threshold) then break the word at this gap boundary.

EliotJones · 2021-05-09T17:12:50Z

It looks like these documents contain unusually narrow word gaps. I've added a slightly more 'intelligent' approach to word gap calculation in 0.1.5-alpha002 https://www.nuget.org/packages/PdfPig/0.1.5-alpha002 please give it a try and let me know,.

ArkadeepDey71 · 2021-05-19T08:55:21Z

Hi @EliotJones I have tested the new changes for lots of documents and its working as expected. When do we expect this new changes in release branch ?

ArkadeepDey71 · 2021-06-01T08:47:44Z

Hi @EliotJones
Any updates on this. When do we expect this new changes in main release branch ?

EliotJones · 2021-06-10T20:07:12Z

Hi @ArkadeepDey71 sorry for the delay. I have no firm plans to cut a release yet. The changes are available in the 0.1.5-alpha002 package which is publicly available but to do a full release takes time I don't currently have to do release notes and document breaking changes.

There's no difference from a code perspective with the pre-releases versus the main releases so you should be able to use the pre-release. To be honest any lumbering corporations that mandates no pre-releases in production when the code between full and pre-releases is equally auditable should probably be buying a package with full support like Aspose 😆

BobLd · 2021-08-17T15:35:00Z

Closed as answered - feel free to reopen if need be

EliotJones added the question label Apr 30, 2021

EliotJones added the testing label May 9, 2021

BobLd closed this as completed Aug 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PDFpig parser not parsing some pdfs correctly #319

PDFpig parser not parsing some pdfs correctly #319

ArkadeepDey71 commented Apr 29, 2021

EliotJones commented May 1, 2021

ArkadeepDey71 commented May 3, 2021

EliotJones commented May 9, 2021

ArkadeepDey71 commented May 19, 2021

ArkadeepDey71 commented Jun 1, 2021

EliotJones commented Jun 10, 2021

BobLd commented Aug 17, 2021

PDFpig parser not parsing some pdfs correctly #319

PDFpig parser not parsing some pdfs correctly #319

Comments

ArkadeepDey71 commented Apr 29, 2021

EliotJones commented May 1, 2021

ArkadeepDey71 commented May 3, 2021

EliotJones commented May 9, 2021

ArkadeepDey71 commented May 19, 2021

ArkadeepDey71 commented Jun 1, 2021

EliotJones commented Jun 10, 2021

BobLd commented Aug 17, 2021