SimpleTextExtractionStrategy ? #7

jaytonic · 2017-04-18T16:26:38Z

Hi,

I was trying to use your project for my .NetCore app, to replace the usage of the .net 4.0 lib which I can't use.

I managed to create the PdfReader, but I can't find what would be the equivalent of PdfTextExtractor class, do you have any idea?

VahidN · 2017-04-18T16:56:29Z

PdfTextExtractor exists in v5.0.2+ with AGPL license (Current project is based on the iTextSharp 4.x, not 5.x).

jaytonic · 2017-04-18T17:02:29Z

Just for my understanding: you only port the V2 and therefore I will not be able to parse the text in PDF ?
Or is there another way with the V2?

VahidN · 2017-04-18T18:00:10Z

This version doesn't contain the whole iTextSharp.text.pdf.parser namespace which is present in the v5.0.2+.
But this version has the infrastructure of the iTextSharp.text.pdf.parser namespace, such as Test_Extract_Text method.

jaytonic · 2017-04-19T11:57:30Z

@VahidN I tried to use it, but the results is just word in an incorrect direction, there is no way I can use the result :(

AndrePScope · 2017-11-10T23:39:08Z

Hello @VahidN !
I use your sample Test_Extract_Text method for getting words and creation full text index after that. This method works good for me, but not for all pdf-s. Some extractions return waste. I expect that it can be problem with encoding or something like it... Could you suggest me something?

VahidN · 2017-11-11T08:00:48Z

I added a new sample to demonstrate how different PDF writers, create a PDF file:
Test_Draw_Text()
If you run it, you will see this output:

Which is equal to this output in PDF language:

q
BT
36 806 Td
0 -18 Td
/F1 12 Tf
(Test)Tj
0 0 Td
ET
Q
BT
/F1 12 Tf
88.66 367 Td
(ld)Tj
-22 0 Td
(Wor)Tj
-15.33 0 Td
(llo)Tj
-15.33 0 Td
(He)Tj
ET
q 1 0 0 1 36 343 cm /Xf1 Do Q

Or

SaveGraphicsState(); // q
BeginText(); // BT
MoveTextPos(36, 806); // Td
MoveTextPos(0, -18); // Td
SelectFontAndSize("/F1", 12); // Tf
ShowText("(Test)"); // Tj
MoveTextPos(0, 0); // Td
EndTextObject(); // ET
RestoreGraphicsState(); // Q
BeginText(); // BT
SelectFontAndSize("/F1", 12); // Tf
MoveTextPos(88.66, 367); // Td
ShowText("(ld)"); // Tj
MoveTextPos(-22, 0); // Td
ShowText("(Wor)"); // Tj
MoveTextPos(-15.33, 0); // Td
ShowText("(llo)"); // Tj
MoveTextPos(-15.33, 0); // Td
ShowText("(He)"); // Tj
EndTextObject(); // ET
SaveGraphicsState(); // q
TransMatrix(1, 0, 0, 1, 36, 343); // cm
XObject("/Xf1"); // Do
RestoreGraphicsState(); // Q

As you can see, PDF object is essentially a canvas and we can draw texts on it. In this case, your best choice to convert this output to its textual form is using an OCR.

VitaliyMF · 2017-11-11T08:28:32Z

@AndrePScope for indexing purposes you can check another approach that might give better results; instead of iTextSharp try pdftotext command line utility (part of poppler tools, it has both windows and linux builds).

lock · 2020-01-18T07:49:28Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related problems.

VahidN closed this as completed Apr 18, 2017

VahidN mentioned this issue May 8, 2017

Not resolved iTextSharp.text.pdf.parser reference #10

Closed

AndrePScope mentioned this issue Nov 7, 2017

How to get text content from PDF #15

Closed

lock bot locked as resolved and limited conversation to collaborators Jan 18, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SimpleTextExtractionStrategy ? #7

SimpleTextExtractionStrategy ? #7

jaytonic commented Apr 18, 2017

VahidN commented Apr 18, 2017 •

edited

jaytonic commented Apr 18, 2017

VahidN commented Apr 18, 2017

jaytonic commented Apr 19, 2017

AndrePScope commented Nov 10, 2017

VahidN commented Nov 11, 2017 •

edited

VitaliyMF commented Nov 11, 2017

lock bot commented Jan 18, 2020

SimpleTextExtractionStrategy ? #7

SimpleTextExtractionStrategy ? #7

Comments

jaytonic commented Apr 18, 2017

VahidN commented Apr 18, 2017 • edited

jaytonic commented Apr 18, 2017

VahidN commented Apr 18, 2017

jaytonic commented Apr 19, 2017

AndrePScope commented Nov 10, 2017

VahidN commented Nov 11, 2017 • edited

VitaliyMF commented Nov 11, 2017

lock bot commented Jan 18, 2020

VahidN commented Apr 18, 2017 •

edited

VahidN commented Nov 11, 2017 •

edited