Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SimpleTextExtractionStrategy ? #7

Closed
jaytonic opened this issue Apr 18, 2017 · 8 comments
Closed

SimpleTextExtractionStrategy ? #7

jaytonic opened this issue Apr 18, 2017 · 8 comments

Comments

@jaytonic
Copy link

Hi,

I was trying to use your project for my .NetCore app, to replace the usage of the .net 4.0 lib which I can't use.

I managed to create the PdfReader, but I can't find what would be the equivalent of PdfTextExtractor class, do you have any idea?

@VahidN
Copy link
Owner

VahidN commented Apr 18, 2017

PdfTextExtractor exists in v5.0.2+ with AGPL license (Current project is based on the iTextSharp 4.x, not 5.x).

@jaytonic
Copy link
Author

Just for my understanding: you only port the V2 and therefore I will not be able to parse the text in PDF ?
Or is there another way with the V2?

@VahidN
Copy link
Owner

VahidN commented Apr 18, 2017

  • This version doesn't contain the whole iTextSharp.text.pdf.parser namespace which is present in the v5.0.2+.
  • But this version has the infrastructure of the iTextSharp.text.pdf.parser namespace, such as Test_Extract_Text method.

@VahidN VahidN closed this as completed Apr 18, 2017
@jaytonic
Copy link
Author

@VahidN I tried to use it, but the results is just word in an incorrect direction, there is no way I can use the result :(

@AndrePScope
Copy link

Hello @VahidN !
I use your sample Test_Extract_Text method for getting words and creation full text index after that. This method works good for me, but not for all pdf-s. Some extractions return waste. I expect that it can be problem with encoding or something like it... Could you suggest me something?

@VahidN
Copy link
Owner

VahidN commented Nov 11, 2017

I added a new sample to demonstrate how different PDF writers, create a PDF file:
Test_Draw_Text()
If you run it, you will see this output:
pdftext
Which is equal to this output in PDF language:

q
BT
36 806 Td
0 -18 Td
/F1 12 Tf
(Test)Tj
0 0 Td
ET
Q
BT
/F1 12 Tf
88.66 367 Td
(ld)Tj
-22 0 Td
(Wor)Tj
-15.33 0 Td
(llo)Tj
-15.33 0 Td
(He)Tj
ET
q 1 0 0 1 36 343 cm /Xf1 Do Q

Or

SaveGraphicsState(); // q
BeginText(); // BT
MoveTextPos(36, 806); // Td
MoveTextPos(0, -18); // Td
SelectFontAndSize("/F1", 12); // Tf
ShowText("(Test)"); // Tj
MoveTextPos(0, 0); // Td
EndTextObject(); // ET
RestoreGraphicsState(); // Q
BeginText(); // BT
SelectFontAndSize("/F1", 12); // Tf
MoveTextPos(88.66, 367); // Td
ShowText("(ld)"); // Tj
MoveTextPos(-22, 0); // Td
ShowText("(Wor)"); // Tj
MoveTextPos(-15.33, 0); // Td
ShowText("(llo)"); // Tj
MoveTextPos(-15.33, 0); // Td
ShowText("(He)"); // Tj
EndTextObject(); // ET
SaveGraphicsState(); // q
TransMatrix(1, 0, 0, 1, 36, 343); // cm
XObject("/Xf1"); // Do
RestoreGraphicsState(); // Q

As you can see, PDF object is essentially a canvas and we can draw texts on it. In this case, your best choice to convert this output to its textual form is using an OCR.

@VitaliyMF
Copy link

@AndrePScope for indexing purposes you can check another approach that might give better results; instead of iTextSharp try pdftotext command line utility (part of poppler tools, it has both windows and linux builds).

@lock
Copy link

lock bot commented Jan 18, 2020

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related problems.

@lock lock bot locked as resolved and limited conversation to collaborators Jan 18, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants