-
Notifications
You must be signed in to change notification settings - Fork 149
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SimpleTextExtractionStrategy ? #7
Comments
PdfTextExtractor exists in v5.0.2+ with AGPL license (Current project is based on the iTextSharp 4.x, not 5.x). |
Just for my understanding: you only port the V2 and therefore I will not be able to parse the text in PDF ? |
|
@VahidN I tried to use it, but the results is just word in an incorrect direction, there is no way I can use the result :( |
Hello @VahidN ! |
I added a new sample to demonstrate how different PDF writers, create a PDF file:
Or
As you can see, PDF object is essentially a canvas and we can draw texts on it. In this case, your best choice to convert this output to its textual form is using an OCR. |
@AndrePScope for indexing purposes you can check another approach that might give better results; instead of iTextSharp try pdftotext command line utility (part of poppler tools, it has both windows and linux builds). |
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related problems. |
Hi,
I was trying to use your project for my .NetCore app, to replace the usage of the .net 4.0 lib which I can't use.
I managed to create the PdfReader, but I can't find what would be the equivalent of PdfTextExtractor class, do you have any idea?
The text was updated successfully, but these errors were encountered: