Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Still have issues with CID Characters #39

Open
JSB97 opened this issue Jan 6, 2014 · 17 comments

Comments

Projects
None yet
7 participants
@JSB97
Copy link

commented Jan 6, 2014

I am trying to extract information from this file; http://www.kantei.go.jp/jp/singi/tiiki/siryou/pdf/h25yosan2.pdf

Following the example code on the pdfminer website, I put together this simple code which tries to extract text using LTTextBoxHorizontal class, I get the output as

(cid:5561)(cid:6210)(cid:18446)(cid:18449)(cid:5562)(cid:2979)(cid:10220)(cid:6715)(cid:5587)(cid:7244)(cid:18171)(cid:9490)(cid:18202)(cid:13240)(cid:18190)(cid:18204)(cid:18159)(cid:4485)(cid:4582)(cid:8049)(cid:5878)(cid:3820)(cid:6795)(cid:10183)

and not the Japanese unicode characters. I get similar results when using the pdf2txt.py tool.

Could someone suggest what I should do to resolve this?
Thank you in advance.

Code

from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.pdfdevice import PDFDevice
from pdfminer.layout import LTTextBoxHorizontal
from pdfminer.layout import LAParams
from pdfminer.converter import PDFPageAggregator

# Open a PDF file.
fp = open('/Users/Documents/h25yosan2.pdf', 'rb')
password=''
# Create a PDF parser object associated with the file object.
parser = PDFParser(fp)
# Create a PDF document object that stores the document structure.
document = PDFDocument(parser)
# Supply the password for initialization.
# (If no password is set, give an empty string.)
document.initialize(password)
# Create a PDF resource manager object that stores shared resources.
rsrcmgr = PDFResourceManager()
# Set parameters for analysis.
laparams = LAParams()

# Create a PDF page aggregator object.
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)
for page in PDFPage.create_pages(document):
    interpreter.process_page(page)
    # receive the LTPage object for the page.
    layout = device.get_result()
    objstack = list(reversed(layout._objs))

    while objstack:
        b = objstack.pop()
        if type(b) == LTTextBoxHorizontal: # Text Box H
            print "get text line is %s" % b.get_text().encode('utf-8')
@euske

This comment has been minimized.

Copy link
Owner

commented Jan 6, 2014

Your sample code worked fine to me.
You just need CMap for extracting non-ASCII texts (e.g. Japanese).
Try doing:
$ make cmap
on the pdfminer directory.
If you already have an old version of cmap, delete and rebuild it.
I think at one point there was some issue in cmap generation, and that might be your cause.
Sorry for not being clear about this.

@JSB97

This comment has been minimized.

Copy link
Author

commented Jan 8, 2014

Thank you Eusuke, but I am still not having luck with this.
I've deleted the old pdfminer/cmap folder and ran
$ make cmap

again, which gives this sort of output;
writing: 'pdfminer/cmap/KSCpc-EUC-H.pickle.gz'...
writing: 'pdfminer/cmap/UniKS-UTF16-V.pickle.gz'...

etc. I then run the following;
$ ./pdfminer/tools/pdf2txt.py -p 1 -o /Users/Documents/h25yosan2.html /Users/Documents/h25yosan2.pdf
but I still get only CID's and no japanese text.

Is there anything else you can point to help resolve this?
FYI, the version of python I am using I have attached below...
$ python
Python 2.6.6 (r266:84374, Aug 31 2010, 11:00:51)
[GCC 4.0.1 (Apple Inc. build 5493)] on darwin

@euske

This comment has been minimized.

Copy link
Owner

commented Jan 8, 2014

What's in your $PYTHONPATH variable? There could be a previous version of pdfminer in your system which is responding incorrectly.

@JSB97

This comment has been minimized.

Copy link
Author

commented Jan 18, 2014

I checked this, there were older versions of pdfminer that I had to remove, and reinstalled from scratch. I think I need to debug this going into the source code… can you provide any suggestions where I might start?

@tastyminerals

This comment has been minimized.

Copy link

commented May 23, 2014

Hi, I am having the same issue with my pdf file.

The examination of news coverage related to the accession of (cid:51)(cid:82)(cid:79)(cid:68)(cid:81)(cid:71)(cid:15)...

I tried to convert the same document with poppler pdftotext tool and also had this:
The examination of news coverage related to the accession of 3RODQG�� &]HFK� 5HSXEOLF�� %XOJDULD

So it looks like something is wrong with pdf itself. However I've noticed numerous reports of such behavior from other users. I wonder if there is a reliable workaround. I have pdfminer installed on linux machine via pip.

@euske

This comment has been minimized.

Copy link
Owner

commented May 24, 2014

I don't think there's any reliable workaround. The thing is that not everything that looks like text on PDF can be easily converted to actual texts. The prime example would be math symbols that is rendered by using math fonts. Converting PDF to text is always a "best effort" approach in the end.

@tastyminerals

This comment has been minimized.

Copy link

commented May 25, 2014

I see, so I thought. I think the best one can do is use Acrobat Pro trial version to manually extract problematic text or, as the last resort, tesseract-orc.
PDF is evil.

@deanmalmgren

This comment has been minimized.

Copy link

commented Jun 9, 2014

I, too, am having an issue with CID font codes. This PDF, for example, has a line ("metacities — defined by UN Habitat as cities with more than 10") that is parsed as a bunch of (cid:%d) characters instead of the actual characters when running pdf2txt.py. Here is the pdf2txt.py output from the 20140328 release after running make cmap.

The font in this PDF is apparently an Adobe font who's CMAP isn't included in your database. How can I extend PDFMiner to be able to handle this situation?

@deanmalmgren

This comment has been minimized.

Copy link

commented Jun 9, 2014

On further inspection, it looks my issue with the CID font codes is likely because of ligatures, like when the letters f and i appear next to each other they can be combined into the same glyph. What would be the best way to address this challenge in pdfminer?

@euske

This comment has been minimized.

Copy link
Owner

commented Jun 12, 2014

Sadly there's no standard way to address this issue, because the way ligatures are handled is PDF-specific. Unicode actually has characters for ligature, but they're only for backward compatibility and there's no guarantee that a PDF uses it. Sometimes it's rendered by using a special embedded font, whose information is not available to PDFMiner.

@deanmalmgren

This comment has been minimized.

Copy link

commented Jun 12, 2014

@euske thanks for the heads up. I've got a hacky workaround to deal with the rest of the line that appears to work across a few different PDFs. If it ends up being something useful that can systematically decode the majority of these lines (with the exception of the ligatures), is there a good place I can stick it in the source code?

@VladimirStarostenkov

This comment has been minimized.

Copy link

commented Apr 6, 2018

@deanmalmgren did you manage to publish the workaround?

@deanmalmgren

This comment has been minimized.

Copy link

commented Apr 6, 2018

This was ages ago. Forgive me but I don't think I published a workaround and I don't recall what I did to do it or which project it was. eeeee... not very helpful I'm afraid, @VladimirStarostenkov

@VladimirStarostenkov

This comment has been minimized.

Copy link

commented Apr 6, 2018

@deanmalmgren no worries :) What about textract? It pushed me to the idea that one can do "pdf -> image -> tesseract -> text" Which is a kind of neural network brute force workaround...

@mooncrater31

This comment has been minimized.

Copy link

commented Jun 7, 2018

@VladimirStarostenkov Why would one try to use an OCR instead of directly extracting text from a PDF? I'm using Tesseract on a project, and am trying to get rid of it for the PDFs which have extractable text. This is so because it's slow to crop out the meaningful pieces and OCR'ing them, and the OCR is inaccurate sometimes.

So, is the extraction process that unreliable that we have to depend upon the OCR technology?

@VladimirStarostenkov

This comment has been minimized.

Copy link

commented Jun 7, 2018

@mooncrater31 The short answer would be "yes, for some documents it really is!"
In our project we decided to skip documents where cid chars become dominating. And skip plenty of other docs. In the PDF collection we have there are dozens of weird issues. Just one example. Each page contains not only the text on the current page, but also all the text from the next page (printed outside of the visible page). pdfminer does not handle it.
PDF is evil. One have to do lot's of sanity checks, deduplications, spell checking etc. to get a healthy textual output! Even then, we are not immune from the creativity of the document creators as the "fancy" page layout can completely destroy word and even character order (unfortunately, out-of-box OCR won't help here).

@wanghaisheng

This comment has been minimized.

Copy link

commented Jun 7, 2018

perhaps pdf.js may be helpful

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.