Skip to content
This repository has been archived by the owner on Apr 15, 2024. It is now read-only.

Mix of encoded and decoded strings in final result #84

Open
jorgecarleitao opened this issue Dec 10, 2014 · 1 comment
Open

Mix of encoded and decoded strings in final result #84

jorgecarleitao opened this issue Dec 10, 2014 · 1 comment

Comments

@jorgecarleitao
Copy link

Consider the following function to return a string in HTML from a pdf:

from StringIO import StringIO

from pdfminer.converter import TextConverter, XMLConverter, HTMLConverter
from pdfminer.layout import LAParams
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage

def _run_sample(self, sample_name, format):
        fp = file('samples/' + sample_name + '.pdf', 'rb')
        rsrcmgr = PDFResourceManager()

        result = StringIO()
        device = HTMLConverter(rsrcmgr, result, laparams=LAParams(detect_vertical = True))
        interpreter = PDFPageInterpreter(rsrcmgr, device)
        for page in PDFPage.get_pages(fp, caching=True, check_extractable=True):
            interpreter.process_page(page)
        fp.close()
        device.close()

        result.seek(0)
        return result.read().decode('utf-8')

This fails with the error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe3 in position 0: ordinal not in range(128)
in ".../2.7/lib/python2.7/StringIO.py", line 106, in seek.

I searched, and this seems to be because

The StringIO object can accept either Unicode or 8-bit strings, but mixing the two may take some care. If both are used, 8-bit strings that cannot be interpreted as 7-bit ASCII (that use the 8th bit) will cause a UnicodeError to be raised when getvalue() is called.
(from a docstring inside StringIO.py).

Should we try to uniformize this?

@euske
Copy link
Owner

euske commented Apr 5, 2015

The font name in PDF was Unicode and mixed with normal strings. Fixed in 14fd0fd.

euske added a commit that referenced this issue Apr 5, 2015
yu-liang-kono pushed a commit to yu-liang-kono/pdfminer that referenced this issue Apr 8, 2015
pombredanne added a commit to aboutcode-org/pdfminer that referenced this issue Aug 12, 2015
Fixed: euske#84 (fontname was in unicode)
pombredanne pushed a commit to aboutcode-org/pdfminer that referenced this issue Jun 5, 2018
Instead of list comprehension which will call a function to get the integer value of the bytes directly convert it to bytearray which is more optimal structure for storing list of bytes.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants