Mix of encoded and decoded strings in final result #84

jorgecarleitao · 2014-12-10T16:57:37Z

Consider the following function to return a string in HTML from a pdf:

from StringIO import StringIO

from pdfminer.converter import TextConverter, XMLConverter, HTMLConverter
from pdfminer.layout import LAParams
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage

def _run_sample(self, sample_name, format):
        fp = file('samples/' + sample_name + '.pdf', 'rb')
        rsrcmgr = PDFResourceManager()

        result = StringIO()
        device = HTMLConverter(rsrcmgr, result, laparams=LAParams(detect_vertical = True))
        interpreter = PDFPageInterpreter(rsrcmgr, device)
        for page in PDFPage.get_pages(fp, caching=True, check_extractable=True):
            interpreter.process_page(page)
        fp.close()
        device.close()

        result.seek(0)
        return result.read().decode('utf-8')

This fails with the error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe3 in position 0: ordinal not in range(128)
in ".../2.7/lib/python2.7/StringIO.py", line 106, in seek.

I searched, and this seems to be because

The StringIO object can accept either Unicode or 8-bit strings, but mixing the two may take some care. If both are used, 8-bit strings that cannot be interpreted as 7-bit ASCII (that use the 8th bit) will cause a UnicodeError to be raised when getvalue() is called.
(from a docstring inside StringIO.py).

Should we try to uniformize this?

The text was updated successfully, but these errors were encountered:

euske · 2015-04-05T10:02:49Z

The font name in PDF was Unicode and mixed with normal strings. Fixed in 14fd0fd.

Fixed: euske#84 (fontname was in unicode)

Instead of list comprehension which will call a function to get the integer value of the bytes directly convert it to bytearray which is more optimal structure for storing list of bytes.

euske added a commit that referenced this issue Apr 5, 2015

Fixed: #84 (fontname was in unicode)

14fd0fd

yu-liang-kono pushed a commit to yu-liang-kono/pdfminer that referenced this issue Apr 8, 2015

Fixed: euske#84 (fontname was in unicode)

7c6841c

pombredanne added a commit to aboutcode-org/pdfminer that referenced this issue Aug 12, 2015

Merge pull request #6 from euske/master

28db5f0

Fixed: euske#84 (fontname was in unicode)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mix of encoded and decoded strings in final result #84

Mix of encoded and decoded strings in final result #84

jorgecarleitao commented Dec 10, 2014

euske commented Apr 5, 2015

Mix of encoded and decoded strings in final result #84

Mix of encoded and decoded strings in final result #84

Comments

jorgecarleitao commented Dec 10, 2014

euske commented Apr 5, 2015