Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid UnicodeDecoderError: use binary stream for xml files, let SAX determine the encoding #303

Closed
StefanBruens opened this issue Mar 2, 2018 · 3 comments
Labels
medium Medium priority bug.

Comments

@StefanBruens
Copy link
Contributor

Currently zone_reader from zone.py:

with open(name, "r") as f:

and *_reader from the other io classes open the file as character stream, which fails as soon as e.g. the "C" locale is used and the xml files contains non-ASCII character.

SAX allows to feed a binary stream into the parser, and let SAX determine the encoding from the XML encoding pseudo attribute.

The following code works with both Python2 and Python3:

#! /usr/bin/python3
# -*- coding: utf-8 -*-

import xml.sax as sax
import codecs

name = "./foo.xml"
parser = sax.make_parser()

with open(name, "rb") as f:
  source = sax.InputSource(None)
  source.setByteStream(f)
  try:
    parser.parse(source)
  except Exception as e:
    print(e)

This is able to parse the following XML file:

<?xml version="1.0" encoding="utf-8"?>
<foo>
  <a>
    asdf
  </a>
  <b>
    äöü
  </b>
</foo>
@erig0 erig0 added the medium Medium priority bug. label Mar 2, 2018
StefanBruens added a commit to StefanBruens/firewalld that referenced this issue Sep 21, 2018
SAX is able to determine the encoding of XML files itself if the file
contains a correct "encoding" pseudo attribute, e.g.:
<?xml version="1.0" encoding="UTF-8" standalone="yes" ?>

For this to work, the file stream has to be opened in binary mode, and
the parser has to read the stream using a SAX InputStream, which
autodetects the encoding.

Fix for firewalld#303.
@erig0
Copy link
Collaborator

erig0 commented Sep 25, 2018

Can you tell me what version of python you're using? I can reproduce this on 3.4 and 3.6, but not 3.7

$ LC_ALL=C python3.4 foo.py
'ascii' codec can't decode byte 0xc3 in position 99: ordinal not in range(128)

$ LC_ALL=C python3.6 foo.py
'ascii' codec can't decode byte 0xc3 in position 99: ordinal not in range(128)

$ LC_ALL=C python3.7 foo.py
works

@erig0 erig0 closed this as completed in 7cdd802 Sep 25, 2018
erig0 added a commit that referenced this issue Sep 25, 2018
Verify the XML parser can parse unicode.
@StefanBruens
Copy link
Contributor Author

Currently 3.6.5, not sure which version I used in march.

On 3.7 it probably works cause of https://docs.python.org/3/whatsnew/3.7.html#whatsnew37-pep538

@erig0
Copy link
Collaborator

erig0 commented Sep 25, 2018

On 3.7 it probably works cause of https://docs.python.org/3/whatsnew/3.7.html#whatsnew37-pep538

Indeed. Thanks for the pointer.

erig0 pushed a commit that referenced this issue Oct 5, 2018
SAX is able to determine the encoding of XML files itself if the file
contains a correct "encoding" pseudo attribute, e.g.:
<?xml version="1.0" encoding="UTF-8" standalone="yes" ?>

For this to work, the file stream has to be opened in binary mode, and
the parser has to read the stream using a SAX InputStream, which
autodetects the encoding.

Fixes: #303
(cherry picked from commit 7cdd802)
erig0 added a commit that referenced this issue Oct 5, 2018
Verify the XML parser can parse unicode.

(cherry picked from commit 34ac8cd)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
medium Medium priority bug.
Projects
None yet
Development

No branches or pull requests

2 participants