Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UltraJson doesn't behave the same way as json.JSONEncoder for unicode chars #156

Open
prosanes opened this issue Nov 27, 2014 · 5 comments

Comments

Projects
None yet
6 participants
@prosanes
Copy link

commented Nov 27, 2014

As stated in issue #155:

I'm not sure that ujson.encode('\ud83d\ude80') should give any error.
The Python standard json library ("simplejson") doesn't:

json.JSONEncoder().encode('\ud83d\ude80')
'"\ud83d\ude80"'

When using ujson:

Python 3.3.2 (default, Sep 16 2013, 16:19:35)
[GCC 4.4.6 20120305 (Red Hat 4.4.6-4)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import ujson
>>> ujson.encode('\ud83d\ude80')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud83d' in position 0: surrogates not allowed
>>> ujson.__version__
'1.34'
@tonicava

This comment has been minimized.

Copy link

commented Nov 28, 2014

Right, and that's a problem...

@Jahaja

This comment has been minimized.

Copy link
Member

commented Dec 6, 2014

We're looking into it.

@vetal4444

This comment has been minimized.

Copy link

commented Mar 26, 2015

I have this issue too:

In [1]: import json, ujson

In [2]: s = '"\ud8df\u4b61"'

In [3]: json.loads(s)
Out[3]: u'\ud8df\u4b61'

In [4]: ujson.loads(s)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-4-1b4d181e3a09> in <module>()
----> 1 ujson.loads(s)

ValueError: Unpaired high surrogate when decoding 'string'

Python 2.7.9

@mittonk

This comment has been minimized.

Copy link

commented Oct 6, 2016

Still reproducible on Python 3.5.1 and ujson 1.3.5.

gnprice added a commit to zulip/ultrajson that referenced this issue Aug 29, 2017

Fix handling of surrogate pseudocharacters under Python 3.
This is a situation where we have a Python unicode string which doesn't
consist entirely of genuine Unicode characters -- some of the codepoints
in the string are surrogate codepoints, which occur in a UTF-16 encoding
of a string and were also repurposed in PEP 383 for losslessly encoding
arbitrary mostly-UTF-8 bytestrings (like Unix filenames) in Python
strings.  Currently, on Python 3, we cause a UnicodeEncodeError if we
try to encode such a string as JSON.

It's not 100% obvious what the right thing to do here is -- this
situation seems like it must reflect a bug somewhere else in the
program or its environment.  But

 * one way we can get such a string is by loading a JSON document
   (perhaps an invalid JSON document? anyway, we load it without error):

   >>> ujson.dumps(ujson.loads('"\\udcff"'))
   Traceback (most recent call last):
     File "<stdin>", line 1, in <module>
   UnicodeEncodeError: 'utf-8' codec can't encode character '\udcff' in position 0: surrogates not allowed

 * we already pass these strings through without complaint on Python 2;

 * as the included test shows, passing these through matches the
   behavior of the stdlib's `json` module.

So it seems best to pass them through.

Fixes esnme#156.

@gnprice gnprice referenced a pull request that will close this issue Aug 29, 2017

Open

Fix handling of surrogate pseudocharacters under Python 3. #284

gnprice added a commit to zulip/ultrajson that referenced this issue Aug 29, 2017

Fix handling of surrogate pseudocharacters under Python 3.
This is a situation where we have a Python unicode string which doesn't
consist entirely of genuine Unicode characters -- some of the codepoints
in the string are surrogate codepoints, which occur in a UTF-16 encoding
of a string and were also repurposed in PEP 383 for losslessly encoding
arbitrary mostly-UTF-8 bytestrings (like Unix filenames) in Python
strings.  Currently, on Python 3, we cause a UnicodeEncodeError if we
try to encode such a string as JSON.

It's not 100% obvious what the right thing to do here is -- this
situation seems like it must reflect a bug somewhere else in the
program or its environment.  But

 * one way we can get such a string is by loading a JSON document
   (perhaps an invalid JSON document? anyway, we load it without error):

   >>> ujson.dumps(ujson.loads('"\\udcff"'))
   Traceback (most recent call last):
     File "<stdin>", line 1, in <module>
   UnicodeEncodeError: 'utf-8' codec can't encode character '\udcff' in position 0: surrogates not allowed

 * we already pass these strings through without complaint on Python 2;

 * as the included test shows, passing these through matches the
   behavior of the stdlib's `json` module.

So it seems best to pass them through.

Fixes esnme#156.
@rafaelxy

This comment has been minimized.

Copy link

commented Mar 15, 2018

Still happens on Python 2.7.12 and ujson==1.35

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.