Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode handling meta-issue [JIRA: CLIENTS-791] #334

Open
2 tasks
seancribbs opened this issue Jun 3, 2014 · 21 comments
Open
2 tasks

Unicode handling meta-issue [JIRA: CLIENTS-791] #334

seancribbs opened this issue Jun 3, 2014 · 21 comments

Comments

@seancribbs
Copy link

We need to review and have a policy for all handling of Unicode character data. This issue consolidates and supercedes:

Key issues:

  • Uniform handling of object identifiers (bucket-type, bucket, key)
  • Uniform handling of values encoded to/parsed from JSON
@reiddraper
Copy link
Contributor

I'm happy to help here if I can. Do we have some uniform policy we use in the other clients? Say Java?

@seancribbs
Copy link
Author

The plan I'm thinking of is this:

  1. At the RiakClient object level, allow the user to set the desired charset. It should probably default to utf-8. If the user ever passes a str instead of a unicode type, we should probably assume it's already encoded (but I'm less firm on this point, it needs more investigation).
  2. In all places where identifiers come back from Riak (put w/o key, key lists, bucket lists, index queries), decode them into the configured charset.

@reiddraper
Copy link
Contributor

I suppose the issue with 2. is that if there is already binary data written, we won't be able to decode it as UTF-8 when we read. Would we want/need a 'bypass' decoder that returns str instead of unicode objects? I believe the Java client has had to deal with this as well, but maybe they just only work with Byte Arrays. cc @broach

@seancribbs
Copy link
Author

I agree with your reservations about 2. Without roundtrip charset information however, we can't be sure of anything. As of now, we just puke back at the user if they give us a unicode.

@reiddraper
Copy link
Contributor

As of now, we just puke back at the user if they give us a unicode.

Not sure I completely followed this.

@seancribbs
Copy link
Author

As of now, we just puke back at the user if they give us a unicode.

Not sure I completely followed this.

https://github.com/basho/riak-python-client/blob/master/riak/riak_object.py#L113-L117

@reiddraper
Copy link
Contributor

Oh yes, right. Get it now. We're certainly in a place where incremental improvement 'is a thing'.

@macintux
Copy link
Contributor

macintux commented Jun 9, 2014

Worth mentioning: we've explicitly disallowed the creation of non-ascii YZ indexes until we sort out filesystem compatibility and Riak's internal handling. I imagine the clients will just hand back the (not terribly informative) error message, but fyi.

@seancribbs
Copy link
Author

Thanks @macintux. Edited description.

@macintux
Copy link
Contributor

macintux commented Jun 9, 2014

I'm not sure the client should block it, lest we have to coordinate when we fix it on the backend.

@seancribbs
Copy link
Author

@macintux What does the error coming back look like then?

@macintux
Copy link
Contributor

macintux commented Jun 9, 2014

"Invalid character in index name " via Webmachine. I don't know off-hand what the PB interface does; the error tuple internally is {error, invalid_name}.

https://github.com/basho/yokozuna/pull/381/files

@reiddraper
Copy link
Contributor

What about just requiring that the user always pass in a str (necessarily already encoded). We would return a str as well. This is sort of the closest we can get in Python 2.x to using Byte Arrays (Java).

@mcobzarenco
Copy link

The discussion here focuses on unicode keys, but issue #32 which was closed in favour of this one is about general binary keys. Currently, we have to wrap all keys passed to the library in a very hacky class, as unfortunately the insistence on doing encoding in the library gets in the way:

class BinaryString(str):
    def encode(self, ignored):
        return self

Riak's PBC API (message RpbGetReq) makes the key field bytes rather than string, i.e. not UTF-8 encoded.

It would be really great if there was a way to specify encoding=None across the whole of RiakClient object

@seancribbs
Copy link
Author

@mcobzarenco Yes, that is the goal.

@seancribbs seancribbs self-assigned this Aug 25, 2014
@seancribbs seancribbs modified the milestones: 2.0, 2.0.1 Aug 27, 2014
@seancribbs seancribbs removed their assignment May 8, 2015
@cread
Copy link

cread commented Mar 10, 2016

Any plans on actually fixing this?

@Basho-JIRA Basho-JIRA changed the title Unicode handling meta-issue Unicode handling meta-issue [JIRA: CLIENTS-791] Mar 10, 2016
@lukebakken
Copy link
Contributor

@cread - can you provide information (ideally code as well) of how this issue affects you?

The reason I ask is that I am not very familiar, at this point, with Python's string handling with regard to encoding vs. bytes. Python 2 vs Python 3 complicates things further.

Thank you!

@cread
Copy link

cread commented Mar 10, 2016

The problem is that we're using a uint64 as our riak object key. Up until now we've been using raw protobuf in C++ to communicate with riak and have not had any problems.

We're now trying to write a helper utility in Python but get an error like this:

>>> r = riak.RiakClient()
>>> r._protocol
'pbc'
>>> bucket = r.bucket('mybucket')
>>> key = '\xfd\x01U\xc3\x02\x00\x00\x00'
>>> obj = bucket.get(key)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "build/bdist.linux-x86_64/egg/riak/bucket.py", line 231, in get
  File "build/bdist.linux-x86_64/egg/riak/riak_object.py", line 110, in __init__
TypeError: Unicode keys are not supported.

@lukebakken
Copy link
Contributor

Using C++, you've been using the bytes representing the uint64 value as the key, correct?

I recommend using Python 3 since the check for "Unicode" keys is only for Python 2 (see here).

I have forwarded this issue on to product management to consider its priority.

@cread
Copy link

cread commented Mar 10, 2016

Correct.

Thanks, moving to Python 3 for this is not an option for us at the moment.

@cread
Copy link

cread commented Mar 10, 2016

While you're talking to product management, can you also ask them about this issue too: basho/riak#789

Thanks again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants