Handling of non-ascii usernames #8

Closed
brutasse opened this Issue Jun 8, 2013 · 4 comments

Comments

Projects
None yet
2 participants
@brutasse
Contributor

brutasse commented Jun 8, 2013

tl;dr: replace 'foo' with 'foò' in the test suite and watch things break.

Non-ascii usernames seems a bit broken at the moment, at least on python2. Since to_string (from compat.py) calls str() on the value if it's unicode, this fails with an encoding error when the given value is not plain ascii.

self.app.get(url, user=u'foò') breaks completely on python 2. The workaround is to pass a native string instead of a unicode string with self.app.get(url, user=force_bytes(username)) everywhere. This somewhat works but the test output ends up with lots UnicodeWarnings and the client code isn't very friendly. This also prevents using a user instance as user= kwarg since get_username() returns a unicode string instead of a native string.

The WSGI pep says native strings must be used but they're not the same thing in py2 and py3.

I've tried to come up with an implementation that doesn't break on unicode based on what webtest does data but failed so far.

Ideas?

@kmike

This comment has been minimized.

Show comment
Hide comment
@kmike

kmike Jun 9, 2013

Member

Hey Bruno,

That's a weird ticket. Username is passed through WSGI environment variable (which is a header if I'm not mistaken). Headers must be latin1-encoded bytes in Python 2.x and latin1-encodable unicode string in Python 3.x.

Quote from PEP-3333:

On Python platforms where the str or StringType type is in fact Unicode-based (e.g. Jython, IronPython, Python 3, etc.), all "strings" referred to in this specification must contain only code points representable in ISO-8859-1 encoding (\u0000 through \u00FF, inclusive). It is a fatal error for an application to supply strings containing any other Unicode character or code point. Similarly, servers and gateways must not supply strings to an application containing any other Unicode characters.

Latin-1 can't handle full unicode range, so I think it is bad to use it as encoding for unicode usernames.

I think in Python 2.x we can pass utf8-encoded bytestring and decode it (because there is no automatic encoding-decoding). In Python 3.x we may create a 'fake' unicode string that will be encoded to latin1 (s.encode('utf8').decode('latin1')). But this is hacky. Also, I don't understand how the following quote from PEP-3333 affects these hacks:

Each header_value must not include any control characters, including carriage returns or linefeeds, either embedded or at the end.

After all, I gave up understanding specs :)

The idea implemented in a commit above is just to quote/unquote WEBTEST_USER variable to make it ASCII without encodings juggling. What do you think?

Member

kmike commented Jun 9, 2013

Hey Bruno,

That's a weird ticket. Username is passed through WSGI environment variable (which is a header if I'm not mistaken). Headers must be latin1-encoded bytes in Python 2.x and latin1-encodable unicode string in Python 3.x.

Quote from PEP-3333:

On Python platforms where the str or StringType type is in fact Unicode-based (e.g. Jython, IronPython, Python 3, etc.), all "strings" referred to in this specification must contain only code points representable in ISO-8859-1 encoding (\u0000 through \u00FF, inclusive). It is a fatal error for an application to supply strings containing any other Unicode character or code point. Similarly, servers and gateways must not supply strings to an application containing any other Unicode characters.

Latin-1 can't handle full unicode range, so I think it is bad to use it as encoding for unicode usernames.

I think in Python 2.x we can pass utf8-encoded bytestring and decode it (because there is no automatic encoding-decoding). In Python 3.x we may create a 'fake' unicode string that will be encoded to latin1 (s.encode('utf8').decode('latin1')). But this is hacky. Also, I don't understand how the following quote from PEP-3333 affects these hacks:

Each header_value must not include any control characters, including carriage returns or linefeeds, either embedded or at the end.

After all, I gave up understanding specs :)

The idea implemented in a commit above is just to quote/unquote WEBTEST_USER variable to make it ASCII without encodings juggling. What do you think?

@brutasse

This comment has been minimized.

Show comment
Hide comment
@brutasse

brutasse Jun 10, 2013

Contributor

@kmike this looks good. It'll only break when the given value is a non-ASCII bytestring, which seems acceptable given that Django uses unicode strings internally.

Contributor

brutasse commented Jun 10, 2013

@kmike this looks good. It'll only break when the given value is a non-ASCII bytestring, which seems acceptable given that Django uses unicode strings internally.

@kmike

This comment has been minimized.

Show comment
Hide comment
@kmike

kmike Jun 11, 2013

Member

Thanks for raising this issue and for the feedback! This is now released as 1.7.1.

Member

kmike commented Jun 11, 2013

Thanks for raising this issue and for the feedback! This is now released as 1.7.1.

@kmike kmike closed this Jun 11, 2013

@brutasse

This comment has been minimized.

Show comment
Hide comment
@brutasse

brutasse Jun 11, 2013

Contributor

@kmike thanks!

Contributor

brutasse commented Jun 11, 2013

@kmike thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment