Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling of non-ascii usernames #8

Closed
brutasse opened this issue Jun 8, 2013 · 4 comments
Closed

Handling of non-ascii usernames #8

brutasse opened this issue Jun 8, 2013 · 4 comments

Comments

@brutasse
Copy link
Contributor

@brutasse brutasse commented Jun 8, 2013

tl;dr: replace 'foo' with 'foò' in the test suite and watch things break.

Non-ascii usernames seems a bit broken at the moment, at least on python2. Since to_string (from compat.py) calls str() on the value if it's unicode, this fails with an encoding error when the given value is not plain ascii.

self.app.get(url, user=u'foò') breaks completely on python 2. The workaround is to pass a native string instead of a unicode string with self.app.get(url, user=force_bytes(username)) everywhere. This somewhat works but the test output ends up with lots UnicodeWarnings and the client code isn't very friendly. This also prevents using a user instance as user= kwarg since get_username() returns a unicode string instead of a native string.

The WSGI pep says native strings must be used but they're not the same thing in py2 and py3.

I've tried to come up with an implementation that doesn't break on unicode based on what webtest does data but failed so far.

Ideas?

kmike added a commit that referenced this issue Jun 9, 2013
@kmike
Copy link
Member

@kmike kmike commented Jun 9, 2013

Hey Bruno,

That's a weird ticket. Username is passed through WSGI environment variable (which is a header if I'm not mistaken). Headers must be latin1-encoded bytes in Python 2.x and latin1-encodable unicode string in Python 3.x.

Quote from PEP-3333:

On Python platforms where the str or StringType type is in fact Unicode-based (e.g. Jython, IronPython, Python 3, etc.), all "strings" referred to in this specification must contain only code points representable in ISO-8859-1 encoding (\u0000 through \u00FF, inclusive). It is a fatal error for an application to supply strings containing any other Unicode character or code point. Similarly, servers and gateways must not supply strings to an application containing any other Unicode characters.

Latin-1 can't handle full unicode range, so I think it is bad to use it as encoding for unicode usernames.

I think in Python 2.x we can pass utf8-encoded bytestring and decode it (because there is no automatic encoding-decoding). In Python 3.x we may create a 'fake' unicode string that will be encoded to latin1 (s.encode('utf8').decode('latin1')). But this is hacky. Also, I don't understand how the following quote from PEP-3333 affects these hacks:

Each header_value must not include any control characters, including carriage returns or linefeeds, either embedded or at the end.

After all, I gave up understanding specs :)

The idea implemented in a commit above is just to quote/unquote WEBTEST_USER variable to make it ASCII without encodings juggling. What do you think?

@brutasse
Copy link
Contributor Author

@brutasse brutasse commented Jun 10, 2013

@kmike this looks good. It'll only break when the given value is a non-ASCII bytestring, which seems acceptable given that Django uses unicode strings internally.

@kmike
Copy link
Member

@kmike kmike commented Jun 11, 2013

Thanks for raising this issue and for the feedback! This is now released as 1.7.1.

@kmike kmike closed this Jun 11, 2013
@brutasse
Copy link
Contributor Author

@brutasse brutasse commented Jun 11, 2013

@kmike thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
2 participants