Skip to content
This repository has been archived by the owner on Aug 4, 2022. It is now read-only.

Trouble with utf-8 encoded input files #7

Closed
esc opened this issue Jan 21, 2016 · 6 comments
Closed

Trouble with utf-8 encoded input files #7

esc opened this issue Jan 21, 2016 · 6 comments

Comments

@esc
Copy link

esc commented Jan 21, 2016

zsh» cram test.t
!
--- test.t
+++ test.t.err
@@ -1,1 +1,2 @@
   $ echo hallö
+  hall\xc3\xb6 (esc)

# Ran 1 tests, 0 skipped, 1 failed.
zsh» cat test.t
  $ echo hallö
@esc
Copy link
Author

esc commented Jan 21, 2016

Not sure what the problem is, but the file contains a special character which seems to become mangled either when cram reads it, or when the input is handed over to the shell. Ideas highly appreciated. Python version is 2.7.11

cc @karolyi

@karolyi
Copy link

karolyi commented Jan 21, 2016

It seems the problem is only reproducible on Travis-CI, AND Python 3.

After tracking it down, what happens is when using subprocess.Popen, utf-8 chars get understood by their surrogate-escaped counterparts by the shell that's getting the utf-8 sequences.

A practical example:
test.t:

# vim set enc=utf-8 :

  $ python -c 'import sys; print(sys.argv[1].encode("utf-8"))' ö

This locally evaluates to b'\xc3\xb6', but when running it on Travis, the result is '\udcc3\udcb6'.
The latter is the surrogate-escape version of the former:

>>> '\udcc3\udcb6'.encode('utf-8', 'surrogateescape')
b'\xc3\xb6'

I tried setting PYTHONIOENCODING=UTF-8 in the environment, still doesn't help.

@svenfuchs, any ideas?

@aiiie
Copy link
Owner

aiiie commented Jan 22, 2016

@esc, when you say it's being mangled, do you mean the escaped string is wrong? Or that you wouldn't expect Cram to escape the string at all?

I think what happening now isn't necessarily incorrect. Cram intentionally escapes mismatched output that isn't purely ASCII. The letter ö in UTF-8 is indeed represented by the bytes 0xC3 and 0xB6, so you should see hall\xc3\xb6 (esc) as the mismatched output in the diff.

You can elect to use (esc) like the diff suggests, but you can forgo escaping and add the literal UTF-8-encoded output in the test:

$ cat test.t
  $ echo hallö
  hallö
$ cram test.t
.
# Ran 1 tests, 0 skipped, 0 failed.

Again, this is an intentional design decision, but I could be swayed to change the behavior. Perhaps Cram should only escape characters in mismatched output in the diff if they can't be displayed by the terminal's current locale. Do you think that would be a better behavior? Then if your terminal encoding is UTF-8, it shouldn't escape this particular string.

The only edge case I can think of is if the test file is encoded in one way (let's say UTF-16), then you run it and there's mismatched output that's UTF-8 and your terminal is set to UTF-8. If you're running Cram with --interactive and you choose to patch the test, you'd end up with a test file that doesn't have a consistent encoding. Cram would have to do a fair bit of work to prevent that situation from happening, but maybe it's possible to deal with. I'll have to think about it some more.

@karolyi, I'm not sure I follow what you're saying. Are you seeing the same thing @esc is, or is something different happening? Do you have a test repo I can run in Travis to reproduce the issue you're seeing?

@karolyi
Copy link

karolyi commented Jan 22, 2016

Cram is working okay on OSX and Ubuntu linux locally with Python 3 (on local machines), but does this mystical thing I mentioned on Travis.

What you explained here is right, we get the same escaping locally, everything is nice. But when we test our stuff on Travis CI (Scout24/afp-cli#46), it fails miserably, with cram formatting the non-ascii characters to their surrogate-escape counterparts. This must be something specific to Travis, but after spending 30+ hours on the subject, I still couldn't figure out the reason.So I tried to pull @svenfuchs in, as he works on that project.

I can set you up with a minimalistic repo that will reproduce the bug.

@aiiie
Copy link
Owner

aiiie commented Jan 22, 2016

I think I see what the issue you're running into is.

First of all, regarding this:

$ python -c 'import sys; print(sys.argv[1].encode("utf-8"))' ö
Traceback (most recent call last):
  File "<string>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)

On Python 2, that's what's supposed to happen. sys.argv contains byte strings. Calling .encode() on a byte string first decodes it as ASCII, and that will fail. What you'd really want to do is remove the .encode('utf-8') part. For Python 3, where sys.argv contains Unicode strings, that code makes sense, but not for 2.

As for what you're seeing with Python 3, I was able to reproduce that. If you have a test like this, you can see what's happening:

  $ python3 -c 'import sys; print(sys.getfilesystemencoding())'

On my local OS X box, that prints utf-8. On my Linode box running Ubuntu, it prints ascii. On Travis CI, it prints ascii. Because it's ascii, it can't decode the non-ASCII characters, so to preserve them, the surrogate-escape error handler kicks in and you get '\udcc3\udcb6' for ö.

So, what's happening is that before Cram runs the shell, it sets LANG, LC_ALL, and LANGUAGE to C. It does this so tests have consistent locale behavior, regardless of the environment they're run in. So the behavior you're seeing on Travis makes sense. (You can disable this behavior with the -E/--preserve-env flag, but I don't recommend it.)

The behavior on OS X is due to Python intentionally hardcoding sys.getfilesystemencoding() to return utf-8 for that platform. That explains the discrepancy you're seeing.

As for making your tests work the way you'd expect, try this:

  $ export LC_ALL=C.UTF-8

You could also set it to en_US.UTF-8 if that doesn't work (but that might be less portable).

I'm somewhat surprised PYTHONIOENCODING=UTF-8 wouldn't also have the same effect, but the docs say it only affects the encoding of stdin, stdout, and stderr. I don't think there's any way to influence what sys.getfilesystemencoding() returns except through setting LC_ALL (and again, there's no way to influence it on OS X—it's always utf-8 there.)

PS: It might be worth considering making your Cram tests or your application work without having to change LC_ALL from C toC.UTF-8`. I haven't looked enough at your repo to say how that would work, but it might be good to try to handle the surrogate-escaped Unicode characters gracefully if possible.

You could take a look at the current master branch of Cram itself for ideas. Cram works on Python 2.4-3.5, and it prefers to deal with raw bytes for I/O, filenames, command line arguments, and environment variables. I use os.fsencode() and os.fsdecode() to facilitate that.

@karolyi
Copy link

karolyi commented Jan 25, 2016

Hey @Brodie,

thanks for the -E tip, it seems that it solved our problem.

It would be nice to know which environment variable caused this havoc, I played around with it a lot, and still couldn't figure it out. I took all environment variables from travis, set up a script locally, without being able to reproduce this bug, so it remains a mystery yet to be resolved.

Thanks for your insight, appreciated.

Cheers,
László

P.S.: @esc, you can close this now as I can't.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants