Trouble with utf-8 encoded input files #7
Comments
Not sure what the problem is, but the file contains a special character which seems to become mangled either when cram reads it, or when the input is handed over to the shell. Ideas highly appreciated. Python version is 2.7.11 cc @karolyi |
It seems the problem is only reproducible on Travis-CI, AND Python 3. After tracking it down, what happens is when using A practical example:
This locally evaluates to >>> '\udcc3\udcb6'.encode('utf-8', 'surrogateescape')
b'\xc3\xb6' I tried setting @svenfuchs, any ideas? |
@esc, when you say it's being mangled, do you mean the escaped string is wrong? Or that you wouldn't expect Cram to escape the string at all? I think what happening now isn't necessarily incorrect. Cram intentionally escapes mismatched output that isn't purely ASCII. The letter ö in UTF-8 is indeed represented by the bytes 0xC3 and 0xB6, so you should see You can elect to use $ cat test.t
$ echo hallö
hallö
$ cram test.t
.
# Ran 1 tests, 0 skipped, 0 failed. Again, this is an intentional design decision, but I could be swayed to change the behavior. Perhaps Cram should only escape characters in mismatched output in the diff if they can't be displayed by the terminal's current locale. Do you think that would be a better behavior? Then if your terminal encoding is UTF-8, it shouldn't escape this particular string. The only edge case I can think of is if the test file is encoded in one way (let's say UTF-16), then you run it and there's mismatched output that's UTF-8 and your terminal is set to UTF-8. If you're running Cram with @karolyi, I'm not sure I follow what you're saying. Are you seeing the same thing @esc is, or is something different happening? Do you have a test repo I can run in Travis to reproduce the issue you're seeing? |
Cram is working okay on OSX and Ubuntu linux locally with Python 3 (on local machines), but does this mystical thing I mentioned on Travis. What you explained here is right, we get the same escaping locally, everything is nice. But when we test our stuff on Travis CI (Scout24/afp-cli#46), it fails miserably, with cram formatting the non-ascii characters to their surrogate-escape counterparts. This must be something specific to Travis, but after spending 30+ hours on the subject, I still couldn't figure out the reason.So I tried to pull @svenfuchs in, as he works on that project. I can set you up with a minimalistic repo that will reproduce the bug. |
I think I see what the issue you're running into is. First of all, regarding this: $ python -c 'import sys; print(sys.argv[1].encode("utf-8"))' ö
Traceback (most recent call last):
File "<string>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128) On Python 2, that's what's supposed to happen. As for what you're seeing with Python 3, I was able to reproduce that. If you have a test like this, you can see what's happening: $ python3 -c 'import sys; print(sys.getfilesystemencoding())' On my local OS X box, that prints So, what's happening is that before Cram runs the shell, it sets The behavior on OS X is due to Python intentionally hardcoding As for making your tests work the way you'd expect, try this: $ export LC_ALL=C.UTF-8 You could also set it to I'm somewhat surprised PS: It might be worth considering making your Cram tests or your application work without having to change LC_ALL from You could take a look at the current master branch of Cram itself for ideas. Cram works on Python 2.4-3.5, and it prefers to deal with raw bytes for I/O, filenames, command line arguments, and environment variables. I use |
Hey @Brodie, thanks for the It would be nice to know which environment variable caused this havoc, I played around with it a lot, and still couldn't figure it out. I took all environment variables from travis, set up a script locally, without being able to reproduce this bug, so it remains a mystery yet to be resolved. Thanks for your insight, appreciated. Cheers, P.S.: @esc, you can close this now as I can't. |
The text was updated successfully, but these errors were encountered: