Issues when trying to run tests on Windows #237

AnthonyBriggs · 2016-08-16T01:43:13Z

(This is on Windows 8.1; I imagine similar issues appear on other Windows systems)

I've had a try at running the tests under Windows, the main failure that I run into seems to be related to the default encoding under Windows (cp1252 rather than utf-8):

E AssertionError: 'utf-8' codec can't decode byte 0xff in position 10: invalid start byte

This seems to be an issue on both the Java and Python versions, see attached voc_output.txt, which is a dump of the main_code for both runAsPython() and runAsJava()

Another, possibly related issue is that Windows has a "charmap" encoding in it's terminals (cmd and powershell), which doesn't handle some unicode characters. I've also attached
unicode_test.py.txt which blows up with a similar error.

In the process of debugging, I've also found a couple of places where utf-8 encoding seems to have been missed - pull request here: #236

The text was updated successfully, but these errors were encountered:

AnthonyBriggs · 2016-08-22T13:02:09Z

I've been (slowly) chasing this up this evening, and making some progress through judicious commenting-out of exception handlers :) This is more of an infodump than anything, but might be helpful.

Commenting out the first exception handler in assertCodeExecution (lines 345/346), reveals that it's runAsJava that's throwing the dodgy string, specifically

`line = self.jvm.stdout.readline().decode("utf-8")`

on line 505. Printing out the return string from self.jvm.stdout.readline() shows that it's returning the hovercraft-is-full-of-eels string as

`b'>>> x = "M\xff h\xf4v\xe8r\xe7r\xe0ft \xee\xdf f\xfb?l \xf6f \xe9\xeal?"\n'`

which I'm pretty sure is not right. \xff is a beginning-of-unicode-string recognition character, for a start, which is why it's exploding. (For the record, it's supposed to be "Mÿ hôvèrçràft îß fûłl öf éêlś")

I've tried a few simple things to try and work out what's going on. Encoding the java output to cp1252 instead of utf-8 just turns it into

 `x = "M├┐ h├┤v├¿r├ºr├áft ├«├ƒ f├╗?l ├Âf ├®├¬l?"`

and various other combinations of encoding/decoding that string in a test script (without explicitly setting \xff didn't replicate the exact error.

I've done some light googling for "windows java stdout encoding" and "windows check powershell encoding", which turns up some potentially helpful info:

Default character encoding for java console output
Powershell: Get default system encoding
UTF-8 output from PowerShell (long)

Based on these, I've tried a few things:

setting [Console]::OutputEncoding = [System.Text.Encoding]::UTF8 to no avail
I did find that [System.Text.Encoding]::Default.EncodingName is set to Western European (Windows) which is cp1252/windows-1252/ANSI Latin 1; Western European (Windows).
Adding "-Dfile.encoding", "UTF-8", to the subprocess.Popen call didn't do anything.
neither did switching to decoding UTF-16

Anyway, not really sure what I'm doing here, but I'll keep trying things out until I figure out what's going on, or someone more knowledgable can jump in :)

AnthonyBriggs · 2016-08-22T13:12:20Z

Update: it seems like I can recreate that specific error, but by writing a file out as UTF-16, and then back as UTF-8 (see attached script, test_encoding.py.txt). Perhaps the Windows Java does this by default on Windows?

In any case, it's progress, and late here, so I'll pick this up later if it's still open.

freakboy3742 · 2016-08-23T01:04:11Z

You might be on to something with the UTF-16/8 thing. Internally, Java's string format uses an odd format called MUTF-8. The key feature of MUTF-8 is an odd way of encoding nulls.

I'm not sure why this would be manifesting on console output, and only on Windows - but it's worth some investigation.

For Python <3.6 on Windows, encoding strings for output to the Windows command prompt may result in a UnicodeEncodingError. VOC does not take encoding into consideration, and so the output differs. Until such time as proper encoding support is implemented in VOC, use an environment variable that Python provides for overriding the IO encoding. By setting this to UTF-8, the output may appear garbled, but the error is avoided, and it matches the run-as-Java output. For consistency, pass the environment variable to Java as well. Addresses beeware#610 and beeware#237.

whydoubt mentioned this issue Nov 18, 2017

Work around Unicode encoding error on Windows #703

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issues when trying to run tests on Windows #237

Issues when trying to run tests on Windows #237

AnthonyBriggs commented Aug 16, 2016

AnthonyBriggs commented Aug 22, 2016 •

edited

AnthonyBriggs commented Aug 22, 2016 •

edited

freakboy3742 commented Aug 23, 2016

Issues when trying to run tests on Windows #237

Issues when trying to run tests on Windows #237

Comments

AnthonyBriggs commented Aug 16, 2016

AnthonyBriggs commented Aug 22, 2016 • edited

AnthonyBriggs commented Aug 22, 2016 • edited

freakboy3742 commented Aug 23, 2016

AnthonyBriggs commented Aug 22, 2016 •

edited

AnthonyBriggs commented Aug 22, 2016 •

edited