Skip to content
This repository has been archived by the owner on May 31, 2020. It is now read-only.

Issues when trying to run tests on Windows #237

Open
AnthonyBriggs opened this issue Aug 16, 2016 · 3 comments
Open

Issues when trying to run tests on Windows #237

AnthonyBriggs opened this issue Aug 16, 2016 · 3 comments

Comments

@AnthonyBriggs
Copy link
Contributor

(This is on Windows 8.1; I imagine similar issues appear on other Windows systems)

I've had a try at running the tests under Windows, the main failure that I run into seems to be related to the default encoding under Windows (cp1252 rather than utf-8):

E AssertionError: 'utf-8' codec can't decode byte 0xff in position 10: invalid start byte

This seems to be an issue on both the Java and Python versions, see attached voc_output.txt, which is a dump of the main_code for both runAsPython() and runAsJava()

Another, possibly related issue is that Windows has a "charmap" encoding in it's terminals (cmd and powershell), which doesn't handle some unicode characters. I've also attached
unicode_test.py.txt which blows up with a similar error.

In the process of debugging, I've also found a couple of places where utf-8 encoding seems to have been missed - pull request here: #236

@AnthonyBriggs
Copy link
Contributor Author

AnthonyBriggs commented Aug 22, 2016

I've been (slowly) chasing this up this evening, and making some progress through judicious commenting-out of exception handlers :) This is more of an infodump than anything, but might be helpful.

Commenting out the first exception handler in assertCodeExecution (lines 345/346), reveals that it's runAsJava that's throwing the dodgy string, specifically

`line = self.jvm.stdout.readline().decode("utf-8")`

on line 505. Printing out the return string from self.jvm.stdout.readline() shows that it's returning the hovercraft-is-full-of-eels string as

`b'>>> x = "M\xff h\xf4v\xe8r\xe7r\xe0ft \xee\xdf f\xfb?l \xf6f \xe9\xeal?"\n'`

which I'm pretty sure is not right. \xff is a beginning-of-unicode-string recognition character, for a start, which is why it's exploding. (For the record, it's supposed to be "Mÿ hôvèrçràft îß fûłl öf éêlś")

I've tried a few simple things to try and work out what's going on. Encoding the java output to cp1252 instead of utf-8 just turns it into

 `x = "M├┐ h├┤v├¿r├ºr├áft ├«├ƒ f├╗?l ├Âf ├®├¬l?"`

and various other combinations of encoding/decoding that string in a test script (without explicitly setting \xff didn't replicate the exact error.

I've done some light googling for "windows java stdout encoding" and "windows check powershell encoding", which turns up some potentially helpful info:

Default character encoding for java console output
Powershell: Get default system encoding
UTF-8 output from PowerShell (long)

Based on these, I've tried a few things:

  • setting [Console]::OutputEncoding = [System.Text.Encoding]::UTF8 to no avail
  • I did find that [System.Text.Encoding]::Default.EncodingName is set to Western European (Windows) which is cp1252/windows-1252/ANSI Latin 1; Western European (Windows).
  • Adding "-Dfile.encoding", "UTF-8", to the subprocess.Popen call didn't do anything.
  • neither did switching to decoding UTF-16

Anyway, not really sure what I'm doing here, but I'll keep trying things out until I figure out what's going on, or someone more knowledgable can jump in :)

@AnthonyBriggs
Copy link
Contributor Author

AnthonyBriggs commented Aug 22, 2016

Update: it seems like I can recreate that specific error, but by writing a file out as UTF-16, and then back as UTF-8 (see attached script, test_encoding.py.txt). Perhaps the Windows Java does this by default on Windows?

In any case, it's progress, and late here, so I'll pick this up later if it's still open.

@freakboy3742
Copy link
Member

You might be on to something with the UTF-16/8 thing. Internally, Java's string format uses an odd format called MUTF-8. The key feature of MUTF-8 is an odd way of encoding nulls.

I'm not sure why this would be manifesting on console output, and only on Windows - but it's worth some investigation.

whydoubt added a commit to whydoubt/voc that referenced this issue Nov 14, 2017
For Python <3.6 on Windows, encoding strings for output to the Windows
command prompt may result in a UnicodeEncodingError.  VOC does not take
encoding into consideration, and so the output differs.

Until such time as proper encoding support is implemented in VOC, use an
environment variable that Python provides for overriding the IO
encoding.  By setting this to UTF-8, the output may appear garbled, but
the error is avoided, and it matches the run-as-Java output.  For
consistency, pass the environment variable to Java as well.

Addresses beeware#610 and beeware#237.
whydoubt added a commit to whydoubt/voc that referenced this issue Nov 18, 2017
For Python <3.6 on Windows, encoding strings for output to the Windows
command prompt may result in a UnicodeEncodingError.  VOC does not take
encoding into consideration, and so the output differs.

Until such time as proper encoding support is implemented in VOC, use an
environment variable that Python provides for overriding the IO
encoding.  By setting this to UTF-8, the output may appear garbled, but
the error is avoided, and it matches the run-as-Java output.  For
consistency, pass the environment variable to Java as well.

Addresses beeware#610 and beeware#237.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants