New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change Response Files encoding to be identified by response file suffix? #15292
Comments
I remember digging into this a while back finding it very thorny. I would not be opposed to your suggestion if it solves a real problem. What other tools are you imagining |
Today, none other that you list. But the interface does not reflect that the internals deal about LLVM internal issues. Maybe we don't need to care about changing the creation side, since there aren't any issues there; just on the read side. |
Yes, sounds like its mostly the reading in of these response files that is it the issue. We can clarify that On the reading side I wonder if we can just copy the logic from llvm itself? I imagine it must have code to auto-detect the encoding of the file its reading? (I seem to remember reading this code, or at least looking it up a while back). |
Detecting the difference between ASCII and UTF-8 is hard without a Byte order Mark. |
But we don't know if windows presents that problem to us or not. It could be that the BOM is present in all the cases we care about? |
There is no guarantee that users who have crafted response files would have created the files with a BOM? Recall that these files all come from the users of Emscripten, and not just internally created by the Emscripten toolchain when invoking subprocesses.
Also to clarify, the issue here was not ASCII vs UTF-8 (ASCII is a subset of UTF-8, so would be fine to hoist to unconditionally decode ASCII files as UTF-8 in that case), but e.g. Windows Codepage 437 vs UTF-8 and/or Windows Codepage 1252 vs UTF-8, or some other "default system encoding" vs UTF-8. We could try autodetecting the encoding by first attempting to decode it as UTF-8 and if parsing the file using that fails, then load it using the system current locale. But I find that it is better to have support to be explicit. Posted #15406 to fix this. |
When python sees
open(file, 'r')
without explicitencoding=
parameter, it opens a text file using the default system encoding. This happens today atemscripten/tools/response_file.py
Line 80 in ff23b8c
So when users are calling emcc and other tools with response files, they need to encode their response files using the current system encoding locale.
It would be preferable to always use utf-8 encoding to encode response files with, but there is a danger that changing
open(file, 'r', encoding='utf-8')
there will break existing user build systems.For example on Windows if a shell script did
echo arg1 arg2 arg3 arg4 > file.rsp
, I believe that will create a current system encoding locale encoded file.Also related, we have this quirky code when different Emscripten tools create response files:
emscripten/tools/response_file.py
Lines 44 to 53 in ff23b8c
here the code is making an assumption that the only time the Emscripten toolchain is creating response files, it would be creating those in a call to an LLVM tool. This might be true, though the intent of
create_response_file()
definitely is not to create response files only for LLVM to consume.(btw I believe the answer to that TODO above is 'yes')
To fix both of these, I would propose that the response file logic would use the file suffix to decide what the encoding of the file should be:
.utf8
, then the response file should be created and/or read using explicit utf-8 encoding..rsp
), then the response file should be created using current system encoding locale.Then when Emscripten users are creating response files, they can create files with suffix
.rsp
or.rsp.utf8
to choose either current locale (open(file, 'r')
in python), or utf-8 locale (open(file, 'r', encoding='utf-8')
for python). Likewise, the functioncreate_response_file()
can then be extended to take in a suffix, and the call sites to that tool can choose which encoding to use (I think they all will want.rsp.utf8
at the moment)Does that sound good?
The text was updated successfully, but these errors were encountered: