Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change Response Files encoding to be identified by response file suffix? #15292

Closed
juj opened this issue Oct 14, 2021 · 6 comments · Fixed by #15406
Closed

Change Response Files encoding to be identified by response file suffix? #15292

juj opened this issue Oct 14, 2021 · 6 comments · Fixed by #15406

Comments

@juj
Copy link
Collaborator

juj commented Oct 14, 2021

When python sees open(file, 'r') without explicit encoding= parameter, it opens a text file using the default system encoding. This happens today at

with open(response_filename) as f:

So when users are calling emcc and other tools with response files, they need to encode their response files using the current system encoding locale.

It would be preferable to always use utf-8 encoding to encode response files with, but there is a danger that changing open(file, 'r', encoding='utf-8') there will break existing user build systems.

For example on Windows if a shell script did echo arg1 arg2 arg3 arg4 > file.rsp, I believe that will create a current system encoding locale encoded file.

Also related, we have this quirky code when different Emscripten tools create response files:

# When writing windows repsonse files force the encoding to UTF8 which we know
# that llvm tools understand. Without this, we get whatever the default codepage
# might be.
# See: https://github.com/llvm/llvm-project/blob/3f3d1c901d7abcc5b91468335679b1b27d8a02dd/llvm/include/llvm/Support/Program.h#L168-L170
# And: https://github.com/llvm/llvm-project/blob/63d16d06f5b8f71382033b5ea4aa668f8150817a/clang/include/clang/Driver/Job.h#L58-L69
# TODO(sbc): Should we also force utf-8 on non-windows?
if WINDOWS:
encoding = 'utf-8'
else:
encoding = None

here the code is making an assumption that the only time the Emscripten toolchain is creating response files, it would be creating those in a call to an LLVM tool. This might be true, though the intent of create_response_file() definitely is not to create response files only for LLVM to consume.

(btw I believe the answer to that TODO above is 'yes')

To fix both of these, I would propose that the response file logic would use the file suffix to decide what the encoding of the file should be:

  • if the suffix is .utf8, then the response file should be created and/or read using explicit utf-8 encoding.
  • if the suffix is anything else (e.g. the usual .rsp), then the response file should be created using current system encoding locale.

Then when Emscripten users are creating response files, they can create files with suffix .rsp or .rsp.utf8 to choose either current locale (open(file, 'r') in python), or utf-8 locale (open(file, 'r', encoding='utf-8') for python). Likewise, the function create_response_file() can then be extended to take in a suffix, and the call sites to that tool can choose which encoding to use (I think they all will want .rsp.utf8 at the moment)

Does that sound good?

@sbc100
Copy link
Collaborator

sbc100 commented Oct 14, 2021

I remember digging into this a while back finding it very thorny. I would not be opposed to your suggestion if it solves a real problem.

What other tools are you imagining create_response_file files being fed too? Aren't llvm and binaryen the only tools that emscripten invokes internally?

@juj
Copy link
Collaborator Author

juj commented Oct 14, 2021

What other tools are you imagining create_response_file files being fed too?

Today, none other that you list. But the interface does not reflect that the internals deal about LLVM internal issues. Maybe we don't need to care about changing the creation side, since there aren't any issues there; just on the read side.

@sbc100
Copy link
Collaborator

sbc100 commented Oct 14, 2021

Yes, sounds like its mostly the reading in of these response files that is it the issue. We can clarify that create_response_file is only indented for binaryn and llvm to read and therefore can always be in utf-8.

On the reading side I wonder if we can just copy the logic from llvm itself? I imagine it must have code to auto-detect the encoding of the file its reading? (I seem to remember reading this code, or at least looking it up a while back).

@ab-unity
Copy link

Detecting the difference between ASCII and UTF-8 is hard without a Byte order Mark.

@sbc100
Copy link
Collaborator

sbc100 commented Oct 15, 2021

But we don't know if windows presents that problem to us or not. It could be that the BOM is present in all the cases we care about?

@juj
Copy link
Collaborator Author

juj commented Nov 2, 2021

But we don't know if windows presents that problem to us or not. It could be that the BOM is present in all the cases we care about?

There is no guarantee that users who have crafted response files would have created the files with a BOM? Recall that these files all come from the users of Emscripten, and not just internally created by the Emscripten toolchain when invoking subprocesses.

Detecting the difference between ASCII and UTF-8 is hard without a Byte order Mark.

Also to clarify, the issue here was not ASCII vs UTF-8 (ASCII is a subset of UTF-8, so would be fine to hoist to unconditionally decode ASCII files as UTF-8 in that case), but e.g. Windows Codepage 437 vs UTF-8 and/or Windows Codepage 1252 vs UTF-8, or some other "default system encoding" vs UTF-8.

We could try autodetecting the encoding by first attempting to decode it as UTF-8 and if parsing the file using that fails, then load it using the system current locale. But I find that it is better to have support to be explicit.

Posted #15406 to fix this.

@juj juj closed this as completed in #15406 Nov 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants