Change Response Files encoding to be identified by response file suffix? #15292

juj · 2021-10-14T06:58:34Z

When python sees open(file, 'r') without explicit encoding= parameter, it opens a text file using the default system encoding. This happens today at

emscripten/tools/response_file.py

Line 80 in ff23b8c

with open(response_filename) as f:

So when users are calling emcc and other tools with response files, they need to encode their response files using the current system encoding locale.

It would be preferable to always use utf-8 encoding to encode response files with, but there is a danger that changing open(file, 'r', encoding='utf-8') there will break existing user build systems.

For example on Windows if a shell script did echo arg1 arg2 arg3 arg4 > file.rsp, I believe that will create a current system encoding locale encoded file.

Also related, we have this quirky code when different Emscripten tools create response files:

emscripten/tools/response_file.py

Lines 44 to 53 in ff23b8c

    
           # When writing windows repsonse files force the encoding to UTF8 which we know 
        
           # that llvm tools understand.  Without this, we get whatever the default codepage 
        
           # might be. 
        
           # See: https://github.com/llvm/llvm-project/blob/3f3d1c901d7abcc5b91468335679b1b27d8a02dd/llvm/include/llvm/Support/Program.h#L168-L170 
        
           # And: https://github.com/llvm/llvm-project/blob/63d16d06f5b8f71382033b5ea4aa668f8150817a/clang/include/clang/Driver/Job.h#L58-L69 
        
           # TODO(sbc): Should we also force utf-8 on non-windows? 
        
           if WINDOWS: 
        
             encoding = 'utf-8' 
        
           else: 
        
             encoding = None

here the code is making an assumption that the only time the Emscripten toolchain is creating response files, it would be creating those in a call to an LLVM tool. This might be true, though the intent of create_response_file() definitely is not to create response files only for LLVM to consume.

(btw I believe the answer to that TODO above is 'yes')

To fix both of these, I would propose that the response file logic would use the file suffix to decide what the encoding of the file should be:

if the suffix is .utf8, then the response file should be created and/or read using explicit utf-8 encoding.
if the suffix is anything else (e.g. the usual .rsp), then the response file should be created using current system encoding locale.

Then when Emscripten users are creating response files, they can create files with suffix .rsp or .rsp.utf8 to choose either current locale (open(file, 'r') in python), or utf-8 locale (open(file, 'r', encoding='utf-8') for python). Likewise, the function create_response_file() can then be extended to take in a suffix, and the call sites to that tool can choose which encoding to use (I think they all will want .rsp.utf8 at the moment)

Does that sound good?

The text was updated successfully, but these errors were encountered:

sbc100 · 2021-10-14T14:36:00Z

I remember digging into this a while back finding it very thorny. I would not be opposed to your suggestion if it solves a real problem.

What other tools are you imagining create_response_file files being fed too? Aren't llvm and binaryen the only tools that emscripten invokes internally?

juj · 2021-10-14T15:02:54Z

What other tools are you imagining create_response_file files being fed too?

Today, none other that you list. But the interface does not reflect that the internals deal about LLVM internal issues. Maybe we don't need to care about changing the creation side, since there aren't any issues there; just on the read side.

sbc100 · 2021-10-14T15:08:58Z

Yes, sounds like its mostly the reading in of these response files that is it the issue. We can clarify that create_response_file is only indented for binaryn and llvm to read and therefore can always be in utf-8.

On the reading side I wonder if we can just copy the logic from llvm itself? I imagine it must have code to auto-detect the encoding of the file its reading? (I seem to remember reading this code, or at least looking it up a while back).

ab-unity · 2021-10-14T23:45:39Z

Detecting the difference between ASCII and UTF-8 is hard without a Byte order Mark.

sbc100 · 2021-10-15T00:06:51Z

But we don't know if windows presents that problem to us or not. It could be that the BOM is present in all the cases we care about?

juj · 2021-11-02T10:26:36Z

But we don't know if windows presents that problem to us or not. It could be that the BOM is present in all the cases we care about?

There is no guarantee that users who have crafted response files would have created the files with a BOM? Recall that these files all come from the users of Emscripten, and not just internally created by the Emscripten toolchain when invoking subprocesses.

Detecting the difference between ASCII and UTF-8 is hard without a Byte order Mark.

Also to clarify, the issue here was not ASCII vs UTF-8 (ASCII is a subset of UTF-8, so would be fine to hoist to unconditionally decode ASCII files as UTF-8 in that case), but e.g. Windows Codepage 437 vs UTF-8 and/or Windows Codepage 1252 vs UTF-8, or some other "default system encoding" vs UTF-8.

We could try autodetecting the encoding by first attempting to decode it as UTF-8 and if parsing the file using that fails, then load it using the system current locale. But I find that it is better to have support to be explicit.

Posted #15406 to fix this.

juj mentioned this issue Nov 2, 2021

Add support for specifying response file encoding #15406

Merged

juj closed this as completed in #15406 Nov 3, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change Response Files encoding to be identified by response file suffix? #15292

Change Response Files encoding to be identified by response file suffix? #15292

juj commented Oct 14, 2021 •

edited

sbc100 commented Oct 14, 2021

juj commented Oct 14, 2021

sbc100 commented Oct 14, 2021

ab-unity commented Oct 14, 2021

sbc100 commented Oct 15, 2021 •

edited

juj commented Nov 2, 2021

Change Response Files encoding to be identified by response file suffix? #15292

Change Response Files encoding to be identified by response file suffix? #15292

Comments

juj commented Oct 14, 2021 • edited

sbc100 commented Oct 14, 2021

juj commented Oct 14, 2021

sbc100 commented Oct 14, 2021

ab-unity commented Oct 14, 2021

sbc100 commented Oct 15, 2021 • edited

juj commented Nov 2, 2021

juj commented Oct 14, 2021 •

edited

sbc100 commented Oct 15, 2021 •

edited