Failures handing UTF-16, UTF-32 encoded files #22

cgmb · 2017-10-08T22:02:03Z

It seems that Visual Studio may generate UTF-16 header files for Resource Files. An example of such a file is renderdoccmd/resource.h. I expect that UTF-32 files have the same problem, though I have never encountered one.

Under Python 2, guardonce actually happens to handle this case correctly, as resource files don't have guards. checkguard notes that no guard was found, and both guard2once and once2guard ignore it. This is not because guardonce is behaving intelligently. Even if there were a guard, it would not be recognized, and guardonce would exhibit the same behaviour. That's not ideal, but as long as checkguard is telling you that the files are a problem, and as long as guard2once and once2guard do no harm to the files, it's acceptable.

Under Python 3, guardonce fails to decode the file to string, and prints a cryptic error message:

'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

This is from Linux, where utf-8 is the default codec. There's probably a different message under Windows. The behaviour is mostly the same as under Python 2, but all utilities print that error message, and checkguard does not print out the file name. It's hard to track down what file has the problem, because I'm not including enough information in that error message. That's not acceptable.

It's hard to say what the right thing to do is. Programs like file and vim will guess these encodings, though sed and gcc won't. UTF-16 and UTF-32 are pretty distinctive. They will have a BOM, and it's very likely that a large percentage of bytes in the file are going to be null. It's very unlikely that a real C header would start with the BOM characters in any encoding, or be full of null bytes.

Another possibility is to allow the user to specify the encodings of their files, but that may be complicated, as even in the renderdoc example above, most files in the repository are UTF-8 and there's only a single UTF-16 file. Many developers probably don't know how all their files are encoded, and there's probably a mixture of encodings within the repository.

At least for now, the plan is to make Python 3's behaviour match Python 2. Everything beyond complaining about and ignoring these files is a bonus.

The text was updated successfully, but these errors were encountered:

cgmb · 2017-10-11T05:53:20Z

At present, UTF-8 is the only encoding supported by guardonce. UTF-16, UTF-32 will be reported by checkguard as being problem files, and will be ignored by guard2once and once2guard.

I hope to improve upon that eventually, but for now this will do.

cgmb added bug Python3 and removed Python3 labels Oct 8, 2017

cgmb closed this as completed Oct 11, 2017

cgmb mentioned this issue Mar 19, 2018

Add UTF-16 and UTF-32 support #28

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failures handing UTF-16, UTF-32 encoded files #22

Failures handing UTF-16, UTF-32 encoded files #22

cgmb commented Oct 8, 2017 •

edited

Loading

cgmb commented Oct 11, 2017

Failures handing UTF-16, UTF-32 encoded files #22

Failures handing UTF-16, UTF-32 encoded files #22

Comments

cgmb commented Oct 8, 2017 • edited Loading

cgmb commented Oct 11, 2017

cgmb commented Oct 8, 2017 •

edited

Loading