Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failures handing UTF-16, UTF-32 encoded files #22

Closed
cgmb opened this issue Oct 8, 2017 · 1 comment
Closed

Failures handing UTF-16, UTF-32 encoded files #22

cgmb opened this issue Oct 8, 2017 · 1 comment
Labels

Comments

@cgmb
Copy link
Owner

cgmb commented Oct 8, 2017

It seems that Visual Studio may generate UTF-16 header files for Resource Files. An example of such a file is renderdoccmd/resource.h. I expect that UTF-32 files have the same problem, though I have never encountered one.

Under Python 2, guardonce actually happens to handle this case correctly, as resource files don't have guards. checkguard notes that no guard was found, and both guard2once and once2guard ignore it. This is not because guardonce is behaving intelligently. Even if there were a guard, it would not be recognized, and guardonce would exhibit the same behaviour. That's not ideal, but as long as checkguard is telling you that the files are a problem, and as long as guard2once and once2guard do no harm to the files, it's acceptable.

Under Python 3, guardonce fails to decode the file to string, and prints a cryptic error message:

'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

This is from Linux, where utf-8 is the default codec. There's probably a different message under Windows. The behaviour is mostly the same as under Python 2, but all utilities print that error message, and checkguard does not print out the file name. It's hard to track down what file has the problem, because I'm not including enough information in that error message. That's not acceptable.

It's hard to say what the right thing to do is. Programs like file and vim will guess these encodings, though sed and gcc won't. UTF-16 and UTF-32 are pretty distinctive. They will have a BOM, and it's very likely that a large percentage of bytes in the file are going to be null. It's very unlikely that a real C header would start with the BOM characters in any encoding, or be full of null bytes.

Another possibility is to allow the user to specify the encodings of their files, but that may be complicated, as even in the renderdoc example above, most files in the repository are UTF-8 and there's only a single UTF-16 file. Many developers probably don't know how all their files are encoded, and there's probably a mixture of encodings within the repository.

At least for now, the plan is to make Python 3's behaviour match Python 2. Everything beyond complaining about and ignoring these files is a bonus.

@cgmb
Copy link
Owner Author

cgmb commented Oct 11, 2017

At present, UTF-8 is the only encoding supported by guardonce. UTF-16, UTF-32 will be reported by checkguard as being problem files, and will be ignored by guard2once and once2guard.

I hope to improve upon that eventually, but for now this will do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant