Fix unicode exceptions #156

jkloetzke · 2016-11-22T07:41:46Z

Source files may not be properly encoded. Make the handling of such
files more tolerant.

Fixes #148.

Source files may not be properly encoded. While the compiler and gcov do not care it will blow up Python 3 that expects proper encoding. Make the handling of such files more tolerant by using the 'surrogateescape' error policy. On the other hand Python 2 does not care about the encoding. Wrap the open() function there to add the missing 'errors' parameter. Fixes gcovr#148.

strahlc · 2017-08-22T12:00:25Z

can you please merge this PR?

strahlc · 2017-12-21T11:57:34Z

please merge

latk · 2018-02-11T20:48:10Z

Thank you for these changes. I think that overriding the open() function is an elegant way to handle the Python 2 vs 3 differences.

But this solution only hides any errors and doesn't actually support other encodings. Resilience against encoding errors is good, but I am deferring this PR until a general strategy for dealing with source encodings can be developed. See also my comment at #148 (comment) for more context.

jkloetzke · 2018-02-12T10:29:20Z

I think they are two parts that can be considered almost independently. The one thing is to override the default source encoding (inferred by Python from the locale) and the other is how to treat errors. This PR is solely about to handle encoding errors. Note that even if it would be possible to specify a source encoding it would be not sufficient for our use case. We have quite a bit of legacy and 3rd party code that are all compiled together. Because of that there are unfortunately multiple different encodings in the code base.

This PR proposes to use the surrogateescape error handler. Quoting the Python documentation:

'surrogateescape' will represent any incorrect bytes as code points in the Unicode Private Use Area ranging from U+DC80 to U+DCFF. These private code points will then be turned back into the same bytes when the surrogateescape error handler is used when writing data. This is useful for processing files in an unknown encoding.

I think this is a good trade-off. This way gcovr can read incorrectly encoded source files, process the correctly encoded parts and write back the result with the same byte representation. The other options are basically only to either replace the offending bytes or to bail out. IMHO neither of them look attractive.

So I think it is still valuable to have this kind of error handling even if the source encoding can be properly specified.

latk · 2018-06-03T20:11:49Z

An alternative approach has been implemented in #256, so I'm closing this PR. Still, many thanks for putting this prototype forward as it helped the discussion! If you feel that the currently implemented solution does not support a specific use case, please open a new issue.

jkloetzke force-pushed the fix-unicode-crashes branch from 22d36a7 to ae68a8b Compare November 22, 2016 19:44

lisongmin mentioned this pull request May 21, 2018

Support --source-encoding option. #256

Merged

latk closed this Jun 3, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix unicode exceptions #156

Fix unicode exceptions #156

jkloetzke commented Nov 22, 2016

strahlc commented Aug 22, 2017

strahlc commented Dec 21, 2017

latk commented Feb 11, 2018

jkloetzke commented Feb 12, 2018

latk commented Jun 3, 2018

Fix unicode exceptions #156

Fix unicode exceptions #156

Conversation

jkloetzke commented Nov 22, 2016

strahlc commented Aug 22, 2017

strahlc commented Dec 21, 2017

latk commented Feb 11, 2018

jkloetzke commented Feb 12, 2018

latk commented Jun 3, 2018