Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix unicode exceptions #156

Closed
wants to merge 1 commit into from

Conversation

jkloetzke
Copy link

Source files may not be properly encoded. Make the handling of such
files more tolerant.

Fixes #148.

Source files may not be properly encoded. While the compiler and gcov do
not care it will blow up Python 3 that expects proper encoding. Make the
handling of such files more tolerant by using the 'surrogateescape'
error policy.

On the other hand Python 2 does not care about the encoding. Wrap the
open() function there to add the missing 'errors' parameter.

Fixes gcovr#148.
@strahlc
Copy link

strahlc commented Aug 22, 2017

can you please merge this PR?

@strahlc
Copy link

strahlc commented Dec 21, 2017

please merge

@latk
Copy link
Member

latk commented Feb 11, 2018

Thank you for these changes. I think that overriding the open() function is an elegant way to handle the Python 2 vs 3 differences.

But this solution only hides any errors and doesn't actually support other encodings. Resilience against encoding errors is good, but I am deferring this PR until a general strategy for dealing with source encodings can be developed. See also my comment at #148 (comment) for more context.

@jkloetzke
Copy link
Author

I think they are two parts that can be considered almost independently. The one thing is to override the default source encoding (inferred by Python from the locale) and the other is how to treat errors. This PR is solely about to handle encoding errors. Note that even if it would be possible to specify a source encoding it would be not sufficient for our use case. We have quite a bit of legacy and 3rd party code that are all compiled together. Because of that there are unfortunately multiple different encodings in the code base.

This PR proposes to use the surrogateescape error handler. Quoting the Python documentation:

'surrogateescape' will represent any incorrect bytes as code points in the Unicode Private Use Area ranging from U+DC80 to U+DCFF. These private code points will then be turned back into the same bytes when the surrogateescape error handler is used when writing data. This is useful for processing files in an unknown encoding.

I think this is a good trade-off. This way gcovr can read incorrectly encoded source files, process the correctly encoded parts and write back the result with the same byte representation. The other options are basically only to either replace the offending bytes or to bail out. IMHO neither of them look attractive.

So I think it is still valuable to have this kind of error handling even if the source encoding can be properly specified.

@latk
Copy link
Member

latk commented Jun 3, 2018

An alternative approach has been implemented in #256, so I'm closing this PR. Still, many thanks for putting this prototype forward as it helped the discussion! If you feel that the currently implemented solution does not support a specific use case, please open a new issue.

@latk latk closed this Jun 3, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants