Support --source-encoding option. #256

lisongmin · 2018-05-21T16:47:49Z

This PR add --source-encoding suggest in #148 .

Note that io.open will load file as unicode under python2, which may change the coding behavior.

codecov · 2018-05-21T16:55:22Z

Codecov Report

Merging #256 into master will increase coverage by 0.06%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master     #256      +/-   ##
==========================================
+ Coverage   88.76%   88.83%   +0.06%     
==========================================
  Files          13       13              
  Lines        1478     1487       +9     
  Branches      267      268       +1     
==========================================
+ Hits         1312     1321       +9     
  Misses        108      108              
  Partials       58       58

Impacted Files	Coverage Δ
gcovr/tests/test_gcovr.py	`97.67% <100%> (+0.11%)`	⬆️
gcovr/tests/test_gcov_parser.py	`100% <100%> (ø)`	⬆️
gcovr/html_generator.py	`96.53% <100%> (+0.01%)`	⬆️
gcovr/__main__.py	`89.17% <100%> (+0.13%)`	⬆️
gcovr/gcov.py	`83.46% <100%> (-0.05%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 17dfaf6...21ebbaa. Read the comment docs.

latk · 2018-05-21T17:06:37Z

This is fantastic! Thank you very much for looking into this. I'll review properly in the next couple of days.

BTW the Travis test failure was just a small style problem. To check for those before you commit you can run flake8 locally, e.g. python3 -m flake8 gcovr/.

If you'd like to work on other issues, please consider first making a comment on the corresponding issue or mentioning this in the chat room. I really appreciate your work, but some issues first need some discussion on the best design.

lisongmin · 2018-05-21T23:14:04Z

Ok. Test should pass now.

I forgot run it locally. It is greate if there is a .flake8 file under the project root and autopep8 as well. modern editor with plugin may auto check if it is there(like vim, vs code).

Do you mind change the file contents to unicode in python2?

lisongmin · 2018-05-21T23:47:56Z

And it is less tolerant than #156 because encode/decode error handler is replace but not surrogateescape for python2 compat. Any way, we cat get coverage even if encoding error. that is important.

latk

I've reviewed this PR more thoroughly. Encoding issues are difficult! I'll try to work on test cases with various encodings, and will not merge until there are some tests that work on all platforms.

latk · 2018-05-29T09:59:50Z

gcovr/html_generator.py

@@ -235,7 +236,8 @@ def print_html_report(covdata, options):
    if options.output is None:
        sys.stdout.write(htmlString + '\n')
    else:
-        OUTPUT = open(options.output, 'w')
+        OUTPUT = io.open(options.output, 'w', encoding=options.html_encoding,
+                         errors='replace')


for HTML output, setting errors='xmlcharrefreplace' might be more appropriate

latk · 2018-05-29T10:03:49Z

gcovr/__main__.py

+    options.add_argument(
+        '--source-encoding',
+        help="Override the declared source encoding. "
+             "Defaults to the system default encoding.",


The --source-encoding and --html-encoding options should reference each other. Now that the source encoding can be specified explicitly, the --html-encoding help message is incorrect.
It might also be nice to show the system default encoding (available through locale.getpreferredencoding()).

They should independency? the transition in python:

source -- read and convert to --> unicode(or str on python3) -- convert to utf-8 and write --> html(utf-8)

--html-encoding is show in broswer, and utf-8 is suitable for that.

latk · 2018-05-29T10:55:04Z

gcovr/gcov.py

@@ -87,7 +85,8 @@ def is_non_code(code):
 #
 def process_gcov_data(data_fname, covdata, source_fname, options, currdir=None):
    logger = Logger(options.verbose)
-    INPUT = open(data_fname, "r")
+    INPUT = io.open(data_fname, "r", encoding=options.source_encoding,
+                    errors='replace')


I've been thinking about whether it is correct to decode the gcov files in one go. The problem is that the gcov tool is encoding transparent. The output files contain ASCII annotations and the original source file contents. So opening these files will only work under an ASCII-compatible encoding.

Perhaps this restriction should be mentioned in the --source-encoding help message. It would also be good to have relevant tests. I'll try to develop a test case in the course of this week.

I tested it locally with non-ascii chars and output is correct before make this PR.

input with utf-8 and output with gbk

input with utf-8 and output with utf-8

Which tests are you wanted to add? I can have a try although i'm not familiar to gcovr's test framwork.

latk · 2018-05-29T11:06:18Z

gcovr/gcov.py

-    if sys.version_info < (3, 0):
-        code = code.strip().translate(None, '}{')
-    else:
-        code = code.strip().translate(noncode_mapper)


I was a bit confused by this change, but apparently unicode.translate() from Python 2.7 behaves like str.translate() in Python 3. It might be good to make that requirement explicit, e.g. by adding a six dependency and:

import six ... def is_non_code(code): assert isinstance(code, six.text_type) ...

io.open returns unicode on python2 and str on python3

Python distinguishes between files opened in binary and text modes, even when the underlying operating system doesn’t. Files opened in binary mode (including 'b' in the mode argument) return contents as bytes objects without any decoding. In text mode (the default, or when 't' is included in the mode argument), the contents of the file are returned as unicode strings, the bytes having been first decoded using a platform-dependent encoding or using the specified encoding if given.

As we can ensure io.open return unicode, adding additional module six is not necessary in my opinion.

…les.

This checks --source-encoding for utf8 and cp1252 (iso-8895-1, latin-1), and --html-encoding for utf8, cp1252, and iso-8859-15 (latin-9). Note that utf8 is the default for --html-encoding. Untested: treatment of encoding errors when reading gcov files.

latk

I have updated the PR with tests for the encoding behaviour, and a small documentation improvement. I'll do the merge now. Sorry that it took so long!

lisongmin force-pushed the not-utf-8-encoding-support branch from b11c88a to a0df533 Compare May 21, 2018 16:55

Support --source-encoding option.

a0df533

latk added Type: Enhancement Format: HTML Gcov labels May 21, 2018

latk added the needs review label May 22, 2018

latk reviewed May 29, 2018

View reviewed changes

latk mentioned this pull request May 29, 2018

gcovr crashes with files which are not utf-8 encoded #148

Closed

lisongmin and others added 3 commits May 29, 2018 22:07

encoding error handler change to xmlcharrefreplace when write html fi…

8260d50

…les.

add encoding tests

07c304a

This checks --source-encoding for utf8 and cp1252 (iso-8895-1, latin-1), and --html-encoding for utf8, cp1252, and iso-8859-15 (latin-9). Note that utf8 is the default for --html-encoding. Untested: treatment of encoding errors when reading gcov files.

encoding option documentation

21ebbaa

latk approved these changes Jun 3, 2018

View reviewed changes

latk merged commit ff4c43d into gcovr:master Jun 3, 2018

latk removed the needs review label Jun 3, 2018

latk mentioned this pull request Jun 3, 2018

Fix unicode exceptions #156

Closed

lisongmin deleted the not-utf-8-encoding-support branch June 4, 2018 01:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support --source-encoding option. #256

Support --source-encoding option. #256

lisongmin commented May 21, 2018

codecov bot commented May 21, 2018 •

edited

latk commented May 21, 2018

lisongmin commented May 21, 2018

lisongmin commented May 21, 2018

latk left a comment

latk May 29, 2018

latk May 29, 2018

lisongmin May 29, 2018 •

edited

latk May 29, 2018

lisongmin May 29, 2018

lisongmin May 29, 2018 •

edited

latk May 29, 2018

lisongmin May 29, 2018 •

edited

lisongmin May 29, 2018

latk left a comment

Support --source-encoding option. #256

Support --source-encoding option. #256

Conversation

lisongmin commented May 21, 2018

codecov bot commented May 21, 2018 • edited

Codecov Report

latk commented May 21, 2018

lisongmin commented May 21, 2018

lisongmin commented May 21, 2018

latk left a comment

Choose a reason for hiding this comment

latk May 29, 2018

Choose a reason for hiding this comment

latk May 29, 2018

Choose a reason for hiding this comment

lisongmin May 29, 2018 • edited

Choose a reason for hiding this comment

latk May 29, 2018

Choose a reason for hiding this comment

lisongmin May 29, 2018

Choose a reason for hiding this comment

lisongmin May 29, 2018 • edited

Choose a reason for hiding this comment

latk May 29, 2018

Choose a reason for hiding this comment

lisongmin May 29, 2018 • edited

Choose a reason for hiding this comment

lisongmin May 29, 2018

Choose a reason for hiding this comment

latk left a comment

Choose a reason for hiding this comment

codecov bot commented May 21, 2018 •

edited

lisongmin May 29, 2018 •

edited

lisongmin May 29, 2018 •

edited

lisongmin May 29, 2018 •

edited