New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gcovr crashes with files which are not utf-8 encoded #148

Closed
strahlc opened this Issue Sep 7, 2016 · 10 comments

Comments

Projects
None yet
5 participants
@strahlc

strahlc commented Sep 7, 2016

Unfortunately we have some submodules which are not utf-8 encoded.
If we run gcovr on our project, we got a backtrace:

$ gcovr -r .
Traceback (most recent call last):
  File "/usr/lib/python-exec/python3.4/gcovr", line 2312, in <module>
    process_datafile(file_, covdata, options)
  File "/usr/lib/python-exec/python3.4/gcovr", line 891, in process_datafile
    process_gcov_data(fname, covdata, abs_filename, options)
  File "/usr/lib/python-exec/python3.4/gcovr", line 489, in process_gcov_data
    line = INPUT.readline()
  File "/usr/lib64/python3.4/codecs.py", line 319, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfc in position 1750: invalid start byte

In projects where all files are utf-8 encoded everything works fine.

We are using gcovr-3.3.

@balegoff

This comment has been minimized.

balegoff commented Sep 8, 2016

We have exactly the same issue since we updated gcvor from 3.2 to 3.3

@balegoff

This comment has been minimized.

balegoff commented Sep 8, 2016

gcovr -v -r . gives me this:

...
Parsing coverage data for file /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/include/c++/v1/vector
  Filtering coverage data for file /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/include/c++/v1/vector
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.5/bin/gcovr", line 4, in <module>
    __import__('pkg_resources').run_script('gcovr==3.2', 'gcovr')
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/pkg_resources/__init__.py", line 735, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/pkg_resources/__init__.py", line 1659, in run_script
    exec(script_code, namespace, namespace)
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/gcovr-3.2-py3.5.egg/EGG-INFO/scripts/gcovr", line 1961, in <module>
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/gcovr-3.2-py3.5.egg/EGG-INFO/scripts/gcovr", line 749, in process_datafile
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/gcovr-3.2-py3.5.egg/EGG-INFO/scripts/gcovr", line 416, in process_gcov_data
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/codecs.py", line 321, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 1049: invalid continuation byte

It's kind of random, it doesn't always fail on the same file, mostly on our files though.

@balegoff

This comment has been minimized.

balegoff commented Sep 14, 2016

Its seems that I'm facing the issue with v3.2 when I install from source code.
Installing from homebrew works fine though.
Any chance to have v3.3 on homebrew ?

jkloetzke added a commit to jkloetzke/gcovr that referenced this issue Nov 22, 2016

Fix unicode exceptions
Source files may not be properly encoded. Make the handling of such
files more tolerant.

Fixes gcovr#148.

jkloetzke added a commit to jkloetzke/gcovr that referenced this issue Nov 22, 2016

Fix unicode exceptions on Python 3
Source files may not be properly encoded. While the compiler and gcov do
not care it will blow up Python 3 that expects proper encoding. Make the
handling of such files more tolerant by using the 'surrogateescape'
error policy.

On the other hand Python 2 does not care about the encoding. Wrap the
open() function there to add the missing 'errors' parameter.

Fixes gcovr#148.
@shw9

This comment has been minimized.

shw9 commented Dec 21, 2017

Facing the same issue. Not getting any clue on what is causing this issue.

Traceback (most recent call last):
File "/opt/gcovr/noarch/3.3-2/bin/gcovr", line 2312, in
process_datafile(file_, covdata, options)
File "/opt/gcovr/noarch/3.3-2/bin/gcovr", line 891, in process_datafile
process_gcov_data(fname, covdata, abs_filename, options)
File "/opt/gcovr/noarch/3.3-2/bin/gcovr", line 489, in process_gcov_data
line = INPUT.readline()
File "/opt/python/x86_64/3.5.1-1/lib/python3.5/codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf8 in position 427: invalid start byte

@latk

This comment has been minimized.

Member

latk commented Feb 11, 2018

As a workaround, using gcovr under Python 2.7 might sidestep these issues when using a single-byte encoding.

The gcovr source currently ignores file encoding. The PR #157 suggests a way to address these issues (inserting replacement characters when the input doesn't decode via UTF-8), but I think that solution is mostly wrong because it doesn't actually support non-UTF-8 encodings – it just paints over any errors. The --html-encoding option has a similar intention but works in reverse, by declaring the encoding of HTML files that include the source code directly.

I think the correct solution is to adapt #157 and introduce a --source-encoding switch to properly decode the input (defaulting to UTF-8). This still won't work for mixed-encoding code bases, but I don't know how such a use case can be addressed.

I'm deferring this issue because other tasks have to be done first, but I understand that gcovr is broken regarding encodings and needs to be fixed.

@latk latk added the Type: Bug label Feb 11, 2018

@goriy

This comment has been minimized.

Contributor

goriy commented Feb 12, 2018

It's a good idea to introduce something like --source-encoding parameter. Maybe during huge refactoring planned after 3.4 release.

It's amazing, but I've got some problems even with utf-8 encoded sources on Windows (python 3.6)!

It seems like there is more than one default encoding:

  • returned by sys.getdefaultencoding
  • returned by locale.getpreferredencoding

They can differ. It seems that default encoding for open is the one returned by locale.getpreferredencoding. I know no way to change it at the moment. It doesn't obey environment variables (LANG, LANGUAGE, PYTHONIOENCODING, LC_xxx), some calls to locale.setlocale doesn't affect it either.

So, if source code is in that encoding (no matter utf-8 or not) - you are lucky and it's just enough to adjust html reports encoding produced by gcovr by means of --html-encoding.

If your source code is not in that encoding gcovr (and any other python3 script) crashes with:

UnicodeDecodeError: 'charmap' codec can't decode byte...

The only simple way to get it work is to implicitly set encoding='' parameter to open() calls.

As far as I know, Python 2.7 reads files "as is", gcovr doesn't interfere either, so there should be no such problem.

@goriy

This comment has been minimized.

Contributor

goriy commented Feb 12, 2018

I've got some sources in utf-8 encoding. gcovr crashes on Windows using Python 3.6.
I've tried this hack and it worked:

diff --git a/scripts/gcovr b/scripts/gcovr                          
index abc8108..3ecc8a1 100755                                       
--- a/scripts/gcovr                                                 
+++ b/scripts/gcovr                                                 
@@ -456,7 +456,7 @@ def is_non_code(code):                          
 # Process a single gcov datafile                                   
 #                                                                  
 def process_gcov_data(data_fname, covdata, source_fname, options): 
-    INPUT = open(data_fname, "r")                                  
+    INPUT = open(data_fname, "r", encoding='utf-8')                
     #                                                              
     # Get the filename                                             
     #                                                              
@@ -1716,7 +1716,7 @@ def print_html_report(covdata, details):      
         data['ROWS'] = []                                          
         currdir = os.getcwd()                                      
         os.chdir(root_dir)                                         
-        INPUT = open(data['FILENAME'], 'r')                        
+        INPUT = open(data['FILENAME'], 'r', encoding='utf-8')      
         ctr = 1                                                    
         for line in INPUT:                                         
             data['ROWS'].append(                                   
@@ -1728,7 +1728,7 @@ def print_html_report(covdata, details):      
         data['ROWS'] = '\n'.join(data['ROWS'])                     
                                                                    
         htmlString = source_page.substitute(**data)                
-        OUTPUT = open(cdata._sourcefile, 'w')                      
+        OUTPUT = open(cdata._sourcefile, 'w', encoding='utf-8')    
         OUTPUT.write(htmlString + '\n')                            
         OUTPUT.close()                                             

In this case --html-encoding parameter value should go to encoding='' while saving files (and maybe not only for html reports)

lisongmin added a commit to lisongmin/gcovr that referenced this issue May 20, 2018

@latk

This comment has been minimized.

Member

latk commented May 29, 2018

@goriy As you've previously looked into encoding issues – could you perhaps test PR #256 on your projects? I think it should fix these problems, although the PR doesn't yet have any testcases.

@latk

This comment has been minimized.

Member

latk commented Jun 3, 2018

Source file encoding support has been implemented in #256. If it doesn't address your use case, please add a comment with more information.

@latk latk closed this Jun 3, 2018

@goriy

This comment has been minimized.

Contributor

goriy commented Jun 6, 2018

Sorry for late answer.
I've just tested gcovr with changes introduced in #256 on same projects as before and in the same environment. It works fine now! Thanks a lot!

OS: Windows, Python version: 3.6, encodings:

  • input with utf-8 and output with utf-8
  • input with utf-8 and output with cp1251
  • input with cp1251 and output with cp1251
  • input with cp1251 and output with utf-8

P. S. Special thanks to @lisongmin!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment