Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add an option to disable long file names #528

Merged
merged 7 commits into from
Dec 31, 2021

Conversation

ebmmy
Copy link
Contributor

@ebmmy ebmmy commented Nov 22, 2021

Using tho option --long-file-names with llvm-cov leads to file name larger than 255 characters on our setup and makes gcovr unusable. Hence this PR adding an option to disable it.

@latk
Copy link
Member

latk commented Nov 22, 2021

Thank you for this pull request! This seems to be fallout from the change in #524/#525.

Could you explain a bit under which circumstances this problem occurs? Is this mainly a Windows thing?

I am not the biggest fan of adding a new option that requires the user to figure out the correct combination of flags. I am also a bit unhappy about making parallel processing impossible in the general case, even though it's a niche feature right now.

I think it might be better to run gcov once to discover available options, and then use the --hash-filenames option if it is available. If not, we can issue a warning similar to the one you've shown (though it should only be shown once, or possibly be a hard error since data will be incorrect in all nontrivial cases). The --hash-filenames option was added in GCC 7 and is also available in llvm-cov.

@ebmmy
Copy link
Contributor Author

ebmmy commented Nov 22, 2021

I discovered this issue investigating on a more generic problem, that upgrading from llvm 11 to 12 broke coverage generation on some file in our project (only files generated in our build folder which is not contained in the source tree).
It seems that older version of llvm-cov was creating .gcov files even if the original source file was not found, replacing all lines with EOF.

Starting from llvm-cov 12, an empty .gcov is generated if the source file is not found. This issue of file not found is fixed by this commit, but it introduces an issue with filename longer than 255 characters (on Ubuntu 18.04 running on WSL2 Windows 10). Such a filename for instance /home/user/project/build/unittest-clang/#home#user#project#build#unittest-clang#external-dependencies#common#mock#MyComponentMock#CMakeFiles#MyComponentMock-Mock.dir#MyComponentMock.cpp.gcda###home#user#project#application#external-dependencies#Common#mock#MyComponentMock#MyComponentMock.cpp.gcov (please note that I redacted some part of the name)

I already tried the --hash-filenames option but it is not successful I still get some too long name (eg. /home/user/project/build/unittest-clang/#home#user#project#build#unittest-clang#test#MyComponentTest#CMakeFiles#MyComponent-Test.dir#__#__#src#MyComponent#SourceFileInComponent.cpp.gcda###home#user#project#application#src#MyComponent#SourceFileInComponent.cpp##54f388945663abd375d80713245a927f.gcov).

Hence expect for a custom option I have no idea how to handle this issue 😕

@latk
Copy link
Member

latk commented Nov 22, 2021

Thank you for that background info. It looks like my suggestion of --long-file-names + --hash-filenames is completely pointless when --preserve-paths is enabled. I've looked at the GCC and Clang implementations for these options. This comment contains some notes along the way so it's a bit longer.

TL;DR: I don't think we need --long-file-names or --preserve-paths at this time if we add --hash-filenames instead, if supported by the gcov version.

In GCC:

expand source code

Permalink: https://github.com/gcc-mirror/gcc/blob/f58bf16f672cda3ac55f92f12e258c817ece6e3c/gcc/gcov.c#L2615-L2650

static string
make_gcov_file_name (const char *input_name, const char *src_name)
{
  string str;

  /* When hashing filenames, we shorten them by only using the filename
     component and appending a hash of the full (mangled) pathname.  */
  if (flag_hash_filenames)
    str = (string (mangle_name (src_name)) + "##"
	   + get_md5sum (src_name) + ".gcov");
  else
    {
      if (flag_long_names && input_name && strcmp (src_name, input_name) != 0)
	{
	  str += mangle_name (input_name);
	  str += "##";
	}

      str += mangle_name (src_name);
      str += ".gcov";
    }

  return str;
}

Pseudocode:

def make_gcov_file_name(gcda_path, source_path, options) -> str:
  if options.hash_filenames:
    return (mangle(source_path, options.preserve_paths)
            + "##" + md5(source_path) + ".gcov")

  coverage_path = ""

  if options.long_filenames and gcda_path != source_path:
    coverage_path += mangle(gcda_path, options.preserve_paths) + "##"

  coverage_path += mangle(source_path, options.preserve_paths) + ".gcov"

  return coverage_path

In LLVM/Clang

expand source code

Permalink: https://github.com/llvm/llvm-project/blob/137d3474ca39a9af6130519a41b62dd58672a5c0/llvm/lib/ProfileData/GCOV.cpp#L637-L659

std::string Context::getCoveragePath(StringRef filename,
                                     StringRef mainFilename) const {
  if (options.NoOutput)
    // This is probably a bug in gcov, but when -n is specified, paths aren't
    // mangled at all, and the -l and -p options are ignored. Here, we do the
    // same.
    return std::string(filename);

  std::string CoveragePath;
  if (options.LongFileNames && !filename.equals(mainFilename))
    CoveragePath =
        mangleCoveragePath(mainFilename, options.PreservePaths) + "##";
  CoveragePath += mangleCoveragePath(filename, options.PreservePaths);
  if (options.HashFilenames) {
    MD5 Hasher;
    MD5::MD5Result Result;
    Hasher.update(filename.str());
    Hasher.final(Result);
    CoveragePath += "##" + std::string(Result.digest());
  }
  CoveragePath += ".gcov";
  return CoveragePath;
}

Pseudocode:

def get_coverage_path(source_path, gcda_path, options) -> str:
  coverage_path = ""

  if options.long_file_names and (gcda_path != source_path):
    coverage_path = mangle(gcda_path, options.preserve_paths) + "##"

  coverage_path += mangle(source_path, options.preserve_paths)

  if options.hash_filenames:
    coverage_path += md5(source_path)

  return coverage_path + ".gcov"

In either case, the mangle() function is something like:

def mangle(path, preserve_paths) -> str:
  if preserve_paths:
    return path.replace('/', '#')
  else:
    return basename(path)

So at least llvm-cov and GCC gcov seem consistent with each other.

The problem is that --preserve-paths --long-file-names and --hash-filenames are not a 100% replacement for each other. If gcovr runs multiple gcov processes in parallel, the generated filenames must be unique based on both the gcda-path and the source-path. This is guaranteed with --preserve-paths --long-file-names. The --hash-filenames option only ensures this with respect to the source-path, but doesn't include the gcda-path.

If we were to add --hash-filenames and remove --preserve-paths+--long-file-names in exchange, the following scenario could lead to incorrect coverage data (but see discussion below):

project/
  lib.gcda
  foo/lib.gcda
  lib.c

Let there be a project directory that contains a lib.gcda, and a subdirectory foo that also contains a lib.gcda. Let's assume that both were generated from a compiler that was run in the project directory, so that gcov will be run from that directory as well.

  • With options --preserve-paths --long-file-names we get distinct filenames. In this scenario, appending the hash does not add information.
    • for gcov lib.gcda:
      #project#lib.gcda##lib.c.gcov
    • for gcov foo/lib.gcda:
      #project#foo#lib.gcda##lib.c.gcov
  • Without --long-file-names we get conflicts, regardless of whether --hash-filenames and --preserve-paths is enabled as well. The hash doesn't contain information about the gcda-path, so appending it is pointless as well.
    • for gcov lib.gcda
      lib.c.gcov
    • for gcov foo/lib.gcda
      lib.c.gcov

At this point, I think the best resolution is the following:

  • We though we needed --long-file-names for correctness so that the gcda-path is part of the coverage file name, if more than one gcov process were to run in each directory.

    • But this case is currently prevented, so --long-file-names is superfluous:

      gcovr/gcovr/gcov.py

      Lines 410 to 413 in 044b5c9

      with locked_directory(chdir):
      out, err = subprocess.Popen(
      cmd, env=env, cwd=chdir, stdout=subprocess.PIPE, stderr=subprocess.PIPE
      ).communicate()
    • My comment in zero coverage rate #524 (comment) was a bit incorrect.
    • I now think this option can be safely removed, until we overhaul parallel coverage collection.
  • We need either --hash-filenames or --preserve-paths so that distinct source files result in different coverage file names.

    • Since --preserve-paths leads to very long filenames, we should prefer --hash-filenames, but only if it is available in that gcov version.
  • In case parallel coverage collection in the same directory shall be implemented in the future, either of the following strategies would work:

    • --long-file-names + --preserve-paths, if the file system can support such long file names
    • --long-file-names + --hash-filenames WITHOUT --preserve-paths, but ensure that there is at most one active gcov process for each (working_directory, basename(gcda_path)) pair.

cc @Spacetown I need your thoughts on this.

@latk
Copy link
Member

latk commented Nov 22, 2021

@ebmmy Could you try the following patch?

     gcov_options = [
         "--branch-counts",
         "--branch-probabilities",
-        "--preserve-paths",
-        "--long-file-names",
+        "--hash-filenames",
     ]

@ebmmy
Copy link
Contributor Author

ebmmy commented Nov 22, 2021

Thanks for the really detailed explanation ! Based on your input I updated this PR. I think the check_gcov_option function could use a cache though, what do you think ?
I'll test tomorrow morning on our codebase and let you know the result with --hash-filenames (--preserve-paths was previously already working fine).

@Spacetown
Copy link
Member

@latk I'm with you to remove the option, adding a BIG comment to the directory lock with a link to this PR.
The problem is how to get the available options. There are two ways:

  • We can do it like @ebmmy: Using the option and check the output.
  • We can use a call to gcov --help parsing the output for all available options (my prefered solution). If the gcov is LLVM based gcov --help-hidden should be used to get the help for the needed options.

Of course a cache is needed because the result is fixed and e.g. on Windows starting a process can take up to 500ms.

Other solutions (https://docs.microsoft.com/en-us/windows/win32/fileio/maximum-file-path-limitation?tabs=cmd):

  • Remove the path limitation by policy (available since Win10, 1607).
    • No real solution because old SW won't work anymore. E.g. Buffers with MAX_PATH are used in many executables.
  • Use DOS device paths on Windows.
    • The wide character API supports paths up to 32767 characters if a DOS device path is used.
    • Need to be supported by gcov and by python.exe.
    • Files can't be handeled by Windows Explorer and DOS commands.

BTW: The max length in Windows is MAX_PATH which is defined to 260.

@ebmmy
Copy link
Contributor Author

ebmmy commented Nov 22, 2021

I'll rework this PR with a cache and using --help command then 👍

Regarding path limitation maybe I was not clear on my problem. The issue I get is running on WSL2 which is, as far as I know, not limited by MAX_PATH but limited by the wide char API so maximum full path length about 32.767 characters. But in my case the exceeded limit is the maximum length for a single component in the path (here the file name) which is usually 255 characters.
That being said problem would remain for users running gcovr on "native" Windows, so I think it is good to get ride of --long-file-names if it is not absolutely needed 🙂

@ebmmy
Copy link
Contributor Author

ebmmy commented Nov 23, 2021

@latk I tested with your patch and I get the expected coverage report with long filename issue 👍
@Spacetown I updated the PR using --help and --help-hidden for option detection

@Spacetown
Copy link
Member

@ebmmy But the cache for the used options is missing.

@Spacetown Spacetown added this to the 5.1 milestone Dec 9, 2021
Copy link
Member

@Spacetown Spacetown left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cache for the help output is missing.

gcovr/gcov.py Outdated Show resolved Hide resolved
@Spacetown
Copy link
Member

Please update the CHANGELOG.rst and the AUTHORS.txt.

@Spacetown
Copy link
Member

@latk Can you take a look at my changes to wrap the gcov command and cache the options?

@codecov
Copy link

codecov bot commented Dec 29, 2021

Codecov Report

Merging #528 (5976f60) into master (cd8020a) will decrease coverage by 0.08%.
The diff coverage is 89.36%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #528      +/-   ##
==========================================
- Coverage   96.18%   96.09%   -0.09%     
==========================================
  Files          22       22              
  Lines        2934     2969      +35     
  Branches      544      554      +10     
==========================================
+ Hits         2822     2853      +31     
- Misses         49       51       +2     
- Partials       63       65       +2     
Flag Coverage Δ
ubuntu-18.04 95.11% <85.10%> (-0.15%) ⬇️
windows-2019 95.75% <78.72%> (-0.26%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
gcovr/gcov.py 82.62% <89.36%> (+1.03%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cd8020a...5976f60. Read the comment docs.

Copy link
Member

@latk latk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks reasonable, let's merge it. Any issues that I see are more stylistic nitpicks and I can change them later. (I have been toying with some refactorings to see what recent Python versions have to offer, since Python 3.6 was EOL'd last week).

For example, I'd probably change:

  • naming conventions
  • a more functional (less OOP) approach
  • using subprocess.run() instead of Popen() (we used to not do this as run() was only added in Python 3.6)
  • raising an error if not at least one --help option works

But again, those are nitpicks that are not pressing.

The only important change we need before merging is to document the new behaviour and add a changelog item. Since we never made any promised about which gcov command line arguments we used, mentioning it as an improvement in the changelog might be sufficient.

@Spacetown
Copy link
Member

  • naming conventions
    Do you mean in general or the functions which I introduced?
  • a more functional (less OOP) approach
    For this I used the class to encapsulate the cached data. For other parts we need a more OOP approach like in Add abstract interface for reader/writer. #474.

The only important change we need before merging is to document the new behaviour and add a changelog item. Since we never made any promised about which gcov command line arguments we used, mentioning it as an improvement in the changelog might be sufficient.

I'll update the changeling and add the used options.
I also need to add a comment to the directory lock that this is essential for parallel execution.

@latk
Copy link
Member

latk commented Dec 30, 2021

naming conventions
Do you mean in general or the functions which I introduced?

E.g. gcov is a class but does not use PascalCase. I'd have called it GcovOptions or something. Doesn't really matter though. This is negligible technical debt, and it's more important to me to merge this PR without extra hurdles.

a more functional (less OOP) approach
For this I used the class to encapsulate the cached data. For other parts we need a more OOP approach like in Add abstract interface for reader/writer. #474.

This was just a subjective remark about my personal preferences. I find stateful objects difficult because every method has to consider the valid object states (is the cache already initialized or not). The more functional approach is to only create an object if it is fully valid. Sketch:

class GcovOptions:
  _instance: CachedGcovOptions = None

  @classmethod
  def get_cached(cls, options) -> CachedGcovOptions:
    if not cached_instance_is_valid():
      cls._instance = cls(options)
    return cls._instance

  def __init__(self, option):
    self.foo = ...
    self.bar = ...

  # methods can assume self to be fully initialized,
  # don't have to check any fields
  def whatever(self): ...

GcovOptions.get_cached(options).whatever()  # example usage

This looks quite similar to your solution, but harmonizes much better with type checkers and IDE features like autocomplete.

But I'm just saying this to broadcast values (typechecking is nice, minimal use of mutable class variables is nice), not to suggest a change that should be implemented as part of this PR. It is more important to me that the queue of pending PRs becomes more reasonable so that I can try my hand at some cross-cutting changes without invalidating other work.

I'll update the changeling and add the used options. I also need to add a comment to the directory lock that this is essential for parallel execution.

Thanks for that. But now the changelog only mentions that the documentation changed, not that we made a change to stay within file system limitations. Or is that unnecessary because gcovr 5.0 didn't use --long-file-names?

@Spacetown
Copy link
Member

Thanks for that. But now the changelog only mentions that the documentation changed, not that we made a change to stay within file system limitations. Or is that unnecessary because gcovr 5.0 didn't use --long-file-names?

I think this is unnecessary because it's a regression change after the 5.0. On the other hand if a project had problems with long file names in 5.0 this can be solved by using the option --hash-filenames. So I'll change the headline in the changelog.

I was also not sure how to document this in the changelog. The changelog lists the improvements since 5.0 and thats only a fix of such an improvement. The problem here is that it seems that some users are using the master branch in production and for them it's a change. Maybe we can list also regression fixes e.g.

- (regression) Fix file path limitation from :issue:`525`. (issue:`528`)

And as a step of the release process go throug the changelog and remove such regression fixes.

@Spacetown Spacetown merged commit 2a2214e into gcovr:master Dec 31, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants