Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gcovr fails with GCC 8.0.1 #226

Closed
ernestask opened this issue Feb 22, 2018 · 10 comments
Closed

gcovr fails with GCC 8.0.1 #226

ernestask opened this issue Feb 22, 2018 · 10 comments

Comments

@ernestask
Copy link

This is actually a different trace than I got the first time, but trying to generate a coverage report in a CI pipeline results in this:

Traceback (most recent call last):
  File "/usr/local/bin/gcovr", line 2412, in <module>
    process_datafile(file_, covdata, options)
  File "/usr/local/bin/gcovr", line 884, in process_datafile
    process_gcov_data(fname, covdata, abs_filename, options)
  File "/usr/local/bin/gcovr", line 616, in process_gcov_data
    covered[lineno] = int(segments[0].strip())
ValueError: invalid literal for int() with base 10: '1*'

The one I saw before was an IndexError at this line:

code = segments[2].strip()

The distro is current Fedora Rawhide with gcovr installed via pip, as the distro package is outdated now.

Full log: https://gitlab.gnome.org/GNOME/nautilus/-/jobs/8807/raw.

@latk
Copy link
Member

latk commented Feb 22, 2018

Thank you for this report. Gcovr depends on GCOV to pre-process the coverage reports. We then parse the human-readable .gcov files. Our tests currently only use GCC 5, but other versions are known to work as well. Your report suggests that GCC 8 may change the GCOV report format.

In order to debug this problem, could you run gcovr in verbose mode (gcovr -v)? This will indicate the .gcov file causing these problems. It would be super helpful if you could provide a .gcov file on which gcovr chokes. I currently can't install GCC 8 to check this myself.

Until I can address the root cause, it might be possible to make gcovr more robust. If parsing one file fails, it will be excluded and gcovr continues with the next file. After generating the reports, gcovr would exit with an error. Would such a fail-later behaviour be helpful for you?

In the meanwhile, please consider only collecting coverage with an earlier compiler, e.g. GCC 7.x.

@ernestask
Copy link
Author

In order to debug this problem, could you run gcovr in verbose mode (gcovr -v)? This will indicate the .gcov file causing these problems. It would be super helpful if you could provide a .gcov file on which gcovr chokes. I currently can't install GCC 8 to check this myself.

I was hoping I wouldn’t have to do this, because I have 7.3.1 installed on my machine, but I’ll report back if I can reproduce the issue locally.

Until I can address the root cause, it might be possible to make gcovr more robust. If parsing one file fails, it will be excluded and gcovr continues with the next file. After generating the reports, gcovr would exit with an error. Would such a fail-later behaviour be helpful for you?

Really, I’d be happy with anything, including a NotImplementedError. Then I’d know that I should avoid doing the thing I’m doing. :)

In the meanwhile, please consider only collecting coverage with an earlier compiler, e.g. GCC 7.x.

That’s not a problem, just that I switched from F27 to Rawhide for a dependency that hasn’t hit stable yet.

@latk
Copy link
Member

latk commented Feb 23, 2018

Ok, the development version of the GCOV documentation shows major changes to the .gcov file format. These changes are good, but break our current parser in many ways.

Here is an excerpt of such a .gcov file from their docs:

        -:    3:template<class T>
        -:    4:class Foo
        -:    5:{
        -:    6:  public:
       1*:    7:  Foo(): b (1000) {}
------------------
Foo<char>::Foo():
function Foo<char>::Foo() called 0 returned 0% blocks executed 0%
    #####:    7:  Foo(): b (1000) {}
------------------
Foo<int>::Foo():
function Foo<int>::Foo() called 1 returned 100% blocks executed 100%
        1:    7:  Foo(): b (1000) {}
------------------
       2*:    8:  void inc () { b++; }
------------------
Foo<char>::inc():
function Foo<char>::inc() called 0 returned 0% blocks executed 0%
    #####:    8:  void inc () { b++; }
------------------
Foo<int>::inc():
function Foo<int>::inc() called 2 returned 100% blocks executed 100%
        2:    8:  void inc () { b++; }
------------------
        -:    9:
        -:   10:  private:
        -:   11:  int b;
        -:   12:};

<snip>

        -:   29:
       1*:   30:  int v = total > 100 ? 1 : 2;
branch  0 taken 0% (fallthrough)
branch  1 taken 100%

They explain:

Note that line 7 is shown in the report multiple times. First occurrence presents total number of execution of the line and the next two belong to instances of class Foo constructors. As you can also see, line 30 contains some unexecuted basic blocks and thus execution count has asterisk symbol.

That is really cool, but complicates parsing. This weekend I want to clean up the gcov parsing code. As part of that, I'll try to add a few GCC 8-related improvements:

  • If the execution count column ends with an asterisk, strip it – we parse branch counts explicitly.
  • Exclude the template specialization sections. Unfortunately there's no clear start and end. Each template section seems to start with a function name in the first column, which should look sufficiently different from source lines and tags.
  • If a line fails to be parsed, warn and continue. This may throw off single lines in the report, but will generally still be correct.

However, I don't have a way to test this myself. I'll update this issue if I have something that I believe might work ;-)

@ernestask
Copy link
Author

That’s awesome, thank you for looking into this. I haven’t had time to set up a testing environment properly, but I’ll test whatever you throw my way.

@ernestask
Copy link
Author

This one causes this error:

Traceback (most recent call last):
  File "/home/ernestas/.local/bin/gcovr", line 2412, in <module>
    process_datafile(file_, covdata, options)
  File "/home/ernestas/.local/bin/gcovr", line 884, in process_datafile
    process_gcov_data(fname, covdata, abs_filename, options)
  File "/home/ernestas/.local/bin/gcovr", line 616, in process_gcov_data
    covered[lineno] = int(segments[0].strip())
ValueError: invalid literal for int() with base 10: '2*'

And this one seems to be causing this error:

Traceback (most recent call last):
  File "/home/ernestas/.local/bin/gcovr", line 2412, in <module>
    process_datafile(file_, covdata, options)
  File "/home/ernestas/.local/bin/gcovr", line 884, in process_datafile
    process_gcov_data(fname, covdata, abs_filename, options)
  File "/home/ernestas/.local/bin/gcovr", line 601, in process_gcov_data
    code = segments[2].strip()
IndexError: list index out of range

Hope these files help.

@latk
Copy link
Member

latk commented Feb 24, 2018 via email

latk added a commit to latk/gcovr that referenced this issue Feb 24, 2018
See gcovr#226 for background.

At this opportunity, the parsing was made a bit more robust.

  - Any exceptions during parsing are swallowed.
    Parsing will always succeed, but may be garbage.
  - After parsing, failed lines and any exceptions are summarized.
  - These errors are shown always, not just in verbose mode.
  - The error message links directly to the Github issues.
@latk
Copy link
Member

latk commented Feb 24, 2018

I've implemented something that should mostly work. Please install from my branch to test this:

pip install git+https://github.com/latk/gcovr.git@simplify-gcov

Gcovr should now always carry on even when there are parse errors, and will report problems on STDERR. Please let me know if there are remaining issues.

@ernestask
Copy link
Author

I don’t see anything in stderr and the coverage is as pathetic as ever (it works!). ;)

@latk
Copy link
Member

latk commented Mar 6, 2018

OK, the changes have now been merged into master. You can use this feature by pip-installing the development version from this repository. A PyPI release is still some time away.

Thanks again for bringing this issue to my attention!

@latk latk closed this as completed Mar 6, 2018
@ernestask
Copy link
Author

Cool, thanks for you work. :)

JamesReynolds pushed a commit to JamesReynolds/gcovr that referenced this issue Mar 8, 2018
See gcovr#226 for background.

At this opportunity, the parsing was made a bit more robust.

  - Any exceptions during parsing are swallowed.
    Parsing will always succeed, but may be garbage.
  - After parsing, failed lines and any exceptions are summarized.
  - These errors are shown always, not just in verbose mode.
  - The error message links directly to the Github issues.
latk added a commit to latk/gcovr that referenced this issue Sep 10, 2021
Summary: This commit introduces a completely rewritten parser for gcov's
human-readable report format. The implementation pays closer attention to the
upstream file format, e.g. with respect to the exact number formatting. Compared
to the previous parser, the data flow has been greatly clarified, making this
parser more obviously correct. The new parser will make it easier to react to
future changes in the format, and to make use of improvements from newer GCC
versions – such as the `Working directory` metadata.

Background
----------

The old parser had some problems that could be addressed by this rewrite.

* The old parser was very complex due to mixing low level parsing and high-level
  decisions about coverage exclusion.

  Fixed in the new parser: now, staging is split into multiple phases that
  clearly separate concerns. The low-level parsing phase produces a data model
  that is safe to use for the other phases.

* The old parser used an object-oriented design were parsing state was stored in
  instance fields. This created a huge "state space" that was difficult to
  track, potentially leading to bugs. For example, it was not immediately clear
  in the investigation of <gcovr#511> whether
  the problem was just that the parser forgot to update its line number
  correctly.

  Fixed in the new parser: using a more functional/procedural design, the data
  flows are very clear. State is tracked explicitly. By separating parsing
  phases, the state space is much smaller.

* To expand on the previous point: The old parser essentially consisted of
  multiple interleaved state machines. There was one state machine for
  processing coverage data, and an interleaved state machine for tracking active
  coverage exclusion markers. This interleaving made it difficult to understand
  whether the state transitions were correct.

  Fixed in the new parser: coverage exclusion patterns are collected in a
  separate phase, before the actual coverage data is processed.

* The old parser made use of very fragile parsing strategies, such as using
  `str.split()` excessively. This gave rise to fragile assumptions about the
  exact format. For example, the IndexError in
  <gcovr#226> was an example of wrong
  assumptions.

  The new parser uses regular expressions to parse tag lines, and only uses
  `str.split()` for the more structured source code lines. This is more
  self-documenting. Treatment of numerical values was aligned with the routines
  in the gcov source code. Should the format deviate in the future, the regexes
  will fail to match, making it possible to detect and fix the errors. (Until
  then, `--gcov-ignore-parse-errors` can be used).

Design of the new parser
------------------------

The new parser is more complex in the sense that there is a lot more code. But
there is a clearer separation of concerns, and the parser was closely informed
by the gcov documentation and source code. As a result, I am confident that it
handles far more edge cases correctly, in particular relating to the handling of
numbers/percentages.

There are three items for external use:

**`parse_metadata(lines)`** creates a dict of values from the metadata lines.
The old parser handled the very first line of the report separately to extract
the filename. The new parser uses the same more robust parsing code for this
metadata.

**`ParserFlags`** is a flag-enum that describes various boolean features. A single
object with flags seems simpler to handle than multiple variables like
`exclude_throw_branches`.

**`parse_coverage(lines, ...)`** is the main function for parsing the coverage.
It performs multiple phases:

* Each input line is parsed/tokenized into an internal data model. The data
  model is strongly typed. The various classes like `_SourceLine` are
  implemented as NamedTuples, which is both very convenient and very
  memory-efficient.

  Relevant items: `_parse_line()`, `_parse_tag_line()`, data model

* Exclusion markers are extracted from source code lines and arranged into a
  data structure for later lookup.

  Relevant items: `_find_excluded_ranges()`, `_make_is_in_any_range()`

* Parsed lines are processed to populate a `FileCoverage` model. At this stage,
  exclusions are applied. The state space is very small, with only four
  variables that have to be tracked.

  Relevant items: `_ParserState`, `_gather_coverage_from_line()`,
  `_line_noncode_and_count()`, `_function_can_be_excluded()`,
  `_branch_can_be_excluded()`, `_add_coverage_for_function()`

* Warnings are reported, and any potential errors re-raised. This is equivalent
  to the previous parser.

  Relevant items: `_report_lines_with_errors()`

Impact on tests
---------------

The new parser is almost completely bug-compatible with the old parser. This is
e.g. visible in the potentially unintuitive handling of the `noncode` status.
The new separation between low-level parsing and high-level decisions makes it
more clear what is actually going on.

There was a significant change in the Nautilus parser test. The input file
contains the following pattern:

    ------------------
        #####:   52:foo() ? bar():

Previously, this line 52 was not reported as uncovered. I consider that to be an
error, and have updated the expected output correspondingly. This could indicate
that the new parser is in fact more robust than the old parser when it comes to
template specialization sections.

Other than that, the tests were only updated to account for the different parser
APIs.

Internally, the new parser uses a lot of doctests.

Future directions
-----------------

The new parser extracts *all* available data, only to throw it away. It might
now become feasible to make use of more of this data. In particular:

* handling template specialization sections properly
* collecting block-level coverage data
* using the `working directory` metadata field

Conflicts with other development efforts
----------------------------------------

* <gcovr#503> report of excluded coverage

  Makes a small patch to the parser. The same effect can be achieved by adding a
  few lines in `_gather_coverage_from_line()`.

* <gcovr#484> tests with clang-10

  Touches neighboring lines. Will be reported as a merge conflict by Git, but
  there's no semantic conflict.

* <gcovr#474> abstract interface for reader/writer

  Small change in the parser code regarding `sys.exit(1)` (new parser:
  `raise SystemExit(1)`). It's worth noting that this is effectively unreachable
  code. Lines will only be reported if there was an exception, and if there was
  an exception it will be re-thrown.

* <gcovr#361> --no-markers to ignore exclusion markers

  Touches the exclusion handling code. This is of course totally changed by the
  new parser. But the new parser would make it even easier to implement that
  functionality.

* <gcovr#350> decision coverage

  Adds significant new parsing code, but most of it is outside of the
  gcov-parser. These changes could be ported with moderate effort to the new
  parser.
latk added a commit to latk/gcovr that referenced this issue Sep 10, 2021
Summary: This commit introduces a completely rewritten parser for gcov's
human-readable report format. The implementation pays closer attention to the
upstream file format, e.g. with respect to the exact number formatting. Compared
to the previous parser, the data flow has been greatly clarified, making this
parser more obviously correct. The new parser will make it easier to react to
future changes in the format, and to make use of improvements from newer GCC
versions – such as the `Working directory` metadata.

Background
----------

The old parser had some problems that could be addressed by this rewrite.

* The old parser was very complex due to mixing low level parsing and high-level
  decisions about coverage exclusion.

  Fixed in the new parser: now, staging is split into multiple phases that
  clearly separate concerns. The low-level parsing phase produces a data model
  that is safe to use for the other phases.

* The old parser used an object-oriented design were parsing state was stored in
  instance fields. This created a huge "state space" that was difficult to
  track, potentially leading to bugs. For example, it was not immediately clear
  in the investigation of <gcovr#511> whether
  the problem was just that the parser forgot to update its line number
  correctly.

  Fixed in the new parser: using a more functional/procedural design, the data
  flows are very clear. State is tracked explicitly. By separating parsing
  phases, the state space is much smaller.

* To expand on the previous point: The old parser essentially consisted of
  multiple interleaved state machines. There was one state machine for
  processing coverage data, and an interleaved state machine for tracking active
  coverage exclusion markers. This interleaving made it difficult to understand
  whether the state transitions were correct.

  Fixed in the new parser: coverage exclusion patterns are collected in a
  separate phase, before the actual coverage data is processed.

* The old parser made use of very fragile parsing strategies, such as using
  `str.split()` excessively. This gave rise to fragile assumptions about the
  exact format. For example, the IndexError in
  <gcovr#226> was an example of wrong
  assumptions.

  The new parser uses regular expressions to parse tag lines, and only uses
  `str.split()` for the more structured source code lines. This is more
  self-documenting. Treatment of numerical values was aligned with the routines
  in the gcov source code. Should the format deviate in the future, the regexes
  will fail to match, making it possible to detect and fix the errors. (Until
  then, `--gcov-ignore-parse-errors` can be used).

Design of the new parser
------------------------

The new parser is more complex in the sense that there is a lot more code. But
there is a clearer separation of concerns, and the parser was closely informed
by the gcov documentation and source code. As a result, I am confident that it
handles far more edge cases correctly, in particular relating to the handling of
numbers/percentages.

There are three items for external use:

**`parse_metadata(lines)`** creates a dict of values from the metadata lines.
The old parser handled the very first line of the report separately to extract
the filename. The new parser uses the same more robust parsing code for this
metadata.

**`ParserFlags`** is a flag-enum that describes various boolean features. A single
object with flags seems simpler to handle than multiple variables like
`exclude_throw_branches`.

**`parse_coverage(lines, ...)`** is the main function for parsing the coverage.
It performs multiple phases:

* Each input line is parsed/tokenized into an internal data model. The data
  model is strongly typed. The various classes like `_SourceLine` are
  implemented as NamedTuples, which is both very convenient and very
  memory-efficient.

  Relevant items: `_parse_line()`, `_parse_tag_line()`, data model

* Exclusion markers are extracted from source code lines and arranged into a
  data structure for later lookup.

  Relevant items: `_find_excluded_ranges()`, `_make_is_in_any_range()`

* Parsed lines are processed to populate a `FileCoverage` model. At this stage,
  exclusions are applied. The state space is very small, with only four
  variables that have to be tracked.

  Relevant items: `_ParserState`, `_gather_coverage_from_line()`,
  `_line_noncode_and_count()`, `_function_can_be_excluded()`,
  `_branch_can_be_excluded()`, `_add_coverage_for_function()`

* Warnings are reported, and any potential errors re-raised. This is equivalent
  to the previous parser.

  Relevant items: `_report_lines_with_errors()`

Impact on tests
---------------

The new parser is almost completely bug-compatible with the old parser. This is
e.g. visible in the potentially unintuitive handling of the `noncode` status.
The new separation between low-level parsing and high-level decisions makes it
more clear what is actually going on.

There was a significant change in the **Nautilus parser test**. The input file
contains the following pattern:

    ------------------
        #####:   52:foo() ? bar():

Previously, this line 52 was not reported as uncovered. I consider that to be an
error, and have updated the expected output correspondingly. This could indicate
that the new parser is in fact more robust than the old parser when it comes to
template specialization sections.

In the **excl-branch test**, gcovr will encounter gcov input as the following
when using GCC-8 or later:

        #####:    9:    virtual ~Bar()
        #####:   10:    {} // ...
    ------------------
    Bar::~Bar():
    function Bar::~Bar() called 0 returned 0% blocks executed 0%
        #####:    9:    virtual ~Bar()
        #####:   10:    {} // ...
    call    0 never executed
    call    1 never executed
    ------------------
    Bar::~Bar():
    function Bar::~Bar() called 0 returned 0% blocks executed 0%
        #####:    9:    virtual ~Bar()
        #####:   10:    {} // ...
    ------------------

The old parser associated the `function` annotations with line 11. This was
clearly incorrect. The test reference was updated to associate the destructor
with line 9.

Other than that, the tests were only updated to account for the different parser
APIs.

Internally, the new parser uses a lot of doctests.

Future directions
-----------------

The new parser extracts *all* available data, only to throw it away. It might
now become feasible to make use of more of this data. In particular:

* handling template specialization sections properly
* collecting block-level coverage data
* using the `working directory` metadata field

Conflicts with other development efforts
----------------------------------------

* <gcovr#503> report of excluded coverage

  Makes a small patch to the parser. The same effect can be achieved by adding a
  few lines in `_gather_coverage_from_line()`.

* <gcovr#484> tests with clang-10

  Touches neighboring lines. Will be reported as a merge conflict by Git, but
  there's no semantic conflict.

* <gcovr#474> abstract interface for reader/writer

  Small change in the parser code regarding `sys.exit(1)` (new parser:
  `raise SystemExit(1)`). It's worth noting that this is effectively unreachable
  code. Lines will only be reported if there was an exception, and if there was
  an exception it will be re-thrown.

* <gcovr#361> --no-markers to ignore exclusion markers

  Touches the exclusion handling code. This is of course totally changed by the
  new parser. But the new parser would make it even easier to implement that
  functionality.

* <gcovr#350> decision coverage

  Adds significant new parsing code, but most of it is outside of the
  gcov-parser. These changes could be ported with moderate effort to the new
  parser.
latk added a commit to latk/gcovr that referenced this issue Sep 10, 2021
Summary: This commit introduces a completely rewritten parser for gcov's
human-readable report format. The implementation pays closer attention to the
upstream file format, e.g. with respect to the exact number formatting. Compared
to the previous parser, the data flow has been greatly clarified, making this
parser more obviously correct. The new parser will make it easier to react to
future changes in the format, and to make use of improvements from newer GCC
versions – such as the `Working directory` metadata.

Background
----------

The old parser had some problems that could be addressed by this rewrite.

* The old parser was very complex due to mixing low level parsing and high-level
  decisions about coverage exclusion.

  Fixed in the new parser: now, staging is split into multiple phases that
  clearly separate concerns. The low-level parsing phase produces a data model
  that is safe to use for the other phases.

* The old parser used an object-oriented design were parsing state was stored in
  instance fields. This created a huge "state space" that was difficult to
  track, potentially leading to bugs. For example, it was not immediately clear
  in the investigation of <gcovr#511> whether
  the problem was just that the parser forgot to update its line number
  correctly.

  Fixed in the new parser: using a more functional/procedural design, the data
  flows are very clear. State is tracked explicitly. By separating parsing
  phases, the state space is much smaller.

* To expand on the previous point: The old parser essentially consisted of
  multiple interleaved state machines. There was one state machine for
  processing coverage data, and an interleaved state machine for tracking active
  coverage exclusion markers. This interleaving made it difficult to understand
  whether the state transitions were correct.

  Fixed in the new parser: coverage exclusion patterns are collected in a
  separate phase, before the actual coverage data is processed.

* The old parser made use of very fragile parsing strategies, such as using
  `str.split()` excessively. This gave rise to fragile assumptions about the
  exact format. For example, the IndexError in
  <gcovr#226> was an example of wrong
  assumptions.

  The new parser uses regular expressions to parse tag lines, and only uses
  `str.split()` for the more structured source code lines. This is more
  self-documenting. Treatment of numerical values was aligned with the routines
  in the gcov source code. Should the format deviate in the future, the regexes
  will fail to match, making it possible to detect and fix the errors. (Until
  then, `--gcov-ignore-parse-errors` can be used).

Design of the new parser
------------------------

The new parser is more complex in the sense that there is a lot more code. But
there is a clearer separation of concerns, and the parser was closely informed
by the gcov documentation and source code. As a result, I am confident that it
handles far more edge cases correctly, in particular relating to the handling of
numbers/percentages.

There are three items for external use:

**`parse_metadata(lines)`** creates a dict of values from the metadata lines.
The old parser handled the very first line of the report separately to extract
the filename. The new parser uses the same more robust parsing code for this
metadata.

**`ParserFlags`** is a flag-enum that describes various boolean features. A single
object with flags seems simpler to handle than multiple variables like
`exclude_throw_branches`.

**`parse_coverage(lines, ...)`** is the main function for parsing the coverage.
It performs multiple phases:

* Each input line is parsed/tokenized into an internal data model. The data
  model is strongly typed. The various classes like `_SourceLine` are
  implemented as NamedTuples, which is both very convenient and very
  memory-efficient.

  Relevant items: `_parse_line()`, `_parse_tag_line()`, data model

* Exclusion markers are extracted from source code lines and arranged into a
  data structure for later lookup.

  Relevant items: `_find_excluded_ranges()`, `_make_is_in_any_range()`

* Parsed lines are processed to populate a `FileCoverage` model. At this stage,
  exclusions are applied. The state space is very small, with only four
  variables that have to be tracked.

  Relevant items: `_ParserState`, `_gather_coverage_from_line()`,
  `_line_noncode_and_count()`, `_function_can_be_excluded()`,
  `_branch_can_be_excluded()`, `_add_coverage_for_function()`

* Warnings are reported, and any potential errors re-raised. This is equivalent
  to the previous parser.

  Relevant items: `_report_lines_with_errors()`

Impact on tests
---------------

The new parser is almost completely bug-compatible with the old parser. This is
e.g. visible in the potentially unintuitive handling of the `noncode` status.
The new separation between low-level parsing and high-level decisions makes it
more clear what is actually going on.

There was a significant change in the **Nautilus parser test**. The input file
contains the following pattern:

    ------------------
        #####:   52:foo() ? bar():

Previously, this line 52 was not reported as uncovered. I consider that to be an
error, and have updated the expected output correspondingly. This could indicate
that the new parser is in fact more robust than the old parser when it comes to
template specialization sections.

In the **excl-branch test**, gcovr will encounter gcov input as the following
when using GCC-8 or later:

        #####:    9:    virtual ~Bar()
        #####:   10:    {} // ...
    ------------------
    Bar::~Bar():
    function Bar::~Bar() called 0 returned 0% blocks executed 0%
        #####:    9:    virtual ~Bar()
        #####:   10:    {} // ...
    call    0 never executed
    call    1 never executed
    ------------------
    Bar::~Bar():
    function Bar::~Bar() called 0 returned 0% blocks executed 0%
        #####:    9:    virtual ~Bar()
        #####:   10:    {} // ...
    ------------------

The old parser associated the `function` annotations with line 11. This was
clearly incorrect. The test reference was updated to associate the destructor
with line 9.

Other than that, the tests were only updated to account for the different parser
APIs.

Internally, the new parser uses a lot of doctests.

Future directions
-----------------

The new parser extracts *all* available data, only to throw it away. It might
now become feasible to make use of more of this data. In particular:

* handling template specialization sections properly
* collecting block-level coverage data
* using the `working directory` metadata field

Conflicts with other development efforts
----------------------------------------

* <gcovr#503> report of excluded coverage

  Makes a small patch to the parser. The same effect can be achieved by adding a
  few lines in `_gather_coverage_from_line()`.

* <gcovr#484> tests with clang-10

  Touches neighboring lines. Will be reported as a merge conflict by Git, but
  there's no semantic conflict.

* <gcovr#474> abstract interface for reader/writer

  Small change in the parser code regarding `sys.exit(1)` (new parser:
  `raise SystemExit(1)`). It's worth noting that this is effectively unreachable
  code. Lines will only be reported if there was an exception, and if there was
  an exception it will be re-thrown.

* <gcovr#361> --no-markers to ignore exclusion markers

  Touches the exclusion handling code. This is of course totally changed by the
  new parser. But the new parser would make it even easier to implement that
  functionality.

* <gcovr#350> decision coverage

  Adds significant new parsing code, but most of it is outside of the
  gcov-parser. These changes could be ported with moderate effort to the new
  parser.
latk added a commit to latk/gcovr that referenced this issue Sep 15, 2021
Summary: This commit introduces a completely rewritten parser for gcov's
human-readable report format. The implementation pays closer attention to the
upstream file format, e.g. with respect to the exact number formatting. Compared
to the previous parser, the data flow has been greatly clarified, making this
parser more obviously correct. The new parser will make it easier to react to
future changes in the format, and to make use of improvements from newer GCC
versions – such as the `Working directory` metadata.

Background
----------

The old parser had some problems that could be addressed by this rewrite.

* The old parser was very complex due to mixing low level parsing and high-level
  decisions about coverage exclusion.

  Fixed in the new parser: now, staging is split into multiple phases that
  clearly separate concerns. The low-level parsing phase produces a data model
  that is safe to use for the other phases.

* The old parser used an object-oriented design were parsing state was stored in
  instance fields. This created a huge "state space" that was difficult to
  track, potentially leading to bugs. For example, it was not immediately clear
  in the investigation of <gcovr#511> whether
  the problem was just that the parser forgot to update its line number
  correctly.

  Fixed in the new parser: using a more functional/procedural design, the data
  flows are very clear. State is tracked explicitly. By separating parsing
  phases, the state space is much smaller.

* To expand on the previous point: The old parser essentially consisted of
  multiple interleaved state machines. There was one state machine for
  processing coverage data, and an interleaved state machine for tracking active
  coverage exclusion markers. This interleaving made it difficult to understand
  whether the state transitions were correct.

  Fixed in the new parser: coverage exclusion patterns are collected in a
  separate phase, before the actual coverage data is processed.

* The old parser made use of very fragile parsing strategies, such as using
  `str.split()` excessively. This gave rise to fragile assumptions about the
  exact format. For example, the IndexError in
  <gcovr#226> was an example of wrong
  assumptions.

  The new parser uses regular expressions to parse tag lines, and only uses
  `str.split()` for the more structured source code lines. This is more
  self-documenting. Treatment of numerical values was aligned with the routines
  in the gcov source code. Should the format deviate in the future, the regexes
  will fail to match, making it possible to detect and fix the errors. (Until
  then, `--gcov-ignore-parse-errors` can be used).

Design of the new parser
------------------------

The new parser is more complex in the sense that there is a lot more code. But
there is a clearer separation of concerns, and the parser was closely informed
by the gcov documentation and source code. As a result, I am confident that it
handles far more edge cases correctly, in particular relating to the handling of
numbers/percentages.

There are three items for external use:

**`parse_metadata(lines)`** creates a dict of values from the metadata lines.
The old parser handled the very first line of the report separately to extract
the filename. The new parser uses the same more robust parsing code for this
metadata.

**`ParserFlags`** is a flag-enum that describes various boolean features. A single
object with flags seems simpler to handle than multiple variables like
`exclude_throw_branches`.

**`parse_coverage(lines, ...)`** is the main function for parsing the coverage.
It performs multiple phases:

* Each input line is parsed/tokenized into an internal data model. The data
  model is strongly typed. The various classes like `_SourceLine` are
  implemented as NamedTuples, which is both very convenient and very
  memory-efficient.

  Relevant items: `_parse_line()`, `_parse_tag_line()`, data model

* Exclusion markers are extracted from source code lines and arranged into a
  data structure for later lookup.

  Relevant items: `_find_excluded_ranges()`, `_make_is_in_any_range()`

* Parsed lines are processed to populate a `FileCoverage` model. At this stage,
  exclusions are applied. The state space is very small, with only four
  variables that have to be tracked.

  Relevant items: `_ParserState`, `_gather_coverage_from_line()`,
  `_line_noncode_and_count()`, `_function_can_be_excluded()`,
  `_branch_can_be_excluded()`, `_add_coverage_for_function()`

* Warnings are reported, and any potential errors re-raised. This is equivalent
  to the previous parser.

  Relevant items: `_report_lines_with_errors()`

Impact on tests
---------------

The new parser is almost completely bug-compatible with the old parser. This is
e.g. visible in the potentially unintuitive handling of the `noncode` status.
The new separation between low-level parsing and high-level decisions makes it
more clear what is actually going on.

There was a significant change in the **Nautilus parser test**. The input file
contains the following pattern:

    ------------------
        #####:   52:foo() ? bar():

Previously, this line 52 was not reported as uncovered. I consider that to be an
error, and have updated the expected output correspondingly. This could indicate
that the new parser is in fact more robust than the old parser when it comes to
template specialization sections.

In the **excl-branch test**, gcovr will encounter gcov input as the following
when using GCC-8 or later:

        #####:    9:    virtual ~Bar()
        #####:   10:    {} // ...
    ------------------
    Bar::~Bar():
    function Bar::~Bar() called 0 returned 0% blocks executed 0%
        #####:    9:    virtual ~Bar()
        #####:   10:    {} // ...
    call    0 never executed
    call    1 never executed
    ------------------
    Bar::~Bar():
    function Bar::~Bar() called 0 returned 0% blocks executed 0%
        #####:    9:    virtual ~Bar()
        #####:   10:    {} // ...
    ------------------

The old parser associated the `function` annotations with line 11. This was
clearly incorrect. The test reference was updated to associate the destructor
with line 9.

Other than that, the tests were only updated to account for the different parser
APIs.

Internally, the new parser uses a lot of doctests.

Future directions
-----------------

The new parser extracts *all* available data, only to throw it away. It might
now become feasible to make use of more of this data. In particular:

* handling template specialization sections properly
* collecting block-level coverage data
* using the `working directory` metadata field

Conflicts with other development efforts
----------------------------------------

* <gcovr#503> report of excluded coverage

  Makes a small patch to the parser. The same effect can be achieved by adding a
  few lines in `_gather_coverage_from_line()`.

* <gcovr#484> tests with clang-10

  Touches neighboring lines. Will be reported as a merge conflict by Git, but
  there's no semantic conflict.

* <gcovr#474> abstract interface for reader/writer

  Small change in the parser code regarding `sys.exit(1)` (new parser:
  `raise SystemExit(1)`). It's worth noting that this is effectively unreachable
  code. Lines will only be reported if there was an exception, and if there was
  an exception it will be re-thrown.

* <gcovr#361> --no-markers to ignore exclusion markers

  Touches the exclusion handling code. This is of course totally changed by the
  new parser. But the new parser would make it even easier to implement that
  functionality.

* <gcovr#350> decision coverage

  Adds significant new parsing code, but most of it is outside of the
  gcov-parser. These changes could be ported with moderate effort to the new
  parser.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants