Robust type name to type identifier conversion for C harnesses #5420

thomasspriggs · 2020-07-13T15:38:29Z

This PR makes the type name to type identifier conversion used in generating C harnesses more robust.

Each commit message has a non-empty body, explaining why the change was made.
Methods or procedures I have added are documented, following the guidelines provided in CODING_STANDARD.md.
n/a Because this improves existing functionality, to reduce potential compile errors in generated harnesses and I assume that these potential errors are not documented. ~~The feature or user visible behaviour I have added or modified has been documented in the User Guide in doc/cprover-manual/~~
Regression or unit tests are included, or existing tests cover the modified code (in this case I have detailed which ones those are in the commit message).
n/a - Non claimed ~~My commit message includes data points confirming performance improvements (if claimed).~~
My PR is restricted to a single feature or bugfix.
White-space or formatting changes outside the feature-related changed lines are in commits of their own.

In order to specify the properties which are expected to hold for this function.

In order to make the conversion from type name to type identifier unit testable in isolation.

In order to quickly test that no regressions have occured as part of adding extra functionality.

In order to certain that we don't emmit C harnesses containing invalid identifiers. The unit test includes both a function to generate the string of all printable characters and the same string as a string literal in order to show that the string does include all printable characters and to show what these characters are.

codecov · 2020-07-13T16:47:12Z

Codecov Report

Merging #5420 into develop will increase coverage by 0.00%.
The diff coverage is 95.23%.

@@           Coverage Diff            @@
##           develop    #5420   +/-   ##
========================================
  Coverage    68.20%   68.21%           
========================================
  Files         1177     1177           
  Lines        97528    97540   +12     
========================================
+ Hits         66523    66535   +12     
  Misses       31005    31005

Flag	Coverage Δ
#cproversmt2	`42.76% <95.23%> (+0.01%)`	⬆️
#regression	`65.38% <95.23%> (+<0.01%)`	⬆️
#unit	`32.26% <71.42%> (+0.01%)`	⬆️

Impacted Files	Coverage Δ
src/ansi-c/c_typecheck_expr.cpp	`75.01% <75.00%> (+0.01%)`	⬆️
src/ansi-c/type2name.cpp	`95.49% <100.00%> (+0.49%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0c08382...a9f7f01. Read the comment docs.

NlightNFotis

Solid!

NlightNFotis · 2020-07-14T10:12:27Z

src/ansi-c/type2name.cpp

+
+std::string type2identifier(const typet &type, const namespacet &ns)
+{
+  return type_name2type_identifier(type2name(type, ns));


hannes-steffenhagen-diffblue · 2020-07-14T10:21:21Z

src/ansi-c/type2name.cpp

@@ -277,27 +279,52 @@ std::string type2name(const typet &type, const namespacet &ns)
 /// If we want utilities like dump_c to work properly identifiers
 /// should ideally always be valid C identifiers
 /// This replaces some invalid characters that can appear in type2name output.
-std::string type2identifier(const typet &type, const namespacet &ns)
+std::string type_name2type_identifier(const std::string &name)


I still don’t like the 2 naming convention here, but changing that isn’t really in scope for this PR.

hannes-steffenhagen-diffblue · 2020-07-14T10:38:05Z

src/ansi-c/type2name.cpp

+    };
+  const auto remove_duplicate_underscores = [](const std::string &identifier) {
+    static const std::regex duplicate_underscore{"_+"};
+    return std::regex_replace(identifier, duplicate_underscore, "_");


I’m not completely sure this is always ‘the right thing’ to do, for example some compilers (including gcc) have a builtin __int128 type, i.e. double underscores in names are a relatively normal thing to see.

That being said, since we only really need these to be related to the original names and not necessarily completely match up with them I suppose this is fine.

Yeah, I'm not sure this is the right thing to do either - double underscore is a perfectly valid identifier for internal symbols - which you may well get in things like header files.

Superflous double underscores could be generated in the preceding section of code. For example char** would be replaced with char_ptr__ptr, which is what I wanted to tidy up.

However I would also argue that we should not emit identifiers which begin with an underscore in the harness code. This is because identifiers which start with an underscore are reserved for use by the compiler / standard libraries. See section 7.1.3 of the standard - http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1570.pdf
I would argue that because we are outputting the harnesses as C code for user modification, the harnesses are therefore user code. This means that use of identifiers with leading underscores would invoke "undefined behaviour" in terms of the language standards.

@thomasspriggs note that this is currently not necessarily being used only for the harness – that’s where we found the issue, but the reason it occurred was actually because of the way we generated names for specialisations of generic builtin functions.

I know that double-underscore is reserved for compiler/library use, and partly thats why I think it may be ok to emit them - because if you are generating a C file from a goto-binary, that goto-binary may well have legitimate typenames from include files (e.g. from things like #defines, etc) and as such, they will be at least semi-legitimate to emit (assuming the emitted code is then compiled against the same headers/libraries).

I may of course be missing something about the context for this change, so if you have an example of where this fails and you need this, do feel free to point me at it :-)

Ok. I have updated this, in order to retain more underscores.

chrisr-diffblue · 2020-07-14T11:01:05Z

src/ansi-c/type2name.cpp

+  const auto replace_invalid_characters_with_underscore =
+    [](const std::string &identifier) {
+      static const std::regex non_alpha_numeric{"[^A-Za-z0-9]"};
+      return std::regex_replace(identifier, non_alpha_numeric, "_");


This, I assume, risks generating clashing names?

@chrisr-diffblue Yes, in general this isn’t generating a unique identifier. The idea would be to use this as an identifier fragment useful for generating a human-readable-but-unique identifier elsewhere.

We were already risking clashing names, rather than guaranteeing uniqueness. For example fruit* will be replaced with fruit_ptr_, but there was no guarantee that something else in user code wasn't already called fruit_ptr.

chrisr-diffblue · 2020-07-14T11:05:31Z

src/ansi-c/type2name.h

+/**
+ * Constructs a string describing the given type, which can be used as part of
+ * a `C` identifier. The resulting identifier is not guaranteed to uniquely
+ * identify the type in all cases.


This might need clarifying - in fact, it might actually be worth making this explicit in the function name, e.g. rename it to something like type_to_approximate_identifier or type_to_nonunique_identifier.

I get that "identifier is a bit of a misnomer in this case. However nonunique_identifier seems like a bit of an oxymoron to me. How does the term partial_identifier sound to you?

partial_identifer sounds reasonable to me - it makes it clearer to an unsuspecting caller that they need to understand what is returning :-)

Because the return value of this function does not uniquely identify a type, it is only used as part of a unique identifier.

thomasspriggs · 2020-07-14T16:10:40Z

@chrisr-diffblue Please can you re-review? Comments are addressed in new commits at the end of the PR.

In order to retain duplicate underscores from original type identifiers. Invalid character removal is also tweaked to reduce the number of underscores introduced in the previous step.

In order to better support UTF-8.

Because the resulting partial identifiers are considered to be internal, it is fine for them to start with an underscore.

thomasspriggs · 2020-07-14T17:24:15Z

I have pushed to restart CI only. The windows job appeared to fail for reasons unrelated to my changes.

chrisr-diffblue

Looks good, thanks. Approving, with a slight note of caution that if CBMC ever supports (or claims to support) internationalised source code (i.e. getting into the whole input file encoding/internal encoding question) then this might need revisiting. It won't be the only place that'll need revisiting though :-)

thomasspriggs · 2020-07-16T10:25:17Z

I did take a look at the implementation of the lexer and there does seem to be some UTF-8 support in there. But it is reasonable to assume that there will be incompatibilities, if there are no end-to-end regression tests for such functionality.

thomasspriggs added 4 commits July 10, 2020 19:24

Document type2identifier function

5a3e001

In order to specify the properties which are expected to hold for this function.

Refactor type2identifier, splitting into 2 functions

7a20ab6

In order to make the conversion from type name to type identifier unit testable in isolation.

Add unit test of existing type_name2type_identifier behaviour

83e2756

In order to quickly test that no regressions have occured as part of adding extra functionality.

thomasspriggs requested review from chrisr-diffblue, kroening and tautschnig as code owners July 13, 2020 15:38

thomasspriggs requested review from NlightNFotis and hannes-steffenhagen-diffblue July 13, 2020 15:38

NlightNFotis approved these changes Jul 14, 2020

View reviewed changes

hannes-steffenhagen-diffblue approved these changes Jul 14, 2020

View reviewed changes

chrisr-diffblue reviewed Jul 14, 2020

View reviewed changes

Rename type2identifier to type_to_partial_identifier

4a8ca1a

Because the return value of this function does not uniquely identify a type, it is only used as part of a unique identifier.

thomasspriggs force-pushed the tas/type2identifier_v2 branch from d80bdcf to 92a1d26 Compare July 14, 2020 16:08

thomasspriggs added 3 commits July 14, 2020 18:23

Remove duplicate underscore removal

05f85d2

In order to retain duplicate underscores from original type identifiers. Invalid character removal is also tweaked to reduce the number of underscores introduced in the previous step.

Make type_name2type_identifier retain UTF-8 characters

2733b56

In order to better support UTF-8.

Make type_name2type_identifier retain leading underscores

a9f7f01

Because the resulting partial identifiers are considered to be internal, it is fine for them to start with an underscore.

thomasspriggs force-pushed the tas/type2identifier_v2 branch from 92a1d26 to a9f7f01 Compare July 14, 2020 17:23

chrisr-diffblue approved these changes Jul 16, 2020

View reviewed changes

thomasspriggs merged commit 069db73 into diffblue:develop Jul 16, 2020

thomasspriggs deleted the tas/type2identifier_v2 branch July 16, 2020 10:22

Robust type name to type identifier conversion for C harnesses #5420

Robust type name to type identifier conversion for C harnesses #5420

Uh oh!

Conversation

thomasspriggs commented Jul 13, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Jul 13, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

NlightNFotis left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

thomasspriggs commented Jul 14, 2020

Uh oh!

thomasspriggs commented Jul 14, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chrisr-diffblue left a comment

Choose a reason for hiding this comment

Uh oh!

thomasspriggs commented Jul 16, 2020

Uh oh!

Uh oh!

thomasspriggs commented Jul 13, 2020 •

edited

Loading

codecov bot commented Jul 13, 2020 •

edited

Loading

thomasspriggs commented Jul 14, 2020 •

edited

Loading