-
Notifications
You must be signed in to change notification settings - Fork 280
Robust type name to type identifier conversion for C harnesses #5420
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Robust type name to type identifier conversion for C harnesses #5420
Conversation
In order to specify the properties which are expected to hold for this function.
In order to make the conversion from type name to type identifier unit testable in isolation.
In order to quickly test that no regressions have occured as part of adding extra functionality.
In order to certain that we don't emmit C harnesses containing invalid identifiers. The unit test includes both a function to generate the string of all printable characters and the same string as a string literal in order to show that the string does include all printable characters and to show what these characters are.
Codecov Report
@@ Coverage Diff @@
## develop #5420 +/- ##
========================================
Coverage 68.20% 68.21%
========================================
Files 1177 1177
Lines 97528 97540 +12
========================================
+ Hits 66523 66535 +12
Misses 31005 31005
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Solid!
|
||
std::string type2identifier(const typet &type, const namespacet &ns) | ||
{ | ||
return type_name2type_identifier(type2name(type, ns)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice 👍
@@ -277,27 +279,52 @@ std::string type2name(const typet &type, const namespacet &ns) | |||
/// If we want utilities like dump_c to work properly identifiers | |||
/// should ideally always be valid C identifiers | |||
/// This replaces some invalid characters that can appear in type2name output. | |||
std::string type2identifier(const typet &type, const namespacet &ns) | |||
std::string type_name2type_identifier(const std::string &name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still don’t like the 2
naming convention here, but changing that isn’t really in scope for this PR.
src/ansi-c/type2name.cpp
Outdated
}; | ||
const auto remove_duplicate_underscores = [](const std::string &identifier) { | ||
static const std::regex duplicate_underscore{"_+"}; | ||
return std::regex_replace(identifier, duplicate_underscore, "_"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’m not completely sure this is always ‘the right thing’ to do, for example some compilers (including gcc
) have a builtin __int128
type, i.e. double underscores in names are a relatively normal thing to see.
That being said, since we only really need these to be related to the original names and not necessarily completely match up with them I suppose this is fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I'm not sure this is the right thing to do either - double underscore is a perfectly valid identifier for internal symbols - which you may well get in things like header files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Superflous double underscores could be generated in the preceding section of code. For example char**
would be replaced with char_ptr__ptr
, which is what I wanted to tidy up.
However I would also argue that we should not emit identifiers which begin with an underscore in the harness code. This is because identifiers which start with an underscore are reserved for use by the compiler / standard libraries. See section 7.1.3 of the standard - http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1570.pdf
I would argue that because we are outputting the harnesses as C code for user modification, the harnesses are therefore user code. This means that use of identifiers with leading underscores would invoke "undefined behaviour" in terms of the language standards.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@thomasspriggs note that this is currently not necessarily being used only for the harness – that’s where we found the issue, but the reason it occurred was actually because of the way we generated names for specialisations of generic builtin functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know that double-underscore is reserved for compiler/library use, and partly thats why I think it may be ok to emit them - because if you are generating a C file from a goto-binary, that goto-binary may well have legitimate typenames from include files (e.g. from things like #defines, etc) and as such, they will be at least semi-legitimate to emit (assuming the emitted code is then compiled against the same headers/libraries).
I may of course be missing something about the context for this change, so if you have an example of where this fails and you need this, do feel free to point me at it :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. I have updated this, in order to retain more underscores.
const auto replace_invalid_characters_with_underscore = | ||
[](const std::string &identifier) { | ||
static const std::regex non_alpha_numeric{"[^A-Za-z0-9]"}; | ||
return std::regex_replace(identifier, non_alpha_numeric, "_"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This, I assume, risks generating clashing names?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chrisr-diffblue Yes, in general this isn’t generating a unique identifier. The idea would be to use this as an identifier fragment useful for generating a human-readable-but-unique identifier elsewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We were already risking clashing names, rather than guaranteeing uniqueness. For example fruit*
will be replaced with fruit_ptr_
, but there was no guarantee that something else in user code wasn't already called fruit_ptr
.
/** | ||
* Constructs a string describing the given type, which can be used as part of | ||
* a `C` identifier. The resulting identifier is not guaranteed to uniquely | ||
* identify the type in all cases. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might need clarifying - in fact, it might actually be worth making this explicit in the function name, e.g. rename it to something like type_to_approximate_identifier
or type_to_nonunique_identifier
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get that "identifier is a bit of a misnomer in this case. However nonunique_identifier
seems like a bit of an oxymoron to me. How does the term partial_identifier
sound to you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
partial_identifer
sounds reasonable to me - it makes it clearer to an unsuspecting caller that they need to understand what is returning :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Because the return value of this function does not uniquely identify a type, it is only used as part of a unique identifier.
d80bdcf
to
92a1d26
Compare
@chrisr-diffblue Please can you re-review? Comments are addressed in new commits at the end of the PR. |
In order to retain duplicate underscores from original type identifiers. Invalid character removal is also tweaked to reduce the number of underscores introduced in the previous step.
In order to better support UTF-8.
Because the resulting partial identifiers are considered to be internal, it is fine for them to start with an underscore.
92a1d26
to
a9f7f01
Compare
I have pushed to restart CI only. The windows job appeared to fail for reasons unrelated to my changes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, thanks. Approving, with a slight note of caution that if CBMC ever supports (or claims to support) internationalised source code (i.e. getting into the whole input file encoding/internal encoding question) then this might need revisiting. It won't be the only place that'll need revisiting though :-)
I did take a look at the implementation of the lexer and there does seem to be some UTF-8 support in there. But it is reasonable to assume that there will be incompatibilities, if there are no end-to-end regression tests for such functionality. |
This PR makes the type name to type identifier conversion used in generating C harnesses more robust.
The feature or user visible behaviour I have added or modified has been documented in the User Guide in doc/cprover-manual/My commit message includes data points confirming performance improvements (if claimed).