Skip to content

Robust type name to type identifier conversion for C harnesses #5420

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Jul 16, 2020

Conversation

thomasspriggs
Copy link
Contributor

@thomasspriggs thomasspriggs commented Jul 13, 2020

This PR makes the type name to type identifier conversion used in generating C harnesses more robust.

  • Each commit message has a non-empty body, explaining why the change was made.
  • Methods or procedures I have added are documented, following the guidelines provided in CODING_STANDARD.md.
  • n/a Because this improves existing functionality, to reduce potential compile errors in generated harnesses and I assume that these potential errors are not documented. The feature or user visible behaviour I have added or modified has been documented in the User Guide in doc/cprover-manual/
  • Regression or unit tests are included, or existing tests cover the modified code (in this case I have detailed which ones those are in the commit message).
  • n/a - Non claimed My commit message includes data points confirming performance improvements (if claimed).
  • My PR is restricted to a single feature or bugfix.
  • White-space or formatting changes outside the feature-related changed lines are in commits of their own.

In order to specify the properties which are expected to hold for this
function.
In order to make the conversion from type name to type identifier unit
testable in isolation.
In order to quickly test that no regressions have occured as part of
adding extra functionality.
In order to certain that we don't emmit C harnesses containing invalid
identifiers.

The unit test includes both a function to generate the string of all
printable characters and the same string as a string literal in order to
show that the string does include all printable characters and to show
what these characters are.
@codecov
Copy link

codecov bot commented Jul 13, 2020

Codecov Report

Merging #5420 into develop will increase coverage by 0.00%.
The diff coverage is 95.23%.

Impacted file tree graph

@@           Coverage Diff            @@
##           develop    #5420   +/-   ##
========================================
  Coverage    68.20%   68.21%           
========================================
  Files         1177     1177           
  Lines        97528    97540   +12     
========================================
+ Hits         66523    66535   +12     
  Misses       31005    31005           
Flag Coverage Δ
#cproversmt2 42.76% <95.23%> (+0.01%) ⬆️
#regression 65.38% <95.23%> (+<0.01%) ⬆️
#unit 32.26% <71.42%> (+0.01%) ⬆️
Impacted Files Coverage Δ
src/ansi-c/c_typecheck_expr.cpp 75.01% <75.00%> (+0.01%) ⬆️
src/ansi-c/type2name.cpp 95.49% <100.00%> (+0.49%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0c08382...a9f7f01. Read the comment docs.

Copy link
Contributor

@NlightNFotis NlightNFotis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solid!


std::string type2identifier(const typet &type, const namespacet &ns)
{
return type_name2type_identifier(type2name(type, ns));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice 👍

@@ -277,27 +279,52 @@ std::string type2name(const typet &type, const namespacet &ns)
/// If we want utilities like dump_c to work properly identifiers
/// should ideally always be valid C identifiers
/// This replaces some invalid characters that can appear in type2name output.
std::string type2identifier(const typet &type, const namespacet &ns)
std::string type_name2type_identifier(const std::string &name)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don’t like the 2 naming convention here, but changing that isn’t really in scope for this PR.

};
const auto remove_duplicate_underscores = [](const std::string &identifier) {
static const std::regex duplicate_underscore{"_+"};
return std::regex_replace(identifier, duplicate_underscore, "_");

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m not completely sure this is always ‘the right thing’ to do, for example some compilers (including gcc) have a builtin __int128 type, i.e. double underscores in names are a relatively normal thing to see.

That being said, since we only really need these to be related to the original names and not necessarily completely match up with them I suppose this is fine.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'm not sure this is the right thing to do either - double underscore is a perfectly valid identifier for internal symbols - which you may well get in things like header files.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Superflous double underscores could be generated in the preceding section of code. For example char** would be replaced with char_ptr__ptr, which is what I wanted to tidy up.

However I would also argue that we should not emit identifiers which begin with an underscore in the harness code. This is because identifiers which start with an underscore are reserved for use by the compiler / standard libraries. See section 7.1.3 of the standard - http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1570.pdf
I would argue that because we are outputting the harnesses as C code for user modification, the harnesses are therefore user code. This means that use of identifiers with leading underscores would invoke "undefined behaviour" in terms of the language standards.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thomasspriggs note that this is currently not necessarily being used only for the harness – that’s where we found the issue, but the reason it occurred was actually because of the way we generated names for specialisations of generic builtin functions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know that double-underscore is reserved for compiler/library use, and partly thats why I think it may be ok to emit them - because if you are generating a C file from a goto-binary, that goto-binary may well have legitimate typenames from include files (e.g. from things like #defines, etc) and as such, they will be at least semi-legitimate to emit (assuming the emitted code is then compiled against the same headers/libraries).

I may of course be missing something about the context for this change, so if you have an example of where this fails and you need this, do feel free to point me at it :-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I have updated this, in order to retain more underscores.

const auto replace_invalid_characters_with_underscore =
[](const std::string &identifier) {
static const std::regex non_alpha_numeric{"[^A-Za-z0-9]"};
return std::regex_replace(identifier, non_alpha_numeric, "_");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This, I assume, risks generating clashing names?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chrisr-diffblue Yes, in general this isn’t generating a unique identifier. The idea would be to use this as an identifier fragment useful for generating a human-readable-but-unique identifier elsewhere.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We were already risking clashing names, rather than guaranteeing uniqueness. For example fruit* will be replaced with fruit_ptr_, but there was no guarantee that something else in user code wasn't already called fruit_ptr.

/**
* Constructs a string describing the given type, which can be used as part of
* a `C` identifier. The resulting identifier is not guaranteed to uniquely
* identify the type in all cases.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might need clarifying - in fact, it might actually be worth making this explicit in the function name, e.g. rename it to something like type_to_approximate_identifier or type_to_nonunique_identifier.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get that "identifier is a bit of a misnomer in this case. However nonunique_identifier seems like a bit of an oxymoron to me. How does the term partial_identifier sound to you?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

partial_identifer sounds reasonable to me - it makes it clearer to an unsuspecting caller that they need to understand what is returning :-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Because the return value of this function does not uniquely identify a
type, it is only used as part of a unique identifier.
@thomasspriggs thomasspriggs force-pushed the tas/type2identifier_v2 branch from d80bdcf to 92a1d26 Compare July 14, 2020 16:08
@thomasspriggs
Copy link
Contributor Author

@chrisr-diffblue Please can you re-review? Comments are addressed in new commits at the end of the PR.

In order to retain duplicate underscores from original type identifiers.

Invalid character removal is also tweaked to reduce the number of
underscores introduced in the previous step.
Because the resulting partial identifiers are considered to be internal,
it is fine for them to start with an underscore.
@thomasspriggs thomasspriggs force-pushed the tas/type2identifier_v2 branch from 92a1d26 to a9f7f01 Compare July 14, 2020 17:23
@thomasspriggs
Copy link
Contributor Author

thomasspriggs commented Jul 14, 2020

I have pushed to restart CI only. The windows job appeared to fail for reasons unrelated to my changes.

Copy link
Contributor

@chrisr-diffblue chrisr-diffblue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks. Approving, with a slight note of caution that if CBMC ever supports (or claims to support) internationalised source code (i.e. getting into the whole input file encoding/internal encoding question) then this might need revisiting. It won't be the only place that'll need revisiting though :-)

@thomasspriggs thomasspriggs merged commit 069db73 into diffblue:develop Jul 16, 2020
@thomasspriggs thomasspriggs deleted the tas/type2identifier_v2 branch July 16, 2020 10:22
@thomasspriggs
Copy link
Contributor Author

I did take a look at the implementation of the lexer and there does seem to be some UTF-8 support in there. But it is reasonable to assume that there will be incompatibilities, if there are no end-to-end regression tests for such functionality.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants