Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

integer overflow (optional) detection #75

Closed
wants to merge 2 commits into from

Conversation

fargies
Copy link

@fargies fargies commented Feb 28, 2022

Hello 👋

Here are some modifications to add a slightly stricter mode on integer parsing (+ some fixes and minor improvements).

  • Small fix on gcc-4.8.hpp amalgamate (actually started to use it)
  • Adding C4CORE_OVERFLOW_DETECT feature
    • when enabled parse will fail on overflow, small performance impact (see attached files)
    • when disabled code is back to full-perfs (old behavior)
    • tests updated to validate with/out option
  • Small performance enhancements:
    • moving returns out of loops allows some compiler optimizations
    • casting symbols on uint8_t and comparing with only < '9' provides a small boost (also applied on oct, similar thing could be done on hex)
  • Removed tests on output when parsing fails: it was valid only for the first byte parsed (ex: 0xK0 would not touch input when 0x0K would)

Note: in graphs version without suffix is before modification, _2 is with feature disabled, _3 is with feature enabled, complete benchmarks results are in the tar.gz archive.

atox_c4_atoi
atox_c4_atou
atox_c4_atox
atox_c4_from_chars
benchmark.tar.gz

@codecov
Copy link

codecov bot commented Feb 28, 2022

Codecov Report

Merging #75 (d37473a) into master (42bd976) will increase coverage by 0.10%.
The diff coverage is 95.91%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #75      +/-   ##
==========================================
+ Coverage   93.96%   94.06%   +0.10%     
==========================================
  Files          53       53              
  Lines       11449    11359      -90     
==========================================
- Hits        10758    10685      -73     
+ Misses        691      674      -17     
Impacted Files Coverage Δ
src/c4/charconv.hpp 96.69% <95.12%> (-0.61%) ⬇️
test/test_charconv.cpp 100.00% <100.00%> (ø)
src/c4/ext/fast_float_all.h 26.81% <0.00%> (-1.31%) ⬇️
test/test_memory_resource.cpp 100.00% <0.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 42bd976...d37473a. Read the comment docs.

@biojppm
Copy link
Owner

biojppm commented Mar 1, 2022

Thanks for the PR, @fargies . It is very solid, and special thanks for providing also the benchmark plots (by the way, what did you use to make those plots?).

Before we go into details, I'm curious to what kind of situation is prompting you to have to detect overflow. I guess this is needed from using rapidyaml to parse data that may overflow. Can you show me example input and the code you are using to get the data to the integers?

I ask this because I'm thinking there may be a more high-level way to detect overflow without having to pay for it in the low-level code. To be clear, I really like the previous situation where the overflow silently wraps around just as a native integer does, but I do concede that there should/could be a way to detect the overflow. But maybe that way can be a separate one, ie functions different than read_dec()/atoi() etc.

I think the conditional macro may be acceptable as a last resort, but if both approaches can be present simultaneously then I will prefer it over the conditional macro. For example, we could maybe have an atou_range_checked()/atoi_range_checked()/atox_range_checked()/from_chars_range_checked() and some functions at the top in rapidyaml to offer both approaches (but note the _range_checked() suffix is terrible, and is used here just as an indication).

So again, can you show me the high level code where you're converting to integer?

@fargies
Copy link
Author

fargies commented Mar 1, 2022

Thanks for the PR, @fargies . It is very solid, and special thanks for providing also the benchmark plots (by the way, what did you use to make those plots?).

I've used this project : https://github.com/gaujay/jomt.git (first time I use it, quite nice and easy)

Before we go into details, I'm curious to what kind of situation is prompting you to have to detect overflow. I guess this is needed from using rapidyaml to parse data that may overflow. Can you show me example input and the code you are using to get the data to the integers?

I ask this because I'm thinking there may be a more high-level way to detect overflow without having to pay for it in the low-level code. To be clear, I really like the previous situation where the overflow silently wraps around just as a native integer does, but I do concede that there should/could be a way to detect the overflow. But maybe that way can be a separate one, ie functions different than read_dec()/atoi() etc.

We don't use RapidYaml yet, I'm still preparing and bundling (testing with compiler in our production env as you noticed), but in the long term we plan to use it to load config and settings but also to encode payload in RPC and REST apis.
My main concern is that there's no means to detect that such an overflow occurred, feedback sent to user will simply be "OK" when underneath something totally different than what was requested is being applied (and can still be in valid operational range).

We are controlling devices and heavy machinery, a simple undetected parameter change can have a very large impact and would definitely be on our back (user provided input may be incorrect but we did not detect and prevented its side-effects).

I do get that it's nice to have a full-speed library that copes well with errors, still in some cases being a bit more strict and safe is preferred regarding performance.

Note: if you wish to go in that direction, shouldn't all serialization errors be silently discarded (setting a default value) ? as an example this code would throw (bar doesn't exist) :

int i;
ryml::parse_in_arena("foo: 1").rootref()["bar"] >> i;

I think the conditional macro may be acceptable as a last resort, but if both approaches can be present simultaneously then I will prefer it over the conditional macro. For example, we could maybe have an atou_range_checked()/atoi_range_checked()/atox_range_checked()/from_chars_range_checked() and some functions at the top in rapidyaml to offer both approaches (but note the _range_checked() suffix is terrible, and is used here just as an indication).

Understood and agreed, we can eventually re-use itoa_min template and make a lightweight parsing (string size + symbol comparison without real parsing).

Then Eventually provide a template method on NodeRef (ex: bool is_overflow<T>() or is_range<T>()).

If you wish to go further (maybe not in this PR), we could add some feature flags to as an example turn off/on strict mode (maybe as a parser argument) ? having once yaml document is parsed either flexible (error-free) stream operators (that set default neutral values on error, or leave untouched), or stricter mode that does error-checking at each step.

Note: another of my concerns is that error handling callbacks are shared for the entire lib, and don't have much info about where the error came from (parent ref, parser-error vs data-extraction error) so we can't really play with this mechanism (enabling errors on parse and disabling on data-extraction is a no go on multi-threaded apps), we've decided to throw on error (abort in a server app when user sends corrupted data is not a good solution), that would be awesome improve this giving more flexibility.

So again, can you show me the high level code where you're converting to integer?

Quick example, I have a device where it's max speed is driven in mm/min and stored in an uint16_t, considering it fits our driver hardware limits, user provides a large number (> 65535 still makes sense physically speaking), everything goes without any warning (current library doesn't offer a way to detect that such thing happened), only once moving the device we may notice that something really bad happened.

Json (and by extension Yaml) integer format is quite flexible, so even if we were to extract all binaries in (unsigned) long long we would have no guarantee whereas something bad happened, unless parsing and testing it externally (which removes some magic from rapidyaml).

@biojppm
Copy link
Owner

biojppm commented Mar 1, 2022

Quick one to address this:

Note: another of my concerns is that error handling callbacks are shared for the entire lib, and don't have much info about where the error came from (parent ref, parser-error vs data-extraction error) so we can't really play with this mechanism (enabling errors on parse and disabling on data-extraction is a no go on multi-threaded apps), we've decided to throw on error (abort in a server app when user sends corrupted data is not a good solution), that would be awesome improve this giving more flexibility.

You definitely can have per-tree callbacks, different from the global library-scope callbacks. Further, you can also have per-phase callbacks (ie parsing vs data-extraction), by instantiating (prior to parsing) a parser and a tree with different callbacks, and then providing the existing tree to the appropriate method in the existing parser. This second aspect I've not explored much, so if something is not working as I described, then that's definitely a rapidyaml bug which should be addressed; further, if there are indeed problems, they will be very easy to fix by simply ensuring the proper callbacks object is used in the parser code (the data-extraction logic happens in the tree/noderef and is by definition calling into the right callbacks object).

HTH.

@biojppm
Copy link
Owner

biojppm commented Mar 1, 2022

No contest. Again, I want to make it clear that I have no problem with having overflow-detecting code; in fact I'd like to have it. It's just that I'd prefer to keep the lean read_dec() code for the cases where we know there is no overflow.

So I'm envisioning something like this (for example, with atoi):

template<class T> bool atoi(csubstr s, T *var); // the existing function

// ... and then either this:
template<class T> bool atoi_rc(csubstr s, T *var) // range-checked version
{
  if(overflows<T>(s))
    return false; 
  return atoi(s, var);
}

// ... or this:
template<class T> bool atoi_rc(csubstr s, T *var) // range-checked version
{
   // ... your code
}

The first version will have a throughput approximately half of the original atoi() as it needs to loop twice over the string. The second version (your approach) will be quicker than the first, as the benchmark figures show, but at the cost of having more code to write and maintain. I think that cost is fair if the estimates above are in range. If OTOH the costs of both approaches are close together, then the bool overflows<T>(csubstr s) is preferable because of its simplicity.

So one of the above would be the first step.

Then, the second step would be providing c4::fmt tag wrappers (eg c4::fmt::in_range(T&)) and appropriate from_chars() overload to dispatch into the proper function. With this, almost no or maybe even no changes would be needed in rapidyaml; you could have something like this client code:

ryml::Tree t;
uint16_t a, b;
t["foo"] >> a;    // silent overflow
t["foo"] >> ryml::fmt::range_checked(b);  // error on overflow, explicit opt-in

The explicit opt-in approach would be in keeping with the very correct philosophy that you don't need to pay for what you don't want to use.

The tag type implementation would be something like this in c4/format.hpp:

namespace fmt {
template<class T>
struct range_checked_
{
    static_assert(std::is_integral<T>::value, "range checking only for integral types");
    C4_ALWAYS_INLINE range_checked_(T &val_) : val(&val_) {}
    T *val;
};
template<class T>
C4_ALWAYS_INLINE range_checked_<T> range_checked(T &val)
{
   return range_checked_<T>(val);
}
} // namespace fmt

template<class T>
C4_ALWAYS_INLINE bool from_chars(csubstr s, fmt::range_checked_<T> wrapper)
{
    return atox_rc(s, wrapper.val);
}

// or we could use SFINAE to bypass `atox_rc()` and 
// call directly `atoi_rc(), atou_rc(), atof_rc()`, etc.

@fargies
Copy link
Author

fargies commented Mar 1, 2022

This sounds like a plan, I'm not sure about the throughput load, it'll depend on the implementation.

I think a simple str.len comparison can eliminate most cases, for the other's it'd indeed have to iterate on string but making only comparisons and iterating up to a certain point (so still O(1) but with lightweight computations), in the end I think it may even be more efficient than current implementation + the iteration will probably load the complete string in L2 which will prepare for the real parsing.

I'll work on that as soon as soon as I can :)

@fargies
Copy link
Author

fargies commented Mar 1, 2022

Here's a draft implementation : 09a0d7e

I'll add format operator and will run a benchmark.

@biojppm if you had some time to review that would be nice :)

@biojppm
Copy link
Owner

biojppm commented Mar 1, 2022

Here's a draft implementation : 09a0d7e

I'll add format operator and will run a benchmark.

@biojppm if you had some time to review that would be nice :)

@fargies exactly, this overflows<T>(csubstr) implementation is exactly what I was thinking as the base. From a glance the code seems correct, and the tests as well. And special kudos for nailing the idiom style; it looks like I was writing it. There are some very minor formatting issues which are no trouble at all.

Just add the commit to the PR or reset the branch (whatever you choose) and I'll add minor reviews inline in the PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants