-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Updating to simdjson 1.0 #70
Conversation
Thanks so much for working on this "for us". I am not sure if anyone hands out prices for most amazing upstream, but I would be compelled to nominate you .... Now, while I am currently fighting other fires elsewhere, one brief comment:
No, in fact, every call generated by Rcpp contains wrapping glue code with a try/catch. We fetch exceptions, and turn them into R errors. So that is not the issue. |
Damn. It could allow for much code simplification... but it is too late for this PR as I started going all exceptionless. Working to find my final bugs. |
Codecov Report
@@ Coverage Diff @@
## master #70 +/- ##
=======================================
Coverage 99.55% 99.55%
=======================================
Files 18 18
Lines 1336 1337 +1
=======================================
+ Hits 1330 1331 +1
Misses 6 6
Continue to review full report at Codecov.
|
Locally, this passes all my tests but it seems to fail in CI I think that in many cases, the PR improves upon the existing code. Either it becomes clearer or the dangers are more evident. Importantly, it follows more closely our recommended approach. Feel free to drop this and rewrite it, I have no ego invested in it. cc @jkeiser |
I am not too worried about random segfaults in the CI ... because we are lazy and installing a few things (R packages we need) as binaries. Sometimes things just need a rebuild. I'll take a look later. |
And the switch to the error status variable is not all that mortal -- @knapply and I will chat and see which style we like better and find more idiomatic. Again, big big thank you for making time and working through this. It should not be far from here to the goal line.... |
You used to do this construction... if (auto [result, parse_error] = something; !parse_error) {
} It did not work under libc++ though @ldionne just fixed that. We no longer allow it because we don't want to give people direct access to the value without enticing them to check the error... but you can do it like so... if (simdjson::dom::element element; something.get(element) == SUCCESS) {
} I would argue that the latter is just a clean. (I did not use this construction because I am old-style guy and I don't declare variables in Importantly, if are lazy and do... simdjson::dom::element element;
something.get(element) It should complain that you are not checking the error condition. That's what we want. We want to bug people who fail to check their errors. |
Well it bombs on my box too (Ubuntu 20.10). With the |
@eddelbuettel My guess is a typo somewhere. I had several. I am probably missing at least one. My guess is that you guys are probably better equipped to track down the offending code. Know that I am not at all opposed to the idea of finishing the PR and making it work. |
Still getting segfaults even when it builds fine. If you do Call > library(RcppSimdJson)
> example(fparse)
fparse> # simple parsing ============================================================
fparse> json_string <- '{"a":[[1,null,3.0],["a","b",true],[10000000000,2,3]]}'
fparse> fparse(json_string)
Thread 1 "R" received signal SIGSEGV, Segmentation fault.
0x00007fff94636556 in SEXPREC* rcppsimdjson::deserialize::matrix::build_matrix_mixed<16>(simdjson::dom::array, unsigned long) () from /usr/local/lib/R/site-library/RcppSimdJson/libs/RcppSimdJson.so
(gdb) backtrace
#0 0x00007fff94636556 in SEXPREC* rcppsimdjson::deserialize::matrix::build_matrix_mixed<16>(simdjson::dom::array, unsigned long) () from /usr/local/lib/R/site-library/RcppSimdJson/libs/RcppSimdJson.so
#1 0x00007fff946606d2 in SEXPREC* rcppsimdjson::deserialize::simplify_element<(rcppsimdjson::deserialize::Type_Policy)0, (rcppsimdjson::utils::Int64_R_Type)0, (rcppsimdjson::deserialize::Simplify_To)0>(simdjson
::dom::element, SEXPREC*, SEXPREC*, SEXPREC*) () from /usr/local/lib/R/site-library/RcppSimdJson/libs/RcppSimdJson.so
#2 0x00007fff94661ef1 in SEXPREC* rcppsimdjson::deserialize::simplify_object<(rcppsimdjson::deserialize::Type_Policy)0, (rcppsimdjson::utils::Int64_R_Type)0, (rcppsimdjson::deserialize::Simplify_To)0>(simdjson:
:dom::object, SEXPREC*, SEXPREC*, SEXPREC*) () from /usr/local/lib/R/site-library/RcppSimdJson/libs/RcppSimdJson.so
#3 0x00007fff94660527 in SEXPREC* rcppsimdjson::deserialize::simplify_element<(rcppsimdjson::deserialize::Type_Policy)0, (rcppsimdjson::utils::Int64_R_Type)0, (rcppsimdjson::deserialize::Simplify_To)0>(simdjson
::dom::element, SEXPREC*, SEXPREC*, SEXPREC*) () from /usr/local/lib/R/site-library/RcppSimdJson/libs/RcppSimdJson.so
#4 0x00007fff9460753c in ?? () from /usr/local/lib/R/site-library/RcppSimdJson/libs/RcppSimdJson.so
#5 0x00007fff9460e773 in deserialize(SEXPREC*, SEXPREC*, SEXPREC*, SEXPREC*, SEXPREC*, bool, SEXPREC*, bool, SEXPREC*, int, int, int) () from /usr/local/lib/R/site-library/RcppSimdJson/libs/RcppSimdJson.so
#6 0x00007fff945dd103 in _RcppSimdJson_deserialize () from /usr/local/lib/R/site-library/RcppSimdJson/libs/RcppSimdJson.so
#7 0x00007ffff7c0ef76 in ?? () from /usr/lib/R/lib/libR.so
## many more lines omitted |
So to take the concrete example, Via the old-fashioned method of a print statement, I make it to a print just above but not below. Dimensions seem fine (3 rows, 3 cols), but for |
@eddelbuettel I am sure it is something stupid that I did. I will have a look tomorrow morning. It is 10 pm here and Jack Daniels is calling. |
@eddelbuettel I need to throw better tooling at this problem. There is a lot of code in there, most of which I am only vaguely familiar with. Normally, I would just use sanitizers. I read the R documentation, I searched on the Internet... and I can't find a way to do it... Furthermore, it really looks like an heisenbug. I can influence where things fail by adding printouts. So something gets the wrong address or the wrong cast. Now with sanitizers, I should pick this up right away. You (@eddelbuettel) state on stackoverflow that one can modify Here is what I have currently...
It is not picked up. In fact, I can do this...
and it is not used at all. For context, I just call
It is possible that your script is overriding this default or else I am not understand. |
I know there is a lot of complexity here, but, frankly, you could just ask. We're here, and this medium is quick. Nobody expects you to go off and re-figure all of the things out by yourself. I created the initial Docker containers for ASAN and UBSAN as CRAN / Brian Ripley have some cryptic instructions (see older posts on my blog). As my builds are not regularly updated, these days the best bet is the 'sumo' style aggregate container by @wch here: https://github.com/wch/r-debug You need those as R itself has to be built with sanitizers. Alternatively, one can also use the builder service here https://builder.r-hub.io/ which has ASAN and Valgrind options (and the CRAN package |
I thought that's what I did? :-) I am not at all against asking for help... but I want to explain what I tried and what I read before I do. |
Point token. I am simply still sore from the missed opportunity to show you exception handling before you went off firmly assuming it does not exist. Stuff happens. Use of valgrind and/or asan here is a very good suggestion. |
@eddelbuettel I am reengineering with the knowledge that exception handling is supported. This will make the code cleaner. Next comment will tell you what the bug was. Wait for it. |
I won't assume that there are no more bugs, but I ended up fixing one nagging one just by reading more carefully my own code. I had the following... for (auto element : array) {
simdjson::dom::array sub_array;
if(!element.get(sub_array)) { // <==== The bug is here
return std::nullopt;
}
matrix_doctor.update(
Type_Doctor<type_policy, int64_opt>(sub_array));
n_cols.insert(std::size(sub_array));
if (std::size(n_cols) > 1 || !matrix_doctor.is_vectorizable()) {
return std::nullopt;
}
}
This was saying... if this is not an array, then cast it as an array... Here is the correct code... for (auto element : array) {
simdjson::dom::array sub_array;
if(element.get(sub_array) != simdjson::SUCCESS) {
return std::nullopt;
}
matrix_doctor.update(
Type_Doctor<type_policy, int64_opt>(sub_array));
n_cols.insert(std::size(sub_array));
if (std::size(n_cols) > 1 || !matrix_doctor.is_vectorizable()) {
return std::nullopt;
}
} Elsewhere, I am now using in this PR casts and other constructions that may throw. They are embedded in noexcept methods so it is unsafe. I could have removed the noexcept but... This PR is for demonstration/discussion... please do not merge as-is. It is a proof of concept. Review and reengineer to your taste. In early commits, I described all of the options you have... basically, suppose that you are given an element and you want to cost it to a double... you can do...
This might throw but only if you got the type wrong, so if you checked the type first, this should be fine. If you don't want to throw, then do...
You can instrument this to support default values like so...
The |
I must have been a little asleep at the wheel, or overwhelmed from the earlier flood of post 😉 , because I seem to have missed that the PR is now good :) As a sign of repentence I did first update my copy of the wch1/r-debug container and ran this with RDsan (which still fails a few tests) but on a normal R(-release) session it passes, as it does with R-devel. So shall we merge now? |
I just rebased this to master given how we were discussing in #75 that this may be a merge candidate too. |
@eddelbuettel Yes. It might be a safer choice in the sense that it does not lead to a performance regression. It might be viable to see the On Demand wrapper as "future work" in the sense that it should be merged after the performance regression has been addressed. Nicolas cannot be expected to do this right now. But it is good future work that we could address next summer? :-) |
Ok will likely be merging, and I will then wear a fool's cap for day or longer as I had completely missed that this was sitting here, ready. I was still hung up under the earlier error. We can always support on-demand as an option, or in a branch. @knapply Can you think of a reason not to merge? |
I am releasing simdjson 1.0 today, so I refreshed this PR with simdjson 1.0. Before definitively releasing 1.0, I will wait to make sure that everything is green. (It should be!) |
It is green. |
Hm. I was about to merge this evening as it had nicely stabilized; now you chose to rock the boat again -- but thanks for updating to 1.0.0 and congrats on that release. Can you please remove hacking.md? That does not belong here (maybe in the wiki?) and I tend to prefer a simpler / more compact style Docker (see eg here for a set or here for a concrete one. |
I did not expect my change to 'rock the boat'. It was conservative update. I will check the hacking file... |
I see that you went ahead and removed it. I honestly do not remember adding this file. I do have notes, but they are on a gist somewhere. |
I just removed it, and merged. I looked into C++14 and/or C++11 -- not a chance right now. Part of that may by now be our fault :-/ as @knapply and I thought we could plow ahead with C++17 given that you imposed it. I'd be up for opening an issue and cleaning this up. But before that I think I should release what we have as either 0.1.6 or 0.2.0. It nice to have a refresher, and it is very nice to be current with simdjson 1.0 -- so big, big Thank You! to you for updating the PR for it. And, of course, for having made the PR in the first place. |
Thanks to you. It is definitively the case that simdjson supports C++11. I cannot recall when we went back and decided to support C++11 exactly, but I think that the credit goes to @jkeiser who worked a h*ll of a lot on making this happen. We now have C++11 tests in CI. Note that now that you have merged this, it will become easier, in the future to adopt On Demand for extra performance. |
I may not have made myself sufficiently clear. As you know, there is code is RcppSimdJson that is not the included Line 3 in 631ea2e
as well as adjusting accordingly little corners likes this one rcppsimdjson/src/deserialize.cpp Lines 1 to 3 in 631ea2e
the you will see, with a tear or two in your eyes, that compilation goes belly up. Irregardless of the fine work you folks did upstream. To paraphrase, "we have met the enemy and he is in this code directory". PRs welcome. I won't have time to chase this, and as CRAN leaves me alone with C++17 as long as I am explicit about it (which 0.1.5 was not that sufficiently, the next release will be) I am also not all that concerned. I hope this clarifies the issue. It is not your upstream code. It is use down here in RcppSimdJson currently making it C++17. |
You were perfectly clear, I was just giving you the confirmation that we fully support C++11. |
Yes for approximately the third time, and it is probably also the third time I am trying to explain that that alone does not buy me lunch, nor that I care too much. So I can assume we are both good on this now? |
You and I have fun exchanges. :-) It is almost like we disagree all the time, except we never disagree. Yes, we are good. |
I think I'll ship it later to CRAN as 0.1.6. Or should I wait for you to pop the cork on 1.0.0 over at https://github.com/simdjson/simdjson ? |
This went to CRAN late-ish Central time and arrived there early today, see https://cran.r-project.org/package=RcppSimdJson |
Great news. We have pushed simdjson 1.0 as well. This was more work than I thought it would be. I am a bit sad that we do not have (yet) a super fast On Demand wrapper. But Rome was not built in a day. |
This PR should not be merged and released as is. But I wrote it to help fix the issue.
Ok. So what is going on in the transition to simdjson 0.9 that requires so many changes ?
Well. We were returning results in the form of an
std::pair
but we were not expecting people to use it as anstd::pair
... (although, to be fair, we sometimes did it ourselves) yet people did. And what happened was that people would just ignore the error field. They would then consume garbage, and complain that simdjson was producing garbage.So we closed off the
std::pair
by making it a protected inheritance. Sadly, it breaks structured binding which was nice but never worked right with all C++ runtimes (libc did not like it).This does not impact at all people who used our normal API. Want a
double
? Dodouble(element)
. But I understand that R does not want exceptions thrown? Well. Thankfully, we support both exception-based and no-exception usage.What about rcppsimdjson? Well rcppsimdjson would do
result.first
to get the result and ignore the error code. This was, presumably, always safe because error conditions were otherwise checked.We don't want you to continue doing that. Instead, you now must do
result.value_unsafe()
.Yes. It is ugly. But the use of
value_unsafe()
is meant to indicate that you are very much doing something unsafe. And that's what we want the code to show so that people are careful. The.first
syntax does not have the same air of danger around it.