Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updating to simdjson 1.0 #70

Merged
merged 25 commits into from
Sep 7, 2021
Merged

Updating to simdjson 1.0 #70

merged 25 commits into from
Sep 7, 2021

Conversation

lemire
Copy link
Collaborator

@lemire lemire commented Mar 18, 2021

This PR should not be merged and released as is. But I wrote it to help fix the issue.

Ok. So what is going on in the transition to simdjson 0.9 that requires so many changes ?

Well. We were returning results in the form of an std::pair but we were not expecting people to use it as an std::pair... (although, to be fair, we sometimes did it ourselves) yet people did. And what happened was that people would just ignore the error field. They would then consume garbage, and complain that simdjson was producing garbage.

So we closed off the std::pair by making it a protected inheritance. Sadly, it breaks structured binding which was nice but never worked right with all C++ runtimes (libc did not like it).

This does not impact at all people who used our normal API. Want a double? Do double(element). But I understand that R does not want exceptions thrown? Well. Thankfully, we support both exception-based and no-exception usage.

What about rcppsimdjson? Well rcppsimdjson would do result.first to get the result and ignore the error code. This was, presumably, always safe because error conditions were otherwise checked.

We don't want you to continue doing that. Instead, you now must do result.value_unsafe().

Yes. It is ugly. But the use of value_unsafe() is meant to indicate that you are very much doing something unsafe. And that's what we want the code to show so that people are careful. The .first syntax does not have the same air of danger around it.

@eddelbuettel
Copy link
Owner

Thanks so much for working on this "for us". I am not sure if anyone hands out prices for most amazing upstream, but I would be compelled to nominate you ....

Now, while I am currently fighting other fires elsewhere, one brief comment:

But I understand that R does not want exceptions thrown?

No, in fact, every call generated by Rcpp contains wrapping glue code with a try/catch. We fetch exceptions, and turn them into R errors. So that is not the issue.

@lemire
Copy link
Collaborator Author

lemire commented Mar 18, 2021

No, in fact, every call generated by Rcpp contains wrapping glue code with a try/catch. We fetch exceptions, and turn them into R errors. So that is not the issue.

Damn. It could allow for much code simplification... but it is too late for this PR as I started going all exceptionless.

Working to find my final bugs.

@codecov
Copy link

codecov bot commented Mar 18, 2021

Codecov Report

Merging #70 (0870053) into master (de81b7c) will increase coverage by 0.00%.
The diff coverage is 100.00%.

❗ Current head 0870053 differs from pull request most recent head ef837d0. Consider uploading reports for the commit ef837d0 to get more accurate results
Impacted file tree graph

@@           Coverage Diff           @@
##           master      #70   +/-   ##
=======================================
  Coverage   99.55%   99.55%           
=======================================
  Files          18       18           
  Lines        1336     1337    +1     
=======================================
+ Hits         1330     1331    +1     
  Misses          6        6           
Impacted Files Coverage Δ
inst/include/RcppSimdJson/deserialize.hpp 100.00% <100.00%> (ø)
...t/include/RcppSimdJson/deserialize/Type_Doctor.hpp 100.00% <100.00%> (ø)
...nst/include/RcppSimdJson/deserialize/dataframe.hpp 100.00% <100.00%> (ø)
inst/include/RcppSimdJson/deserialize/matrix.hpp 100.00% <100.00%> (ø)
inst/include/RcppSimdJson/deserialize/scalar.hpp 100.00% <100.00%> (ø)
inst/include/RcppSimdJson/deserialize/simplify.hpp 100.00% <100.00%> (ø)
inst/include/RcppSimdJson/deserialize/vector.hpp 100.00% <100.00%> (ø)
src/exported-utils.cpp 100.00% <100.00%> (ø)
src/simdjson_example.cpp 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update de81b7c...ef837d0. Read the comment docs.

@lemire
Copy link
Collaborator Author

lemire commented Mar 18, 2021

Locally, this passes all my tests but it seems to fail in CI

I think that in many cases, the PR improves upon the existing code. Either it becomes clearer or the dangers are more evident.

Importantly, it follows more closely our recommended approach.

Feel free to drop this and rewrite it, I have no ego invested in it.

cc @jkeiser

@eddelbuettel
Copy link
Owner

I am not too worried about random segfaults in the CI ... because we are lazy and installing a few things (R packages we need) as binaries. Sometimes things just need a rebuild. I'll take a look later.

@eddelbuettel
Copy link
Owner

And the switch to the error status variable is not all that mortal -- @knapply and I will chat and see which style we like better and find more idiomatic.

Again, big big thank you for making time and working through this. It should not be far from here to the goal line....

@lemire
Copy link
Collaborator Author

lemire commented Mar 18, 2021

You used to do this construction...

if (auto [result, parse_error] = something; !parse_error) {
}

It did not work under libc++ though @ldionne just fixed that.

We no longer allow it because we don't want to give people direct access to the value without enticing them to check the error... but you can do it like so...

if (simdjson::dom::element element; something.get(element) == SUCCESS) {
}

I would argue that the latter is just a clean. (I did not use this construction because I am old-style guy and I don't declare variables in if branches, but you could.)

Importantly, if are lazy and do...

simdjson::dom::element element; 
something.get(element)

It should complain that you are not checking the error condition. That's what we want. We want to bug people who fail to check their errors.

@eddelbuettel
Copy link
Owner

eddelbuettel commented Mar 18, 2021

Well it bombs on my box too (Ubuntu 20.10). With the tinytest framework I can run most of the test files individually (after the updated package is installed, and some of those test file lack a library(RcppSimdJson)) and of those that run a number die on segfault. As does the initial R CMD check on running the examples. Maybe something got chopped up with the error code and exceptions, maybe it ise something else. Will need to dig...

@lemire
Copy link
Collaborator Author

lemire commented Mar 18, 2021

@eddelbuettel My guess is a typo somewhere. I had several. I am probably missing at least one.

My guess is that you guys are probably better equipped to track down the offending code.

Know that I am not at all opposed to the idea of finishing the PR and making it work.

@eddelbuettel
Copy link
Owner

Still getting segfaults even when it builds fine. If you do R -d gdb then say run<return> and in R library(RcppSimdJson); example(fparse) it blows up in deserialization, as it does in some of this test file.

Call backtrace does not scream anything immediately useful at me:

> library(RcppSimdJson)                                                                                  
> example(fparse)                                                                                        
                                                                                                         
fparse> # simple parsing ============================================================
fparse> json_string <- '{"a":[[1,null,3.0],["a","b",true],[10000000000,2,3]]}'
                                                                                                         
fparse> fparse(json_string)                                                                              
                                                                                                         
Thread 1 "R" received signal SIGSEGV, Segmentation fault.  
0x00007fff94636556 in SEXPREC* rcppsimdjson::deserialize::matrix::build_matrix_mixed<16>(simdjson::dom::array, unsigned long) () from /usr/local/lib/R/site-library/RcppSimdJson/libs/RcppSimdJson.so
(gdb) backtrace                  
#0  0x00007fff94636556 in SEXPREC* rcppsimdjson::deserialize::matrix::build_matrix_mixed<16>(simdjson::dom::array, unsigned long) () from /usr/local/lib/R/site-library/RcppSimdJson/libs/RcppSimdJson.so
#1  0x00007fff946606d2 in SEXPREC* rcppsimdjson::deserialize::simplify_element<(rcppsimdjson::deserialize::Type_Policy)0, (rcppsimdjson::utils::Int64_R_Type)0, (rcppsimdjson::deserialize::Simplify_To)0>(simdjson
::dom::element, SEXPREC*, SEXPREC*, SEXPREC*) () from /usr/local/lib/R/site-library/RcppSimdJson/libs/RcppSimdJson.so
#2  0x00007fff94661ef1 in SEXPREC* rcppsimdjson::deserialize::simplify_object<(rcppsimdjson::deserialize::Type_Policy)0, (rcppsimdjson::utils::Int64_R_Type)0, (rcppsimdjson::deserialize::Simplify_To)0>(simdjson:
:dom::object, SEXPREC*, SEXPREC*, SEXPREC*) () from /usr/local/lib/R/site-library/RcppSimdJson/libs/RcppSimdJson.so
#3  0x00007fff94660527 in SEXPREC* rcppsimdjson::deserialize::simplify_element<(rcppsimdjson::deserialize::Type_Policy)0, (rcppsimdjson::utils::Int64_R_Type)0, (rcppsimdjson::deserialize::Simplify_To)0>(simdjson
::dom::element, SEXPREC*, SEXPREC*, SEXPREC*) () from /usr/local/lib/R/site-library/RcppSimdJson/libs/RcppSimdJson.so
#4  0x00007fff9460753c in ?? () from /usr/local/lib/R/site-library/RcppSimdJson/libs/RcppSimdJson.so
#5  0x00007fff9460e773 in deserialize(SEXPREC*, SEXPREC*, SEXPREC*, SEXPREC*, SEXPREC*, bool, SEXPREC*, bool, SEXPREC*, int, int, int) () from /usr/local/lib/R/site-library/RcppSimdJson/libs/RcppSimdJson.so
#6  0x00007fff945dd103 in _RcppSimdJson_deserialize () from /usr/local/lib/R/site-library/RcppSimdJson/libs/RcppSimdJson.so
#7  0x00007ffff7c0ef76 in ?? () from /usr/lib/R/lib/libR.so
## many more lines omitted

@eddelbuettel
Copy link
Owner

eddelbuettel commented Mar 19, 2021

So to take the concrete example, example(fparse) blows up here, apparently on the dom::array access:

https://github.com/lemire/rcppsimdjson/blob/a65c00d2e5709dfe754d1871d6d9bd554eb0fa3e/inst/include/RcppSimdJson/deserialize/matrix.hpp#L212

Via the old-fashioned method of a print statement, I make it to a print just above but not below. Dimensions seem fine (3 rows, 3 cols), but for i==0 we bonk. Any hunch, @lemire?

@lemire
Copy link
Collaborator Author

lemire commented Mar 19, 2021

@eddelbuettel I am sure it is something stupid that I did. I will have a look tomorrow morning. It is 10 pm here and Jack Daniels is calling.

@lemire
Copy link
Collaborator Author

lemire commented Mar 19, 2021

@eddelbuettel I need to throw better tooling at this problem. There is a lot of code in there, most of which I am only vaguely familiar with.

Normally, I would just use sanitizers. I read the R documentation, I searched on the Internet... and I can't find a way to do it...

Furthermore, it really looks like an heisenbug. I can influence where things fail by adding printouts. So something gets the wrong address or the wrong cast. Now with sanitizers, I should pick this up right away.

You (@eddelbuettel) state on stackoverflow that one can modify ~/.R/Makevars and that this gets picked up automagically. Someone comments that he was afraid that you would say that. You dismiss him, but I have the same problem.

Here is what I have currently...

$ more ~/.R/Makevars 
CXXFLAGS = -fsanitize=undefined,address -fno-omit-frame-pointer
CFLAGS = -fsanitize=undefined,address -fno-omit-frame-pointer

It is not picked up. In fact, I can do this...

$ more ~/.R/Makevars 
CC = go to hell
CXX = go to hell 2
CXXFLAGS = -unvalidaerewrewflag
CFLAGS = -iahteate

and it is not used at all.

For context, I just call

./run.sh run_tests

It is possible that your script is overriding this default or else I am not understand.

@eddelbuettel
Copy link
Owner

eddelbuettel commented Mar 19, 2021

I know there is a lot of complexity here, but, frankly, you could just ask. We're here, and this medium is quick. Nobody expects you to go off and re-figure all of the things out by yourself.

I created the initial Docker containers for ASAN and UBSAN as CRAN / Brian Ripley have some cryptic instructions (see older posts on my blog). As my builds are not regularly updated, these days the best bet is the 'sumo' style aggregate container by @wch here: https://github.com/wch/r-debug You need those as R itself has to be built with sanitizers.

Alternatively, one can also use the builder service here https://builder.r-hub.io/ which has ASAN and Valgrind options (and the CRAN package rhub lets upload locally). I can look into any and all of that in a day or two (or ideally, an evening).

@lemire
Copy link
Collaborator Author

lemire commented Mar 19, 2021

I know there is a lot of complexity here, but, frankly, you could just ask.

I thought that's what I did? :-) I am not at all against asking for help... but I want to explain what I tried and what I read before I do.

@eddelbuettel
Copy link
Owner

Point token. I am simply still sore from the missed opportunity to show you exception handling before you went off firmly assuming it does not exist. Stuff happens. Use of valgrind and/or asan here is a very good suggestion.

@lemire
Copy link
Collaborator Author

lemire commented Mar 19, 2021

@eddelbuettel I am reengineering with the knowledge that exception handling is supported. This will make the code cleaner.

Next comment will tell you what the bug was. Wait for it.

@lemire
Copy link
Collaborator Author

lemire commented Mar 19, 2021

I won't assume that there are no more bugs, but I ended up fixing one nagging one just by reading more carefully my own code. I had the following...

    for (auto element : array) {
        simdjson::dom::array sub_array;
        if(!element.get(sub_array)) { // <==== The bug is here
            return std::nullopt;
        }
        matrix_doctor.update(
            Type_Doctor<type_policy, int64_opt>(sub_array));
        n_cols.insert(std::size(sub_array));

        if (std::size(n_cols) > 1 || !matrix_doctor.is_vectorizable()) {
            return std::nullopt;
        }
    }

This was saying... if this is not an array, then cast it as an array...

Here is the correct code...

    for (auto element : array) {
        simdjson::dom::array sub_array;
        if(element.get(sub_array) != simdjson::SUCCESS) {
            return std::nullopt;
        }
        matrix_doctor.update(
            Type_Doctor<type_policy, int64_opt>(sub_array));
        n_cols.insert(std::size(sub_array));

        if (std::size(n_cols) > 1 || !matrix_doctor.is_vectorizable()) {
            return std::nullopt;
        }
    }

Elsewhere, I am now using in this PR casts and other constructions that may throw. They are embedded in noexcept methods so it is unsafe. I could have removed the noexcept but... This PR is for demonstration/discussion... please do not merge as-is. It is a proof of concept. Review and reengineer to your taste.

In early commits, I described all of the options you have... basically, suppose that you are given an element and you want to cost it to a double... you can do...

double(element);

This might throw but only if you got the type wrong, so if you checked the type first, this should be fine.

If you don't want to throw, then do...

double value; // could put in the next line as part the if clause if you have fancy C++
if(element.get(value) == simdjson::SUCCESS) {
  // "value" contains what you need... 
} else {
 // do something
}

You can instrument this to support default values like so...

double value;
value = (element.get(value) == simdjson::SUCCESS) & value :  default_value;

The == simdjson::SUCCESS is verbose and there is the usual auto cast to a bool but I just messed it up so that the current code always explicitly compares against SUCCESS (my brain likes this approach better).

@eddelbuettel
Copy link
Owner

I must have been a little asleep at the wheel, or overwhelmed from the earlier flood of post 😉 , because I seem to have missed that the PR is now good :) As a sign of repentence I did first update my copy of the wch1/r-debug container and ran this with RDsan (which still fails a few tests) but on a normal R(-release) session it passes, as it does with R-devel.

So shall we merge now?

@eddelbuettel
Copy link
Owner

I just rebased this to master given how we were discussing in #75 that this may be a merge candidate too.

@lemire
Copy link
Collaborator Author

lemire commented Sep 5, 2021

@eddelbuettel Yes. It might be a safer choice in the sense that it does not lead to a performance regression. It might be viable to see the On Demand wrapper as "future work" in the sense that it should be merged after the performance regression has been addressed.

Nicolas cannot be expected to do this right now. But it is good future work that we could address next summer? :-)

@eddelbuettel
Copy link
Owner

Ok will likely be merging, and I will then wear a fool's cap for day or longer as I had completely missed that this was sitting here, ready. I was still hung up under the earlier error.

We can always support on-demand as an option, or in a branch.

@knapply Can you think of a reason not to merge?

@lemire lemire changed the title Updating to simdjson 0.9 Updating to simdjson 1.0 Sep 7, 2021
@lemire
Copy link
Collaborator Author

lemire commented Sep 7, 2021

I am releasing simdjson 1.0 today, so I refreshed this PR with simdjson 1.0.

Before definitively releasing 1.0, I will wait to make sure that everything is green. (It should be!)

@lemire
Copy link
Collaborator Author

lemire commented Sep 7, 2021

It is green.

@eddelbuettel
Copy link
Owner

Hm. I was about to merge this evening as it had nicely stabilized; now you chose to rock the boat again -- but thanks for updating to 1.0.0 and congrats on that release.

Can you please remove hacking.md? That does not belong here (maybe in the wiki?) and I tend to prefer a simpler / more compact style Docker (see eg here for a set or here for a concrete one.

@lemire
Copy link
Collaborator Author

lemire commented Sep 7, 2021

now you chose to rock the boat again

I did not expect my change to 'rock the boat'. It was conservative update.

I will check the hacking file...

@lemire
Copy link
Collaborator Author

lemire commented Sep 7, 2021

I see that you went ahead and removed it.

I honestly do not remember adding this file. I do have notes, but they are on a gist somewhere.

@eddelbuettel
Copy link
Owner

I will check the hacking file...

I just removed it, and merged.

I looked into C++14 and/or C++11 -- not a chance right now. Part of that may by now be our fault :-/ as @knapply and I thought we could plow ahead with C++17 given that you imposed it. I'd be up for opening an issue and cleaning this up.

But before that I think I should release what we have as either 0.1.6 or 0.2.0. It nice to have a refresher, and it is very nice to be current with simdjson 1.0 -- so big, big Thank You! to you for updating the PR for it. And, of course, for having made the PR in the first place.

@eddelbuettel eddelbuettel merged commit 7d0ce04 into eddelbuettel:master Sep 7, 2021
@lemire
Copy link
Collaborator Author

lemire commented Sep 7, 2021

Thanks to you.

It is definitively the case that simdjson supports C++11. I cannot recall when we went back and decided to support C++11 exactly, but I think that the credit goes to @jkeiser who worked a h*ll of a lot on making this happen. We now have C++11 tests in CI.

Note that now that you have merged this, it will become easier, in the future to adopt On Demand for extra performance.

@eddelbuettel
Copy link
Owner

eddelbuettel commented Sep 7, 2021

I may not have made myself sufficiently clear.

As you know, there is code is RcppSimdJson that is not the included simdjson.{cpp,h}. If you go in and enforce a different compilation standard, easiest by just wrriting CXX_STD = CXX11, say, in this line

CXX_STD = @CXXSTD@

as well as adjusting accordingly little corners likes this one

#if __cplusplus >= 201703L
# include <RcppSimdJson.hpp>
#endif

the you will see, with a tear or two in your eyes, that compilation goes belly up. Irregardless of the fine work you folks did upstream. To paraphrase, "we have met the enemy and he is in this code directory". PRs welcome. I won't have time to chase this, and as CRAN leaves me alone with C++17 as long as I am explicit about it (which 0.1.5 was not that sufficiently, the next release will be) I am also not all that concerned.

I hope this clarifies the issue. It is not your upstream code. It is use down here in RcppSimdJson currently making it C++17.

@lemire
Copy link
Collaborator Author

lemire commented Sep 7, 2021

You were perfectly clear, I was just giving you the confirmation that we fully support C++11.

@eddelbuettel
Copy link
Owner

Yes for approximately the third time, and it is probably also the third time I am trying to explain that that alone does not buy me lunch, nor that I care too much. So I can assume we are both good on this now?

@lemire
Copy link
Collaborator Author

lemire commented Sep 7, 2021

You and I have fun exchanges. :-) It is almost like we disagree all the time, except we never disagree.

Yes, we are good.

@eddelbuettel
Copy link
Owner

eddelbuettel commented Sep 7, 2021

I think I'll ship it later to CRAN as 0.1.6.

Or should I wait for you to pop the cork on 1.0.0 over at https://github.com/simdjson/simdjson ?

@eddelbuettel
Copy link
Owner

This went to CRAN late-ish Central time and arrived there early today, see https://cran.r-project.org/package=RcppSimdJson

@lemire
Copy link
Collaborator Author

lemire commented Sep 8, 2021

Great news. We have pushed simdjson 1.0 as well.

This was more work than I thought it would be.

I am a bit sad that we do not have (yet) a super fast On Demand wrapper. But Rome was not built in a day.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants