Feature/deserialize #20

knapply · 2020-06-16T03:07:48Z

Many refinements following some real-world usage:

found/fixed an int64_t to double casting bug
reorganized code to (hopefully) make everything more digestible (and IDE-friendly)
- I also added some documentation and notes to facilitate that
  - functions closest to user level
  - compile-time things (enums, constexpr functions, exception options) used in multiple places
compiled with as many ultra strict flags as possible through multiple versions of GCC and Clang to catch anything I'm missing.
- fixed conflicting const qualifiers
- made anything involving integers as type-explicit as possible
added .load_json() function to read JSON files
- API is identical to .deserialize_json()
synced with upstream simdjson (12 Jun 2020)

Fingers crossed, but I feel more confident this is sustainable.

Last thoughts:

So far, everything "just works" whether or not simdjson is compiled with exceptions enabled.
- Since we're so early in the process, I'm leaving the line to disable exceptions commented out for anything that leaves my machine (you can find at line 69 in inst/include/RcppSimdJson/common.hpp), but R CMD check is passing for both cases.
- We can explore in the future whether they're worth disabling, but compiling both ways has provided a nice sanity-check.
I hoped to have made more progress on .load_many()/.parse_many() by now, but they're going to take some more thought.
- A naive "chuck every line in its own list" approach handled a real-world JSONL dataset (~3 GB, 100k documents) way better than I expected, but I'm thinking something truly kick-ass may require 2 passes through everything (1 to diagnose, 1 to pull everything into R).
- But.... parsing files like this in R has brought anything under 16GB of RAM to its knees (and it's never just 1 file), so I'm extremely optimistic.

…ocumentation

…imdJson/common.hpp so build_data_frame() can go where it should have been (inst/include/RcppSimdJson/deserialize/dataframe.hpp)

…) file reader

eddelbuettel

This looks like very fine work.

One thing that is a bit is that you seem to comingled this with an upgrade to (upstream) simdjson. Was that intentional / needed by something else?

knapply · 2020-06-16T22:36:01Z

Nothing actually requires it and I should've done it separately.

Initially it was another sanity check that everything not only works, but works with the updates upstream... and then I left it in.

I have vectorized (in the R sense) versions to parse multiple strings and multiple files ready to go to maximize the parser efficiency (instead of creating a new one over and over w/ lapply()), but it seems something went funky with reusing the parser to read files. I reproduced and opened a simdjson issue here: simdjson/simdjson#938.

It should've done the sync separately. That's 100% my bad.

eddelbuettel · 2020-06-16T22:44:39Z

It should've done the sync separately. That's 100% my bad.

Stuff happens. Do you just want to drop another commit on top and restore two two files?

knapply · 2020-06-16T23:09:21Z

Yes, I'll get it sorted ASAP.

eddelbuettel · 2020-06-16T23:16:02Z

Sounds good, and no need to rush.

lemire · 2020-06-17T00:18:56Z

The upstream issue has been reproduced. It should be "easy" to fix. :-) We will fix it before release 0.4 (which is coming soon).

cc @jkeiser

…ment, add tests, and rebuild

knapply · 2020-06-17T01:15:48Z

@lemire Your response time is amazing. We all really appreciate it.

@eddelbuettel The previous file versions have been restored and vectorized versions of .deserialize_json() and .load_json() with tests/docs are in.

eddelbuettel · 2020-06-17T01:20:18Z

I'll merge. It is still only from your branch off a fork of this into a branch here so we do need another pass anyway before any of this becomes "real".

knapply added 13 commits June 15, 2020 08:13

fix bad type coercion (int64_t to double)

70adc7b

move globals/macros to inst/include/RcppSimdJson/common.hpp and add d…

47b770c

…ocumentation

move forward-declaration for simplify_element() to inst/include/RcppS…

891df4a

…imdJson/common.hpp so build_data_frame() can go where it should have been (inst/include/RcppSimdJson/deserialize/dataframe.hpp)

make all integer types explicit

89d5b34

remove template specification for Rcpp::wrap()

7b518d0

remove const qualifiers on empty_array/object

dbf9ca4

clean dead comments, fix formatting

918e4f9

add more documentation

346d584

small documentation fixes

52a0f72

fix .deserialize_json() when exceptions are disabled; add .load_json(…

f4d73e2

…) file reader

sync with upstream simdjson (12 Jun 2020)

50982cb

re-roxygenize(), re-build, ship

5f66c8a

fix bad includes

c12be19

knapply requested review from dcooley and eddelbuettel June 16, 2020 03:10

fix line deletion missing from commit 70adc7b

c8e5026

eddelbuettel reviewed Jun 16, 2020

View reviewed changes

knapply added 2 commits June 16, 2020 17:31

revert to previous simdjson (Wed May 20 10:23:07 EDT 2020)

4bb2cec

add vectorized versions of .deserialize_json() and .load_json(), docu…

b79304d

…ment, add tests, and rebuild

eddelbuettel merged commit 1c7ef14 into eddelbuettel:feature/deserialize Jun 17, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/deserialize #20

Feature/deserialize #20

knapply commented Jun 16, 2020

eddelbuettel left a comment

knapply commented Jun 16, 2020

eddelbuettel commented Jun 16, 2020

knapply commented Jun 16, 2020

eddelbuettel commented Jun 16, 2020

lemire commented Jun 17, 2020

knapply commented Jun 17, 2020

eddelbuettel commented Jun 17, 2020 •

edited

Feature/deserialize #20

Feature/deserialize #20

Conversation

knapply commented Jun 16, 2020

eddelbuettel left a comment

Choose a reason for hiding this comment

knapply commented Jun 16, 2020

eddelbuettel commented Jun 16, 2020

knapply commented Jun 16, 2020

eddelbuettel commented Jun 16, 2020

lemire commented Jun 17, 2020

knapply commented Jun 17, 2020

eddelbuettel commented Jun 17, 2020 • edited

eddelbuettel commented Jun 17, 2020 •

edited