Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/deserialize #20

Merged
merged 16 commits into from
Jun 17, 2020
Merged

Feature/deserialize #20

merged 16 commits into from
Jun 17, 2020

Conversation

knapply
Copy link
Collaborator

@knapply knapply commented Jun 16, 2020

Many refinements following some real-world usage:

  • found/fixed an int64_t to double casting bug
  • reorganized code to (hopefully) make everything more digestible (and IDE-friendly)
    • I also added some documentation and notes to facilitate that
      • functions closest to user level
      • compile-time things (enums, constexpr functions, exception options) used in multiple places
  • compiled with as many ultra strict flags as possible through multiple versions of GCC and Clang to catch anything I'm missing.
    • fixed conflicting const qualifiers
    • made anything involving integers as type-explicit as possible
  • added .load_json() function to read JSON files
    • API is identical to .deserialize_json()
  • synced with upstream simdjson (12 Jun 2020)

Fingers crossed, but I feel more confident this is sustainable.

Last thoughts:

  • So far, everything "just works" whether or not simdjson is compiled with exceptions enabled.
    • Since we're so early in the process, I'm leaving the line to disable exceptions commented out for anything that leaves my machine (you can find at line 69 in inst/include/RcppSimdJson/common.hpp), but R CMD check is passing for both cases.
    • We can explore in the future whether they're worth disabling, but compiling both ways has provided a nice sanity-check.
  • I hoped to have made more progress on .load_many()/.parse_many() by now, but they're going to take some more thought.
    • A naive "chuck every line in its own list" approach handled a real-world JSONL dataset (~3 GB, 100k documents) way better than I expected, but I'm thinking something truly kick-ass may require 2 passes through everything (1 to diagnose, 1 to pull everything into R).
    • But.... parsing files like this in R has brought anything under 16GB of RAM to its knees (and it's never just 1 file), so I'm extremely optimistic.

Copy link
Owner

@eddelbuettel eddelbuettel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like very fine work.

One thing that is a bit is that you seem to comingled this with an upgrade to (upstream) simdjson. Was that intentional / needed by something else?

@knapply
Copy link
Collaborator Author

knapply commented Jun 16, 2020

Nothing actually requires it and I should've done it separately.

Initially it was another sanity check that everything not only works, but works with the updates upstream... and then I left it in.

I have vectorized (in the R sense) versions to parse multiple strings and multiple files ready to go to maximize the parser efficiency (instead of creating a new one over and over w/ lapply()), but it seems something went funky with reusing the parser to read files. I reproduced and opened a simdjson issue here: simdjson/simdjson#938.

It should've done the sync separately. That's 100% my bad.

@eddelbuettel
Copy link
Owner

It should've done the sync separately. That's 100% my bad.

Stuff happens. Do you just want to drop another commit on top and restore two two files?

@knapply
Copy link
Collaborator Author

knapply commented Jun 16, 2020

Yes, I'll get it sorted ASAP.

@eddelbuettel
Copy link
Owner

Sounds good, and no need to rush.

@lemire
Copy link
Collaborator

lemire commented Jun 17, 2020

The upstream issue has been reproduced. It should be "easy" to fix. :-) We will fix it before release 0.4 (which is coming soon).

cc @jkeiser

@knapply
Copy link
Collaborator Author

knapply commented Jun 17, 2020

@lemire Your response time is amazing. We all really appreciate it.

@eddelbuettel The previous file versions have been restored and vectorized versions of .deserialize_json() and .load_json() with tests/docs are in.

@eddelbuettel
Copy link
Owner

eddelbuettel commented Jun 17, 2020

I'll merge. It is still only from your branch off a fork of this into a branch here so we do need another pass anyway before any of this becomes "real".

@eddelbuettel eddelbuettel merged commit 1c7ef14 into eddelbuettel:feature/deserialize Jun 17, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants