Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge deserialize branch #21

Merged
merged 33 commits into from
Jun 18, 2020
Merged

Merge deserialize branch #21

merged 33 commits into from
Jun 18, 2020

Conversation

eddelbuettel
Copy link
Owner

Assuming there is nothing (major) left to do, shall we merge this?

It's an almost clean commit line apart from the one interim merge and then continued commits on the branch (you can sort-of see that in the middle )

image

Anyway, we likely have bigger fish to fry (and I could insist squash-merging/rebasing but I think I won't). But it might be a good time to carry this over to the main branch...

Thoughts, gentlemen?

knapply added 30 commits June 7, 2020 16:01
…imdJson/common.hpp so build_data_frame() can go where it should have been (inst/include/RcppSimdJson/deserialize/dataframe.hpp)
@knapply
Copy link
Collaborator

knapply commented Jun 17, 2020

@eddelbuettel I don't have an issue with however you want the commits to happen. I'm trying to minimize any hand-holding, but your continued patience is appreciated.

Short answer to your question: let's merge!

  • I'm sure we're going to find little improvements as time goes on (I expect I'll to want to revisit data frame columns, but it'll require a bunch of experiments first), but nothing major at this time.
  • I have a "path forward" that I'll propose separately regarding what I think is the next (last?) major hurdle (NDJSON, JSONL, and friends) as well as the little JSON utilities, but wrapping up this stage so it's "user-ready" should happen first.

From where I stand, coming to a consensus on the user-facing R API should be the next priority.

@eddelbuettel
Copy link
Owner Author

Agreed! Thiis is a really nice (big) step forward from "oh look Dirk connected to simdjson, but we can't actually do anything with it" :)

We'll figure the rest out as we go along. By data.frame column do you mean list columns as in data.table and tibble (and base R, but without a pretty printer there) ?

@knapply
Copy link
Collaborator

knapply commented Jun 17, 2020

For what it's worth, anyone in R land questioning your impact on the ecosystem hasn't actually looked into it ;)

Regarding data frames, I think you're more getting at something I've been playing with mentally, but haven't been able to fully articulate. I'll eventually have some stuff to show rather than try and tell.

What I'm getting at above is how columns are currently diagnosed (name, position, simdjson type, ultimate R type):

template <Type_Policy type_policy> struct Column {
R_xlen_t index = 0L;
Type_Doctor<type_policy> schema = Type_Doctor<type_policy>();
};
template <Type_Policy type_policy> struct Column_Schema {
std::map<std::string_view, Column<type_policy>> schema =
std::map<std::string_view, Column<type_policy>>();
};

The initial improvement is to use std::unordered_map instead of std::map, especially since Column already tracks the indices separately anyways 🤦‍♂️. I think this is the result of using std::map in the early prototype, then only being reminded that the data frame columns should follow the order in which object keys are encountered after running round-trip tests on all the data frames in {datasets}.

But.... there's another issue I only realized over the weekend: the JSON spec doesn't actually require unique object keys.

"The names within an object SHOULD be unique."

I had wrongly 🤦‍♂️🤦‍♂️ assumed that duplicate keys are techinically illegal, so we'd end up handling that after we feel good about valid JSON (if simdjson even supported it), but it is legal so simdjson does.

In my experience 1) duplicate keys are not common, 2) have always been a confirmable red flag that something else is wrong with the data, and 3) are not likely part of an object inside an array of only objects (meaning they're not data-frame-able and would never pass through this code anyway). I feel all of those go for R data frames will duplicate column names as well. But that's me, and unfortunately that's all totally anecdotal.

So a few questions need to be answered, which will require some experiments and checking if there's an informal standard among current R packages (CC @dcooley):

  1. How do edge cases get handled?
  • If one "row" has 2 of the same key, I guess the resulting data frame should have 2 identically named columns.
    • What happens when the next "row" only has one of that key?
      • Does it just go in the first of those identically named columns?
      • If so, why?
  1. How do you track maybe-duplicate keys (that may have completely different types) while also keeping track of the order in which those keys are found without unreasonable impacts on yuge file performance?
  • After doing more research (this is by far the most time I've spent w/ C++, so the learning curve is still near-vertical), I'm suspecting that the ideal solution will be rather custom.

All of that said, I don't think supporting duplicate column names should be prioritized nor should they delay this, but it should be addressed eventually.

Funny enough, this taught me that seemingly every Python JSON module is wrong if it uses a dictionary. As far as I know, that's all of them, including the standard library's.

>>> import json
>>> json.loads('{"a":1,"a":2}')
{'a': 2}

@knapply
Copy link
Collaborator

knapply commented Jun 18, 2020

Since it turned out the duplicate key behavior actually follows the norm of other packages, I swapped out the std::maps for std::unordered_map (and removed a vestigial variable in the same function), so everything I said above is no longer relevant.

We're still waiting for a thumbs up from Dave, but if another commit can be stomached the final touch is available. Otherwise, it'll be on standby: https://github.com/knapply/rcppsimdjson/tree/feature/deserialize

@dcooley
Copy link
Collaborator

dcooley commented Jun 18, 2020

thumbs up from me!

@eddelbuettel
Copy link
Owner Author

Ok, will merge. Any objections to merging as a squash-and-merge because 33 is a little on the large side?

@dcooley
Copy link
Collaborator

dcooley commented Jun 18, 2020

as you see fit.

@eddelbuettel eddelbuettel merged commit fd472b1 into master Jun 18, 2020
@eddelbuettel eddelbuettel deleted the feature/deserialize branch June 18, 2020 22:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants