-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merge deserialize branch #21
Conversation
…ts that will be thrown away
…instead of "untyped" in function names
feature/deserialize
…imdJson/common.hpp so build_data_frame() can go where it should have been (inst/include/RcppSimdJson/deserialize/dataframe.hpp)
…ment, add tests, and rebuild
Feature/deserialize
@eddelbuettel I don't have an issue with however you want the commits to happen. I'm trying to minimize any hand-holding, but your continued patience is appreciated. Short answer to your question: let's merge!
From where I stand, coming to a consensus on the user-facing R API should be the next priority. |
Agreed! Thiis is a really nice (big) step forward from "oh look Dirk connected to simdjson, but we can't actually do anything with it" :) We'll figure the rest out as we go along. By data.frame column do you mean list columns as in data.table and tibble (and base R, but without a pretty printer there) ? |
For what it's worth, anyone in R land questioning your impact on the ecosystem hasn't actually looked into it ;) Regarding data frames, I think you're more getting at something I've been playing with mentally, but haven't been able to fully articulate. I'll eventually have some stuff to show rather than try and tell. What I'm getting at above is how columns are currently diagnosed (name, position, simdjson type, ultimate R type): rcppsimdjson/inst/include/RcppSimdJson/deserialize/dataframe.hpp Lines 11 to 19 in 1c7ef14
The initial improvement is to use But.... there's another issue I only realized over the weekend: the JSON spec doesn't actually require unique object keys. "The names within an object SHOULD be unique." I had wrongly 🤦♂️🤦♂️ assumed that duplicate keys are techinically illegal, so we'd end up handling that after we feel good about valid JSON (if simdjson even supported it), but it is legal so simdjson does. In my experience 1) duplicate keys are not common, 2) have always been a confirmable red flag that something else is wrong with the data, and 3) are not likely part of an object inside an array of only objects (meaning they're not data-frame-able and would never pass through this code anyway). I feel all of those go for R data frames will duplicate column names as well. But that's me, and unfortunately that's all totally anecdotal. So a few questions need to be answered, which will require some experiments and checking if there's an informal standard among current R packages (CC @dcooley):
All of that said, I don't think supporting duplicate column names should be prioritized nor should they delay this, but it should be addressed eventually. Funny enough, this taught me that seemingly every Python JSON module is wrong if it uses a dictionary. As far as I know, that's all of them, including the standard library's. >>> import json
>>> json.loads('{"a":1,"a":2}')
{'a': 2} |
Since it turned out the duplicate key behavior actually follows the norm of other packages, I swapped out the We're still waiting for a thumbs up from Dave, but if another commit can be stomached the final touch is available. Otherwise, it'll be on standby: https://github.com/knapply/rcppsimdjson/tree/feature/deserialize |
thumbs up from me! |
Ok, will merge. Any objections to merging as a squash-and-merge because 33 is a little on the large side? |
as you see fit. |
Assuming there is nothing (major) left to do, shall we merge this?
It's an almost clean commit line apart from the one interim merge and then continued commits on the branch (you can sort-of see that in the middle )
Anyway, we likely have bigger fish to fry (and I could insist squash-merging/rebasing but I think I won't). But it might be a good time to carry this over to the main branch...
Thoughts, gentlemen?