Merge deserialize branch #21

eddelbuettel · 2020-06-17T16:48:49Z

Assuming there is nothing (major) left to do, shall we merge this?

It's an almost clean commit line apart from the one interim merge and then continued commits on the branch (you can sort-of see that in the middle )

Anyway, we likely have bigger fish to fry (and I could insist squash-merging/rebasing but I think I won't). But it might be a good time to carry this over to the main branch...

Thoughts, gentlemen?

…ts that will be thrown away

…instead of "untyped" in function names

feature/deserialize

…ocumentation

…imdJson/common.hpp so build_data_frame() can go where it should have been (inst/include/RcppSimdJson/deserialize/dataframe.hpp)

…) file reader

…ment, add tests, and rebuild

Feature/deserialize

knapply · 2020-06-17T18:10:47Z

@eddelbuettel I don't have an issue with however you want the commits to happen. I'm trying to minimize any hand-holding, but your continued patience is appreciated.

Short answer to your question: let's merge!

I'm sure we're going to find little improvements as time goes on (I expect I'll to want to revisit data frame columns, but it'll require a bunch of experiments first), but nothing major at this time.
I have a "path forward" that I'll propose separately regarding what I think is the next (last?) major hurdle (NDJSON, JSONL, and friends) as well as the little JSON utilities, but wrapping up this stage so it's "user-ready" should happen first.

From where I stand, coming to a consensus on the user-facing R API should be the next priority.

eddelbuettel · 2020-06-17T18:46:35Z

Agreed! Thiis is a really nice (big) step forward from "oh look Dirk connected to simdjson, but we can't actually do anything with it" :)

We'll figure the rest out as we go along. By data.frame column do you mean list columns as in data.table and tibble (and base R, but without a pretty printer there) ?

knapply · 2020-06-17T20:17:22Z

For what it's worth, anyone in R land questioning your impact on the ecosystem hasn't actually looked into it ;)

Regarding data frames, I think you're more getting at something I've been playing with mentally, but haven't been able to fully articulate. I'll eventually have some stuff to show rather than try and tell.

What I'm getting at above is how columns are currently diagnosed (name, position, simdjson type, ultimate R type):

rcppsimdjson/inst/include/RcppSimdJson/deserialize/dataframe.hpp

Lines 11 to 19 in 1c7ef14

    
           template <Type_Policy type_policy> struct Column { 
        
             R_xlen_t index = 0L; 
        
             Type_Doctor<type_policy> schema = Type_Doctor<type_policy>(); 
        
           }; 
        
           template <Type_Policy type_policy> struct Column_Schema { 
        
             std::map<std::string_view, Column<type_policy>> schema = 
        
                 std::map<std::string_view, Column<type_policy>>(); 
        
           };

The initial improvement is to use std::unordered_map instead of std::map, especially since Column already tracks the indices separately anyways 🤦‍♂️. I think this is the result of using std::map in the early prototype, then only being reminded that the data frame columns should follow the order in which object keys are encountered after running round-trip tests on all the data frames in {datasets}.

But.... there's another issue I only realized over the weekend: the JSON spec doesn't actually require unique object keys.

"The names within an object SHOULD be unique."

I had wrongly 🤦‍♂️🤦‍♂️ assumed that duplicate keys are techinically illegal, so we'd end up handling that after we feel good about valid JSON (if simdjson even supported it), but it is legal so simdjson does.

In my experience 1) duplicate keys are not common, 2) have always been a confirmable red flag that something else is wrong with the data, and 3) are not likely part of an object inside an array of only objects (meaning they're not data-frame-able and would never pass through this code anyway). I feel all of those go for R data frames will duplicate column names as well. But that's me, and unfortunately that's all totally anecdotal.

So a few questions need to be answered, which will require some experiments and checking if there's an informal standard among current R packages (CC @dcooley):

How do edge cases get handled?

If one "row" has 2 of the same key, I guess the resulting data frame should have 2 identically named columns.
- What happens when the next "row" only has one of that key?
  - Does it just go in the first of those identically named columns?
  - If so, why?

How do you track maybe-duplicate keys (that may have completely different types) while also keeping track of the order in which those keys are found without unreasonable impacts on yuge file performance?

After doing more research (this is by far the most time I've spent w/ C++, so the learning curve is still near-vertical), I'm suspecting that the ideal solution will be rather custom.

All of that said, I don't think supporting duplicate column names should be prioritized nor should they delay this, but it should be addressed eventually.

Funny enough, this taught me that seemingly every Python JSON module is wrong if it uses a dictionary. As far as I know, that's all of them, including the standard library's.

>>> import json
>>> json.loads('{"a":1,"a":2}')
{'a': 2}

knapply · 2020-06-18T03:38:11Z

Since it turned out the duplicate key behavior actually follows the norm of other packages, I swapped out the std::maps for std::unordered_map (and removed a vestigial variable in the same function), so everything I said above is no longer relevant.

We're still waiting for a thumbs up from Dave, but if another commit can be stomached the final touch is available. Otherwise, it'll be on standby: https://github.com/knapply/rcppsimdjson/tree/feature/deserialize

dcooley · 2020-06-18T22:12:07Z

thumbs up from me!

eddelbuettel · 2020-06-18T22:17:25Z

Ok, will merge. Any objections to merging as a squash-and-merge because 33 is a little on the large side?

dcooley · 2020-06-18T22:18:31Z

as you see fit.

knapply added 30 commits June 7, 2020 16:01

first stab at full deserialization suite

7b8446c

standardize header guards

86e4660

simplify Type_Doctor

66037f5

simplify R object builders

ffc137c

split big dispatch into its own file

d4b07ab

move globals to before other files are included

2c7e1e0

add initial deserialization tests

2473774

fix macros for exceptions experiments

8f7f47a

retain objecy key order when building data frames

3073ea6

expand test coverage

80a9022

add parameters to select results for empty arrays/objects

94e571a

update tests

a24221f

simplify diagnosers with std::optional and preempt initializing objec…

40765ee

…ts that will be thrown away

namespace vector and matrix builders to simplify syntax, use "mixed" …

8dfceea

…instead of "untyped" in function names

add internal .deserialize_json() documentation, re-roxygenize

e2ed7bd

Merge pull request #17 from knapply/feature/deserialize

b05fe0c

feature/deserialize

fix bad type coercion (int64_t to double)

70adc7b

move globals/macros to inst/include/RcppSimdJson/common.hpp and add d…

47b770c

…ocumentation

move forward-declaration for simplify_element() to inst/include/RcppS…

891df4a

…imdJson/common.hpp so build_data_frame() can go where it should have been (inst/include/RcppSimdJson/deserialize/dataframe.hpp)

make all integer types explicit

89d5b34

remove template specification for Rcpp::wrap()

7b518d0

remove const qualifiers on empty_array/object

dbf9ca4

clean dead comments, fix formatting

918e4f9

add more documentation

346d584

small documentation fixes

52a0f72

fix .deserialize_json() when exceptions are disabled; add .load_json(…

f4d73e2

…) file reader

sync with upstream simdjson (12 Jun 2020)

50982cb

re-roxygenize(), re-build, ship

5f66c8a

fix bad includes

c12be19

fix line deletion missing from commit 70adc7b

c8e5026

knapply and others added 3 commits June 16, 2020 17:31

revert to previous simdjson (Wed May 20 10:23:07 EDT 2020)

4bb2cec

add vectorized versions of .deserialize_json() and .load_json(), docu…

b79304d

…ment, add tests, and rebuild

Merge pull request #20 from knapply/feature/deserialize

1c7ef14

Feature/deserialize

eddelbuettel requested review from dcooley and knapply June 17, 2020 16:48

knapply approved these changes Jun 17, 2020

View reviewed changes

knapply mentioned this pull request Jun 17, 2020

Duplicate object names and data frames #22

Closed

dcooley approved these changes Jun 18, 2020

View reviewed changes

eddelbuettel merged commit fd472b1 into master Jun 18, 2020

eddelbuettel deleted the feature/deserialize branch June 18, 2020 22:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merge deserialize branch #21

Merge deserialize branch #21

eddelbuettel commented Jun 17, 2020

knapply commented Jun 17, 2020 •

edited

Loading

eddelbuettel commented Jun 17, 2020

knapply commented Jun 17, 2020 •

edited

Loading

knapply commented Jun 18, 2020

dcooley commented Jun 18, 2020

eddelbuettel commented Jun 18, 2020

dcooley commented Jun 18, 2020

Merge deserialize branch #21

Merge deserialize branch #21

Conversation

eddelbuettel commented Jun 17, 2020

knapply commented Jun 17, 2020 • edited Loading

eddelbuettel commented Jun 17, 2020

knapply commented Jun 17, 2020 • edited Loading

knapply commented Jun 18, 2020

dcooley commented Jun 18, 2020

eddelbuettel commented Jun 18, 2020

dcooley commented Jun 18, 2020

knapply commented Jun 17, 2020 •

edited

Loading

knapply commented Jun 17, 2020 •

edited

Loading