Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add 'enforce_lapply' or similar to fparse/fload #64

Closed
tdeenes opened this issue Feb 8, 2021 · 9 comments · Fixed by #65
Closed

Add 'enforce_lapply' or similar to fparse/fload #64

tdeenes opened this issue Feb 8, 2021 · 9 comments · Fixed by #65

Comments

@tdeenes
Copy link

tdeenes commented Feb 8, 2021

It is a great feature of the package that one does not need to call lapply() on character vectors since fparse\fload are already vectorized. However, this also means that the output type depends on whether the input contains a single element or multiple elements, e.g.:

inp2 <- as.character(1:2)
RcppSimdJson::fparse(inp2)
# [[1]]
# [1] 1
# 
# [[2]]
# [1] 2

inp1 <- as.character(1)
RcppSimdJson::fparse(inp1)
# [1] 1

So to be on the safe side, one has to check if the input vector is of length 1, and if yes, one has to wrap the result in a list. E.g.:

safe_fparse <- function(json, ...) {
  out <- RcppSimdJson::fparse(json, ...)
  if (length(json) == 1L) out <- list(out)
  out
}
safe_fparse("1")
# [[1]]
# [1] 1

What about adding an argument to fparse/fload which enforces that each element of the input json is wrapped in a list?

@eddelbuettel
Copy link
Owner

I wrestle with that in other contexts too where we have data downloaders. With multiple requests it makes sense to returns list around individual sub-components. The question then is what to do for N==1. I usually also 'unlist' if the length is one for convenience. But I hear you -- it may make sense to add a toggle to not do that here so that we know we always have a list.

Thoughts, @knapply ?

@knapply
Copy link
Collaborator

knapply commented Feb 8, 2021

My gut is telling me safe_fparse() (or an equivalent that doesn't imply fparse() is unsafe 😅) will be fairly niche and better left to the user. I'd probably write it something like the following.

fparse_to_list <- function(json, ...) {
  if (length(json) == 1L) lapply(json, RcppSimdJson::fparse, ...) 
  else RcppSimdJson::fparse(json, ...)
}

I recall considering having fparse()/fload() variants that would have been explicitly intended for multi-element vectors (which would always return a list), but decided the "vectorized" approach of having one function seemed more idiomatically R (for better, but maybe for worse). But, we do diverge from the behavior of something like strsplit() (always returning a list) because...

  1. the majority of usage will be for single characters
  2. you're far more likely to know the length of the vector you're passing to fparse()/fload() than the exact structure of the JSON... meaning you're almost always going to have to do some post-parse-processing for anything but trivial schemata (or those than be parsed directly to a "tidy" data.frame).

Another approach I recall considering was to provide a parser= parameter. The user can then create a parser object (externalptr) that can be passed to fparse()/fload() to reuse the parser. The use pattern would then look like something like the following...

my_parser <- parser()
lapply(many_json, fparse, parser =  my_parser)

The issue here is that going to a full-on Rcpp modules interface would make more sense for anything that feels object-oriented-ish -- we'd want to be able to access or modify the parser's state (alive? capacity? max_depth?).

All of that said, I'm not exactly opposed to adding the option, I'm just not convinced it's worthwhile. I'll ponder some more, but @tdeenes do you have a more involved use-case where this has been a pain to deal with?

@tdeenes
Copy link
Author

tdeenes commented Feb 8, 2021

Thanks for the details, @knapply . As for my use case: nowadays more and more RDMS-s support JSON-encoded columns (for example Snowflake has dedicated data types for it). So if you query a table which has a JSON column, you will get it as a character vector in R, which can be parsed to a list vector very efficiently by fparse(). Or if you combine it with data.table and rbindlist(), you can easily parse it to a data.table having as many rows as the length of the JSON vector (probably this is what you meant by "tidy" data.frame).

If someone expects multi-element JSON vectors, and fails to cover the one-element case by proper unit tests, the current defaults of fparse() can easily lead to disastrous bugs downstream. Fortunately I was getting the error message in the development phase and could easily figure out what caused it, but I was not expecting the current (non-type safe) behaviour.

So in a nutshell, reflecting directly upon 1) and 2):

  1. I am sure you implemented and documented fparse() as a vectorized function with a good reason - the pure existence of the vectorized nature of it proves that this is not a niche use case.
  2. You are absolutely right, but I do not get the point why it would support on anti-strsplit behaviour. Or on the contrary, IMO it fully supports the view that your strsplit() example is a perfect analogue: just as with strsplit, you do not know how many "parts" each element of your vector consist of, so it is much more consistent to return the result as a length-consistent list. If you know that you have a single element, you just call result[[1]] in the end, or do whatever customization your ETL process requires.

@knapply
Copy link
Collaborator

knapply commented Feb 8, 2021

As the day went on, the inconsistency has seemed like more of a mistake. Thanks for talking it out; I'm sold.

@eddelbuettel, in terms of API, this can be accomplished either via an additional parameter like @tdeenes initially suggested, or as separate functions that wrap fparse()/fload() like the examples mentioned here (probably w/ a preference for explicit arguments instead of ...).

Before diving in, are there particularly strong feelings either way?

@eddelbuettel
Copy link
Owner

Or (and just thinking out loud) fparse() and fload() can (and likely will) easily add another boolean toggle. We could then also add (if we feel it makes sense) another layer in liststablefparse() (not a great name, but you get the idea). We could also do the second step 'later'.

@tdeenes
Copy link
Author

tdeenes commented Feb 8, 2021

Actually I do favor the additional parameter-based approach (which would default to the current behavior first and then switched to the consistent one after a deprecation period), but do not like the name that I suggested in the title of this thread. enforce_lapply sounds as if fparse used the much less efficient lapply-implementation. Maybe preserve_length or return_list are more explicit and also do not interfere with the max_simplify_lvl argument (which is orthogonal to the recent issue).

@eddelbuettel
Copy link
Owner

as_list ? return_list ?

preserve_length sounds wrong as we weren't about to add or remove elements. enforce sounds too harsh too.

@knapply
Copy link
Collaborator

knapply commented Feb 8, 2021

always_list?

@knapply
Copy link
Collaborator

knapply commented Feb 10, 2021

Try #65 on for size...

# remotes::install_github("eddelbuettel/rcppsimdjson@issues/64")

inp2 <- as.character(1:2)
RcppSimdJson::fparse(inp2)
##> [[1]]
##> [1] 1
##> 
##> [[2]]
##> [1] 2
RcppSimdJson::fparse(inp2[[1L]], always_list = TRUE)
##> [[1]]
##> [1] 1
RcppSimdJson::fparse(c(some_name = inp2[[1L]]), always_list = TRUE)
##> $some_name
##> [1] 1

eddelbuettel added a commit that referenced this issue Feb 10, 2021
(issue #64) add `always_list` parameter (fparse/fload)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants