Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

from_json #10

Closed
dcooley opened this issue May 17, 2020 · 34 comments
Closed

from_json #10

dcooley opened this issue May 17, 2020 · 34 comments

Comments

@dcooley
Copy link
Collaborator

dcooley commented May 17, 2020

I've been working on a prototype from_json() functionality here in my fork, which follows the exact same logic as jsonify

A demo of its current output is

js <- '{}'
RcppSimdJson:::rcpp_from_json( js, TRUE, TRUE )
# named list()

js <- '[]'
RcppSimdJson:::rcpp_from_json( js, TRUE, TRUE )
# character(0);

js <- '1'
RcppSimdJson:::rcpp_from_json( js, TRUE, TRUE )
# [1]

js <- '"a"'
RcppSimdJson:::rcpp_from_json( js, TRUE, TRUE )
# [1] "a"

js <- '[{"x":1,"y":2},[1,2,3]]'
RcppSimdJson:::rcpp_from_json( js, TRUE, TRUE )
# [[1]]
# [[1]]$x
# [1] 1
# 
# [[1]]$y
# [1] 2
# 
# 
# [[2]]
# [1] 1 2 3

js <- '[1,2,3,4]'
RcppSimdJson:::rcpp_from_json( js, TRUE, TRUE )
# [1] 1 2 3 4

js <- '[[1,2,3],[4,5,6]]'
RcppSimdJson:::rcpp_from_json( js, TRUE, TRUE )
#      [,1] [,2] [,3]
# [1,]    1    2    3
# [2,]    4    5    6

js <- '[["a","b"],["c","d"]]'
RcppSimdJson:::rcpp_from_json( js, TRUE, TRUE )
#     [,1] [,2]
# [1,] "a"  "b" 
# [2,] "c"  "d" 

js <- '[1,2,3,4,[1,2,3]]'
RcppSimdJson:::rcpp_from_json( js, TRUE, TRUE )
# [[1]]
# [1] 1
# 
# [[2]]
# [1] 2
# 
# [[3]]
# [1] 3
# 
# [[4]]
# [1] 4
# 
# [[5]]
# [1] 1 2 3

js <- '{"x":1,"y":2}'
RcppSimdJson:::rcpp_from_json( js, TRUE, TRUE )
# $x
# [1] 1
# 
# $y
# [1] 2

js <- '{"x":1,"y":2,"z":[1,2,3]}'
RcppSimdJson:::rcpp_from_json( js, TRUE, TRUE )
# $x
# [1] 1
# 
# $y
# [1] 2
# 
# $z
# [1] 1 2 3

Are you happy for me to make a PR so this from_json() lives inside RcppSimdJson, or would you prefer RcppSimdJson to remain as an 'Interface' library, clear of any R clutter?


Also tagging in @knapply who has been working on something similar, who may have another implementation?

@eddelbuettel
Copy link
Owner

Lovely :)

Fair point re a possible interface package but I think it would be lovely to have this and expose it. This turned also into a really neat conversation with upstream who may enjoy seeing the feature being present and providing, if you wish, extra 'test coverage' because some more user data may be coming this way.

@knapply
Copy link
Collaborator

knapply commented May 17, 2020

I started playing with simdjson + Rcpp back in September(?), but I clearly didn't make much public progress: https://github.com/knapply/simdjsonr

@dcooley, I haven't spent any time on a "simplify" workflow (I just don't really use it myself), but now that I've seen the JSON Pointer part in action, I think simdjson is a total game-changer.

I have no idea how I missed the Pointer functionality in RapidJSON, but I barely knew what I was doing in C++ when I was working with it more regularly (not that I know what I'm doing 7ish months later). Regardless, it has already made working with enormous (ND)JSON(L) data sets from R actually viable.


All that said, it doesn't make much sense to do my own thing elsewhere, especially since this is already on CRAN and rapport with the folks upstream has been established.

The approach I'm using follows (I just dropped them in gists instead of pushing a a bunch of garbage to the old repo), I'll start aggregating things into my fork for proper PRs.

Dave, I suspect some combination of our approaches may make sense 🤷‍♂, but it's a pretty safe bet you've spent more time thinking about JSON (and I default to assuming my C++ code is a ticking time bomb).

https://gist.github.com/knapply/0cfda08e85ba3fa4f7e61071f83d4768

parse_json.cpp
// SIMDJSON_VERSION == 0.3.1

#include <Rcpp.h>
#include <simdjson/simdjson.h>

#include <simdjson/simdjson.cpp>

namespace Rcpp {

template <>
inline SEXP wrap<int64_t>(const int64_t& obj) {
  auto out = Rcpp::NumericVector(1);
  std::memcpy(&(out[0]), &(obj), sizeof(double));
  out.attr("class") = "integer64";
  return out;
}

}  // namespace Rcpp

namespace simdjsonr {

template <typename int_T>
inline constexpr bool is_really_int64_t(int_T);

template <>
inline constexpr bool is_really_int64_t<uint64_t>(uint64_t x) {
  return x > INT_MAX - 1;
}

template <>
inline constexpr bool is_really_int64_t<int64_t>(int64_t x) {
  return x > INT_MAX - 1 || x < INT_MIN + 1;
}

template <typename int_T, bool bit64_integer64, bool int_64_strings>
inline constexpr SEXP resolve_int(int_T x) {
  return is_really_int64_t<int_T>(x)
             ? (bit64_integer64 ? Rcpp::wrap<int64_t>(x)
                                : int_64_strings ? Rcpp::wrap(std::to_string(x)) : Rcpp::wrap<double>(x))
             : Rcpp::wrap<int>(x);
}

template <typename F>
inline SEXP build_object(dom::object&& object, F f) {
  const R_xlen_t n = std::size(object);

  Rcpp::List out(n);
  Rcpp::CharacterVector out_names(n);

  R_xlen_t i = 0;
  for (auto [key, val] : object) {
    out[i] = f(val);
    out_names[i] = std::string(key);
    i++;
  }

  out.attr("names") = out_names;
  return out;
}

template <typename F>
inline auto build_array(dom::array&& object, F f) {
  Rcpp::List out;
  for (dom::element child : object) {
    out.push_back(f(child));
  }
  return out;
}

template <bool bit64_integer64, bool int_64_strings>
SEXP dump_json(dom::element element) {
  switch (element.type()) {
    case dom::element_type::ARRAY:
      return build_array(element, dump_json<bit64_integer64, int_64_strings>);

    case dom::element_type::OBJECT:
      return build_object(element, dump_json<bit64_integer64, int_64_strings>);

    case dom::element_type::INT64:
      return resolve_int<int64_t, bit64_integer64, int_64_strings>(element);

    case dom::element_type::UINT64:
      return resolve_int<uint64_t, bit64_integer64, int_64_strings>(element);

    case dom::element_type::DOUBLE:
      return Rcpp::wrap<double>(element);

    case dom::element_type::STRING:
      return Rcpp::wrap(std::string(element));

    case dom::element_type::BOOL:
      return Rcpp::wrap<bool>(element);

    case dom::element_type::NULL_VALUE:
      [[fallthrough]];
    default:
      return R_NilValue;
  }
}

template <bool use_json_pointer>
inline constexpr simdjson::dom::element stage_element(simdjson::dom::element element,
                                                      const std::string_view& json_pointer) {
  return use_json_pointer ? element.at(json_pointer) : element;
}

template <bool warning>
inline constexpr void throw_bad_parse(const char* msg) {
  warning ? Rcpp::warning(msg) : Rcpp::stop(msg);
}

template <bool warning, bool use_json_pointer, bool bit64_integer64, bool int_64_strings>
SEXP parse_json(const Rcpp::CharacterVector& json, const std::string_view& json_pointer) {
  const R_xlen_t n = std::size(json);

  Rcpp::List out(n);
  simdjson::dom::parser parser;

  for (R_xlen_t i = 0; i < n; ++i) {
    auto [res, error] = parser.parse(std::string_view(json[i]));

    if (error) {
      throw_bad_parse<warning>("parse error");
      continue;
    }

    out[i] = dump_json<bit64_integer64, int_64_strings>(stage_element<use_json_pointer>(res, json_pointer));
  }

  return out;
}

inline constexpr auto parse_int64_as_integer64_stop = parse_json<false, false, true, false>;
inline constexpr auto parse_int64_as_string_stop = parse_json<false, false, false, true>;
inline constexpr auto parse_int64_as_double_stop = parse_json<false, false, false, false>;
inline constexpr auto parse_pointer_int64_as_integer64_stop = parse_json<false, true, true, false>;
inline constexpr auto parse_pointer_int64_as_string_stop = parse_json<false, true, false, true>;
inline constexpr auto parse_pointer_int64_as_double_stop = parse_json<false, true, false, false>;

inline constexpr auto parse_int64_as_integer64_warning = parse_json<true, false, true, false>;
inline constexpr auto parse_int64_as_string_warning = parse_json<true, false, false, true>;
inline constexpr auto parse_int64_as_double_warning = parse_json<true, false, false, false>;
inline constexpr auto parse_pointer_int64_as_integer64_warning = parse_json<true, true, true, false>;
inline constexpr auto parse_pointer_int64_as_string_warning = parse_json<true, true, false, true>;
inline constexpr auto parse_pointer_int64_as_double_warning = parse_json<true, true, false, false>;

}  // namespace simdjsonr

//
//

// [[Rcpp::export(.parse_json_impl)]]
SEXP parse_json_impl(const Rcpp::CharacterVector& json,
                     const std::string& json_pointer,
                     const bool bit64_integer64,
                     const bool int_64_strings,
                     const bool error_on_bad_parse) {
  using namespace simdjsonr;

  const auto use_pointer = !json_pointer.empty();

  if (error_on_bad_parse) {
    if (bit64_integer64) {
      return use_pointer ? parse_pointer_int64_as_integer64_stop(json, json_pointer)
                         : parse_int64_as_integer64_stop(json, json_pointer);
    }
    if (int_64_strings) {
      return use_pointer ? parse_pointer_int64_as_string_stop(json, json_pointer)
                         : parse_int64_as_string_stop(json, json_pointer);
    } else {
      return use_pointer ? parse_pointer_int64_as_double_stop(json, json_pointer)
                         : parse_int64_as_double_stop(json, json_pointer);
    }

  } else {
    if (bit64_integer64) {
      return use_pointer ? parse_pointer_int64_as_integer64_warning(json, json_pointer)
                         : parse_int64_as_integer64_warning(json, json_pointer);
    }
    if (int_64_strings) {
      return use_pointer ? parse_pointer_int64_as_string_warning(json, json_pointer)
                         : parse_int64_as_string_warning(json, json_pointer);
    } else {
      return use_pointer ? parse_pointer_int64_as_double_warning(json, json_pointer)
                         : parse_int64_as_double_warning(json, json_pointer);
    }
  }
}

... and simdjson_parse.md has the R wrapper function and some examples of what it looks like in action...

simdjson_parse <- function(x, json_pointer = "",
                           int64 = c("auto", "integer64", "string", "double"),
                           error_on_bad_parse = TRUE) {
  int64 <- match.arg(int64, c("auto", "integer64", "string", "double"))
  
  if (int64 %in% c("auto", "integer64")) {
    bit64_available <- requireNamespace("bit64", quietly = TRUE)
    if (int64 == "integer64" && !bit64_available) {
      stop('`int64` set to `"integer64"`, but {bit64} is not installed.')
    }
    if (bit64_available) { # int64_t as bit64::integer64
      out <- .parse_json_impl(
        json = x, json_pointer,
        bit64_integer64 = TRUE, int_64_strings = FALSE,
        error_on_bad_parse = error_on_bad_parse
      )
    } else {
      int64 = "string"
    }

  }
  
  if (int64 == "string") { # int64_t as character
    out <- .parse_json_impl(
      json = x, json_pointer,
      bit64_integer64 = FALSE, int_64_strings = TRUE,
      error_on_bad_parse = error_on_bad_parse
    )
  } else { # int64_t as double
    out <- .parse_json_impl(
      json = x, json_pointer,
      bit64_integer64 = FALSE, int_64_strings = FALSE,
      error_on_bad_parse = error_on_bad_parse
    )
  }
  
  if (length(out) > 1L) out else out[[1L]]
}
simdjson_parse("[]")
## list()
simdjson_parse("{}")
## named list()
simdjson_parse('{"simd":["j","s","o","n"]}')
## $simd
## $simd[[1]]
## [1] "j"
## 
## $simd[[2]]
## [1] "s"
## 
## $simd[[3]]
## [1] "o"
## 
## $simd[[4]]
## [1] "n"
simdjson_parse(c("bad_json", '{"good_json":true}'))
## Error in .parse_json_impl(json = x, json_pointer, bit64_integer64 = TRUE, : parse error
simdjson_parse(c("bad_json", '{"good_json":true}'), error_on_bad_parse = FALSE)
## Warning in .parse_json_impl(json = x, json_pointer, bit64_integer64 = TRUE, :
## parse error

## [[1]]
## NULL
## 
## [[2]]
## [[2]]$good_json
## [1] TRUE
simdjson_parse('{"ints":[1,2,3]}')
## $ints
## $ints[[1]]
## [1] 1
## 
## $ints[[2]]
## [1] 2
## 
## $ints[[3]]
## [1] 3
is.integer(unlist(simdjson_parse('{"ints":[1,2,3]}')))
## [1] TRUE
simdjson_parse('{"big_int":1178007955838509057}')
## $big_int
## integer64
## [1] 1178007955838509057
simdjson_parse('{"big_int":2356015911677018114}', int64 = "string")
## $big_int
## [1] "2356015911677018114"
simdjson_parse('{"big_int":3534023867515527171}', int64 = "double")
## $big_int
## [1] 3.534024e+18
simdjson_parse(
  '{"big_ints":[{"a":1178007955838509057,"b":2356015911677018114,"c":[2356015911677018114,4712031823354036228]}]}',
  json_pointer = "big_ints/0/c/1"
)
## integer64
## [1] 4712031823354036228

Benchmarking

tweet_json <- readr::read_lines("../tweetio/inst/example-data/ufc-tweet-stream.json")
test_json <- tweet_json[vapply(tweet_json, jsonlite::validate, logical(1L))]

length(test_json)
## [1] 100000
library(jsonlite)
# library(jsonify, warn.conflicts = FALSE)

bench::mark(
  simdjson = simdjson <- simdjson_parse(test_json),
  fairer_simdjson = fairer_simdjson <- lapply(test_json, simdjson_parse)
  # jsonify = jsonify <- lapply(test_json, from_json, simplify = FALSE)  # sefgaults when knitting...?
  ,
  jsonlite = jsonlite <- lapply(test_json, parse_json)
  ,
  check = FALSE
)
## Warning: Some expressions had a GC in every iteration; so filtering is disabled.

## # A tibble: 3 x 6
##   expression           min   median `itr/sec` mem_alloc `gc/sec`
##   <bch:expr>      <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
## 1 simdjson           6.85s    6.85s    0.146      243MB   0.146 
## 2 fairer_simdjson   11.02s   11.02s    0.0907     474MB   0.0907
## 3 jsonlite          41.89s   41.89s    0.0239     234MB   0.0239

Comparing Output

simdjson[[200]]$entities$user_mentions[[1]][c("id", "id_str", "indices")]
## $id
## integer64
## [1] 3230043854
## 
## $id_str
## [1] "3230043854"
## 
## $indices
## $indices[[1]]
## [1] 3
## 
## $indices[[2]]
## [1] 14
# jsonify[[200]]$entities$user_mentions[[1]][c("id", "id_str", "indices")]
jsonlite[[200]]$entities$user_mentions[[1]][c("id", "id_str", "indices")]
## $id
## [1] 3230043854
## 
## $id_str
## [1] "3230043854"
## 
## $indices
## $indices[[1]]
## [1] 3
## 
## $indices[[2]]
## [1] 14

@dcooley
Copy link
Collaborator Author

dcooley commented May 18, 2020

yeah I've been focussed on getting the correct R object for the given JSON, which includes the simplification processes. And I haven't so much concentrated on performance.


here are a few tests and examples. Currently some int64s are returned to R as numeric, so that needs to be handled, but most of the logic for returning the correct R structure is there.

@dcooley
Copy link
Collaborator Author

dcooley commented May 18, 2020

A quick* benchmark suggests theres some overhead I haven't accounted for, as this implementation is currently slower than jsonify

library(jsonify)
library(jsonlite)
library(RcppSimdJson)
library(microbenchmark)

js <- readLines('http://opendata.canterburymaps.govt.nz/datasets/fb00b553120b4f2fac49aa76bc8d82aa_26.geojson')
js <- paste0(js, collapse = "")

microbenchmark::microbenchmark(
  jsonify = { jfy <- jsonify::from_json( js ) },
  jsonlite = { jlt <- jsonlite::fromJSON( js ) },
  simdjson = { sim <- RcppSimdJson::from_json( js ) },
  times = 5
)

# Unit: milliseconds
#     expr       min        lq      mean    median        uq       max neval
#  jsonify  138.4742  139.7152  140.6192  139.7596  141.3846  143.7623     5
# jsonlite 1230.4436 1232.6133 1256.6139 1251.1161 1267.9963 1300.9003     5
# simdjson  201.8796  202.3721  203.7961  202.5413  204.1732  208.0143     5
  • I haven't verified or tested the outputs.

@knapply
Copy link
Collaborator

knapply commented May 19, 2020

@dcooley , try benchmarking a non-geojson file: https://github.com/simdjson/simdjson/blob/master/doc/performance.md#number-parsing

I'm trying to figure out how to get this to pass R CMD check with the latest simdjson (this one is still missing .size() for dom::objects), but it seems that it's going to require changes upstream (stderr and abort calls).

Now that RTools40 is considered stable, this should be viable for Windows R users. It's worth noting that a string_view implementation is now bundled (although it's unclear to me what the minimum GCC version is). I'm all for ditching C++11 from the start, but it's worth considering.

@eddelbuettel Are you opposed to RcppSimdJson housing the R/Rcpp interface while the simdjson library itself is self-contained (like how @dcooley approached rapidjsonr + jsonify)? There are pros and cons, but I've found this dependency-free header silo + R/Rcpp interface paradigm to be extremely useful.

@eddelbuettel
Copy link
Owner

eddelbuettel commented May 19, 2020

  • stderr and abort: maybe talk to upstream, I seem to recall my last few updates were simple after I explained the issue, and it was my understanding that simdjson would not revert (some statements to the "no i/o in library" etc pp)

  • are you opposed: I am not sure what it is you are proposing. This package exists, and already provides the simdjson header via CRAN. I would suggest to keep it that way. @dcooley and I don't seem to have a problem adding functionality to the src/ and R/ directories of this package. So can you maybe reword your suggestion? Thx!

@eddelbuettel
Copy link
Owner

Let me rephrase. The way I see it we have a few options:

  • current status: RcppSimdJson ships the header-only library, this already works, litte to no added functonality
  • we add some functionality to take advantage of simdjson, this is still exploratory, likely 'real' work still uses the more feature-rich other json packages; no need to split packages
  • if ever we have so much performance and functionality here that it would make sense to split we still can

I don't have any other package where header and use are split. We could do that but I don't yet see a really compelling reason besides "well we can". But I may miss something. In any event we can revisit...

@knapply
Copy link
Collaborator

knapply commented May 19, 2020

maybe talk to upstream

That's the plan. I'd just like to have a solution ready first.

We could do that but I don't yet see a really compelling reason besides "well we can". But I may miss something.

The reason is flexibility. The omnipresence of JSON extends to environments and systems with all kinds of requirements; some valid, some nonsensical.

In any event we can revisit...

I was thinking that splitting now (while it has a minimal amount of users) would prevent disruptive headaches later. After more consideration, keeping things together is probably safer: simdjson itself is relatively young and it's clearly evolving... and I suppose copying the two amalgamated files still works in a pinch.

@eddelbuettel
Copy link
Owner

Right. I still think keeping it as one is preferable, the whole may offer more. I missed that chance with CCTZ (wrapped as RcppCCTZ) and now the source linger in three other packages for no benefit. It's somewhat suboptimal.

@lemire
Copy link
Collaborator

lemire commented May 19, 2020

although it's unclear to me what the minimum GCC version is

Please see

https://github.com/simdjson/simdjson/blob/master/doc/basics.md#requirements

stderr and abort: maybe talk to upstream

There is no use of stderr or abort in the main library as far as we know. If there is, please report it as a bug.

@lemire
Copy link
Collaborator

lemire commented May 19, 2020

Follow-up: abort and stderr did get back into the library. The problem is that we did not have tests. I have added such tests this time around so it should stay away.

@eddelbuettel
Copy link
Owner

I said it last time, I say it again: really really appreciate that. Makes our downstream work a lot easier.

@lemire
Copy link
Collaborator

lemire commented May 19, 2020

@eddelbuettel Yes. Removing offending code is easy. Tracking new commits and checking every line to make sure that we don't fall back is harder. This sort of work needs to be automated.

@knapply
Copy link
Collaborator

knapply commented May 20, 2020

@lemire Thank you so much!

@knapply
Copy link
Collaborator

knapply commented May 20, 2020

simdjson/simdjson#893 has not yet been merged, but I took the new amalgamation for a test drive in my fork.

@dcooley I pulled your changes and added is_really_int64_t() and resolve_int() here, then swapped all the integer get-ters for resolve_int() .

I think a portion of the overhead you're seeing is coming from redundant checks. Specifically, if types are confirmed with simdjson::dom::element::is() or dom::element::type(), the value can be safely extracted with T(dom::element), T = dom::element, or even Rcpp::wrap<T>(dom::element). Extracting elements with dom::element::get() and dom::array::at() is going to slow things down because they return simdjson_results and can throw exceptions. I certainly have some more exploring to do though.

After making a few other small modifications, RcppSimdJson builds on Linux, Mac, and Windows w/ 100% passing on the tests @dcooley referenced at #10 (comment) via Github Actions. The only R CMD Check Warnings are coming from the undocumented exports.

@eddelbuettel , if you want to keep CI to only Travis + Docker, please say the word.

@eddelbuettel
Copy link
Owner

eddelbuettel commented May 20, 2020

if you want to keep CI to only Travis + Docker, please say the word.

"word"

I looked into the alternatives, and remain content with Travis CI.

@lemire
Copy link
Collaborator

lemire commented May 20, 2020

Note that simdjson can be used with or without exceptions. We have two distinct "sub-API" depending on the mode you are using. It is possible to control this with macros and it depends in part on how you compile the library (with or without exceptions). So you definitively do not have to deal with exceptions if you do not want to. It is usually the case the relying on exceptions comes with a performance overhead.

@dcooley
Copy link
Collaborator Author

dcooley commented May 20, 2020

@knapply

I pulled your changes and added is_really_int64_t() and resolve_int() here, then swapped all the integer get-ters for resolve_int() .

Do you want to make a PR with with these changes included, as well as anything else you've got covered in your other PR ?

@dcooley
Copy link
Collaborator Author

dcooley commented May 20, 2020

try benchmarking a non-geojson file:

Yeah there's definitely something wrong with the way I'm using simdjson given these results

library(jsonify)
library(jsonlite)
library(RcppSimdJson)
library(microbenchmark)

n <- 1e5
df <- data.frame(
  x = 1:n
  , y = sample( letters, size = n, replace = T)
)

js <- jsonify::to_json(df)

microbenchmark::microbenchmark(
  jsonify = { jfy <- jsonify::from_json( js ) },
  jsonlite = { jlt <- jsonlite::fromJSON( js ) },
  simdjson = { sim <- RcppSimdJson::from_json( js ) },
  times = 5
)

# Unit: milliseconds
#     expr        min         lq       mean     median         uq        max neval
#  jsonify   247.4799   285.1659   335.8276   366.9796   373.1176   406.3950     5
# jsonlite   289.5658   292.8082   335.9969   300.2588   325.5345   471.8174     5
# simdjson 37025.7764 37443.0639 37941.6096 37634.0160 38023.0598 39582.1316     5

@lemire
Copy link
Collaborator

lemire commented May 20, 2020

You are taking 38 seconds to parse about 2MB of data? That's just not possible.

It is a bit difficult to reason in milliseconds. It is easier if you break it down in, say, GB/s.

I am not a R user, so I tried to guess what the script would generate... and I implemented it in Python:

import string
lower_upper_alphabet = string.ascii_letters
import random
def randomletter():
	return random.choice(lower_upper_alphabet)
print("[",end="")
for i in range(1,100000):
  print("{\"id\":"+str(i)+",\"val\":\""+randomletter()+"\"},",end="")
print("{\"id\":"+str(i)+",\"val\":\""+randomletter()+"\"}",end="")
print("]",end="\n")

This generate a crazy file which I called crazy.json. Then I ran a benchmark over it...

$ ./benchmark/parsingcompetition ../crazy.json
simdjson (dynamic mem)                  	:    4.485 cycles per input byte (best)    4.957 cycles (avg)    0.760 GB/s (error margin: 0.072 GB/s)           332 documents/s (best)           300 documents/s (avg)
simdjson                                	:    4.486 cycles per input byte (best)    4.511 cycles (avg)    0.760 GB/s (error margin: 0.004 GB/s)           332 documents/s (best)           330 documents/s (avg)
RapidJSON                               	:   17.655 cycles per input byte (best)   17.766 cycles (avg)    0.193 GB/s (error margin: 0.001 GB/s)            84 documents/s (best)            84 documents/s (avg)
RapidJSON (accurate number parsing)     	:   19.179 cycles per input byte (best)   19.202 cycles (avg)    0.178 GB/s (error margin: 0.000 GB/s)            78 documents/s (best)            78 documents/s (avg)
RapidJSON (insitu)                      	:   15.888 cycles per input byte (best)   15.945 cycles (avg)    0.214 GB/s (error margin: 0.001 GB/s)            94 documents/s (best)            93 documents/s (avg)
RapidJSON (insitu, accurate number parsing)	:   17.798 cycles per input byte (best)   17.827 cycles (avg)    0.191 GB/s (error margin: 0.000 GB/s)            84 documents/s (best)            84 documents/s (avg)

So we achieve ~0.75 GB/s which is very low for simdjson, but it is a somewhat adversarial (synthetic example) case.

Ok. So let us turn this into milliseconds. My file spans 2288896 bytes. So we have 0.2% of a GB... We need to divide this by 0.75 GB to get the time in second, and then multiply again by 1000 to get the number of milliseconds... 2288896/1000000000./0.75 * 1000 which is 3 milliseconds.

So I would expect simdjson to take about 3 milliseconds to parse this file. Of course, there may be overhead that I am not aware of...

But there is no possible way that it goes up to 38 seconds.

@lemire
Copy link
Collaborator

lemire commented May 20, 2020

try benchmarking a non-geojson file

With canada.json (a geojson file), which is one of our standard test file, we get better than 0.8 GB/s on a 3.4 GHz Skylake processor. A 3.4 GHz Skylake is really quite ordinary at this point.

I don't think it is possible to build an input JSON such that it would take 38 seconds to parse 3 MB... I would argue that no non-broken JSON parser could possibly be that slow.

@eddelbuettel
Copy link
Owner

Let's keep it apples to apples. It is no longer parse speed alone.

@dcooley is trying to build a data structure to return to R, and we typically have a few constraints on the way (having a limited set of types is one). So there will be copies, and in phase one there may be extra copies. Such is life. I trust Dave who has has put together amazing stuff (off JSON input) for the mapdeck viz. Let's not quite shoot with real bullets yet.

@knapply
Copy link
Collaborator

knapply commented May 23, 2020

@dcooley , I think I have a reasonable workflow for the integer stuff that won't be regretted (too badly) later that I brought up in #13. I just want to confirm that's the desired direction before it invades any code.

I'm still trying to grock what the simplify rules exactly are, but there will definitely be a speed jump by separating concerns between the simplify and "vanilla" routine.

.parse_json() is a "refined" version of what I discussed earlier in the thread.

It's meant to be a clone of jsonlite::parse_json() (I haven't actually looked at its internals), with its default arguments (so no simplify), so I'm sure having less branches helps it zip along. I'm kinda amazed that, so far, its results are identical() to jsonlite::parse_json()'s.

Of course, this is the "easy" part; you're tackling a lot more with from_json(simplify = TRUE)...

js <- paste0(readLines("https://github.com/zemirco/sf-city-lots-json/raw/master/citylots.json"), 
             collapse = "")
pryr::object_size(js)
#> 189 MB

microbenchmark::microbenchmark(
  jsonlite = jsonlite::parse_json(js),
  simdjson = RcppSimdJson:::.parse_json(js)
  ,
  times = 1,
  check = "identical"
)
#> Unit: seconds
#>      expr      min       lq     mean   median       uq      max neval
#>  jsonlite 4.402949 4.402949 4.402949 4.402949 4.402949 4.402949     1
#>  simdjson 1.212326 1.212326 1.212326 1.212326 1.212326 1.212326     1

rcppsimdjson_dir <- "~/Documents/rcppsimdjson/inst/jsonexamples/"
json_file_paths <- dir(rcppsimdjson_dir, pattern = "\\.json$", full.names = TRUE)
names(json_file_paths) <- dir(rcppsimdjson_dir, pattern = "\\.json$")

jsons <- vapply(
  json_file_paths,
  function(.x) paste0(readLines(.x, warn = FALSE), collapse = ""),
  character(1L)
)

bench_marks <- mapply(
  function(.x, .y) {
    res <- microbenchmark::microbenchmark(
      jsonlite = jsonlite::parse_json(.x),
      simdjson = RcppSimdJson:::.parse_json(.x)
      ,
      times = 10,
      unit = "ms",
      check = "identical"
    )
    cat("********************** ", .y, "\n")
    print(res, order = "median")
    cat("\n\n")
    cbind(data.frame(file_name = .y), as.data.frame(res))
  },
  jsons,
  names(jsons),
  SIMPLIFY = FALSE
)
#> **********************  apache_builds.json 
#> Unit: milliseconds
#>      expr      min       lq      mean    median       uq      max neval
#>  simdjson 0.434170 0.451185 0.5107266 0.4735555 0.592085 0.608644    10
#>  jsonlite 1.142164 1.173881 1.3241093 1.2536620 1.310450 2.024043    10
#> 
#> 
#> **********************  canada.json 
#> Unit: milliseconds
#>      expr       min        lq      mean    median        uq       max neval
#>  simdjson  7.230551  7.456116  7.875385  7.849235  8.224713  8.706415    10
#>  jsonlite 42.288008 43.653252 44.711665 44.029844 46.164748 47.662902    10
#> 
#> 
#> **********************  citm_catalog.json 
#> Unit: milliseconds
#>      expr       min        lq      mean    median        uq       max neval
#>  simdjson  3.638028  3.704805  4.300136  3.886545  5.053951  5.899548    10
#>  jsonlite 22.725979 22.934923 23.873911 23.696750 24.524041 25.694399    10
#> 
#> 
#> **********************  github_events.json 
#> Unit: milliseconds
#>      expr      min       lq      mean    median       uq      max neval
#>  simdjson 0.195300 0.207428 0.2620517 0.2547240 0.280591 0.427534    10
#>  jsonlite 0.918447 0.960214 0.9794883 0.9879705 1.010506 1.017453    10
#> 
#> 
#> **********************  gsoc-2018.json 
#> Unit: milliseconds
#>      expr       min        lq      mean    median        uq       max neval
#>  simdjson  7.289759  7.891569  8.145152  7.990608  8.276493  9.496555    10
#>  jsonlite 13.240555 13.401006 14.031844 13.863335 14.754366 15.289081    10
#> 
#> 
#> **********************  instruments.json 
#> Unit: milliseconds
#>      expr      min       lq     mean    median       uq      max neval
#>  simdjson 0.587803 0.589676 0.694585 0.6076295 0.811805 0.892138    10
#>  jsonlite 2.040344 2.153086 2.228978 2.1947955 2.283114 2.598088    10
#> 
#> 
#> **********************  marine_ik.json 
#> Unit: milliseconds
#>      expr      min       lq     mean   median      uq      max neval
#>  simdjson 12.97164 13.33626 13.82794 13.56599 13.6018 16.91775    10
#>  jsonlite 72.88100 73.37122 76.34738 75.26905 77.1654 87.12118    10
#> 
#> 
#> **********************  mesh.json 
#> Unit: milliseconds
#>      expr       min        lq      mean    median       uq       max neval
#>  simdjson  2.305229  2.391161  2.618197  2.558874  2.82882  3.046341    10
#>  jsonlite 19.313140 19.674100 20.896420 21.152274 21.49891 23.122117    10
#> 
#> 
#> **********************  mesh.pretty.json 
#> Unit: milliseconds
#>      expr       min       lq      mean    median        uq      max neval
#>  simdjson  2.875675  2.90280  3.047431  3.013965  3.160781  3.32808    10
#>  jsonlite 19.739430 20.92576 25.905227 27.926010 29.914193 30.26666    10
#> 
#> 
#> **********************  numbers.json 
#> Unit: milliseconds
#>      expr      min       lq      mean    median       uq      max neval
#>  simdjson 0.339428 0.341582 0.3677705 0.3715365 0.377728 0.417071    10
#>  jsonlite 2.934479 2.941936 3.0214713 2.9529865 2.987781 3.581713    10
#> 
#> 
#> **********************  random.json 
#> Unit: milliseconds
#>      expr       min        lq      mean    median        uq       max neval
#>  simdjson  2.483639  2.506914  2.905289  2.568928  3.212158  4.380286    10
#>  jsonlite 10.048730 10.181996 11.190389 10.640860 12.015381 13.486832    10
#> 
#> 
#> **********************  twitter.json 
#> Unit: milliseconds
#>      expr      min       lq     mean   median        uq       max neval
#>  simdjson 1.767523 2.140532 2.225777 2.201807  2.332409  2.853772    10
#>  jsonlite 9.487721 9.667403 9.996055 9.958649 10.020133 10.824681    10
#> 
#> 
#> **********************  twitterescaped.json 
#> Unit: milliseconds
#>      expr      min       lq     mean   median       uq      max neval
#>  simdjson 1.890421 1.938308 2.100133 1.983586 2.315762 2.384929    10
#>  jsonlite 5.442077 5.511250 5.902488 5.638079 6.064283 7.470741    10
#> 
#> 
#> **********************  update-center.json 
#> Unit: milliseconds
#>      expr      min       lq      mean    median        uq       max neval
#>  simdjson 2.194598 2.274436  2.625021  2.625768  2.965949  3.070396    10
#>  jsonlite 9.733094 9.853983 10.574512 10.206547 10.945863 12.812568    10

df <- do.call(rbind, bench_marks)

@dcooley
Copy link
Collaborator Author

dcooley commented May 24, 2020

I've done some tests on my from_json() and the bottleneck is coming from getting the data types inside each element. The jsonify version is here, and my RcppSimdJson version is here

(I've made this test small and quick 'cos I was getting annoyed waiting for it to run each time I did a test. But this result is representative of larger examples)

n <- 1e4L
df <- data.frame(x = 1L:n)
js <- jsonify::to_json( df )

microbenchmark::microbenchmark(
  jsonify = { res <- jsonify:::rcpp_get_dtypes( js ) },
  rcppsimd = { RcppSimdJson:::rcpp_get_dtypes( js ) }
)

# Unit: microseconds
#     expr       min        lq       mean    median        uq        max neval
#  jsonify   657.069   673.832   715.9397   697.081   730.814   1013.196   100
# rcppsimd 85469.820 86976.375 94386.8532 92468.832 98570.788 124948.499   100

@knapply FYI the get_dtypes() gets the data types of each element inside an objet or an array, which I use to determine if the object can be simplified or not. In jsonify this has very little cost, so I thought I could simply bring it across to here. But these tests suggest it's not the correct approach.

@knapply
Copy link
Collaborator

knapply commented May 24, 2020

@dcooley Yea, that seems weird.

Have you tried it without using .get<SIMDJON-TYPE>() (kinda like below) or not passing everything by reference?

w/o .get()
// [[Rcpp::plugins(cpp17)]]                                        
// [[Rcpp::depends(RcppSimdJson)]]

#include <simdjson.h>
#include <simdjson.cpp>

#include <Rcpp.h>

using namespace simdjson;
using Rcpp::_;

typedef std::unordered_set<dom::element_type> DTypes;

template <typename T>
bool is_homogeneous(T x) {
  DTypes dtypes;
  for (auto value : x) {
    dtypes.insert(value.type());
  }
  return std::size(dtypes) == 1;
}

template <>
bool is_homogeneous<dom::object>(dom::object x) {
  DTypes dtypes;
  for (auto [key, value] : x) {
    dtypes.insert(value.type());
  }
  return std::size(dtypes) == 1;
}


// [[Rcpp::export(.dtypes)]]
SEXP test() {
  auto cars_json = R"( [
  { "make": "Toyota", "model": "Camry",  "year": 2018,
    "tire_pressure": [ 40.1, 39.9 ] },
    { "make": "Kia",    "model": "Soul",   "year": 2012,
      "tire_pressure": [ 30.1, 31.0 ] },
      { "make": "Toyota", "model": "Tercel", "year": 1999,
        "tire_pressure": [ 29.8, 30.0 ] }
  ] )"_padded;
  
  dom::parser parser;
  dom::element cars = parser.parse(cars_json);
  
  return Rcpp::List::create(
    _["element"] = is_homogeneous<dom::element>(cars),
    _["array"] = is_homogeneous<dom::array>(cars),
    _["object"] = is_homogeneous<dom::object>(cars.at(0)),
    _["array"] = is_homogeneous<dom::array>(cars.at(0).at("tire_pressure"))
  );
}


/*** R
.dtypes()
##> $element
##> [1] TRUE
##> 
##> $array
##> [1] TRUE
##> 
##> $object
##> [1] FALSE
##> 
##> $array
##> [1] TRUE
*/

After walking through and playing with jsonify earlier, I'm a bit confused. Is there a reference you're using for the simplify routine? I'm getting different enough results when comparing to jsonlite's that I'm not confident I get the rules.

test <- '{"test":[1,[2,[3]]]}'
jsonlite::fromJSON(test)
#> $test
#> $test[[1]]
#> [1] 1
#> 
#> $test[[2]]
#> $test[[2]][[1]]
#> [1] 2
#> 
#> $test[[2]][[2]]
#> [1] 3
jsonify::from_json(test)
#> $test
#> $test[[1]]
#> [1] 1
#> 
#> $test[[2]]
#>      [,1]
#> [1,]    2
#> [2,]    3

To me, there's nothing to simplify here, so jsonlite is closer to what I'd expect (but still weird). That said, very little of the JSON data I deal with is numerical so I'm sure there's something I'm missing.

I also realized that things can be simplified down to matrices, which complicates the integer handling (possible integer64 matrices and beyond). Do you stop at matrices, or are 3D+ arrays possible?

@dcooley
Copy link
Collaborator Author

dcooley commented May 24, 2020

Is there a reference you're using for the simplify routine

Not really a reference, but my rule is, round-trips have to work.

So if you simplified down to a matrix, you couldn't then get back to the [, [, [...]]] structure.

So I'm using "simplify" to mean, the simplest structure possible, without breaking the original JSON structure.

But it looks like you've found an issue

@knapply
Copy link
Collaborator

knapply commented May 24, 2020

Sorry! That wasn't what I intended. 🤦‍♂️😬

Edit: I'll move the example over there.

@knapply
Copy link
Collaborator

knapply commented Jun 8, 2020

I have a deserialization routine that seems to check a lot of boxes (multiple levels of type-strictness and simplification): https://github.com/knapply/rcppsimdjson/tree/feature/deserialize

I haven't quite sorted out how to best handle nested data frames. The way jsonify and jsonlite go about it seems different enough that I need to reevaluate.

What I'm envisioning is being able to clone jsonify::from_json() and whatever else, but having the individual parts be sufficiently modular that alternatives and modifications don't require a total rebuild. Be that here, or in other packages that want to leverage the pre-built boilerplate.

I'm sure the code has issues (C++ has been a total uphill battle for me), but the results seem promising.

json1 <- readr::read_file(
  "~/Documents/rcppsimdjson/inst/jsonexamples/canada.json"
)
json2 <- readr::read_file(
  "~/Documents/rcppsimdjson/inst/jsonexamples/gsoc-2018.json"
)

microbenchmark::microbenchmark(
  rcppsimdjson1 = RcppSimdJson:::.deserialize_json(json1),
  jsonify1 = jsonify::from_json(json1),
  jsonlite = jsonlite::fromJSON(json1)
  ,
  times = 3
)

#> Unit: milliseconds
#>           expr        min         lq      mean     median         uq        max neval
#>  rcppsimdjson1   4.474563   5.140195   6.12159   5.805827   6.945103   8.084379     3
#>       jsonify1  45.232373  45.339811  47.12165  45.447250  48.066282  50.685314     3
#>       jsonlite 462.022187 467.628038 478.05169 473.233888 486.066435 498.898981     3

microbenchmark::microbenchmark(
  rcppsimdjson2 = RcppSimdJson:::.deserialize_json(json2),
  jsonify2 = jsonify::from_json(json2),
  jsonlite2 = jsonlite::fromJSON(json2)
  ,
  times = 3
)
#> Unit: milliseconds
#>           expr       min       lq      mean    median        uq       max neval
#>  rcppsimdjson2  9.390968 10.85924  12.44297  12.32752  13.96897  15.61043     3
#>       jsonify2 30.635981 30.80953  33.61642  30.98309  35.10664  39.23019     3
#>      jsonlite2 96.001989 98.44159 108.28356 100.88118 114.42435 127.96752     3

This is what it looks like in action ...

type_policy <- list(
  anything_goes = 0,
  ints_as_dbl = 1,
  strict = 2
)

int64_opt <- list(
  double = 0,
  string = 1,
  integer64 = 2
)

js <- '[[1,2,3],
        ["4","5",null],
        [1,2,3.3],
        [true,false,true],
        [10000000000,20000000000,30000000000]]'
RcppSimdJson:::.deserialize_json(js)
#>      [,1]          [,2]          [,3]         
#> [1,] "1"           "2"           "3"          
#> [2,] "4"           "5"           NA           
#> [3,] "1"           "2"           "3.30"       
#> [4,] "TRUE"        "FALSE"       "TRUE"       
#> [5,] "10000000000" "20000000000" "30000000000"
RcppSimdJson:::.deserialize_json(js, type_policy = type_policy$ints_as_dbl)
#> [[1]]
#> [1] 1 2 3
#> 
#> [[2]]
#> [1] "4" "5" NA 
#> 
#> [[3]]
#> [1] 1.0 2.0 3.3
#> 
#> [[4]]
#> [1]  TRUE FALSE  TRUE
#> 
#> [[5]]
#> [1] 1e+10 2e+10 3e+10
RcppSimdJson:::.deserialize_json(js, type_policy = type_policy$strict)
#> [[1]]
#> [1] 1 2 3
#> 
#> [[2]]
#> [1] "4" "5" NA 
#> 
#> [[3]]
#> [[3]][[1]]
#> [1] 1
#> 
#> [[3]][[2]]
#> [1] 2
#> 
#> [[3]][[3]]
#> [1] 3.3
#> 
#> 
#> [[4]]
#> [1]  TRUE FALSE  TRUE
#> 
#> [[5]]
#> [1] 1e+10 2e+10 3e+10
RcppSimdJson:::.deserialize_json(js, type_policy = type_policy$strict,
                                 int64_r_type = int64_opt$string)
#> [[1]]
#> [1] 1 2 3
#> 
#> [[2]]
#> [1] "4" "5" NA 
#> 
#> [[3]]
#> [[3]][[1]]
#> [1] 1
#> 
#> [[3]][[2]]
#> [1] 2
#> 
#> [[3]][[3]]
#> [1] 3.3
#> 
#> 
#> [[4]]
#> [1]  TRUE FALSE  TRUE
#> 
#> [[5]]
#> [1] "10000000000" "20000000000" "30000000000"
RcppSimdJson:::.deserialize_json(js, type_policy = type_policy$strict,
                                 int64_r_type = int64_opt$integer64)
#> [[1]]
#> [1] 1 2 3
#> 
#> [[2]]
#> [1] "4" "5" NA 
#> 
#> [[3]]
#> [[3]][[1]]
#> [1] 1
#> 
#> [[3]][[2]]
#> [1] 2
#> 
#> [[3]][[3]]
#> [1] 3.3
#> 
#> 
#> [[4]]
#> [1]  TRUE FALSE  TRUE
#> 
#> [[5]]
#> integer64
#> [1] 10000000000 20000000000 30000000000

RcppSimdJson:::.deserialize_json('[{"id":1,"val":"a"},{"id":2,"val":"b"}]')
#>   id val
#> 1  1   a
#> 2  2   b
RcppSimdJson:::.deserialize_json('[{"id":1,"val":"a"},{"id":2,"val":["b","c"]}]')
#>   id  val
#> 1  1    a
#> 2  2 b, c

RcppSimdJson:::.deserialize_json('[{"id":1,"val":"a"},{"id":2,"val":["b","c"]}]',
                                 json_pointer = '1/val/0')
#> [1] "b"

... and these are the types of nested data frames that still need some thought...

x <- data.frame(driver = c("Bowser", "Peach"), occupation = c("Koopa", "Princess"))
x$vehicle <- data.frame(model = c("Piranha Prowler", "Royal Racer"))
x$vehicle$stats <- data.frame(speed = c(55, 56), weight = c(67, 24), drift = c(35, 32))
js <- jsonify::to_json(x)

str(jsonlite::fromJSON(js)) # identical() to jsonify
#> 'data.frame':    2 obs. of  3 variables:
#>  $ driver    : chr  "Bowser" "Peach"
#>  $ occupation: chr  "Koopa" "Princess"
#>  $ vehicle   :'data.frame':  2 obs. of  2 variables:
#>   ..$ model: chr  "Piranha Prowler" "Royal Racer"
#>   ..$ stats:'data.frame':    2 obs. of  3 variables:
#>   .. ..$ speed : num  55 56
#>   .. ..$ weight: num  67 24
#>   .. ..$ drift : num  35 32


str(RcppSimdJson:::.deserialize_json(js))
#> 'data.frame':    2 obs. of  3 variables:
#>  $ driver    : chr  "Bowser" "Peach"
#>  $ occupation: chr  "Koopa" "Princess"
#>  $ vehicle   :List of 2
#>   ..$ :List of 2
#>   .. ..$ model: chr "Piranha Prowler"
#>   .. ..$ stats:List of 3
#>   .. .. ..$ speed : num 55
#>   .. .. ..$ weight: num 67
#>   .. .. ..$ drift : num 35
#>   ..$ :List of 2
#>   .. ..$ model: chr "Royal Racer"
#>   .. ..$ stats:List of 3
#>   .. .. ..$ speed : num 56
#>   .. .. ..$ weight: num 24
#>   .. .. ..$ drift : num 32

As someone who came to R well after data.table and dplyr came about, the multi-column data frames that jsonify/jsonlite build are completely bizarre to me. There's also this...

Warning message:
In data.table::setDT(x) :
  Some columns are a multi-column type (such as a matrix column): [3]. setDT will retain these columns as-is but subsequent operations like grouping and joining may fail. Please consider as.data.table() instead which will create a new column for each embedded column.

I'm not suggesting that only "enhanced" data frame users be considered, it's more that the power of RcppSimdJson is going to be in the ability to ingest yuge data sets, so being able to hand off to those packages with minimal fidgeting would be nice.

With that in mind, If there's a standard (or even legacy?) use-case for them, it'd be helpful to know what that is so we can consider what options best support that. If anyone has thoughts on that, I'd love to hear them.

@eddelbuettel
Copy link
Owner

eddelbuettel commented Jun 8, 2020

The timings are very enticing. And being able to deal with 'simple' structures (but at scale) has total merit. Think ndjson logs for example. Potentially yuge but not nested.

@knapply
Copy link
Collaborator

knapply commented Jun 8, 2020

That's 99% my use-case, but sadly they're not always simple.

tweetio began as an exercise to figure out how to handle large, complicated json streams while staying in R (and fortunately found some practical uses, but it's sorely in need of an update now that I kinda know what I'm doing).

The cool thing about simdjson's "JSON pointer" capability is that it will minimize the need for the insanely tedious mapping to custom structures I had to do there.

I'll pull the deserialize routine over here. It is not exactly simple (they type dynamism was... rough), but more sets of eyes may help.

@dcooley
Copy link
Collaborator Author

dcooley commented Jun 8, 2020

I think as long as the underlying data relationships are maintained, then the R representation shouldn't really matter. So if there is a good way of representing nested JSON objects in a way suitable for data.table, tibble, whatever, etc, then I see no reason not to use those structures.

@knapply
Copy link
Collaborator

knapply commented Jun 9, 2020

That's what I'm thinking as well, but I'm wondering if any folks rely on that structure. Just food for thought.

Here's a "fairer" benchmark from #17 using a bigger file.

# js <- readr::read_file("https://github.com/zemirco/sf-city-lots-json/raw/master/citylots.json")
js <- readr::read_file("~/Documents/citylots.json")
bench::mark(
  rcppsimdjson = rcppsimdjson <-  RcppSimdJson:::.deserialize_json(js),
  jsonify = jsonify <-  jsonify:::rcpp_from_json(js, simplify = T, fill_na = F),
  jsonlite = jsonlite <-  jsonlite:::parse_and_simplify(js, simplifyVector = T, simplifyDataFrame = T, simplifyMatrix = T)
  ,
  filter_gc = FALSE,
  check = FALSE
)
#> # A tibble: 3 x 6
#>   expression        min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>   <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 rcppsimdjson 895.88ms 895.88ms    1.12      58.6MB    0    
#> 2 jsonify         7.49s    7.49s    0.134    104.5MB    0.668
#> 3 jsonlite       37.85s   37.85s    0.0264     369MB    1.24

microbenchmark::microbenchmark(
  rcppsimdjson = rcppsimdjson <-  RcppSimdJson:::.deserialize_json(js),
  jsonify = jsonify <-  jsonify:::rcpp_from_json(js, simplify = T, fill_na = F),
  jsonlite = jsonlite <-  jsonlite:::parse_and_simplify(js, simplifyVector = T, simplifyDataFrame = T, simplifyMatrix = T)
  ,
  times = 1
)
#> Unit: milliseconds
#>          expr       min        lq      mean    median        uq       max neval
#>  rcppsimdjson   887.617   887.617   887.617   887.617   887.617   887.617     1
#>       jsonify  6964.756  6964.756  6964.756  6964.756  6964.756  6964.756     1
#>      jsonlite 35507.670 35507.670 35507.670 35507.670 35507.670 35507.670     1

I'm not sure how accurate {bench}'s memory measurements actually are, but they seem to reflect my goal of diagnosing what R structures should look like upfront so they can be essentially treated as immutable once they're created and populated. Now that I think about it, that probably can (should?) be enforced.

It isn't surprising that the simplification process is the bottleneck, but it's much more than I expected. For comparison, this is what happens without any simplification.

bench::mark(
  rcppsimdjson = rcppsimdjson <-  RcppSimdJson:::.deserialize_json(js, simplify_to = 3),
  jsonify = jsonify <-  jsonify:::rcpp_from_json(js, simplify = F, fill_na = F),
  jsonlite = jsonlite <-  jsonlite:::parse_and_simplify(js, simplifyVector = F, simplifyDataFrame = F, simplifyMatrix = F)
  ,
  filter_gc = FALSE,
  check = FALSE
)
#> # A tibble: 3 x 6
#>   expression        min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>   <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 rcppsimdjson    1.11s    1.11s     0.901    13.1MB    0    
#> 2 jsonify         3.27s    3.27s     0.306    13.1MB    0.612
#> 3 jsonlite        6.68s    6.68s     0.150    13.1MB    0.150

@knapply
Copy link
Collaborator

knapply commented Jun 20, 2020

To move the conversation forward about what the user-facing API should look like, here's a prototype w/ some data.table-inspired functionality (text, url, file download, decompress, etc), but "vecotrized" over multiple strings, URLs, and files.

Don't be shy. It's meant to instigate opinions, criticism, discussion etc. (and definitely has bugs)

Early prototype
.file_extension <- function(x, dot = TRUE, ignore_zip_ext = FALSE) {
  if (ignore_zip_ext) {
    base_name <- sub("\\.[bgx]z2?$", "", basename(x))
  } else {
    base_name <- basename(x)
  }
  captures <- regexpr("(?<!^|[.]|/)[.]([^.]+)$", base_name, perl = TRUE)
  out <- rep(NA_character_, length(x))
  out[captures > 0L] <- substring(base_name[captures > 0L], captures[captures > 0L])
  if (dot) out else substring(out, 2L)
}


.url_prefix <- function(x) {
  vapply(x, function(.x) {
    if (substring(.x, 1L, 8L) == "https://") {
      "https://"
    } else if ((prefix <- substring(.x, 1L, 7L)) %in% c("http://", "ftps://", "file://")) {
      prefix
    } else if (substring(.x, 1L, 6L) == "ftp://") {
      "ftp://"
    } else {
      NA_character_
    }
  }, character(1L), USE.NAMES = FALSE)
}


.diagnose_input <- function(x, diagnose_type = TRUE) {
  init <- list(
    input = x,
    url_prefix = .url_prefix(x),
    file_ext = .file_extension(x)
  )
  init$compressed <- init$file_ext %in% c(".gz", ".bz", ".bz2", ".xz")
  
  if (diagnose_type) {
    if (!anyNA(init$url_prefix)) {
      init$type <- "url"
    } else if (!anyNA(init$file_ext)) {
      init$type <- "file"
    } else {
      init$type <- "text"
    }
  }
  structure(init, class = "data.frame", row.names = seq_along(x))
}


fparse <- function(input = NULL,
                   json_pointer = "",
                   input_type = c("auto", "text", "file", "url"),
                   empty_array = NULL,
                   empty_object = NULL,
                   max_simplify_lvl = c("data_frame", "matrix", "vector", "none"),
                   type_policy = c("anything_goes", "numbers", "strict"),
                   int64_opt = c("double", "string", "integer64"),
                   verbose = FALSE,
                   temp_dir = tempdir(),
                   keep_temp_files = FALSE) {
  # validate arguments =========================================================
  # types ----------------------------------------------------------------------
  if (!is.character(json_pointer) || is.na(json_pointer) || length(json_pointer) != 1L) {
    stop("`json_pointer=` must be a single, non-`NA` `character`.")
  }
  if (!is.character(input)) {
    stop("`input=` must be a `character`.")
  }
  if (any(is.na(input)) || any(nchar(input) == 0L)) {
    stop("`input=` contains `NA`s or empty strings.")
  }
  if (!dir.exists(temp_dir)) {
    stop("`temp_dir=` does not exist.")
  }
  # prep options ===============================================================
  # max_simplify_lvl -----------------------------------------------------------
  if (!is.character(max_simplify_lvl) && !is.numeric(max_simplify_lvl)) {
    stop("`max_simplify_lvl` must be of type `character` or `numeric`.")
  }

  if (is.numeric(max_simplify_lvl)) {
    stopifnot(max_simplify_lvl %in% 0:3)
  } else { # (is.character(max_simplify_lvl)) {
    max_simplify_lvl <- switch(
      match.arg(max_simplify_lvl, c("data_frame", "matrix", "vector", "none")),
      data_frame = 0L,
      matrix = 1L,
      vector = 2L,
      none = 3L,
      stop("Unknown `max_simplify_lvl` argument.")
    )
  }
  # type_policy ----------------------------------------------------------------
  if (!is.character(type_policy) && !is.numeric(type_policy)) {
    stop("`type_policy` must be of type `character` or `numeric`.")
  }

  if (is.numeric(type_policy)) {
    stopifnot(max_simplify_lvl %in% 0:2)
  } else { # if (is.character(type_policy)) {
    type_policy <- switch(
      match.arg(type_policy, c("anything_goes", "numbers", "strict")),
      anything_goes = 0L,
      numbers = 1L,
      strict = 2L,
      stop("Unknown `type_policy` argument.")
    )
  }
  # int64_opt ------------------------------------------------------------------
  if (!is.character(int64_opt) && !is.numeric(int64_opt)) {
    stop("`int64_opt` must be of type `character` or `numeric`.")
  }

  if (is.numeric(int64_opt)) {
    stopifnot(int64_opt %in% 0:2)
  } else { # if (is.character(int64_opt)) {
    int64_opt <- switch(
      match.arg(int64_opt, c("double", "string", "integer64")),
      double = 0L,
      string = 1L,
      integer64 = 2L,
      stop("Unknown `int64_opt` argument.")
    )
  }

  if (int64_opt == 2L && !requireNamespace("bit64", quietly = TRUE)) {
    stop('`int64_opt = "integer64", but the {bit64} package is not installed.')
  }
  # diagnose input_type ========================================================
  input_type <- match.arg(input_type, c("auto", "text", "file", "url"))
  # auto -----------------------------------------------------------------------
  if (input_type == "auto") {
    if (any(substring(input, 1L, 1L) %in% c(" ", "{", "[", '"')) || any(substring(input, 1L, 4L) == "null")) {
      input_type <- "text"
    } else {
      diagnosis <- .diagnose_input(input)
      input_type <- unique(diagnosis$type)
      if (length(input_type) != 1L) {
        stop ("`input` should all be of the same `input_type`. Types detected:", 
              sprintf("\n\t- %s", input_type))
      }
      seq_input <- seq_along(input)
    }
  }
  # url ------------------------------------------------------------------------
  if (input_type == "url") {
    for (i in seq_input) {
      temp_file <- tempfile(fileext = diagnosis$file_ext[[i]], tmpdir = temp_dir)

      switch(
        diagnosis$url_prefix[[i]],
        "https://" = ,
        "ftps://" = ,
        "http://" = ,
        "ftp://" = download.file(diagnosis$input[[i]], destfile = temp_file, method = getOption("download.file.method", default = "auto"), quiet = !verbose),
        "file://" = download.file(diagnosis$input[[i]], destfile = temp_file, method = "internal", quiet = !verbose),
        stop("Unknown URL prefix")
      )

      diagnosis$input[[i]] <- temp_file
      diagnosis$type[[i]] <- "file"
    }

    input_type <- unique(diagnosis$type)
    stopifnot(length(input_type) == 1L)
    if (!keep_temp_files) {
      on.exit(unlink(diagnosis$input), add = TRUE)
    }
  }
  # file -----------------------------------------------------------------------
  input_decompressed <- FALSE
  if (input_type == "file") {
    if (any(diagnosis$compressed)) { # temporary... this can be done w/o materializing R strings in C++ for at least .gz, and Suggests to support others (?)
      .input <- vector("character", length = length(input))
      input_decompressed <- TRUE

      if (verbose) message("Compressed files found. Decompressing...")
      for (i in seq_input) {
        if (diagnosis$compressed[[i]]) {
          decomp_type <- switch(
            diagnosis$file_ext[[i]],
            ".gz" = "gzip",
            ".bz" = ,
            ".bz2" = "bzip2",
            ".xz" = "xz"
            ,
            "unknown"
          )

          con <- file(diagnosis$input[[i]], open = "rb")
          raw_vec <- readBin(con, what = "raw", n = file.size(diagnosis$input[[i]]))
          close(con)
          .input[[i]] <- memDecompress(raw_vec, type = decomp_type, asChar = TRUE)
        }
      }
      input_type <- "text"
    } else {
      diagnosis$input <- Sys.glob(diagnosis$input)
    }
  }
  # set names ==================================================================
  if (input_type != "text") {
    .input <- diagnosis$input
  }
  if (input_type != "text" || input_decompressed) {
    if (length(names(input))) {
      names(.input) <- names(input)
    } else {
      names(.input) <- basename(input)
    } 
  }
  # deserialize ================================================================
  switch(
    input_type,
    "text" = RcppSimdJson:::.deserialize_json(
      json = if (input_decompressed) .input else input,
      json_pointer = json_pointer,
      empty_array = empty_array,
      empty_object = empty_object,
      simplify_to = max_simplify_lvl,
      type_policy = type_policy,
      int64_r_type = int64_opt
    ),
    "file" = RcppSimdJson:::.load_json(
      file_path = .input,
      json_pointer = json_pointer,
      empty_array = empty_array,
      empty_object = empty_object,
      simplify_to = max_simplify_lvl,
      type_policy = type_policy,
      int64_r_type = int64_opt
    )
    ,
    stop("Unknown `input_type`.")
  )
}
files <- dir("~/Documents/rcppsimdjson/inst/jsonexamples/", pattern = "\\.json$", full.names = TRUE, recursive = TRUE)

urls <- c(
  "https://raw.githubusercontent.com/eddelbuettel/rcppsimdjson/master/inst/jsonexamples/apache_builds.json",
  "https://raw.githubusercontent.com/eddelbuettel/rcppsimdjson/master/inst/jsonexamples/mesh.json",
  "https://raw.githubusercontent.com/eddelbuettel/rcppsimdjson/master/inst/jsonexamples/citm_catalog.json",
  "https://raw.githubusercontent.com/eddelbuettel/rcppsimdjson/master/inst/jsonexamples/canada.json",
  "https://raw.githubusercontent.com/eddelbuettel/rcppsimdjson/master/inst/jsonexamples/twitter.json",
  "https://raw.githubusercontent.com/eddelbuettel/rcppsimdjson/master/inst/jsonexamples/github_events.json",
  "https://raw.githubusercontent.com/eddelbuettel/rcppsimdjson/master/inst/jsonexamples/gsoc-2018.json"
)

gz_files <- sapply(
  files[1:10], 
  function(.x) {
    R.utils::compressFile(
      .x, remove = FALSE, FUN = gzfile, ext = "gz",
      destname = sprintf("%s/%s%s", tempdir(), basename(.x), ".gz")
    )
  }, USE.NAMES = FALSE
)

json_text <- c("[1,2,3]",'[4,5,6]')
fparse(json_text)
#> [[1]]
#> [1] 1 2 3
#> 
#> [[2]]
#> [1] 4 5 6

parsed_files <- fparse(files)
names(parsed_files)
#>  [1] "apache_builds.json"                   
#>  [2] "canada.json"                          
#>  [3] "citm_catalog.json"                    
#>  [4] "github_events.json"                   
#>  [5] "gsoc-2018.json"                       
#>  [6] "instruments.json"                     
#>  [7] "marine_ik.json"                       
#>  [8] "mesh.json"                            
#>  [9] "mesh.pretty.json"                     
#> [10] "numbers.json"                         
#> [11] "random.json"                          
#> [12] "adversarial.json"                     
#> [13] "demo.json"                            
#> [14] "flatadversarial.json"                 
#> [15] "che-1.geo.json"                       
#> [16] "che-2.geo.json"                       
#> [17] "che-3.geo.json"                       
#> [18] "google_maps_api_compact_response.json"
#> [19] "google_maps_api_response.json"        
#> [20] "twitter_api_compact_response.json"    
#> [21] "twitter_api_response.json"            
#> [22] "repeat.json"                          
#> [23] "smalldemo.json"                       
#> [24] "truenull.json"                        
#> [25] "twitter_timeline.json"                
#> [26] "twitter.json"                         
#> [27] "twitterescaped.json"                  
#> [28] "update-center.json"

download_and_parse_files <- fparse(urls)
names(download_and_parse_files)
#> [1] "apache_builds.json" "mesh.json"          "citm_catalog.json" 
#> [4] "canada.json"        "twitter.json"       "github_events.json"
#> [7] "gsoc-2018.json"

inflate_and_parse <- fparse(gz_files)
names(inflate_and_parse)
#>  [1] "apache_builds.json.gz" "canada.json.gz"        "citm_catalog.json.gz" 
#>  [4] "github_events.json.gz" "gsoc-2018.json.gz"     "instruments.json.gz"  
#>  [7] "marine_ik.json.gz"     "mesh.json.gz"          "mesh.pretty.json.gz"  
#> [10] "numbers.json.gz"

@eddelbuettel
Copy link
Owner

I think we can close this now that 0.1.0 is out. Please re-open with details if something is still amiss.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants