Join GitHub today
GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together.
Sign upoverhaul deserialize.cpp for extremely flexible queries, generalize decompression #45
Conversation
|
@eddelbuettel I (finally) got back to this. I think the only point of concern you might have is regarding the parameters. Originally, we had After approaching this a dozen different ways, I landed on separating parsing errors (where the JSON is simply invalid) and query errors (where a query doesn't return anything, but there's nothing wrong with the JSON itself). With that an mind, I split them into Other than that, there's some cool query functionality and we can handle .gz, .bz2, and .xz compressed files. I'd like to reinforce the tests and re-walk through (and yes, hit the ChangeLog and such), but almost there. |
|
I think we're not at a stage we have to worry about breaking params. We are still evolving quite a bit. Will try to take a look later or tomorrow but still have another burning issue to take care of myself. No rush. And still really lovely to see you chipping away at it and building something awesome. |
Merge branch 'experimental/pointer' of https://github.com/eddelbuettel/rcppsimdjson into experimental/pointer # Conflicts: # inst/include/RcppSimdJson/deserialize.hpp # inst/include/RcppSimdJson/utils.hpp # inst/tinytest/test_fparse_fload.R
|
Naturally, I forgot NEWS/ChangeLog..... will fix ASAP |
|
Ooof. That is a big one. Not pretending I looked line by line -- but it is looks like pretty amazing and extensive work, once again. I guess next is merge and platform testing... |
|
I'm working on the ChangeLog and decided I need some better notes myself (which uncovered some refinements that I'm now fixing). These are the "big-picture" ideas: library(RcppSimdJson)Better QueriesWe can still pass a single json_to_query <- c(json1 = '["a",{"b":{"c": [[1,2,3],[4,5,6]]}}]',
json2 = '["a",{"b":{"c": [[7,8,9],[10,11,12]],"d":[[13,14,15,16],[17,18,19,20]]}}]')
# ^^^ json1 doesn't have "d"fparse(json_to_query, query = "1/b/c")
But now we can also pass multiple “flat” queries (a named or unnamed character vector). Each element of This is the preferred method if each fparse(json_to_query, query = c(query1 = "1/b/c",
query2 = "1/b/c/0",
query3 = "1/b/c/1"))
When we want to extract different data from each fparse(json_to_query,
query = list(queries1 = c(c1 = "1/b/c/0",
c2 = "1/b/c/1"),
queries2 = c(d1 = "1/b/d/0",
d2 = "1/b/d/1")))
Compressed FilesWe now handle .gz, .bz2, and .xz files that are decompressed to a raw vector (via .read_compress_write_load <- function(file_path, temp_dir) {
types <- c("gzip", "bzip2", "xz")
exts <- c("gz", "bz2", "xz")
init <- readBin(file_path, n = file.size(file_path), what = "raw")
mapply(function(type, ext) {
target_path <- paste0(temp_dir, "/", basename(file_path), ".", ext)
writeBin(memCompress(init, type = type), target_path)
RcppSimdJson::fload(target_path)
}, types, exts, SIMPLIFY = FALSE)
}
my_temp_dir <- sprintf("%s/rcppsimdjson-compressed-files", tempdir())
dir.create(my_temp_dir)
all_files <- dir(
system.file("jsonexamples", package = "RcppSimdJson"),
recursive = TRUE,
pattern = "\\.json$",
full.names = TRUE
)
names(all_files) <- basename(all_files)
res <- t(sapply(all_files, .read_compress_write_load, my_temp_dir))
unlink(my_temp_dir)
stopifnot(all(apply(
res, 1L,
function(.x) identical(.x[[1]], .x[[2]]) &&
identical(.x[[1]], .x[[3]])
)))
res
Smarter URL HandlingWith compressed files supported, we can better leverage the Additionally, remote JSON files are now downloaded simultaneously json_urls <- c(
"https://raw.githubusercontent.com/eddelbuettel/rcppsimdjson/master/inst/jsonexamples/small/smalldemo.json",
"https://raw.githubusercontent.com/eddelbuettel/rcppsimdjson/master/inst/jsonexamples/small/demo.json"
)my_temp_dir <- sprintf("%s/rcppsimdjson-downloads", tempdir())
dir.create(my_temp_dir)
fload(json_urls,
query = list(c(width = "Thumbnail/Width",
height = "Thumbnail/Height"),
c(width = "Image/Thumbnail/Width",
height = "Image/Thumbnail/Height")),
temp_dir = my_temp_dir,
keep_temp_files = TRUE,
compressed_download = TRUE)
list.files(my_temp_dir)
Lurking Windows Trap FixedWindows was mangling non-ASCII UTF-8. The issue/fix are essentially the same as SymbolixAU/jsonify#57 and it’s now tested (rather, a test is present) that uses a mix of 1-4 byte characters. extended_unicode <- '"լ ⿕ ٷ 豈 ٸ 㐀 ٹ 丂 Ɗ 一 á ٵ ̝ ѵ ̇ ˥ ɳ Ѡ · վ й ף ޑ ц Ґ ӎ Љ ß ϧ ͎ ƽ ޜ է ϖ y Î վ Ο Ӊ ٻ ʡ ө ȭ ˅ ޠ ɧ ɻ ث ́ ܇ ܧ ɽ Ո 戸 Ð 坮 ٳ 䔢 찅 곂 묨 ß ᇂ ƻ 䏐 ܄ 㿕 ս ّ 昩 僫 똠 Ɯ ٰ É"'
fparse(extended_unicode)
fparse(charToRaw(extended_unicode))
|
commit 342b08c19ecf2bea802be2426665222142c73e9f
Author: Brendan Knapp <brendan.g.knapp@gmail.com>
Date: Sat Aug 8 17:12:35 2020 -0700
more clean up, update ChangeLog/NEWS, add notes
commit 62d74ff
Author: Brendan <brendan.g.knapp@gmail.com>
Date: Fri Aug 7 12:36:58 2020 -0700
add more encoding tests, verify on windows system
commit 89f5bf8
Author: Brendan Knapp <brendan.g.knapp@gmail.com>
Date: Fri Aug 7 07:45:20 2020 -0700
more clean up
commit c31c15d
Author: Brendan Knapp <brendan.g.knapp@gmail.com>
Date: Thu Aug 6 17:46:31 2020 -0700
comfirmed windows string mangling... checking likely solution
commit 6e41d18
Merge: a16908c 4e8337d
Author: Brendan Knapp <brendan.g.knapp@gmail.com>
Date: Thu Aug 6 16:41:57 2020 -0700
fix merge
Merge branch 'experimental/pointer' of https://github.com/eddelbuettel/rcppsimdjson into experimental/pointer
# Conflicts:
# inst/include/RcppSimdJson/deserialize.hpp
# inst/include/RcppSimdJson/utils.hpp
# inst/tinytest/test_fparse_fload.R
commit a16908c
Author: Brendan Knapp <brendan.g.knapp@gmail.com>
Date: Thu Aug 6 16:30:22 2020 -0700
rebase, more cleaning, add likely fix/check for potential Windows encoding issue
commit 2b5bf82
Author: Brendan Knapp <brendan.g.knapp@gmail.com>
Date: Sun Aug 2 10:25:26 2020 -0700
cleaning up structure and vestigial junk
commit dac90d8
Author: Brendan Knapp <brendan.g.knapp@gmail.com>
Date: Sat Aug 1 15:47:46 2020 -0700
queries and compressed files passing
commit 289b66f
Author: Brendan Knapp <brendan.g.knapp@gmail.com>
Date: Thu Jul 23 20:34:46 2020 -0700
overhaul deserialize.cpp for extremely flexible querie, generalize decompression
commit f1088bb
Author: Dirk Eddelbuettel <edd@debian.org>
Date: Wed Aug 5 07:05:11 2020 -0500
fix README thinko s/fparse/fload/ (closes #46)
commit 4e8337d
Author: Brendan Knapp <brendan.g.knapp@gmail.com>
Date: Sun Aug 2 10:25:26 2020 -0700
cleaning up structure and vestigial junk
commit 6120334
Author: Brendan Knapp <brendan.g.knapp@gmail.com>
Date: Sat Aug 1 15:47:46 2020 -0700
queries and compressed files passing
commit df7e711
Author: Brendan Knapp <brendan.g.knapp@gmail.com>
Date: Thu Jul 23 20:34:46 2020 -0700
overhaul deserialize.cpp for extremely flexible querie, generalize decompression
commit 6e4a27c
Author: Dirk Eddelbuettel <edd@debian.org>
Date: Thu Jul 16 13:34:18 2020 -0500
updated changelog, rolled minor version
also ran M-x untabify on ChangeLog so unholy amount of whitespace change
addresses #43
Life got busy and I underestimated how tricky the more flexible queries would get.
Still WIP, but on track.