-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rcppsimdjson has inefficient network downloads #44
Comments
I have actually located the cause of the slowdown which is util::download.file I have replaced it with curl::curl_download and its is twice as fast as jsonlite::fromJson and I think it can go faster with more curl tweaks.
|
@melsiddieg If you have done the leg work of changing the code and running benchmarks, maybe you could issue a PR and include benchmark results... this might help motivate @eddelbuettel. |
Sure no problem, It is a simple tweak actually. |
Please please please please hold off. I have a few days of experience with R, expected something like this, and can tell you it is complicated. The code does the right thing _by defaulting to |
@lemire and with all due respect, the benchmark above is silly. You are doing precisely what you belabored earlier. Maybe this is fairer: url <- "http://bioinfo.hpc.cam.ac.uk/cellbase/webservices/rest/v4/hsapiens/feature/gene/TET1/snp?limit=200&skip=-1&skipCount=false&count=false&Output%20format=json&merge=false"
res <- microbenchmark::microbenchmark(straight = curl::curl_download(url, tempfile()),
baseR = download.file(url, tempfile()),
times = 5L) (still needs a quiet= argument for both) leading to R> res
Unit: milliseconds
expr min lq mean median uq max neval cld
straight 520.735 525.681 569.826 547.477 577.775 677.463 5 a
baseR 1086.744 1184.752 5415.459 1336.059 1389.182 22080.559 5 a
R> So select |
@eddelbuettel Rcppsimdjson with curl_download by default |
@eddelbuettel @lemire I think someone else could implement it in a separate R package or something, still, it seems like a small imperfection to have in this otherwise brilliant package. |
@melsiddieg Have you considered not benchmarking download and parse ? I really do not care who wins; that is like timing who boils an egg faster when you include a drive to the store to buy eggs. Here is mine updated over all arguments for Outputedd@rob:~/git/rcppsimdjson(master)$ Rscript demo/downloadBenchmark.R
Unit: milliseconds
expr min lq mean median uq max neval cld
straight 554.601 610.33 607.669 621.958 624.695 626.763 5 a
baseRwget 2339.938 2353.82 2557.452 2385.150 2808.722 2899.631 5 a
baseRinternal 2160.950 2395.10 5600.646 2429.357 2615.772 18402.050 5 a
baseRlibcurl 1407.968 2458.61 2582.007 2512.940 2811.564 3718.959 5 a
baseRauto 1399.441 2331.44 2405.339 2541.951 2781.279 2972.583 5 a
baseRcurl 2164.464 2221.29 2423.828 2547.877 2559.365 2626.139 5 a
Unit: relative
expr min lq mean median uq max neval cld
straight 1.00000 1.00000 1.00000 1.00000 1.00000 1.00000 5 a
baseRwget 4.21914 3.85664 4.20863 3.83491 4.49615 4.62636 5 a
baseRinternal 3.89641 3.92427 9.21660 3.90599 4.18728 29.36048 5 a
baseRlibcurl 2.53870 4.02832 4.24903 4.04037 4.50070 5.93360 5 a
baseRauto 2.52333 3.81996 3.95830 4.08702 4.45222 4.74276 5 a
baseRcurl 3.90274 3.63950 3.98873 4.09654 4.09698 4.19000 5 a
edd@rob:~/git/rcppsimdjson(master)$ Codeurl <- "http://bioinfo.hpc.cam.ac.uk/cellbase/webservices/rest/v4/hsapiens/feature/gene/TET1/snp?limit=200&skip=-1&skipCount=false&count=false&Output%20format=json&merge=false"
res <- microbenchmark::microbenchmark(straight = curl::curl_download(url, tempfile()),
baseRauto = download.file(url, tempfile(), quiet=TRUE),
baseRinternal = download.file(url, tempfile(), method="internal", quiet=TRUE),
baseRlibcurl = download.file(url, tempfile(), method="libcurl", quiet=TRUE),
baseRwget = download.file(url, tempfile(), method="wget", quiet=TRUE),
baseRcurl = download.file(url, tempfile(), method="curl", quiet=TRUE),
times = 5L)
print(res, order="median")
print(res, order="median", unit="relative") |
@eddelbuettel I speak from the utmost respect and love to both of you and all the RcppSimdJson team. My point is that fload does both download and parse, and that the paring part is flawless and blazing fast, it is a matter of whether we care about optimizing the speed of downloads as well or not. In my opinion, It is a shame to have a blazing fast parser sitting behind slow downloaders. The whole point of RcppSimdJson is the speed, and with the default download.file this advantage is wasted for url parsing, which is a common if not the commonest use case of JSON parsers. |
myfload(file, ...) {
tf <- tempfile()
curl::curl_download(file, tf)
fload(tf, ...)
} Untested but you get the idea. |
I locked this, and I would appreciate it if you could respect this by not opening another issue on the same topic. The issue is closed. We can revisit in two weeks or so which give us some time to look at things that may be more consequential that relitigating this question. Thanks for your patience. |
I think that @melsiddieg was correct to complain about the performance of
RcppSimdJson::fload
. Something does not add up. Given how fast RcppSimdJson, it should be roughly as fast ascurl:: curl_download
. But it is not!The file has 784 KB. You can replace
tempfile()
by a file name and inspect it, you find that, indeed, curl is grabbing every little byte.So it seems that RcppSimdJson could go faster by invoking curl_download and then parsing the resulting temporary file. It is also possible to load the file directly to memory (curl_fetch_memory) but I did not want to use that as a benchmark since you might argue (rightly so) that it might be cheating.
I am not 100% clear on why there is such a difference, but it does warrant investigation.
The text was updated successfully, but these errors were encountered: