Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

understanding performance in streaming -- when to download first? #93

Open
cboettig opened this issue Jan 6, 2024 · 1 comment
Open

Comments

@cboettig
Copy link

cboettig commented Jan 6, 2024

Naively, I had thought that for a remote proxy cube, taking advantage of gdalcubes lazy evaluation to perform a computation in a single go would be faster than forcing it to write out to disk half-way, and then reading it back in and continuing. i.e. that:

proxy_cube |> 
    gdalcubes::crop(extent) |> 
    reduce_time("sd(sst)") |>
    plot(col = viridisLite::viridis(10))

would be faster than:

proxy_cube |> 
    gdalcubes::crop(extent) |> 
    write_ncdf("sst.nc")

 ncdf_cube("sst.nc") |>
      reduce_time("sd(sst)") |>
      plot(col = viridisLite::viridis(10))

however, in my current examples I'm seeing over 5x fold improvement in runtime, from over an hour in the first case to about 12 minutes in the second case. Notably, monitoring my network download rates I see the stays steadily around 100Mb/s, whereas in the second case I see rates of 700-999Mb/s (my max), suggesting that the however gdalcubes is computing the temporal reduce in streaming mode is limiting the how fast the network can read in? (But not sure what part of the hardware is the bottleneck here -- CPU does not max out either).

It kinda makes sense to me -- it's still worth doing the crop to download bytes we never touch, but the temporal reduction requires all the data, so maybe it is faster to stream it in big chunks or something to disk and then process it? Do you think that's true in general or just here?

This should be a reprex example for the full data. (Aside, but note we annoyingly need a custom url function wrapper -- I'm thrilled gdalcubes supports that even for stac_image_collection, though here I read from nasa's custom api which is more reliable than nasa's stac api. Also note this example reads netcdf, not cog, because that's what NASA gives us.).

library(earthdatalogin) # remotes::install_github("boettiger-lab/earthdatalogin")
library(gdalcubes)

urls <- edl_search(short_name = "MUR-JPL-L4-GLOB-v4.1",
                   temporal = c("2020-01-01", "2021-12-31"))


vrt <- function(url) {
  prefix <-  "vrt://NETCDF:/vsicurl/"
  suffix <- ":analysed_sst?a_srs=OGC:CRS84&a_ullr=-180,90,180,-90"
  paste0(prefix, url, suffix)
}

gdalcubes_options(parallel = parallel::detectCores()*2)
with_gdalcubes()

url_dates <- as.Date(gsub(".*(\\d{8})\\d{6}.*", "\\1", urls), format="%Y%m%d")

bench::bench_time({
  proxy_cube <- gdalcubes::stack_cube(vrt(urls),
                                   datetime_values = url_dates,
                                   band_names = "sst")
})

extent = list(left=-93, right=-76, bottom=41, top=49,
              t0="2020-01-01", t1="2021-12-31")

bench::bench_time({ # 12 min
  proxy_cube |> 
    gdalcubes::crop(extent) |> 
    write_ncdf("sst.nc")
})

bench::bench_time({
ncdf_cube("sst.nc") |>
  reduce_time("sd(sst)") |>
  plot(col = viridisLite::viridis(10))
})
@seamusrobertmurphy
Copy link

Super useful! I was suspicious that this might be the case. Thanks for confirming

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants