Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using memoise to enhance performance and reduce network traffic #366

Closed
rburghol opened this issue Nov 17, 2023 · 4 comments
Closed

Using memoise to enhance performance and reduce network traffic #366

rburghol opened this issue Nov 17, 2023 · 4 comments

Comments

@rburghol
Copy link

rburghol commented Nov 17, 2023

This example is based on some code that @dblodgett-usgs shared in an issue on cacheing NWIS queries. DOI-USGS/dataRetrieval#681

Setup

Essentially, one selects a dat directory to store caches, and writes a wrapper around nhdplustools queries using the package memoise. I used a 1-year timeout since I figured that this data changes slowly, but the veracity of that assumption is less important than the technique:

dir <- "/media/model/usgs/cache"
db <- memoise::cache_filesystem(dir)
one_day <- 24*60^2
one_year <- 365 * one_day
memo_get_nhdplus <- memoise::memoise(nhdplusTools::get_nhdplus, ~memoise::timeout(one_year), cache = db)
memo_get_UT <- memoise::memoise(nhdplusTools::get_UT, ~memoise::timeout(one_year), cache = db)
memo_plot_nhdplus <- memoise::memoise(nhdplusTools::plot_nhdplus, ~memoise::timeout(one_year), cache = db)

Retrieving point and basin info

After initial setup, calling these functions more than once in a year will see it searching the cache before going out to get fresh data.

  • retrieving point info
plat = 36.80806
plon = -80.93889
out_point = sf::st_sfc(sf::st_point(c(plon, plat)), crs = 4326)

# First time it takes a full second in my instance, since data is not stored
system.time(nhd <- memo_get_nhdplus(out_point))
Spherical geometry (s2) switched off
Spherical geometry (s2) switched on
   user  system elapsed 
   0.13    0.00    0.83 
# subsequent query is instantaneous
nhd <- memo_get_nhdplus(out_point)

  • Retrieving basin info has a much larger time savings, like 7 seconds for a relatively large basin (~3,500 sqkm)
system.time(nhd <- memo_get_nhdplus(m_cat$basin))
Spherical geometry (s2) switched off
although coordinates are longitude/latitude, st_intersects assumes that they are planar
Spherical geometry (s2) switched on
   user  system elapsed 
   3.83    0.14    6.59 

# again:
system.time(nhd <- memo_get_nhdplus(m_cat$basin))
   user  system elapsed 
      0       0       0

Testing

  • Points and basin data caches work well in my limited testing:
    • Data caches persist and are accessible afrestarting my windows machine.
    • Once got an error retrieving from cache after quitting rstudio, and
  • Plots work OK within session, but maybe fail after restarts?:
    • memoised plots successfully re-render within a single Rstudio session.
    • Substantial time savings are achieved with memoised plots (3-5 seconds in my test)
    • Plots DO NOT re-render after a restart, even though they are retrieved, and the data seems intact, but the images do not show up in the window.
      • looks to be a feature of the plot_nhdplus() way of plotting, which renders the plot, but does not return it as part of the function data list
m_cat <- memo_plot_nhdplus(list(nhd_out$comid))

image

@dblodgett-usgs
Copy link
Collaborator

+1 Thanks for the prompt @rburghol -- I need to think about how to best use this kind of thing in the package. nhdplusTools has undergone a lot of change recently and needs some further clean up.

@dblodgett-usgs
Copy link
Collaborator

I've introduced memoise as a dependency for something else and will start working it in over time. Sorry this has been on the back burner for a while.

@dblodgett-usgs
Copy link
Collaborator

I roughed in an implementation that I'm pretty happy with. No doubt there'll be issues, but it's a good start. See #364.

You can now set environment variables to control cache location (memory or disc) and duration. I've wrapped functions that make cacheable requests in memoise and use the pattern discussed above to control cache behavior. It defaults to a filesystem cache for one day.

@dblodgett-usgs
Copy link
Collaborator

A start is in now -- please test and open follow up issues if things are not right.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants