-
Notifications
You must be signed in to change notification settings - Fork 3
Cache scores object #169
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cache scores object #169
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This approach to caching won't extend elegantly to other function calls (e.g. plots), so we'd want to use
memoise
to support those.
I don't think that plots ore filtering are the bottleneck here but downloading the data.
number machines in the pool should be << the number of users. (It is possible to store data in an on-disk cache that would be accessible to all app/R sessions, depending on server pool setup, but session load time is slower when reading from disk.)
don't forget the middle step that is there: docker containers. Each physical machine could host multiple shiny containers at the same time.
I definitely agree that downloading is the slow part. I just mean anything else we might want to cache in the future wouldn't fit super well into the structure added here; whether or not we want to cache anything else is a separate decision.
Thanks for the correction; I updated the description above. |
Kate doesn't recall any particular reason why caching wasn't being done and doesn't think adding caching will cause any problems. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good from what I can see. Tho, I'm not an R export
Description
Store scores as a global-ish variable (the environment where it's stored is not actually the global environment, but it is a unique environment accessible to all user sessions within the app instance/R session) and only update when the contents of the S3 bucket changes in some way. Reduces the number of times we read the score data, from once per user session to twice a week.
Changes
server
function.getData
andgetFallbackData
getS3Bucket
function.getAllData
function.getRecentDataHelper
closure. This function generates a functiongetRecentData
to return either stored score data or new score data if the score data has been updated since it was last read. Whether or not it was updated since it was last read is determined by comparing the contents of the stored S3 bucket object, saved at the same time as the stored score data, and a newly-read S3 bucket object. This check happens at the beginning of each user session.getRecentData
updates score data and the S3 bucket object stored in it's parent (getRecentDataHelper
) environment.getRecentData
function fromgetRecentDataHelper
so that a single environment is shared by and persists between all calls togetRecentData
.Fixes
Right now, the dashboard is slow to load, especially when multiple people are using it simultaneously. This is because R Shiny is single-threaded by default so user requests are queued and handled one at a time. If there are a bunch of users, the queue is a multiplicative factor longer, exacerbated by requests that take a long time, notably reading score data from the S3 bucket.
Implications
This change means that score data should only be pulled twice a week, for the first dashboard visitor following each pipeline run and cached for all other users. Locally, sessions using cached data load in about 1 sec, down from about 10 sec.
Since a given instance of the "cache" (scores stored as a variable in memory) is only available within a single R session, it is desirable to have many users share a given app/R session. It's not desirable to have a huge server/session pool since the data is loaded (slow step) again for each R session. For example, # R sessions should be << the number of users. (It is possible to store data in an on-disk cache that would be accessible to all app/R sessions, depending on server pool setup, but session load time is slower when reading from disk. It might also be possible to allocate a shared memory region using
mmap
.).This approach to caching won't extend elegantly to other function calls (e.g. plots), so we'd want to use
memoise
to support those if desired.