Skip to content

Cache scores object #169

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Sep 15, 2021
Merged

Cache scores object #169

merged 3 commits into from
Sep 15, 2021

Conversation

nmdefries
Copy link
Collaborator

@nmdefries nmdefries commented Sep 9, 2021

Description

Store scores as a global-ish variable (the environment where it's stored is not actually the global environment, but it is a unique environment accessible to all user sessions within the app instance/R session) and only update when the contents of the S3 bucket changes in some way. Reduces the number of times we read the score data, from once per user session to twice a week.

Changes

  • Move data-fetching code out of server function.
    • Functions getData and getFallbackData
    • Code to get S3 bucket object. Create replacement getS3Bucket function.
    • Code to load data from S3 bucket, filter columns, and rename columns. Create replacement getAllData function.
  • Create getRecentDataHelper closure. This function generates a function getRecentData to return either stored score data or new score data if the score data has been updated since it was last read. Whether or not it was updated since it was last read is determined by comparing the contents of the stored S3 bucket object, saved at the same time as the stored score data, and a newly-read S3 bucket object. This check happens at the beginning of each user session. getRecentData updates score data and the S3 bucket object stored in it's parent (getRecentDataHelper) environment.
  • Create a global getRecentData function from getRecentDataHelper so that a single environment is shared by and persists between all calls to getRecentData.

Fixes

Right now, the dashboard is slow to load, especially when multiple people are using it simultaneously. This is because R Shiny is single-threaded by default so user requests are queued and handled one at a time. If there are a bunch of users, the queue is a multiplicative factor longer, exacerbated by requests that take a long time, notably reading score data from the S3 bucket.

Implications

This change means that score data should only be pulled twice a week, for the first dashboard visitor following each pipeline run and cached for all other users. Locally, sessions using cached data load in about 1 sec, down from about 10 sec.

Since a given instance of the "cache" (scores stored as a variable in memory) is only available within a single R session, it is desirable to have many users share a given app/R session. It's not desirable to have a huge server/session pool since the data is loaded (slow step) again for each R session. For example, # R sessions should be << the number of users. (It is possible to store data in an on-disk cache that would be accessible to all app/R sessions, depending on server pool setup, but session load time is slower when reading from disk. It might also be possible to allocate a shared memory region using mmap.).

This approach to caching won't extend elegantly to other function calls (e.g. plots), so we'd want to use memoise to support those if desired.

@nmdefries nmdefries requested review from krivard and sgratzl and removed request for krivard September 10, 2021 16:05
Copy link
Member

@sgratzl sgratzl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This approach to caching won't extend elegantly to other function calls (e.g. plots), so we'd want to use memoise to support those.

I don't think that plots ore filtering are the bottleneck here but downloading the data.

number machines in the pool should be << the number of users. (It is possible to store data in an on-disk cache that would be accessible to all app/R sessions, depending on server pool setup, but session load time is slower when reading from disk.)

don't forget the middle step that is there: docker containers. Each physical machine could host multiple shiny containers at the same time.

@nmdefries
Copy link
Collaborator Author

nmdefries commented Sep 10, 2021

This approach to caching won't extend elegantly to other function calls (e.g. plots), so we'd want to use memoise to support those.

I don't think that plots or filtering are the bottleneck here but downloading the data.

I definitely agree that downloading is the slow part. I just mean anything else we might want to cache in the future wouldn't fit super well into the structure added here; whether or not we want to cache anything else is a separate decision.

don't forget the middle step that is there: docker containers. Each physical machine could host multiple shiny containers at the same time.

Thanks for the correction; I updated the description above.

@nmdefries
Copy link
Collaborator Author

Kate doesn't recall any particular reason why caching wasn't being done and doesn't think adding caching will cause any problems.

@nmdefries nmdefries requested a review from sgratzl September 14, 2021 20:32
Copy link
Member

@sgratzl sgratzl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good from what I can see. Tho, I'm not an R export

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants