Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add cache_key_wrapper to Jinja template processor #7816

Merged
merged 1 commit into from
Jul 20, 2019
Merged

Add cache_key_wrapper to Jinja template processor #7816

merged 1 commit into from
Jul 20, 2019

Conversation

villebro
Copy link
Member

@villebro villebro commented Jul 3, 2019

CATEGORY

Choose one

  • Enhancement (new features, refinement)
  • Add tests
  • Documentation

SUMMARY

Currently using dynamic filters that change based on the logged in user can unintended consequences when caching is enabled. For example, if a where-clause references the currenct_username() function to filter only rows for the currently logged in user, the result will be cached with the same key that another user would get, despite both users getting different results from the current_username() function. This is because the rendered result of calling current_username() is never stored in the query_obj that is the basis for the cache key.

This PR adds a new function cache_key_wrapper to the jinja context, which can be wrapped around any function call, and stores the called values in a list extra_cache_keys, which are added to the cache_dict prior to hashing. This ensures that both users get unique values when referencing the same datasource. In practice this is done by "compiling" the query before calculation of the cache key, and storing all values that have been passed to cache_key_wrapper, which is then considered when calling the cache_key function in viz.py (legacy) and query_context.py/query_object.py (future). This adds some overhead, as the full SQLAlchemy selectable has to be generated and compiled once prior to cache key calculation, and again if there isn't a cache hit. The selectable could be easily stored and reused if there isn't a cache hit, but since the overhead is rather unnoticeable, I decided against it in favor of code readability.

SCREENSHOT OF NEW DOCS

image

TEST PLAN

Tested locally + CI

ADDITIONAL INFORMATION

REVIEWERS

@mistercrunch @betodealmeida @john-bodley @duffar12

@villebro
Copy link
Member Author

villebro commented Jul 8, 2019

@duffar12 I would appreciate your thoughts on this PR, as I believe it fixes your immediate problem and could be useful to others, too.

@villebro
Copy link
Member Author

It seems feedback from the original issue poster isn't forthcoming, but would like to get this processed anyway. @mistercrunch @john-bodley do you have any thoughts on this?

@john-bodley
Copy link
Member

@villebro I think this approach seems valid. My only question is it seems that cache_key takes a dictionary of extra cache key arguments whereas in your implementation you're using a single predefined key extra_cache_keys with a list of keys. I can understand why this is the case from your implementation but was wondering whether it's viable to use a dictionary.

@villebro
Copy link
Member Author

@john-bodley The idea is basically to gather up all the objects that have been passed to query_key_wrapper and make sure they appear in the order they've been called. Therefore having them under one key extra_cache_keys seemed to me the most valid approach, as adding a key to each value probably wouldn't be of much utility, and might confuse/interfere with the other keys in cache_obj.

@john-bodley
Copy link
Member

@villebro agreed.

@villebro
Copy link
Member Author

Ok to merge this?

@mistercrunch mistercrunch merged commit 4568b2a into apache:master Jul 20, 2019
@duffar12
Copy link

@villebro, sorry I missed all of this. Looks good though. Thanks for getting this in

alex-mark pushed a commit to alex-mark/incubator-superset that referenced this pull request Jul 29, 2019
@villebro villebro deleted the jinja_cache branch August 21, 2019 15:37
@etr2460 etr2460 mentioned this pull request Aug 23, 2019
12 tasks
@etr2460
Copy link
Member

etr2460 commented Aug 27, 2019

@villebro: We’re running into an issue where charts and dashboards are significantly slower after your change. This seems to be because we’re “compiling” the query every time before checking the cache for it, which results in several requests to our Presto/Hive backend (show partitions, show columns, etc.) to resolve the templates. When we load a dashboard with 20 charts, all with a use of latest partition, it significantly slows down and reduces the usefulness of caching.

I had a couple possible solutions to this:

  1. Only compile the jinja templates that relate to the current user. Your fix was aimed at fixing uses of current_username, maybe we can just special case this here?
  2. Cache the results of the queries that compile the jinja templates. That way we wouldn’t need to repeatedly ping Presto/Hive for the partitions/columns in the table.

What do you think?
cc @graceguo-supercat @michellethomas

@villebro
Copy link
Member Author

I propose introducing a method uses_jinja_templates() which checks the query and filters for {{.*cache_key_wrapper(.*}} or similar, and if none are present, doesn't do the double compilation.

@mistercrunch mistercrunch added 🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels 🚢 0.34.0 labels Feb 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels size/L 🚢 0.34.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Queries using current_username() method do not run properly when caching is enabled
5 participants