Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[cache] Using the query as the basis of the cache key #4016

Merged
merged 1 commit into from Jan 12, 2018

Conversation

john-bodley
Copy link
Member

@john-bodley john-bodley commented Dec 6, 2017

This PR resolves issue #3840 where previously the visualization data query key was a hashed form of the form_data.

Per the description in the issue the form_data stored in the DB which is what the /warm_up_cache/ endpoint leverages is not the same form data in the explorer view (which is augmented by both the front- and back-ends). Given the cached data is the query response it makes more sense to simply use the hashed query string as the cache key. In addition to resolving this issue the cache is based purely on the relevant portions of the form_data which are used to generate the query, thus UI changes to the color palette etc. will not violate the cache.

Note there is no way to determine via the cache API when the cache key was created and thus it is necessary to cache the DTTM alongside the data.

Note the cache is now explorer view agnostic, i.e., a cache key may be shared by multiple explorer views/slices.

to: @graceguo-supercat @michellethomas @mistercrunch @timifasubaa

@@ -203,8 +203,6 @@ def query_obj(self):

@property
def cache_timeout(self):
if self.form_data.get('cache_timeout'):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that the cache is explorer view agnostic the cache timeout is only associated with the query and not the explorer form_data.

@john-bodley john-bodley force-pushed the john-bodley-issue-3840 branch 3 times, most recently from b09ca5e to 88c341e Compare December 7, 2017 08:22
@mistercrunch
Copy link
Member

If using the query as the cache key, then we should use the query results as the payload.

@john-bodley
Copy link
Member Author

@mistercrunch shouldn't the payload which includes annotations etc. remain unchanged per https://github.com/apache/incubator-superset/blob/master/superset/views/core.py#L1082. Wouldn't a augmented response to the /explore_json/ API endpoint require a number of additional frontend changes?

@mistercrunch
Copy link
Member

My point is there are numerous controls that don't affect the query but do affect the pandas logic. Take the rolling averages on line charts for instance: same SQL, different backend pandas logic...

So there are two potential approaches here:

  • change the caching location from the whole payload to just the interactions from the backend to the database
  • be more cautious around the subset of the form data keys (which controls really) affect the cache_key

@john-bodley
Copy link
Member Author

@mistercrunch sorry for the delay in responding to your comment. The PR is doing what you suggest as the first potential approach, i.e., the whole payload is not cached, only the data which is fetched from the database. The cache key is the deterministic database query.

superset/viz.py Outdated
merge_extra_filters(form_data)
s = str([(k, form_data[k]) for k in sorted(form_data.keys())])
def cache_key(self, query_obj):
s = self.datasource.get_query_str(query_obj)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes it such that get_query_str needs to be deterministic. I think that's the case. Let's note that if we see cache-hit-misses in the future it may be related to get_query_str not being fully deterministic.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. I'll add a docstring to provide more context.

@mistercrunch
Copy link
Member

👍 This looks good to me. I'll let @graceguo-supercat push the merge button so that you can fit it best in your internal release schedule.

@john-bodley john-bodley force-pushed the john-bodley-issue-3840 branch 2 times, most recently from 5e4940a to 96542e1 Compare January 4, 2018 07:20
superset/viz.py Outdated
form_data = self.form_data.copy()
merge_extra_filters(form_data)
s = str([(k, form_data[k]) for k in sorted(form_data.keys())])
def cache_key(self, query_obj):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably need to include the datasource itself => you can have 2 databases with the same tables, which as far as I can tell would cause collisions

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, I think so too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fabianmenges/@mistercrunch I agree. I made the cache key a tuple of the database name and query string. PTAL.

@mistercrunch
Copy link
Member

LGTM

superset/viz.py Outdated

return hashlib.md5(
json.dumps((
self.datasource.database.database_name,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think just using the Database name as part of the cache key can cause authorization issues. I think we need to include the superset Datasource unique id.

E.g. I'm authorized to only query a single schema in a Database but data for a schema that I don't have access to is cached.

Also the database name can still collide with other database names on a different server.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fabianmenges I've made the change to use the datasource ID (which indirectly has the database information encoded in it).

superset/viz.py Outdated
return hashlib.md5(
json.dumps((
self.datasource.id,
self.datasource.get_query_str(query_obj).encode('utf-8'),

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is probably already an issue with the existing setup, but you may also want to consider cases where different users may see different results for the same query and datasource based on out of band information, such as a user that gets passed through in the query context (e.g. row-level permission in a view).

@john-bodley john-bodley force-pushed the john-bodley-issue-3840 branch 4 times, most recently from ddf1a92 to 932cd27 Compare January 10, 2018 03:21
@graceguo-supercat graceguo-supercat merged commit a7a6678 into apache:master Jan 12, 2018
mistercrunch added a commit to mistercrunch/superset that referenced this pull request Jan 17, 2018
mistercrunch added a commit that referenced this pull request Jan 17, 2018
mistercrunch pushed a commit that referenced this pull request Jan 18, 2018
* Working polygon layer for deckGL

* add js controls

* add thumbnail

* better description

* refactor to leverage line_column controls

* templates: open code and documentation on a new tab (#4217)

As they are external resources.

* Fix tutorial doesn't match the current interface #4138 (#4215)

* [bugfix] markup and iframe viz raise 'Empty query' (#4225)

closes #4222

Related to: #4016

* [bugfix] time_pivot entry got missing in merge conflict (#4221)

PR here #3518 missed a
line of code while merging conflicts with time_pivot viz

* Improve deck.gl GeoJSON visualization (#4220)

* Improve geoJSON

* Addressing comments

* lint

* refactor to leverage line_column controls

* refactor to use DeckPathViz

* oops
hughhhh pushed a commit to lyft/incubator-superset that referenced this pull request Jan 18, 2018
hughhhh added a commit to lyft/incubator-superset that referenced this pull request Jan 18, 2018
* Working polygon layer for deckGL

* add js controls

* add thumbnail

* better description

* refactor to leverage line_column controls

* templates: open code and documentation on a new tab (apache#4217)

As they are external resources.

* Fix tutorial doesn't match the current interface apache#4138 (apache#4215)

* [bugfix] markup and iframe viz raise 'Empty query' (apache#4225)

closes apache#4222

Related to: apache#4016

* [bugfix] time_pivot entry got missing in merge conflict (apache#4221)

PR here apache#3518 missed a
line of code while merging conflicts with time_pivot viz

* Improve deck.gl GeoJSON visualization (apache#4220)

* Improve geoJSON

* Addressing comments

* lint

* refactor to leverage line_column controls

* refactor to use DeckPathViz

* oops
@john-bodley john-bodley deleted the john-bodley-issue-3840 branch January 19, 2018 01:28
michellethomas pushed a commit to michellethomas/panoramix that referenced this pull request May 24, 2018
michellethomas pushed a commit to michellethomas/panoramix that referenced this pull request May 24, 2018
michellethomas pushed a commit to michellethomas/panoramix that referenced this pull request May 24, 2018
* Working polygon layer for deckGL

* add js controls

* add thumbnail

* better description

* refactor to leverage line_column controls

* templates: open code and documentation on a new tab (apache#4217)

As they are external resources.

* Fix tutorial doesn't match the current interface apache#4138 (apache#4215)

* [bugfix] markup and iframe viz raise 'Empty query' (apache#4225)

closes apache#4222

Related to: apache#4016

* [bugfix] time_pivot entry got missing in merge conflict (apache#4221)

PR here apache#3518 missed a
line of code while merging conflicts with time_pivot viz

* Improve deck.gl GeoJSON visualization (apache#4220)

* Improve geoJSON

* Addressing comments

* lint

* refactor to leverage line_column controls

* refactor to use DeckPathViz

* oops
wenchma pushed a commit to wenchma/incubator-superset that referenced this pull request Nov 16, 2018
wenchma pushed a commit to wenchma/incubator-superset that referenced this pull request Nov 16, 2018
wenchma pushed a commit to wenchma/incubator-superset that referenced this pull request Nov 16, 2018
* Working polygon layer for deckGL

* add js controls

* add thumbnail

* better description

* refactor to leverage line_column controls

* templates: open code and documentation on a new tab (apache#4217)

As they are external resources.

* Fix tutorial doesn't match the current interface apache#4138 (apache#4215)

* [bugfix] markup and iframe viz raise 'Empty query' (apache#4225)

closes apache#4222

Related to: apache#4016

* [bugfix] time_pivot entry got missing in merge conflict (apache#4221)

PR here apache#3518 missed a
line of code while merging conflicts with time_pivot viz

* Improve deck.gl GeoJSON visualization (apache#4220)

* Improve geoJSON

* Addressing comments

* lint

* refactor to leverage line_column controls

* refactor to use DeckPathViz

* oops
@mistercrunch mistercrunch added 🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels 🚢 0.23.0 labels Feb 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels 🚢 0.23.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants