Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: add datasource.changed_on to cache_key #8901

Merged
merged 3 commits into from
Jan 16, 2020
Merged

fix: add datasource.changed_on to cache_key #8901

merged 3 commits into from
Jan 16, 2020

Conversation

villebro
Copy link
Member

@villebro villebro commented Dec 29, 2019

CATEGORY

Choose one

  • Bug Fix
  • Enhancement (new features, refinement)
  • Refactor
  • Add tests
  • Build / Development Environment
  • Documentation

SUMMARY

When changing the SQL query in a datasource, the cache key doesn't reflect this change, as only the datasource id is added to the cache key. The same problem also applies to metrics and expressions. This PR adds the changed_on property of the datasource to the dict on which cache_key is based, ensuring that any change to the datasource will invalidate cached results. While this might cause unnecessary cache misses if datasources are automatically refreshed periodically, this at least ensures that no updates to datasources (sql query, metrics, expressions) go unnoticed.

TEST PLAN

Tested locally by replicating the problem in the issue, and ensuring that the bug goes away with this diff. New unit tests added to ensure that the new QueryContext/QueryObject classes work as intended. A small bug was also fixed in QueryContext.

ADDITIONAL INFORMATION

REVIEWERS

@durchgedreht

@codecov-io
Copy link

Codecov Report

Merging #8901 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master    #8901   +/-   ##
=======================================
  Coverage   58.97%   58.97%           
=======================================
  Files         359      359           
  Lines       11333    11333           
  Branches     2787     2787           
=======================================
  Hits         6684     6684           
  Misses       4471     4471           
  Partials      178      178

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 191aca1...7713e08. Read the comment docs.

@willbarrett
Copy link
Member

Thanks for this @villebro! My recommendation would be to go with the changed_on timestamp instead of the sql column. As you mention, Datasources define a number of different things that could affect cache behavior or results. Causing cache misses on every Datasource update feels more future-proof, as this logic won't need to change every time Datasource behavior changes. We may cause some unnecessary cache misses, but I think we should lean in the direction of ensuring correctness first and thinking about performance second.

@villebro
Copy link
Member Author

Thanks for chiming in @willbarrett ; personally I also started leaning towards using changed_on instead of all potentially interesting attributes. I'll update the PR to reflect this. It would be valuable to hear if large deployments of Superset are using large scale automated metadata refreshes of existing datasources, and if this might cause unnecessary cache misses. @john-bodley @etr2460 @betodealmeida any feedback would be valuable to understand if this would be a blocker in your deployments?

@villebro
Copy link
Member Author

villebro commented Jan 8, 2020

As there doesn't seem to be any objections to adding changed_on to cache_key, I'll try to finalize this in the coming days.

@villebro
Copy link
Member Author

FYI working on this fix+unit tests surfaced some other issues I will be addressing with this PR, i.e. will take slightly longer to complete than anticipated.

@villebro villebro changed the title [WIP] Add sql query to cache_key Add datasource.changed_on to cache_key Jan 13, 2020
Comment on lines +160 to +172
def cache_key(self, query_obj: QueryObject, **kwargs) -> Optional[str]:
extra_cache_keys = self.datasource.get_extra_cache_keys(query_obj.to_dict())
cache_key = (
query_obj.cache_key(
datasource=self.datasource.uid,
extra_cache_keys=extra_cache_keys,
changed_on=self.datasource.changed_on,
**kwargs
)
if query_obj
else None
)
return cache_key
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cache key calculation logic was broken into a method of its own to enable easier unit testing.

metric if "expressionType" in metric else metric["label"] # type: ignore
for metric in metrics
]
self.metrics = [utils.get_metric_name(metric) for metric in metrics]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

During testing I noticed that the existing logic was incomplete; utils.get_metric_name on the other hand is used elsewhere and handles all metric types correctly (legacy and ad-hoc).

Comment on lines -134 to +164
# TODO: update once get_data is implemented for QueryObject
with self.assertRaises(Exception):
self.get_resp("/api/v1/query/", {"query_context": data})
qc_dict = self._get_query_context_dict()
data = json.dumps(qc_dict)
resp = json.loads(self.get_resp("/api/v1/query/", {"query_context": data}))
self.assertEqual(resp[0]["rowcount"], 100)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The old unit test seemed to be incomplete, so fixed a few bugs in the body (limit -> row_limit and removed time_range) to make it work properly.

@villebro
Copy link
Member Author

This is ready for review.

@villebro villebro changed the title Add datasource.changed_on to cache_key fix: add datasource.changed_on to cache_key Jan 13, 2020
@john-bodley
Copy link
Member

john-bodley commented Jan 13, 2020

@villebro do you think there is merit in adding a line item to UPDATING.md to mention that this PR will invalidate the entire cache given that the keys used for hashing have changed?

@villebro
Copy link
Member Author

@john-bodley oh absolutely, good idea. Will add a note in UPDATING.

@pull-request-size pull-request-size bot added size/L and removed size/M labels Jan 13, 2020
Copy link
Member

@willbarrett willbarrett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lookin' good!

@mistercrunch mistercrunch merged commit c087a48 into apache:master Jan 16, 2020
john-bodley added a commit that referenced this pull request Jan 28, 2020
john-bodley added a commit that referenced this pull request Jan 28, 2020
john-bodley added a commit to airbnb/superset-fork that referenced this pull request Jan 28, 2020
extra_cache_keys = self.datasource.get_extra_cache_keys(query_obj.to_dict())
cache_key = (
query_obj.cache_key(
datasource=self.datasource.uid,
extra_cache_keys=extra_cache_keys,
changed_on=self.datasource.changed_on,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@villebro apologies for the late comment. In your PR description you mention that the cache key should be a function on whether the column or metric definitions associated with a datasource are changed.

On line #166 you merely use the datasource change_on and thus I was wondering whether a changes to the columns and/or metrics cascade, i.e., trigger an update to the datasource changed_on?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is my understanding that changed_on is updated every time any change is applied, i.e. does not check if only relevant metrics/expressions have changed. While this can cause unnecessary cache misses, I felt the added complexity of checking only for relevant changes was not warranted unless the ultimately proposed simpler solution was found to be too generic (I tried to convey this in the unit test which only changed the description). If this does cause unacceptable amounts of cache misses I think we need to revisit this logic; until then I personally think this is a good compromise. However, I'm happy to open up the discussion again if there are opinions to the contrary.

@mistercrunch mistercrunch added 🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels 🚢 0.36.0 labels Feb 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels size/L 🚢 0.36.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SQL Lab Query doesn't get reflected in chart
5 participants