feat: rename TABLE_NAMES_CACHE_CONFIG to DATA_CACHE_CONFIG #11509

ktmud · 2020-10-31T03:30:23Z

SUMMARY

💥 Breaking Change

Rename config value TABLE_NAMES_CACHE_CONFIG to DATA_CACHE_CONFIG, and save datasource query results to this cache, too.

Motivation

Companies like Airbnb have different security levels for Superset itself and the datasources Superset connects to: anyone can login to the same Superset, but not everyone have access to all data. Superset has SecurityManager to check whether a user has access to certain datasource, but all query results are still stored in the same cache, regardless of users' access levels. If a malicious party obtained access to the cache server via certain Superset user in a high-risk region, they get access to all cached data.

By separating cache storage for Superset metadata and the actual data users consume, we can have a better insulation of sensitive data by configuring different cache backend for high-risk regions while keeping the cache for all the Superset metadata in sync.

BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF

N/A

TEST PLAN

CI

ADDITIONAL INFORMATION

Has associated issue:
Changes UI
Requires DB Migration: Superset admins need to manually update cache configs
Confirm DB Migration upgrade and downgrade tested.
Introduces new feature or API
Removes existing feature or API

codecov-io · 2020-10-31T03:40:13Z

Codecov Report

Merging #11509 (226fbe9) into master (68693c7) will increase coverage by 4.26%.
The diff coverage is 65.68%.

@@            Coverage Diff             @@
##           master   #11509      +/-   ##
==========================================
+ Coverage   62.86%   67.12%   +4.26%     
==========================================
  Files         889      889              
  Lines       43055    43062       +7     
  Branches     4017     4017              
==========================================
+ Hits        27065    28905    +1840     
+ Misses      15811    14060    -1751     
+ Partials      179       97      -82

Flag	Coverage Δ
cypress	`56.05% <ø> (?)`
javascript	`62.81% <ø> (ø)`
python	`62.87% <65.68%> (-0.01%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
superset/utils/decorators.py	`100.00% <ø> (+34.17%)`	⬆️
superset/utils/screenshots.py	`36.36% <0.00%> (ø)`
superset/viz_sip38.py	`0.00% <0.00%> (ø)`
superset/utils/cache.py	`57.57% <54.38%> (-22.43%)`	⬇️
superset/models/dashboard.py	`75.50% <75.00%> (+0.12%)`	⬆️
superset/viz.py	`59.40% <88.88%> (+0.07%)`	⬆️
superset/__init__.py	`100.00% <100.00%> (ø)`
superset/common/query_context.py	`83.23% <100.00%> (-4.72%)`	⬇️
superset/config.py	`90.07% <100.00%> (-0.04%)`	⬇️
superset/db_engine_specs/hive.py	`84.04% <100.00%> (+0.06%)`	⬆️
... and 194 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 68693c7...226fbe9. Read the comment docs.

ktmud · 2020-10-31T03:33:24Z

superset/common/query_context.py

 from superset.common.query_object import QueryObject
 from superset.connectors.base.models import BaseDatasource
 from superset.connectors.connector_registry import ConnectorRegistry
 from superset.exceptions import QueryObjectValidationError
+from superset.extensions import cache_manager, security_manager


Import the cache_manager instead of cache to make it clear which cache we are using.

ktmud · 2020-10-31T03:33:46Z

superset/common/query_context.py

@@ -285,7 +286,11 @@ def get_df_payload(  # pylint: disable=too-many-statements
                status = utils.QueryStatus.FAILED
                stacktrace = utils.get_stacktrace()

-            if is_loaded and cache_key and cache and status != utils.QueryStatus.FAILED:


and cache is not needed because of #9020

ktmud · 2020-10-31T03:35:45Z

superset/models/core.py

@@ -450,8 +451,8 @@ def inspector(self) -> Inspector:
        return sqla.inspect(engine)

    @cache_util.memoized_func(
-        key=lambda *args, **kwargs: "db:{}:schema:None:table_list",
-        attribute_in_key="id",
+        key=lambda self, *args, **kwargs: f"db:{self.id}:schema:None:table_list",


A small refactor to make the memoize function more flexible.

ktmud · 2020-10-31T03:37:30Z

superset/utils/cache.py

+
+def etag_cache(
+    max_age: int, check_perms: Callable[..., Any], cache: Cache = cache_manager.cache
+) -> Callable[..., Any]:


Moved over from superset.utils.decorators

ktmud · 2020-10-31T03:42:32Z

superset/utils/cache_manager.py

        self._thumbnail_cache = Cache()

    def init_app(self, app: Flask) -> None:
        self._cache.init_app(app, app.config["CACHE_CONFIG"])
-        self._tables_cache.init_app(app, app.config["TABLE_NAMES_CACHE_CONFIG"])
+        self._data_cache.init_app(app, app.config["DATA_CACHE_CONFIG"])


Didn't add backward compatibility because the data objects put into this cache will become much much larger. If a Superset admin has configured in-memory cache for TABLE_NAMES_CACHE_CONFIG, they might run into problems. Therefore it's better to make things break to make them aware of this change.

For most users, the migration is as simple as renaming the config value.

bkyryliuk · 2020-10-31T05:38:23Z

I will be on PTO next week, won't have a change to give a full review. 1 suggestion it would be nice to add a unit test that will document the expected behavior and will prevent from the regressions, tests can impersonate the user to hit various endpoints.

ktmud · 2020-10-31T07:44:39Z

I will be on PTO next week, won't have a change to give a full review. 1 suggestion it would be nice to add a unit test that will document the expected behavior and will prevent from the regressions, tests can impersonate the user to hit various endpoints.

I added two test assertions in cache_tests.py. Do you think we need more?

UPDATING.md

superset/config.py

superset/utils/cache.py

UPDATING.md

superset/models/dashboard.py

ktmud · 2020-11-02T19:25:47Z

superset/utils/cache_manager.py

+        self._cache.init_app(
+            app,
+            {
+                "CACHE_DEFAULT_TIMEOUT": app.config["CACHE_DEFAULT_TIMEOUT"],


Make sure the global default is applied as documented.

ktmud · 2020-11-10T21:21:49Z

@john-bodley @etr2460 So previously the test tests/databases/api_tests.py::TestDatabaseApi::test_create_database_conn_fail was failing because I setup a cache backend for DATA_CACHE_CONFIG while in the old test config TABLE_NAMES_CACHE_CONFIG was set to cache_type: null.

Not sure why this cache would fail test_create_database_conn_fail though. Maybe @dpgaspar have some idea?

ktmud · 2020-11-10T22:01:17Z

Here's a test run that fails when the only change comparing to master is to enable TABLE_NAMES_CACHE_CONFIG.

ktmud · 2020-11-10T22:04:36Z

tests/databases/api_tests.py

@@ -95,7 +95,7 @@ def test_get_items(self):
            "function_names",
            "id",
        ]
-        self.assertEqual(response["count"], 2)


count = 2 seems to assume some tests run before this test. If you just initialized the database and run this test alone, this test will fail.

graceguo-supercat

please resolve conflicts. otherwise LGTM.

villebro

LGTM, however I didn't quite see how the "timeout search path" happens, is that something happening outside this PR in local configs or implicitly via Flask-Caching?

villebro · 2020-11-12T21:30:48Z

superset/utils/cache.py

-        if cache_manager.tables_cache:
-
-            def wrapped_f(self: Any, *args: Any, **kwargs: Any) -> Any:
-                if not kwargs.get("cache", True):
-                    return f(self, *args, **kwargs)
-
-                if attribute_in_key:
-                    cache_key = key(*args, **kwargs).format(
-                        getattr(self, attribute_in_key)
-                    )
-                else:
-                    cache_key = key(*args, **kwargs)
-                o = cache_manager.tables_cache.get(cache_key)
-                if not kwargs.get("force") and o is not None:
-                    return o
-                o = f(self, *args, **kwargs)
-                cache_manager.tables_cache.set(
-                    cache_key, o, timeout=kwargs.get("cache_timeout")
-                )
-                return o
-
-        else:
-            # noop
-            def wrapped_f(self: Any, *args: Any, **kwargs: Any) -> Any:
+        def wrapped_f(self: Any, *args: Any, **kwargs: Any) -> Any:
+            if not kwargs.get("cache", True):
                return f(self, *args, **kwargs)

+            cache_key = key(self, *args, **kwargs)
+            obj = cache.get(cache_key)
+            if not kwargs.get("force") and obj is not None:
+                return obj
+            obj = f(self, *args, **kwargs)
+            cache.set(cache_key, obj, timeout=kwargs.get("cache_timeout"))
+            return obj
+


ktmud · 2020-11-12T23:52:58Z

LGTM, however I didn't quite see how the "timeout search path" happens, is that something happening outside this PR in local configs or implicitly via Flask-Caching?

I think it is referring to this logic here: https://github.com/apache/incubator-superset/blob/600a6fa92a0bbe5bfd93371db5ced6af556c3697/superset/viz.py#L423-L433

villebro · 2020-11-13T07:56:34Z

LGTM, however I didn't quite see how the "timeout search path" happens, is that something happening outside this PR in local configs or implicitly via Flask-Caching?

I think it is referring to this logic here:

https://github.com/apache/incubator-superset/blob/600a6fa92a0bbe5bfd93371db5ced6af556c3697/superset/viz.py#L423-L433

Ah sorry, I misinterpreted the comment, never mind.

villebro

This is a good layer of optional additional security, thanks for implementing.

bkyryliuk

looks great

superset/utils/cache.py

The corresponding cache will now also cache the query results.

) * feat: rename TABLE_NAMES_CACHE_CONFIG to DATA_CACHE_CONFIG The corresponding cache will now also cache the query results. * Slice use DATA_CACHE_CONFIG CACHE_DEFAULT_TIMEOUT * Add test for default cache timeout * rename FAR_FUTURE to ONE_YEAR_IN_SECS

pull-request-size bot added the size/XL label Oct 31, 2020

ktmud force-pushed the data-cache branch from 27ee045 to 900a10b Compare October 31, 2020 03:39

ktmud commented Oct 31, 2020

View reviewed changes

ktmud requested review from villebro, john-bodley, bkyryliuk, dpgaspar, etr2460 and graceguo-supercat October 31, 2020 03:44

mistercrunch added the risk:breaking-change Issues or PRs that will introduce breaking changes label Oct 31, 2020

john-bodley reviewed Oct 31, 2020

View reviewed changes

UPDATING.md Outdated Show resolved Hide resolved

superset/config.py Outdated Show resolved Hide resolved

superset/config.py Outdated Show resolved Hide resolved

superset/utils/cache.py Outdated Show resolved Hide resolved

superset/utils/cache.py Outdated Show resolved Hide resolved

ktmud force-pushed the data-cache branch 2 times, most recently from 45ce586 to c0eb415 Compare November 1, 2020 23:29

ktmud commented Nov 2, 2020

View reviewed changes

UPDATING.md Show resolved Hide resolved

ktmud force-pushed the data-cache branch from d37ae39 to b0a857f Compare November 2, 2020 18:51

ktmud commented Nov 2, 2020

View reviewed changes

superset/models/dashboard.py Show resolved Hide resolved

ktmud commented Nov 2, 2020

View reviewed changes

ktmud force-pushed the data-cache branch 4 times, most recently from 93729d6 to 9543b47 Compare November 9, 2020 05:46

ktmud force-pushed the data-cache branch 2 times, most recently from c2a46af to ca25463 Compare November 10, 2020 20:34

ktmud commented Nov 10, 2020

View reviewed changes

ktmud mentioned this pull request Nov 11, 2020

fix: database connection validation when creation #11653

Merged

6 tasks

graceguo-supercat approved these changes Nov 12, 2020

View reviewed changes

ktmud force-pushed the data-cache branch from ca25463 to 0b7d0a4 Compare November 12, 2020 17:24

villebro reviewed Nov 12, 2020

View reviewed changes

ktmud force-pushed the data-cache branch from 0b7d0a4 to a7c6842 Compare November 12, 2020 23:52

villebro approved these changes Nov 13, 2020

View reviewed changes

bkyryliuk approved these changes Nov 14, 2020

View reviewed changes

superset/utils/cache.py Outdated Show resolved Hide resolved

ktmud force-pushed the data-cache branch 2 times, most recently from 0811d14 to 60283e2 Compare November 14, 2020 05:48

ktmud added 4 commits November 13, 2020 21:48

feat: rename TABLE_NAMES_CACHE_CONFIG to DATA_CACHE_CONFIG

a12d808

The corresponding cache will now also cache the query results.

Slice use DATA_CACHE_CONFIG CACHE_DEFAULT_TIMEOUT

b6d3b88

Add test for default cache timeout

36ca694

rename FAR_FUTURE to ONE_YEAR_IN_SECS

226fbe9

ktmud force-pushed the data-cache branch from 60283e2 to 226fbe9 Compare November 14, 2020 05:49

ktmud merged commit 4cfcaeb into apache:master Nov 14, 2020

ktmud deleted the data-cache branch November 14, 2020 06:35

mistercrunch added 🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels 🚢 1.0.0 labels Mar 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: rename TABLE_NAMES_CACHE_CONFIG to DATA_CACHE_CONFIG #11509

feat: rename TABLE_NAMES_CACHE_CONFIG to DATA_CACHE_CONFIG #11509

ktmud commented Oct 31, 2020 •

edited

Loading

codecov-io commented Oct 31, 2020 •

edited

Loading

ktmud Oct 31, 2020

ktmud Oct 31, 2020

ktmud Oct 31, 2020

ktmud Oct 31, 2020

ktmud Oct 31, 2020 •

edited

Loading

bkyryliuk commented Oct 31, 2020

ktmud commented Oct 31, 2020

ktmud Nov 2, 2020

ktmud commented Nov 10, 2020

ktmud commented Nov 10, 2020 •

edited

Loading

ktmud Nov 10, 2020

graceguo-supercat left a comment

villebro left a comment

villebro Nov 12, 2020

ktmud commented Nov 12, 2020

villebro commented Nov 13, 2020

villebro left a comment

bkyryliuk left a comment

feat: rename TABLE_NAMES_CACHE_CONFIG to DATA_CACHE_CONFIG #11509

feat: rename TABLE_NAMES_CACHE_CONFIG to DATA_CACHE_CONFIG #11509

Conversation

ktmud commented Oct 31, 2020 • edited Loading

SUMMARY

Motivation

BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF

TEST PLAN

ADDITIONAL INFORMATION

codecov-io commented Oct 31, 2020 • edited Loading

Codecov Report

ktmud Oct 31, 2020

Choose a reason for hiding this comment

ktmud Oct 31, 2020

Choose a reason for hiding this comment

ktmud Oct 31, 2020

Choose a reason for hiding this comment

ktmud Oct 31, 2020

Choose a reason for hiding this comment

ktmud Oct 31, 2020 • edited Loading

Choose a reason for hiding this comment

bkyryliuk commented Oct 31, 2020

ktmud commented Oct 31, 2020

ktmud Nov 2, 2020

Choose a reason for hiding this comment

ktmud commented Nov 10, 2020

ktmud commented Nov 10, 2020 • edited Loading

ktmud Nov 10, 2020

Choose a reason for hiding this comment

graceguo-supercat left a comment

Choose a reason for hiding this comment

villebro left a comment

Choose a reason for hiding this comment

villebro Nov 12, 2020

Choose a reason for hiding this comment

ktmud commented Nov 12, 2020

villebro commented Nov 13, 2020

villebro left a comment

Choose a reason for hiding this comment

bkyryliuk left a comment

Choose a reason for hiding this comment

ktmud commented Oct 31, 2020 •

edited

Loading

codecov-io commented Oct 31, 2020 •

edited

Loading

ktmud Oct 31, 2020 •

edited

Loading

ktmud commented Nov 10, 2020 •

edited

Loading