Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Query result cache [experimental] #43797

Merged
merged 89 commits into from
Jan 23, 2023
Merged

Query result cache [experimental] #43797

merged 89 commits into from
Jan 23, 2023

Conversation

rschu1ze
Copy link
Member

@rschu1ze rschu1ze commented Nov 29, 2022

Query Result Cache [experimental]

Continuation of #34773. Also see #34011. Similar in spirit to caching with chproxy.

Query Result Cache [experimental]

The query result cache allows to compute SELECT queries just once and to serve further executions of the same query directly from the cache. Depending on the type of the queries, this can dramatically reduce latency and resource consumption of the ClickHouse server.

Background, Design and Limitations

Query result caches can generally be viewed as transactionally consistent or inconsistent.

  • In transactionally consistent caches, the database invalidates (discards) cached query results if the result of the SELECT query changes or potentially changes. In ClickHouse, operations which change the data include inserts/updates/deletes in/of/from tables or collapsing merges. Transactionally consistent caching is especially suitable for OLTP databases, for example MySQL (which removed query result cache after v8.0) and Oracle.
  • In transactionally inconsistent caches, slight inaccuracies in query results are accepted under the assumption that all cache entries are assigned a validity period after which they expire (e.g. 1 minute) and that the underlying data changes only little during this period. This approach is overall more suitable for OLAP databases. As an example where transactionally inconsistent caching is sufficient, consider an hourly sales report in a reporting tool which is simultaneously accessed by multiple users. Sales data changes typically slowly enough that the database only needs to compute the report once (represented by the first SELECT query). Further queries can be served directly from the query result cache. In this example, a reasonable validity period could be 30 min.

Transactionally inconsistent caching is traditionally provided by client tools or proxy packages interacting with the database. As a result, the same caching logic and configuration is often duplicated. With ClickHouse's query result cache, the caching logic moves to the serverside. This reduces maintenance effort and avoids redundancy.

Warning: The query result cache is an experimental feature that should not be used in production. There are known cases (e.g. in distributed query processing) where wrong results are returned.

Configuration Settings and Usage

Parameter "enable_experimental_query_result_cache" controls whether query results are inserted into / retrieved from the cache for the current query or session. For example, the first execution of query

SELECT some_expensive_calculation(column_1, column_2)
FROM table
SETTINGS enable_experimental_query_result_cache = true;

stores the query result into the query result cache. Subsequent executions of the same query (also with parameter
enable_experimental_query_result_cache = true) will read the computed result directly from the cache.

Sometimes, it is desirable to use the query result cache only passively, i.e. to allow reading from it but not writing into it (if the cache result is not stored yet). Parameter "enable_experimental_query_result_cache_passive_usage" instead of "enable_experimental_query_result_cache" can be used for that.

For maximum control, it is generally recommended to provide settings "enable_experimental_query_result_cache" or
"enable_experimental_query_result_cache_passive_usage" only with specific queries. It is also possible to enable caching at user or profile level but one should keep in mind that all SELECT queries may return a cached results, including monitoring or debugging queries to system tables.

The query result cache can be cleared using statement SYSTEM DROP QUERY RESULT CACHE. The content of the query result cache is displayed in system table SYSTEM.QUERY_RESULT_CACHE. The number of query result cache hits and misses are shown as events "QueryResultCacheHits" and "QueryResultCacheMisses" in system table SYSTEM.EVENTS. Both counters are only updated for SELECT queries which run with settings "enable_experimental_query_result_cache = true" or "enable_experimental_query_result_cache_passive_usage = true". Other queries do not affect the cache miss counter.

The query result cache exists once per ClickHouse server process. However, cache results are by default not shared between users. This can be changed (see below) but doing so is not recommended for security reasons.

Query results are referenced in the query result cache by the Abstract Syntax Tree (AST) of their query. This means that caching is agnostic to upper/lowercase, for example SELECT 1 and select 1 are treated as the same query. To make the matching more natural, all query-level settings related to the query result cache are removed from the AST.

If the query was aborted due to an exception or user cancellation, no entry is written into the query result cache.

The size of the query result cache, the maximum number of cache entries and the maximum size of cache entries (in bytes and in records) can be configured using different server configuration options.

To define how long a query must run at least such that its result can be cached, you can use setting
"query_result_cache_min_query_duration". For example, the result of query

SELECT some_expensive_calculation(column_1, column_2)
FROM table
SETTINGS enable_experimental_query_result_cache = true, query_result_cache_min_query_duration = 5000;

is only cached if the query runs longer than 5 seconds. It is also possible to specify how often a query needs to run until its result is cached - for that use setting "query_result_cache_min_query_runs".

Entries in the query result cache become stale after a certain time period (time-to-live). By default, this period is 60 seconds but a different value can be specified at session, profile or query level using setting "query_result_cache_ttl".

Also, results of queries with non-deterministic functions such as rand() and now() are not cached. This can be overruled using the setting "query_result_cache_store_results_of_queries_with_nondeterministic_functions".

Finally, entries in the query cache are not shared between users due to security reasons. For example, user A must not be able to bypass a row policy on a table by running the same query as another user B for whom no such policy exists. However, if necessary, cache entries can be marked accessible by other users (i.e. shared) by supplying setting
"query_result_cache_share_between_users".

Changelog category (leave one):

  • New Feature

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Add experimental query result cache

Documentation entry for user-facing changes

  • [ X ] Documentation is written (mandatory for new features)

@rschu1ze rschu1ze mentioned this pull request Nov 29, 2022
@robot-clickhouse-ci-2 robot-clickhouse-ci-2 added the pr-feature Pull request with new product feature label Nov 29, 2022
@rschu1ze rschu1ze force-pushed the query-result-cache branch 8 times, most recently from 468c895 to af9f9f3 Compare December 1, 2022 22:12
@rschu1ze rschu1ze force-pushed the query-result-cache branch 2 times, most recently from a3b04d6 to ec49ff1 Compare December 7, 2022 21:43
@rschu1ze rschu1ze changed the title WIP Query result cache Query result cache Dec 12, 2022
@rschu1ze rschu1ze marked this pull request as ready for review December 12, 2022 10:37
@rschu1ze rschu1ze changed the title Query result cache Query result cache [experimental] Dec 12, 2022
@rschu1ze
Copy link
Member Author

rschu1ze commented Dec 12, 2022

This PR is RFR, if someone likes to take a look ...

I'll make a separate push where active/passive cache usage is globally enabled, to find out if something breaks.

EDIT: It's here - #44364

@rschu1ze rschu1ze mentioned this pull request Dec 12, 2022
22 tasks
@rschu1ze
Copy link
Member Author

@nickitat All your feedback was included except:

  • associate cache lookups instead of writes with a TTL: I like the idea but it will be a bigger refactoring (--> later/separate PR)
  • add the stage to the cache key: also makes sense, of course. The single-node case works fine (I know of no bugs) and for the distributed case, I added an extra bold warning into the docs that the QRC is experimental and not for productive use.

Since v23.1 approaches, I think this PR is in a state good enough to include it. So if you don't mind, perhaps you could take another brief look and (hopefully :-)) approve. Thanks!

programs/server/config.xml Show resolved Hide resolved
src/Processors/Sources/SourceFromChunks.cpp Show resolved Hide resolved
@nickitat
Copy link
Member

jackpot, Robert! all green

@cangyin
Copy link
Contributor

cangyin commented Jun 27, 2023

excuse me. I found that query cache is marked as production ready. so is this still a concern ?

Warning: The query result cache is an experimental feature that should not be used in production. There are known cases (e.g. in distributed query processing) where wrong results are returned.

@rschu1ze
Copy link
Member Author

It isn't a concern and the warning was removed from the docs as per #47977

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pr-feature Pull request with new product feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants