Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache: Query result cache #4521

Closed
BohuTANG opened this issue Mar 21, 2022 · 6 comments
Closed

Cache: Query result cache #4521

BohuTANG opened this issue Mar 21, 2022 · 6 comments
Labels

Comments

@BohuTANG
Copy link
Member

BohuTANG commented Mar 21, 2022

Summary

Background

For a read, the main flow is:

  1. Get the source plan: file name(partition file)
  2. Read the files by file name which on object storage(like AWS S3).

With the query result cache, we can do:

  • Step1. Parse the query, and calculate the fingerprint: query_id
  • Step2. Get the source plan(read_plan), and calculate the fingerprint: source_plan_id
  • Step3. Check the cache
    • 3.1 If the cache is exists: /query_id/source_plan_id/result, get and return the result.
    • 3.2 If the cache is not exists, put the result to the cache

Where the cache stored

Storage in the S3, path is /<bucket>/<tenant>/result/cache/, and the user can download it.

How to calculate the fingerprint

  • query_id need based on the AST? select * from t1 where a>1 fingerprint is same select * from t1 where a>1 and 1=1
  • source_plan_id based on the partition file name and the file offset
@BohuTANG BohuTANG added the C-feature Category: feature label Mar 21, 2022
@youngsofun
Copy link
Member

/assignme

@flaneur2020
Copy link
Member

cc @Chasen-Zhang this may help the issues we talked yesterday

@BohuTANG
Copy link
Member Author

cc @drmingdrmer

@drmingdrmer
Copy link
Member

The requirement for query result cache and the requirement for data block cache is different:

  • The result cache must be complete, i.e., caching only part of the result is not allowed. The entire result data is added and removed as a whole.
  • The data block cache prefers partial cache: not used data should not be cached.

AFAIK, these two goals may conflict with each other:

  • The result cache may be evicted due to too many block caches.
  • The result cache would be better using a least-recently-added eviction policy(a result is kept for a certain time no matter how often it is read). While block cache would be better using a least-recently-used policy.

@Xuanwo
Copy link
Member

Xuanwo commented Aug 15, 2022

This issue have been moved to v0.9

@Xuanwo Xuanwo removed this from the v0.8 milestone Aug 15, 2022
@BohuTANG
Copy link
Member Author

The result cache is finished but not used yet, let's close.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants