You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
In a cloud native architecture with completely stateless executors, each executor needs to fetch the source data from the remote storage. If the amount of source data is very large, it will easily meet the network throughput bottleneck and it will take too much time for this step of fetching source data. For example, for a compute layer with 20 nodes with 10Gb node bandwidth, it will take at least 4s to fetch 100GB source data. While it often takes less than 1s to finish other steps of processing the 100GB data. Therefore, it’s better to introduce a cache layer into the cloud native architecture to make the executors be of weak state for caching hot data on local disk, like snowflake does.
I am definitely interested in this feature, thanks for posting this. I came to the exact same conclusions while testing on large datasets stored on S3: the approach suggested here is definitely the best. The bandwidth is the bottleneck and sending the requests to the same executors, that cache on disk, using an hashing algorithm, is definitely the best solution.
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
In a cloud native architecture with completely stateless executors, each executor needs to fetch the source data from the remote storage. If the amount of source data is very large, it will easily meet the network throughput bottleneck and it will take too much time for this step of fetching source data. For example, for a compute layer with 20 nodes with 10Gb node bandwidth, it will take at least 4s to fetch 100GB source data. While it often takes less than 1s to finish other steps of processing the 100GB data. Therefore, it’s better to introduce a cache layer into the cloud native architecture to make the executors be of weak state for caching hot data on local disk, like snowflake does.
Describe the solution you'd like
https://docs.google.com/document/d/1iMFv3S-TuiwBoTzp4KX0Ltrrenm86ULr0q_PwIKdW6g/edit?usp=sharing
To achieve this goal, we need to finish the following tasks:
Describe alternatives you've considered
Additional context
The text was updated successfully, but these errors were encountered: