Replies: 6 comments 13 replies
-
We have to clarify the terminology. Feldera can do view maintenance with any granularity. There is no gap between streaming, micro batch, and batch, they all work in exactly the same way. You decide when to feed inputs (accumulating batches) and when you want to see the output (letting feldera acumulate batches). Feldera does offer no particular benefits for ad-hoc queries which are run only once. There is nothing special you need to do to implement the diagram you have described. Will look at the paper you describe for details. |
Beta Was this translation helpful? Give feedback.
-
In fact, feldera is even more efficient with batched inputs. |
Beta Was this translation helpful? Give feedback.
-
Hi @Myasuka! Let me know if I understand your idea correctly:
Is this what you had in mind? |
Beta Was this translation helpful? Give feedback.
-
My next blog post, which will be published this week, touches on the topic you mention. It will describe how the work performed by a DBSP circuit can be decoupled from the input and output deltas using buffers. But I don't talk about optimizations, only about the actual buffering mechanism. Thank you for the pointer to the paper on "incrementability". If I read it right, translating the proposal of the paper in the language of DBSP would simply involve breaking a DBSP circuit with a single "logical clock" into multiple circuits separated by buffers, using independent clocks, under the assumption that some parts of the plan can benefit from larger batches. The clock of the output buffer can be driven by the users, i.e., the equivalent of the triggers in the paper. The paper only uses the metric of work, but does not discuss about idle time. If the system has nothing else to do, perhaps processing smaller batches is a good idea (if you don't pay by CPU usage). The more general problem requires an on-line algorithm - since you don't necessarily know when the user wants to inspect the data. For choosing a good size for the input batches for one circuit, I suspect a simple control mechanism could work: try different sizes, and see which one works best; it can probably use a simple regression model. It would be a fun experiment to reproduce the results in the paper using our system, and I suspect it wouldn't entail too much work. Could be a good intern/research project. So far we have done very little with regards to tuning. |
Beta Was this translation helpful? Give feedback.
-
Sorry to join the discussion late; I'll share some info from my side(use cases, problems, and potential solutions -- some use cases have been simplified so that I can discuss them in public). The picture shows three cases, I'll describe them separately later. The first caseThe first case is the common warehouse scenario. The user will ingest data into Hive/DataLake Table, and then use Spark to process it. Problem:
User's Desire:
Potential Solution:
The Second Case:The second case is generally used in recommendation scenarios with high real-time requirements. It can be considered the real-time pipeline of the first case. However, there are also some problems: Issues:
User's Desire:
Possible Solutions:None for now. If possible, the user hopes to process the first and second cases in a single pipeline. The Third Case:The third use case is generally an online service that consumes upstream data, joins external dimension tables, and sends the processed results downstream. Issues:
User's Desire:
Possible Solutions:
Other QuestionsStreaming-Streaming join with retractionIIUC the delta join is more efficient in batch mode, but has a litter effect in streaming-streaming join, is this understanding right? DBSP and TempuraRegarding the mentioned Tempura, my understanding is that DBSP can cover the part that Tempura does. I'll briefly explain my understanding:
|
Beta Was this translation helpful? Give feedback.
-
Thank you for sharing the use cases, these are very interesting. We should also make a distinction between the DBSP theory and the DBSP Rust library. The theory does not care about stuff like scale out or fault-tolerance, whereas the implementation clearly does. The Rust library is also continuously evolving, so we have to distinguish between what it does today from what it can do in the future. And the SQL compiler is also evolving. I am saying this because whether DBSP Rust/Feldera can implement these use cases also may depend on the nature of the data (size, frequency, etc), the nature of the computation (Spark and Flink can each express algorithms the other one cannot) and the maturity of our implementation. |
Beta Was this translation helpful? Give feedback.
-
Currently, Feldera focus on the streaming area to act as a continuous query engine. However, for the BI area, people don't rely on the micro-second-level or even minute-level latency to make a decision, that's why T+1 batch style ETL is more suitable and have a much larger market. If we can make the batch processing to generate fresher output without high cost, I think that could make a real difference to the world.
And for the cost analysis of query execution, we can use the model in the SIGMOD '20 paper Thrifty Query Execution via Incrementability. Long running streaming engine have to pay high cost for retraction operations.
Since incremental materialized view (IVM) is very suitable for batch processing, and the DBSP already support to make any incremental materialized view possible. From my point of view, it's a great chance to make the change.
If we refer to the product of snowflake's dynamic tables, I think we can change the architecture to the following:
Beta Was this translation helpful? Give feedback.
All reactions