How about making Feldera more batch-style with lower cost to act as a near real-time data transformation engine #1590

Myasuka · 2024-03-30T07:09:12Z

Myasuka
Mar 30, 2024

Currently, Feldera focus on the streaming area to act as a continuous query engine. However, for the BI area, people don't rely on the micro-second-level or even minute-level latency to make a decision, that's why T+1 batch style ETL is more suitable and have a much larger market. If we can make the batch processing to generate fresher output without high cost, I think that could make a real difference to the world.

And for the cost analysis of query execution, we can use the model in the SIGMOD '20 paper Thrifty Query Execution via Incrementability. Long running streaming engine have to pay high cost for retraction operations.
Since incremental materialized view (IVM) is very suitable for batch processing, and the DBSP already support to make any incremental materialized view possible. From my point of view, it's a great chance to make the change.

If we refer to the product of snowflake's dynamic tables, I think we can change the architecture to the following:

mihaibudiu · 2024-03-30T07:23:46Z

mihaibudiu
Mar 30, 2024
Maintainer

We have to clarify the terminology. Feldera can do view maintenance with any granularity. There is no gap between streaming, micro batch, and batch, they all work in exactly the same way. You decide when to feed inputs (accumulating batches) and when you want to see the output (letting feldera acumulate batches).

Feldera does offer no particular benefits for ad-hoc queries which are run only once. There is nothing special you need to do to implement the diagram you have described.

Will look at the paper you describe for details.

0 replies

mihaibudiu · 2024-03-30T07:25:08Z

mihaibudiu
Mar 30, 2024
Maintainer

In fact, feldera is even more efficient with batched inputs.

5 replies

Myasuka Mar 30, 2024
Author

I agree. And I think we should minimize the additional storage cost of z-set, as a base table could be consumed over 100 times in production batch processing.
Materialize and RisingWave all implement inner join to a stateless delta join to minimize the state storage cost.

mihaibudiu Mar 30, 2024
Maintainer

The same delta join algorithm is used by Feldera, but I wouldn't call it "stateless". It has to maintain a copy of all input relations of the join. Materialize calls these "arrangements." The delta join algorithm has been known for many decades and is essentially the optimal way to perform incremental joins.

ryzhyk Mar 30, 2024
Maintainer

base table could be consumed over 100 times in production batch processing.

Sorry, not sure I understand. Feldera shouldn't need to consume the full table each time, only ingesting changes after initial full load. And in any case, storage cost should not depend on how frequently the data is loaded, it should only be a function of the current size of the table.

Myasuka Apr 1, 2024
Author

but I wouldn't call it "stateless". It has to maintain a copy of all input relations of the join.

For a streaming database, it will always store the input table, and the delta-join operation would not introduce another inner state. That's why we can say it appears as stateless. On the other hand, Feldera is not a database, but act as a data processor, which needs to introduce the additional storage cost.

mihaibudiu Apr 1, 2024
Maintainer

Certainly, all the internal state maintained by Feldera is in addition to the data in the database. That state can exceed the size of the database. This is inherent in IVM, which is a trade-off between time and space.
If you happen to join on a table that already exists in the database we could build an abstraction layer for that input table which lazily loads data from the database for that table, especially if there happens to be an index on the joined columns (for equijoins).
Sometimes the runtime can prove that some rows of the table will never participate in future joins, so these may be pruned by Feldera's runtime. But this can only happen if there is a limit on how much the input data can be "out of order".

ryzhyk · 2024-03-30T16:30:26Z

ryzhyk
Mar 30, 2024
Maintainer

Hi @Myasuka! Let me know if I understand your idea correctly:

You would like Feldera to be able to incrementally maintain complex views over large (warehouse-scale) datasets
These use cases don't require very low latency, but because data is large and queries are complex they need to maintain a lot of intermediate state in the form of Z-sets, which can be expensive with the current Feldera storage design.
So the question is whether it is possible to reduce the cost of storage by potentially increasing latency (which probably means using disaggregated storage, e.g., keeping Z-sets in S3).

Is this what you had in mind?

2 replies

Myasuka Apr 1, 2024
Author

@ryzhyk Thanks for your response.
Yes, I think we can use Feldera + disaggregated storage to replace Spark + Hive (or iceberg) in the batch ETL era with better data refreshens.

ryzhyk Apr 1, 2024
Maintainer

This makes sense. As @mihaibudiu pointed out, IVM may require maintaining a lot of intermediate state, in addition to the input tables; however there is no fundamental reason this state cannot live in disaggregated storage. We are starting to implement support for DeltaLake, initially in the form of input and output adapters, but potentially also as a third-tier data store in the future.

mihaibudiu · 2024-04-04T03:40:52Z

mihaibudiu
Apr 4, 2024
Maintainer

My next blog post, which will be published this week, touches on the topic you mention. It will describe how the work performed by a DBSP circuit can be decoupled from the input and output deltas using buffers. But I don't talk about optimizations, only about the actual buffering mechanism.

Thank you for the pointer to the paper on "incrementability". If I read it right, translating the proposal of the paper in the language of DBSP would simply involve breaking a DBSP circuit with a single "logical clock" into multiple circuits separated by buffers, using independent clocks, under the assumption that some parts of the plan can benefit from larger batches. The clock of the output buffer can be driven by the users, i.e., the equivalent of the triggers in the paper.

The paper only uses the metric of work, but does not discuss about idle time. If the system has nothing else to do, perhaps processing smaller batches is a good idea (if you don't pay by CPU usage). The more general problem requires an on-line algorithm - since you don't necessarily know when the user wants to inspect the data.

For choosing a good size for the input batches for one circuit, I suspect a simple control mechanism could work: try different sizes, and see which one works best; it can probably use a simple regression model.

It would be a fun experiment to reproduce the results in the paper using our system, and I suspect it wouldn't entail too much work. Could be a good intern/research project. So far we have done very little with regards to tuning.

2 replies

Myasuka Apr 7, 2024
Author

Compared with the simple "incrementability" model paper, I think another Tempura paper would be more complex and general. As far as I know, the Tempura paper is one of the basic work of enterprise Incremental processing engine.

mihaibudiu Apr 9, 2024
Maintainer

I read the Tempura paper, very interesting work. I also like the name of the system.

Most of the Tempura paper seems to address the offline version of the problem, where the "future" deltas are know by the optimizer, but the paper does talk about how to adapt this in an online setting.
If I understand right, there are two degrees of freedom in Tempura in optimizing an incremental plan:

when to evaluate certain operators, similar to the "Thrifty" paper above
whether to re-evaluate the query completely on the current input ("progressive data warehouse")

Let me point out several things contrasting DBSP to Tempura:

both use time varying relations
the DBSP theory of incremental view maintenance is much simpler than the Tempura model
- There are no positive/negative/unchanged parts of a relation (equation 4 in their paper) - just Z-sets everywhere that cover all 3 cases
- DBSP can model both the "multiplicity" perspective (which is what Z-sets really are) and the "attribute perspective" (which is what is used in evaluating aggregates), but in the end the results of a query is a multiset, so the final result must be expressed using the "multiplicity' perspective
- The incremental form of each kind of primitive operation is much simpler to write down in DBSP
- The chain rule in DBSP allows you to incrementalize any query plan
in the DBSP streaming model of computation, the delay operator Z^-1 has internal state. Like in Tempura, the incremental form of some SQL operators, e.g., joins, distinct, aggregates, contains delay operators. The cost of evaluating the query for an input change includes not only producing the output change, but also the cost of updating the internal state. The act of computing a new plan for the query would entail abandoning the existing state and reconstructing it from scratch by executing the new plan for the inputs accumulated so far. This is something we haven't considered, but may be indeed be beneficial. Even better would be if some of the old state could be reused for seeding the initial state of the new plan without needing to backfill the entire inputs.
in our current implementation we generate the plan for the standard ad-hoc query (not the incremental version). This allows us to reuse an existing query planner - in our case, Calcite. Then we incrementalize the ad-hoc plan. We perform some optimizations on the incremental plan, but they are more like "peephole" optimizations, not global cost-driven optimizations. Cost-driven optimization is a topic we haven't explored at all. I suspect that the quality of the original ad-hoc plan is most important to get right.

I hope this makes sense, and that I interpreted correctly the Tempura paper. Thank you for sharing it!

klion26 · 2024-04-12T12:39:42Z

klion26
Apr 12, 2024

Sorry to join the discussion late; I'll share some info from my side(use cases, problems, and potential solutions -- some use cases have been simplified so that I can discuss them in public).

The picture shows three cases, I'll describe them separately later.

The first case

The first case is the common warehouse scenario. The user will ingest data into Hive/DataLake Table, and then use Spark to process it.

Problem:

Too many tables in each layer complicates the overall maintenance process.
The entire pipeline is too complicated, resulting in excessive final report generation time.
May consume too much storage -- this is unrelated to the current problem (caused by business architecture), so it can be ignored for now.
Reading entire rows of data takes too much time.

User's Desire:

Ultimately, they hope to maintain the system easily and generate reports promptly. The reports are used for 1) Presenting to the boss, and 2) Making decisions.

Potential Solution:

I know some companies are using self-developed Data Lake architecture (more like HBase + OpenSource DataLake + some other optimizations) to solve this, and using the CDC of the Data Lake to perform incremental computations -- which, according to private communication, is working quite well.
Use column storage for column pruning, which can optimize read time.

The Second Case:

The second case is generally used in recommendation scenarios with high real-time requirements. It can be considered the real-time pipeline of the first case. However, there are also some problems:

Issues:

High maintenance cost -- Maintaining the entire real-time pipeline, including Kafka/Flink, etc.
High resource consumption -- Currently, Flink/Kafka consumes a lot of resources (especially when the state is large).
Data in the MQ cannot be queried immediately, making it difficult to reprocess in case of exceptions for the pipeline.
Cannot fully align with the first type of pipeline because: 1) There may be rollback data; 2) When joining dimension tables, the dimension data is dynamically updated.
Achieving exactly once in Kafka has a high cost.

User's Desire:

Reduce maintenance costs (resource consumption costs for the entire pipeline and human labor costs for maintenance).
More convenient for fault recovery.

Possible Solutions:

None for now.

If possible, the user hopes to process the first and second cases in a single pipeline.

The Third Case:

The third use case is generally an online service that consumes upstream data, joins external dimension tables, and sends the processed results downstream.

Issues:

If the dimension table is large and stored in external storage like HBase, the latency cannot meet the requirements.
If stored in Flink's state, the cost is too high -- this will be partially alleviated when Flink's Disaggregated state Store architecture is ready.
checkpoint & Failover takes too long.
Dimension data needs to be stored in each operator if we use Flink state to store them.
The overall pipeline maintenance is very complicated -- in reality, there will be multiple Flink jobs chained together, something like use case 2

User's Desire:

Able to process large amounts of data for joins -- may need to join up to 1 month of historical data.
Overall TCO (system resource costs, human labor costs, etc.) can be reduced.

Possible Solutions:

After Flink's Disaggregated state Store solution is ready, it can solve the issues in 2, and 3.

Other Questions

Streaming-Streaming join with retraction

IIUC the delta join is more efficient in batch mode, but has a litter effect in streaming-streaming join, is this understanding right?

DBSP and Tempura

Regarding the mentioned Tempura, my understanding is that DBSP can cover the part that Tempura does. I'll briefly explain my understanding:

Tempura dynamically generates Plans. The optimal plan at time T may not depend on the plan at time T-1, as there may be rollbacks and other operations. Tempura can handle generating the plan at time T may based on time $T-k$ (k > 1)
In theory(IIUC), DBSP supports such operations because $z^{-k}$ and $z^{-1}$ have the same properties, but I don't know if the current DBSP implementation includes dynamically selecting plans or not.

4 replies

mihaibudiu Apr 12, 2024
Maintainer

This is a long message. I will reply in bits.
Let me start with the last question. This deserves a blog post, so I will probably expand on it again later.

DBSP is a language to describe streaming systems, which can be implemented relatively easily (there are only 4 kinds of operators), maybe 5 if you are picky.

Our paper also gives an algorithm to convert a query Q (a standard point-in-time query written in SQL or Datalog) into a DBSP circuit which maintains the results of the query incrementally. The algorithm input is a plan for Q, the output is a circuit in DBSP implementing (\uparrow Q)^\Delta (the incremental lifted query). For each plan of Q you will get a different DBSP circuit, but if the input plans are equivalent, the DBSP circuits will also be 100% equivalent, providing the same exact input-output behavior.

But DBSP as a language is more powerful, it can express queries that are not just of the form (\uparrow Q)^\Delta. The simplest example is a program consisting of just a delay. There is no query Q such that (\uparrow Q)^\Delta = z^{-1}.

A DBSP program is supposed to run forever: you build it and you send data through it; it will produce results, and update its internal state on every input.

The DBSP theory doesn't give you any solution for converting a DBSP program and its internal state into another DBSP program that is equivalent, such that you can substitute the new one for the original one from some point onwards. Neither does our implementation.

If you want to switch to a different DBSP program that computes the same function, you can build the new DBSP circuit and you have to feed again all the inputs you have ever fed the original circuit. Then the new circuit can take over from the original circuit.

However, if your DBSP program implements a lifted incremental query Q, then you can do better: it is enough to feed to the new circuit only the integrals of the input tables computed so far. My understanding is that this is what Tempura does. We can do this too, but you must save these integrals somewhere when you switch circuits. And this may be very hard for applications where these integrals would grow forever (like some streaming applications).

For some special DBSP programs (the ones without any delay operators inside) you can just switch the implementation at any time; such programs do not maintain any internal state, so the replacement can take over at any time. Such programs may be produced from queries Q that only contain linear operators.

These are all worst-case options. You can do better for some programs. For the example of a program that is equivalent to just one delay, it's enough to feed the previous input, and then the replacement can can continue from that point on.

klion26 Apr 15, 2024

IIUC, DBSP should be able to do the above case-1, and case-2(use the final MV as the final result), and this way of IVM seems to be relatively better in terms of maintainability as well as performance, case-3 is also possible, but need to have a suitable storage system.
It would be nice if there is a blog describing this, thank you very much for the previous blogs, which helped me a lot when I didn't understand the DBSP papers.

Thank you a lot for this wonderful work, I'm also trying to use DBSP to solve some of our problems(now is some poc demo), but this can only happen in my part-time, so the progress may be slow.

mihaibudiu Apr 15, 2024
Maintainer

(1) Case 3 above is very interesting. What you need is a wrapper that exposes an external indexed collection as a DBSP Z-set. It should clearly be doable. An important question is whether the external data set changes, and when it changes how DBSP finds out what the changes are.

(2) Regarding blog posts, so far I haven't even started talking about DBSP, only about the preliminaries, I still have quite a few to write. I try to make them each bite-size so they are easy to digest.

(3) Regarding your use of DBSP for solving your problems, I am sure you will encounter problems - we do all the time. Please file issues for any problem found, including documentation and usability problems, and we'll be happy to fix them. Hopefully we can iron out the kinks and make this into an easy to use tool. The Rust DBSP layer wasn't designed necessarily for human consumption (it is the target of our compiler), so some design decisions may seem strange.

klion26 Apr 17, 2024

For use case-3, it is something like real-time fraud detection. The dimension data in HBase may change over time.
Will file issues when encountering any problems, Thanks

mihaibudiu · 2024-04-12T19:44:08Z

mihaibudiu
Apr 12, 2024
Maintainer

Thank you for sharing the use cases, these are very interesting.

We should also make a distinction between the DBSP theory and the DBSP Rust library. The theory does not care about stuff like scale out or fault-tolerance, whereas the implementation clearly does. The Rust library is also continuously evolving, so we have to distinguish between what it does today from what it can do in the future. And the SQL compiler is also evolving.

I am saying this because whether DBSP Rust/Feldera can implement these use cases also may depend on the nature of the data (size, frequency, etc), the nature of the computation (Spark and Flink can each express algorithms the other one cannot) and the maturity of our implementation.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How about making Feldera more batch-style with lower cost to act as a near real-time data transformation engine #1590

{{title}}

Replies: 6 comments 13 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

How about making Feldera more batch-style with lower cost to act as a near real-time data transformation engine #1590

Myasuka Mar 30, 2024

Replies: 6 comments · 13 replies

mihaibudiu Mar 30, 2024 Maintainer

mihaibudiu Mar 30, 2024 Maintainer

Myasuka Mar 30, 2024 Author

mihaibudiu Mar 30, 2024 Maintainer

ryzhyk Mar 30, 2024 Maintainer

Myasuka Apr 1, 2024 Author

mihaibudiu Apr 1, 2024 Maintainer

ryzhyk Mar 30, 2024 Maintainer

Myasuka Apr 1, 2024 Author

ryzhyk Apr 1, 2024 Maintainer

mihaibudiu Apr 4, 2024 Maintainer

Myasuka Apr 7, 2024 Author

mihaibudiu Apr 9, 2024 Maintainer

klion26 Apr 12, 2024

The first case

Problem:

User's Desire:

Potential Solution:

The Second Case:

Issues:

User's Desire:

Possible Solutions:

The Third Case:

Issues:

User's Desire:

Possible Solutions:

Other Questions

Streaming-Streaming join with retraction

DBSP and Tempura

mihaibudiu Apr 12, 2024 Maintainer

klion26 Apr 15, 2024

mihaibudiu Apr 15, 2024 Maintainer

klion26 Apr 17, 2024

mihaibudiu Apr 12, 2024 Maintainer

Myasuka
Mar 30, 2024

Replies: 6 comments 13 replies

mihaibudiu
Mar 30, 2024
Maintainer

mihaibudiu
Mar 30, 2024
Maintainer

Myasuka Mar 30, 2024
Author

mihaibudiu Mar 30, 2024
Maintainer

ryzhyk Mar 30, 2024
Maintainer

Myasuka Apr 1, 2024
Author

mihaibudiu Apr 1, 2024
Maintainer

ryzhyk
Mar 30, 2024
Maintainer

Myasuka Apr 1, 2024
Author

ryzhyk Apr 1, 2024
Maintainer

mihaibudiu
Apr 4, 2024
Maintainer

Myasuka Apr 7, 2024
Author

mihaibudiu Apr 9, 2024
Maintainer

klion26
Apr 12, 2024

mihaibudiu Apr 12, 2024
Maintainer

mihaibudiu Apr 15, 2024
Maintainer

mihaibudiu
Apr 12, 2024
Maintainer