-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature Request] Enable ability to read stream from delta table without duplicates. #1490
Comments
I plan to create MR for the first option, later today. |
Gentle ping, any comments on this? |
Just to confirm, are you aware of the Delta Change Data Feed feature? It seems to me like you could use a CDF stream to accomplish your use case by just filtering out versions with only "insert" rows (which would basically fulfill option (2) that you've outlined above). |
Yes, but it's not related.
While proposed solutions give the ability:
|
(2) shouldn't be a significant difference with CDF by the way (and the others largely depend on the workload.) Regardless it sounds like the semantics of what you’re proposing are unclear on multiple different edge cases (for example logical blind appends vs. insert-only operations). We would need to clarify the exact semantics and also see if this is something others in the community would use. But first I think it would be helpful to see some performance numbers w.r.t CDF to justify adding a new feature when it can be done with CDF. |
I am not sure, I will be able to do such a comparison on my amount of data. And on the small amount of data, I am not sure that difference will be obvious.
I don't get your point. P.S.: As I mentioned in the issue description.
So I just wanna have the ability to stream all this append from the table and ignore any update/delete/overwrite commits, without any additional overhead. |
If your use case is 99.9% blind appends, the overhead of CDF should be close to none actually. The way CDF works, for operations like appends, no additional data files are written per commit.
For your workload the above should not be true given how CDF is implemented. Given CDF provides a working solution and should have little to no overhead for your use case it doesn't make sense to add a whole new feature. If this is not the performance you are seeing with CDF enabled then this is where some performance numbers would really help (and we would want to address why you are seeing significant slowdowns with CDF enabled.) Happy to discuss more in depth how CDF works if you would like and here's the design doc if you would like to take a look. |
Yes, of course. The blind append is a chip.
And I even don't need CDF for them. They just work without any changes.
But problems that last 1%, can be huge.
For example, the last GDPR which I apply.
Rewrite more than 2kk files, and it takes 2 days of 1k nodes spark cluster.
In such situations, costs grow significantly and are unpredictable. And I
don't want to enable CDF for these cases.
Can I disable CDF dataset generation for specific jobs/commits?
|
From the (design) doc that you mention.
I found 20% overhead as very significant. |
The numbers you are referring to from the design doc is comparing between different approaches for writing the CDF data. The exact number you referred to is for the rejected approach for writing CDF. The correct overheads to consider are as follows:
So unless you have actually tested your workload and have concrete numbers showing either write or read overheads cause by enabling CDF being too much, I really think CDF is the right approach. Footnote, we are aware of the massive baselien cost of Copy-on-write. Thats why we are building Deletion Vector based write-optimized u/d/m. |
OK, thx for detail explanation. I will come back with concrete numbers later. |
Hey I think this was done in #1616 and should suffice for your use case. Closing this now feel free to re-open if you see fit. |
Feature request
Overview
Enable the ability to read the stream from the delta table without duplicates.
Motivation
While there are a big amount of technics to deal with duplicates over the stream, like Dirty Partitioning or SCD(2) or ingest from CDC events, they bring costly overhead from a CPU/Memory perspective. Especially when we talk about
big data
workload.But there are a lot of business use cases that give the ability to separate Updates/Overwrites/Delete with Append of new data.
Mostly this is streams of Facts, like IoT metrics or user activity, which are always immutable and there are only a few cases when this data must be updated or deleted.
As you may notice this is a pretty rare event in comparison to near real-time data ingestion.
So in case data engineers can use data stream from append-only data, and treat all updates as an exceptional case. DE can achieve significant cost reduction over their data pipeline.
So while this approach increases technical complexity.
It is obvious that separating of AppendOnly ingestion and Reruns brings additional complexity to the system.
But for a lot of data pipelines and business keeping cost per event
P.S. Iceberg has a similar feature. it's not really fair to compare, because Iceberg is not a transaction log and did not support CDC before version 2.0. But still, I think delta may support
append-only
streams as first-class citizens, and not only CDC.Further details
Suggested technical implementation:
Add an
append-only
option to the delta.io stream source.isBlindAppend
fromCommitInfo
to filter out commits which is notblindAppend
.The first approach is deadly simple but will ignore some interesting cases like the
MERGE
operation which only inserts data.The second approach will handle
MERGE
. But from my point of view - this can make API less concise, because people may expect that during theOVERWRITE
operation when new data is added, this new data will be available through a stream source. Also, it's not 100% clear, what must be a behavior when DELETE and then MERGE with Only Inserts happen - this may lead to logical duplicates in some cases.Willingness to contribute
The Delta Lake Community encourages new feature contributions. Would you or another member of your organization be willing to contribute to the implementation of this feature?
The text was updated successfully, but these errors were encountered: