Flink: Add passthroughRecords option to DynamicIcebergSink#15433
Flink: Add passthroughRecords option to DynamicIcebergSink#15433sqd wants to merge 1 commit intoapache:mainfrom
Conversation
| if (passthroughRecords) { | ||
| if (!immediateUpdate) { | ||
| throw new UnsupportedOperationException( | ||
| "Immediate update must be enabled to pass through records"); | ||
| } | ||
| rowDataDataStreamSink = converted.sinkTo(sink).uid(prefixIfNotNull(uidPrefix, "-sink")); | ||
| } else { |
There was a problem hiding this comment.
This will ignore DistributionMode and partitioning in DynamicRecord. I saw that you listed this in the docs, but I'm not sure we should diverge too much from the normal mode of operation. I think what we can do, is to add a new chained side output with an extra DynamicWriter for this quick path.
It may be worth adding a new DistributionMode. Currently NONE does a round-robin, which is slightly confusing, we could rename it to ROUND_ROBIN and use NONE for this direct path.
There was a problem hiding this comment.
How do we handle DistributionMode in the normal Sink?
We should be consistent
There was a problem hiding this comment.
DistributionMode.NONE in the regular sink does strict forward partitioning (no redistribution), which is similar to what the PR does. It leads to Flink chaining the input with the writer. For DynamicSink, because we have many tables, the idea was to spread out the data onto the available workers, which is why we opted for a round-robin across the workers chosen for the table.
There was a problem hiding this comment.
to add a new chained side output with an extra DynamicWriter
Could you elaborate on this please? I could be wrong but my understanding is that Flink cannot chain any operator with a side output. I was wrong.
rename it to ROUND_ROBIN and use NONE
Yes that sounds a big improvement in semanticity.
|
@sqd Could you share a bit more about your use case? Ignoring This approach might work if your input records are already correctly distributed. But any mistake there will lead to small files or skewed writes—fast for the writers, but potentially very costly for the readers. |
|
I think the idea here is valid, but we should implement this in a clean way, e.g. by adding a new |
You have either a very tricky balancing logic before the sink, or every table of yours is similarly loaded and continuously written. Not too dynamic IMHO 😄 |
I actually have some numbers! Before the change the pipeline took around 1 to 1.5TB of memory and around 200 cores. With the change it shaved 50 to 70 cores (not to mention the increased throughput). Of course there are other computation going on as well, but parquet writing and Flink RowData serdes showed up in profiler to take >90% CPU combined. Serdes was taking up around 75% CPU of the actual parquet writing.
My use case is that I have a firehose of data that I want to ingest into Iceberg. Because the volume is so high, it doesn't really matter which writer subtasks a record is routed to, there won't be small files either way. I was running DistributionMode.NONE, and noticed that serdes was taking up a ridiculous amount of resources, also caused a lot of unnecessary network shuffling.
I am a big fan of calling it ROUND_ROBIN instead, but are we not worried about breaking existing code? Maybe introduce ROUND_ROBIN as an alias for NONE, and this new mode can be called "PASSTROUGH" or something? |
|
Re: side output. I can definitely see the argument to not silently ignore other distribution modes, but if that's disabled by default and we have extensive document, maybe it's not that big of a deal? Also, even if we add a side output, and we only enable the side-output switching behavior when this feature is toggled on, a similar argument can be made that we are silently ignoring DistributionMode.PASS_THROUGH (or whatever we call) depending on the feature toggle. |
|
@sqd: Given your use case, it makes perfect sense to skip the routing step. The javadoc for
That’s exactly the behavior we want here. If this were a greenfield project, adding a ROUND_ROBIN option would indeed be a good idea.
|
|
We could call this a bugfix, and make the change very obvious in the documentation |
@sqd I would like not to add a special flag, but use a DistributionMode instead. Using a special flag would deviate from the current design of the dynamic sink. The flag can create unexpected failures, e.g. you have the pass-through flag on and everything works fine, but then a records with DistributionMode.HASH arrives and the pipeline crashes.
@sqd All distribution modes should continue to work, even in the presence of pass-through Distribution Mode. That's why I'm against the flag. We need to modify the topology that it supports a direct path and the round-robin/shuffle case.
@pvary +1 for treating this as a bugfix. We haven't documented the existing round-robin behavior of |
@mxm Sure. Just to confirm I understand what's in your mind:
Am I missing anything? |
I think we don't even need another distribution mode. We can change the |
The current |
|
You are right @aiborodin that all of this could be achieved manually by the user, but the current API of DynamicRecord allows to specify the write parallelism, and it feels wrong to just ignore it entirely when some static flag is set, but I'm ok with ignoring it for an actual "none" DistributionMode. |
@sqd Sounds good! |
ef6f483 to
b7fe04c
Compare
|
@mxm, @sqd: I think we have a misunderstanding here. In the Is there a way to have duplicated instances of writers (some chained, some are not chained) and route things out of the chained versions based on the mode? Sideoutput seems like a possibility, but maybe complicating the things so much is an overkill. |
|
Unless there is more to come after b7fe04c, I think there is indeed a misunderstanding. Let's recap the criteria (1) from above:
I still see the flag in the above commit. We need to remove the flag and add a direct chained path, as an add-on to the existing topology.
Yes, this was what I suggested above and I think we had agreement on. Using a side output is the only option when we want a chained and a non-chained variant. "Side output" is just a fancy word for adding another output. Semantically, it is not different from the main output. Another option would be to multiplex via one output, but I think that makes things more complicated and harder to maintain. |
The flag is internal, not exposed through the public builder. I use a flag so I don't have to copy the code of DynamicIcebergSink into a very similar DynamicIcebergForwardSink or something. Instead we can reuse most of the logics. i.e. This line sets the flag to false to create a shuffled sink. This line sets it to true to create a forward sink.
Yes that's what I am doing (unless I am still missing something?).
Sorry about the CI failure. I know what's going on but a little swamped to fix them. I'll address them as soon as I can. |
|
You are right, we missed that. This is indeed achieving what we discussed. |
flink/v2.0/flink/src/main/java/org/apache/iceberg/flink/sink/dynamic/HashKeyGenerator.java
Outdated
Show resolved
Hide resolved
e5e1d07 to
646dc5e
Compare
Currently, DistributionMode.NONE actually performs a round robin. This commit changes the behavior so that records tagged as NONE will go to a passthrough side output to enable chaining. A new distribution mode ROUND_ROBIN is added which behaves like NONE before this change.
646dc5e to
95a28a2
Compare
|
Hi, @mxm @pvary I would like some of your guidance here if you don't mind. To make the CI pass, I had to make some changes to the Spark code. The new ROUND_ROBIN distribution mode (which is a core API enum) was causing some Spark tests to fail because the tests iterate through all possible distribution modes on this line. I fixed the tests by treating ROUND_ROBIN as an alias for NONE in Spark, because it seems the Spark connector doesn't have a similar concept. I am pretty new to this project and not sure if that's acceptable. |
When enabled, records are forwarded directly from the record generator to the writer using a forward edge instead of a hash edge. This allows Flink to chain the two operators, avoiding serialization/deserialization overhead and drastically increasing throughput in high-volume pipelines.
Current topology:

Same pipeline, with the new change enabled:

Serdes of Flink RowData can be very expensive:
