-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-28908][SS]Implement Kafka EOS sink for Structured Streaming #25618
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
By reading the doc without super deep understanding I've found this in the caveats section:
The |
Before reviewing the design, I may need to say, you are encouraged to at least mention it if you borrow the code from somewhere, so that we are sure that there's no license issue, even no license issue, at least they've got credit. https://github.com/apache/spark/pull/25618/files#diff-c1e1dbc4a986c69ef54e1eebe880d4e9 |
Just skimmed the design doc (need to take a look deeply on fault tolerance) and it's basically known approach what Flink is doing (2PC). Please mention what you've inspired of, for same reason, credit. I planned to propose similar before (in last year, haven't proposed the design itself), more clearly I've asked to support 2PC in DSv2 API level as Spark doesn't support 2PC natively, but feedback wasn't positive as it should be very invasive change on Spark codebase. There has been more cases asking for exactly-once write, and I guess the common answer was leveraging intermediate output. While some storage can leverage it (e.g. RDBMS - writers write to temp table, driver copies rows that writers reported to output table), it doesn't make sense for Kafka, at least for performance reason, as there's no way to let Kafka copies its records from topic A to topic B (right?), so I gave up. If the code change implements 2PC correctly, in general I guess it would work in many cases, though as it's explained that transaction timeout leads data loss. I've indicated the issue on transaction timeout when I designed it and that was also one of major concerns as well. When the producer writes something it must be committed within timeout in any kinds of failures, otherwise "data loss" happen. Even we decide to invalidate that batch and rerun the batch, we're now then "at-least-once" as some partitions already committed successfully. (I'm wondering Flink's Kafka producer with 2PC also has similar issue or they have some safeguard.) |
extends KafkaRowWriter(inputSchema, targetTopic) with DataWriter[InternalRow] { | ||
|
||
private lazy val producer = { | ||
val kafkaProducer = CachedKafkaProducer.getOrCreate(producerParams) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spark leverages the fact that Kafka producer is thread-safe. You may need to check whether it's still valid for transactional producer as well. (My instinct says it may not, otherwise you'll deal with 2PC via multiple partitions with same producer id in same executor. Sounds weird.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have considered kafka producer per executor. But there will be data loss to abort transaction when multiple task share one transaction, and some task failed and retry in other executor.
So to avoid create too many producer, task will reuse the created producer. And the config producer.create.factor
will limit producer total number in abnormal scene, such as long tail task.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I meant is Spark shares Kafka producer in multi-threads in executor once producerParams
is same. So what you considered is exactly what Spark is doing now. (I meant to point out this.) According to your explanation, caching logic should be changed to restrict multi-threads usage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think caching logic is ok and we can control producer creation per task, and also failover with transactional.id in producerParams.
Transaction producer is not thread safe, so what I do is one producer per task in one micro-batch, and in next batch reused the created producer instead of recreate one since transaction is complete in every micro-batch. With producerParams, transactional.id is different between tasks in one micro-batch, but same in the next micro-batch.
And if task number is same for every executor in every micro-batch, no more producer will be created except the first micro-batch.
@gaborgsomogyi @HeartSaVioR Thanks for your reply about the config |
@HeartSaVioR Thanks for your advice. |
Spark doesn't have semantics of 2PC natively as you've seen DSv2 API - If I understand correctly, Spark HDFS sink doesn't leverage 2PC. Previously it used temporal directory - let all tasks write to that directory, and driver move that directory to final destination only when all tasks succeed to write. It leverages the fact that "rename" is atomic, so it didn't support "exactly-once" if underlying filesystem doesn't support atomic renaming. Now it leverages metadata - let all tasks write files, and pass the list of files (path) written to driver. When driver receives all list of written files from all tasks, driver writes overall list of files to metadata. So exactly-once for HDFS is only guaranteed when "Spark" reads the output which is aware of metadata information. |
Sorry for late reply. |
Well, someone could say it as 2PC since the behavior is similar, but generally 2PC assumes coordinator and participants. In second phase, coordinator "ask" for commit/abort to participants, not committing/aborting things directly participants just did in first phase. Based on that, driver should request tasks to commit their outputs, but Spark doesn't provide such flow. So that's pretty simplified version of 2PC and also pretty limited. I think the point is whether we are feeling OK to have exactly-once with some restrictions end users need to be aware of. Could you please initiate discussion on this in Spark dev mailing list? That would be good to hear others' voices. |
+1 having discussion on that. My perspective is clear. Having such limitation is a bit too much in such scenario so I'm not feeling comfortable with it. |
In the meantime I'm speaking with Gyula from Flink side to understand things deeper... |
@HeartSaVioR @gaborgsomogyi Thanks for your advice. I have created discussion in mail-list and looking forward to you guys. |
So we've sit together with the Flink guys and had a deeper look at this area. Basically they have similar solution. Data is sent to Kafka with I have mainly 2 concerns with this PR:
|
I'm not expert of Kafka (specifically how transaction works in Kafka), but given Kafka still writes "sequentially" to the topic and needs to provide records "in order", I don't think Kafka will allow multiple transactions can write to the topic concurrently. (Say, I expect "topic lock", in RDBMS term) If I assume correctly, turning off transaction timeout may lead the topic be not writable forever - once you lose producer/transaction id and unable to restore transaction. I feel no timeout seems unrealistic - that's the thing we should live with in distributed system. So there're not many options but known 2 options: 1) "data loss" after transaction timeout 2) turned to "at-least-once" after transaction timeout. Even it assumes 2PC logic is properly coupled with checkpoint of SS considering fault tolerance. Majority of other options would give up parallelism and let one writer writes all outputs, so I can't imagine better option for now. |
I would like to hear Kafka guys opinion before we say these are the only options. |
You might want to know that Kafka transactional producer is designed for Kafka stream (explaining below), so I'm wondering they've been considered other cases. https://www.confluent.io/blog/enabling-exactly-once-kafka-streams/ This article clearly describes the fact, search the sentence The streaming frameworks have to guarantee transactional write among 1. writing outputs 2. storing states 3. storing offset/commit (checkpoint) to properly support "exactly-once". As the article describes, Kafka stream is easier to achieve above as for Kafka stream 1~3 are all Kafka topics and Kafka producer guarantees atomic write to multiple Kafka topics. (Assuming reader is reading these topics as "read committed".) For other frameworks, there's no such guarantee and frameworks should provide some mechanism to guarantee it, or give up end-to-end exactly-once for some points. SS provides mechanism for 2 and 3 to ensure stateful exactly-once, and also 1 only if driver can transactionally commit the outputs from tasks. (The metadata in FileStreamSink is one of cases - metadata is leveraged to read outputs as "read committed".) Loosen contract for 1 is "idempotent write" which behaves similar as exactly-once given replaying must happen, though the output is not transactional. Kafka transactional producer is not the case so it requires Spark itself to care about the difference, or have limitations somehow. |
Had a small chat with @viktorsomogyi from Kafka team and mentioned that Flink and maybe Hive is dealing with this issue. There are some ugly hacks around but since at least 3 tech areas are struggling with this it would be good to create a real solution with API. |
@gaborgsomogyi @HeartSaVioR Thanks for your reply.
The first way is rejected since it leads to data loss. The second one is rejected since it is not consistent with exactly-once semantics. @gaborgsomogyi About a new Kafka API to resolve Kafka transaction in distributed system, as @HeartSaVioR mentioned above, Kafka producer transaction is not provided only for Kafka Stream, and a new API for Spark/Flink/HIve may be customized. So I also think we should adapt Spark/Flink/Hive to it. |
I would add this feature to Spark when no such hacks are needed like this. |
Sorry you are understanding my comment in opposite way. My claim was that Kafka producer transaction is designed "for" Kafka Stream. Please take a look at my comment thoughtfully. https://cwiki.apache.org/confluence/display/KAFKA/KIP-129%3A+Streams+Exactly-Once+Semantics According to design doc, Kafka community took the approach "transaction per task":
which they never need to worry about transaction across multiple connections/JVMs - unlike other streaming frameworks. According to the information I guess Kafka stream should leverage Kafka topic as shuffle storage and have multiple connected |
Can one of the admins verify this patch? |
We're closing this PR because it hasn't been updated in a while. If you'd like to revive this PR, please reopen it! |
What changes were proposed in this pull request?
Implement Kafka sink exactly-once semantics with transaction Kafka producer.
Why are the changes needed?
Does this PR introduce any user-facing change?
No
How was this patch tested?