[SPARK-24720][STREAMING-KAFKA] add option to align ranges with offset having records to support kafka transaction #21917

QuentinAmbard · 2018-07-30T15:04:33Z

What changes were proposed in this pull request?

Update with a better fix:
With this fix, the offsets are scanned to determine the ranges. We ensure that the when we determine the range [fromOffset, untilOffset), untilOffset is always an offset with an existing record that we've been able to fetch at least once.
This logic is applied as soon as allowNonConsecutiveOffsets is enabled.
Since we scan all the record a first time, we use this to count the number of records. OffsetRange now contains the number of record in each partition and the rdd.count() is a free operation.

Isolation level of uncomitted read is unsafe: untilOffset might become "empty" if the transaction is abort just after the offset range creation. The same thing could happen if the "untilOffset" gets compacted (it's also a potential issue before this change)

How was this patch tested?

Unit test for the offset scan. No integration test for transaction since the current kafka version doesn't support transactions. Tested against a custom streaming use-case.

…support kafka transaction

holdensmagicalunicorn · 2018-07-30T15:04:35Z

@QuentinAmbard, thanks! I am a bot who has found some folks who might be able to help with the review:@tdas, @zsxwing and @koeninger

koeninger · 2018-07-30T20:17:56Z

jenkins, ok to test

SparkQA · 2018-07-30T20:46:12Z

Test build #93803 has finished for PR 21917 at commit 05c7e7f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-08-02T20:15:00Z

Test build #94055 has finished for PR 21917 at commit 70ecd38.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-08-02T20:30:16Z

Test build #94056 has finished for PR 21917 at commit 29c5406.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-08-02T20:34:17Z

Test build #94058 has finished for PR 21917 at commit 69582f4.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

koeninger · 2018-08-03T04:25:17Z

.../kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/DirectKafkaInputDStream.scala

+          val offsetAndCount = localRw.getLastOffsetAndCount(localOffsets(tp), tp, o)
+          (tp, offsetAndCount)
+        }
+      }).collect()


What exactly is the benefit gained by doing a duplicate read of all the messages?

QuentinAmbard · 2018-08-03T12:33:46Z

With this solution we don't read the data another time "just to support transaction."
The current implementation of compacted topics already ready all the messages twice in order to get a correct count for the info tracker: val inputInfo = StreamInputInfo(id, rdd.count, metadata)
Since RDD are compacted or have "empty offset" due to transactions markers/abort, the only way to get the exact count is to read.
The same logic applies to select a range: if you want to get N records per partition, the only way to know what "untilOffset" will make you read N records is to read, stop when you've read N records, and get the offset.
So one of the advantage is to be able to fetch the number of records per partition you really want (for compacted topics and transactions).
But the real advantage is that it let you pick an "untilOffset" that is not empty.
For example if you don't have record for offset [4, 6]

1, 2, 3, 4, 5, 6
a, b, c,  ,  ,  ,

if use an offset range of [1,5), you will try to read 4 but won't receive any data. In this scenario you can't tell if data is missing (so it's ok) or you lose some data because kafka is down (not ok)

To deal with this situation, we first scan the offset and stop to the last offset where we had data, in the example instead of [1,5) we would go with [1,4) because 3 has data so it's safe to stop at 3
During the next batch, if we then get extra data we would then select the next range as [4, 8) and won't have any issue.

1, 2, 3, 4, 5, 6, 7
a, b, c,  ,  ,  , g

does that make sense?
(PS: sorry for the multiple commit, I wasn't running Mima properly, I'll fix the remaining issue soon)

koeninger · 2018-08-04T20:03:33Z

Still playing devil's advocate here, I don't think stopping at 3 in your example actually tells you anything about the cause of the gaps in the sequence at 4. I'm not sure you can know that the gap is because of a transaction marker, without a modified kafka consumer library.

If the actual problem is that when allowNonConsecutiveOffsets is set we need to allow gaps even at the end of an offset range... why not just fix that directly?

Master is updated to kafka 2.0 at this point, so we should be able to write a test for your original jira example of a partition consisting of 1 message followed by 1 transaction commit.

QuentinAmbard · 2018-08-04T21:08:24Z

I'm not sure to understand your point. The cause of the gap doesn't matter, we just want to stop on an existing offset to be able to poll it. It can be because of a transaction marker, a transaction abort or even just a temporary poll failure it's not relevant in this case.
The driver is smart enough to be able to restart from any Offset, even in the middle of a transaction (abort or not)
The issue with gap at the end is that you can't know if it's a gap or if the poll failed.
For example SeekToEnd gives you 5 but the last record you get is 3 and there is no way to know if 4 is missing or just an offset gap.
How could we fix that in a different way?

koeninger · 2018-08-04T21:38:33Z

If the last offset in the range as calculated by the driver is 5, and on the executor all you can poll up to after a repeated attempt is 3, and the user already told you to allowNonConsecutiveOffsets... then you're done, no error.

Why does it matter if you do this logic when you're reading all the messages in advance and counting, or when you're actually computing?

To put it another way, this PR is a lot of code change and refactoring, why not just change the logic of e.g. how CompactedKafkaRDDIterator interacts with compactedNext?

QuentinAmbard · 2018-08-04T22:02:36Z

If you are doing it in advance you'll change the range, so for example you read until 3 and don't get any extra results. Maybe it's because of a transaction offset, maybe another issue, it's ok in both cases.
The big difference is that the next batch will restart from offset 3 and poll from this value. If seek to 3 and poll get you another result (for example 6) then everything is fine it's not a data loss it's just a gap.
The issue with your proposal is that SeekToEnd gives you the last offset which might not be the last record.
So in your example if last offset is 5 and after a few poll the last record you get is 3 what do you do, continue and execute the next batch from 5? How do you know that offset 4 isn't just lost because poll failed?
The only way to know that would be to get a record with an offset higher than 5. In this case you know it's just a gap.
But if the message you are reading is the last of the topic you won't have records higher than 3, do you can't tell if it's a poll failure or an empty offset because of the transaction commit

koeninger · 2018-08-04T22:47:45Z

How do you know that offset 4 isn't just lost because poll failed?

By failed, you mean returned an empty collection after timing out, even though records should be available? You don't. You also don't know that it isn't just lost because kafka skipped a message. AFAIK from the information you have from a kafka consumer, once you start allowing gaps in offsets, you don't know.

I understand your point, but even under your proposal you have no guarantee that the poll won't work in your first pass during RDD construction, and then fail on the executor during computation, right?

The issue with your proposal is that SeekToEnd gives you the last offset which might not be the last record.

Have you tested comparing the results of consumer.endOffsets for consumers with different isolation levels?

Your proposal might end up being the best approach anyway, just because of the unfortunate effect of StreamInputInfo and count, but I want to make sure we think this through.

koeninger · 2018-08-04T22:50:16Z

.../kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/DirectKafkaInputDStream.scala

+    if (nonConsecutive) {
+      val localRw = rewinder()
+      val localOffsets = currentOffsets
+      context.sparkContext.parallelize(offsets.toList).mapPartitions(tpos => {


Because this isn't a kafka rdd, it isn't going to take advantage of preferred locations, which means it's going to create cached consumers on different executors.

Are you suggesting I should create a new kafkaRDD instead, and consume from this RDD to get the last offset range?

koeninger · 2018-08-04T22:53:51Z

external/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/KafkaDataConsumer.scala

  }

+  /**
+   * Similar to compactedStart but will return None if poll doesn't


Did you mean compactedNext?

koeninger · 2018-08-04T22:55:14Z

external/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/KafkaDataConsumer.scala

    buffer.previous()
  }

+  def seekAndPoll(offset: Long, timeout: Long): ConsumerRecords[K, V] = {


Is this used anywhere?

koeninger · 2018-08-04T23:05:35Z

external/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/OffsetRange.scala

    val fromOffset: Long,
-    val untilOffset: Long) extends Serializable {
+    val untilOffset: Long,
+    val recordNumber: Long) extends Serializable {


Does mima actually complain about binary compatibility if you just make recordNumber count? It's just an accessor either way...

If so, and you have to do this, I'd name this recordCount consistently throughout. Number could refer to a lot of things that aren't counts.

koeninger · 2018-08-04T23:07:16Z

...a-0-10/src/test/scala/org/apache/spark/streaming/kafka010/OffsetWithRecordScannerSuite.scala

+  private def records(offsets: Option[Long]*) = {
+    offsets.map(o => o.map(new ConsumerRecord("topic", 0, _, "k", "v"))).toList
+  }
+}


These tests aren't really testing the actual scenario we care about (transaction markers at the end of an offset range), which should be directly testable now that kafka has been upgraded to 2.0

QuentinAmbard · 2018-08-06T07:56:48Z

By failed, you mean returned an empty collection after timing out, even though records should be available? You don't. You also don't know that it isn't just lost because kafka skipped a message. AFAIK from the information you have from a kafka consumer, once you start allowing gaps in offsets, you don't know.

Ok that's interesting, my understanding was that if you successfully poll and get results you are 100% sure that you don't lose anything. Do you have more details on that? Why would kafka skip a record while consuming?

Have you tested comparing the results of consumer.endOffsets for consumers with different isolation levels?

endOffsets returns the last offset (same as seekToEnd). But you're right that the easiest solution for us would be to have something like seekToLastRecord method instead. Maybe something we could also ask ?

koeninger · 2018-08-06T23:39:30Z

Example report of skipped offsets in a non-compacted non-transactional situation http://mail-archives.apache.org/mod_mbox/kafka-users/201801.mbox/%3CCAKWX9VXc1cDosqWwWjK3qmyy3SVvtmH+RJDrjyvsBeJSds8ewQ@mail.gmail.com%3EFo I asked on the kafka list about ways to tell if an offset is a transactional marker. I also asked about endOffset alternatives, although I think that doesn't totally solve the problem (for instance, in cases where the batch size has been rate limited)

…

On Mon, Aug 6, 2018 at 2:57 AM, Quentin Ambard ***@***.***> wrote: By failed, you mean returned an empty collection after timing out, even though records should be available? You don't. You also don't know that it isn't just lost because kafka skipped a message. AFAIK from the information you have from a kafka consumer, once you start allowing gaps in offsets, you don't know. Ok that's interesting, my understanding was that if you successfully poll and get results you are 100% sure that you don't lose anything. Do you have more details on that? Why would kafka skip a record while consuming? Have you tested comparing the results of consumer.endOffsets for consumers with different isolation levels? endOffsets returns the last offset (same as seekToEnd). But you're right that the easiest solution for us would be to have something like seekToLastRecord method instead. Maybe something we could also ask ? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#21917 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAGAB2FVhHp_76l0WnRg_2WPgzSx1LlSks5uN_bxgaJpZM4VmlWm> .

koeninger · 2018-08-06T23:41:00Z

Recursively creating a Kafka RDD during creation of a Kafka RDD would need a base case, but yeah, some way to have appropriate preferred locations.

…

On Mon, Aug 6, 2018 at 2:58 AM, Quentin Ambard ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In external/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/ DirectKafkaInputDStream.scala <#21917 (comment)>: > - val fo = currentOffsets(tp) - OffsetRange(tp.topic, tp.partition, fo, uo) + /** + * Return the offset range. For non consecutive offset the last offset must have record. + * If offsets have missing data (transaction marker or abort), increases the + * range until we get the requested number of record or no more records. + * Because we have to iterate over all the records in this case, + * we also return the total number of records. + * @param offsets the target range we would like if offset were continue + * @return (totalNumberOfRecords, updated offset) + */ + private def alignRanges(offsets: Map[TopicPartition, Long]): Iterable[OffsetRange] = { + if (nonConsecutive) { + val localRw = rewinder() + val localOffsets = currentOffsets + context.sparkContext.parallelize(offsets.toList).mapPartitions(tpos => { Are you suggesting I should create a new kafkaRDD instead, and consume from this RDD to get the last offset range? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#21917 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAGAB_EelzeJDa36_SAKaH8trQC5bTnGks5uN_cugaJpZM4VmlWm> .

QuentinAmbard · 2018-10-03T22:44:02Z

SPARK-25005 has actually a far better solution to detect message lost. Will try to apply same logic...

ghost · 2018-12-17T05:45:18Z

Based on my little understanding, I think this PR will fix this issue - https://stackoverflow.com/q/48344055/2272910

We're facing the same issue and would love to see some solution out in Spark-Kafka streaming package.

AmplabJenkins · 2019-09-16T18:20:29Z

Can one of the admins verify this patch?

github-actions · 2020-01-09T00:06:29Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

quentin added 3 commits July 30, 2018 16:43

SPARK-24720 add option to align ranges with offset having records to …

a5b52c9

…support kafka transaction

correction indentation

79d83db

remove wrong comment edit

05c7e7f

quentin added 3 commits August 2, 2018 21:52

add offset scan logic and conf

70ecd38

keep OffsetRange creation method for backward compatibility

29c5406

revert unecessary modification in test suite on offset range

69582f4

koeninger reviewed Aug 3, 2018

View reviewed changes

koeninger reviewed Aug 4, 2018

View reviewed changes

dongjoon-hyun added the DSTREAMS label Jun 14, 2019

github-actions bot added the Stale label Jan 9, 2020

github-actions bot closed this Jan 10, 2020

[SPARK-24720][STREAMING-KAFKA] add option to align ranges with offset having records to support kafka transaction #21917

[SPARK-24720][STREAMING-KAFKA] add option to align ranges with offset having records to support kafka transaction #21917

Uh oh!

Conversation

QuentinAmbard commented Jul 30, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

holdensmagicalunicorn commented Jul 30, 2018

Uh oh!

koeninger commented Jul 30, 2018

Uh oh!

SparkQA commented Jul 30, 2018

Uh oh!

SparkQA commented Aug 2, 2018

Uh oh!

SparkQA commented Aug 2, 2018

Uh oh!

SparkQA commented Aug 2, 2018

Uh oh!

koeninger Aug 3, 2018

Choose a reason for hiding this comment

Uh oh!

QuentinAmbard commented Aug 3, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

koeninger commented Aug 4, 2018

Uh oh!

QuentinAmbard commented Aug 4, 2018

Uh oh!

koeninger commented Aug 4, 2018

Uh oh!

QuentinAmbard commented Aug 4, 2018

Uh oh!

koeninger commented Aug 4, 2018

Uh oh!

koeninger Aug 4, 2018

Choose a reason for hiding this comment

Uh oh!

QuentinAmbard Aug 6, 2018

Choose a reason for hiding this comment

Uh oh!

koeninger Aug 4, 2018

Choose a reason for hiding this comment

Uh oh!

koeninger Aug 4, 2018

Choose a reason for hiding this comment

Uh oh!

koeninger Aug 4, 2018

Choose a reason for hiding this comment

Uh oh!

koeninger Aug 4, 2018

Choose a reason for hiding this comment

Uh oh!

QuentinAmbard commented Aug 6, 2018

Uh oh!

koeninger commented Aug 6, 2018 via email

Uh oh!

koeninger commented Aug 6, 2018 via email

Uh oh!

QuentinAmbard commented Oct 3, 2018

Uh oh!

ghost commented Dec 17, 2018 • edited by ghost Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AmplabJenkins commented Sep 16, 2019

Uh oh!

github-actions bot commented Jan 9, 2020

Uh oh!

Uh oh!

QuentinAmbard commented Jul 30, 2018 •

edited

Loading

QuentinAmbard commented Aug 3, 2018 •

edited

Loading

ghost commented Dec 17, 2018 •

edited by ghost

Loading