[SPARK-34187][SS] Use available offset range obtained during polling when checking offset validation #31275

viirya · 2021-01-21T07:40:42Z

What changes were proposed in this pull request?

This patch uses the available offset range obtained during polling Kafka records to do offset validation check.

Why are the changes needed?

We support non-consecutive offsets for Kafka since 2.4.0. In fetchRecord, we do offset validation by checking if the offset is in available offset range. But currently we obtain latest available offset range to do the check. It looks not correct as the available offset range could be changed during the batch, so the available offset range is different than the one when we polling the records from Kafka.

It is possible that an offset is valid when polling, but at the time we do the above check, it is out of latest available offset range. We will wrongly consider it as data loss case and fail the query or drop the record.

Does this PR introduce any user-facing change?

No

How was this patch tested?

This should pass existing unit tests.

This is hard to have unit test as the Kafka producer and the consumer is asynchronous. Further, we also need to make the offset out of new available offset range.

...kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/consumer/KafkaDataConsumer.scala

HyukjinKwon · 2021-01-21T13:51:03Z

cc @HeartSaVioR @gaborgsomogyi @xuanyuanking

SparkQA · 2021-01-21T19:49:54Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38927/

SparkQA · 2021-01-21T19:50:02Z

Test build #134340 has finished for PR 31275 at commit 81c52e5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-01-21T19:54:08Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38927/

jaceklaskowski · 2021-01-22T10:49:13Z

...kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/consumer/KafkaDataConsumer.scala

@@ -192,6 +197,13 @@ private[consumer] case class FetchedData(
   * Returns the next offset to poll after draining the pre-fetched records.
   */
  def offsetAfterPoll: Long = _offsetAfterPoll
+
+  /**
+   * Returns the tuple of earliest and latest offsets that is the available offset range when


nit: Use @returns annotation.

Sorry for being late with this, but I've just noticed it.

Hm, I think this just follows other methods above and below.

dongjoon-hyun

Given your assessment, could you make it as a correctness issue, @viirya ?

It looks not correct ...
... drop the record.

...l/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/consumer/FetchedDataPool.scala

dongjoon-hyun

Although we don't have a test case, this looks reasonable. I left a few minor comments.

viirya · 2021-01-24T18:47:29Z

Given your assessment, could you make it as a correctness issue, @viirya ?

Done. Thanks.

dongjoon-hyun · 2021-01-24T19:49:33Z

Thank you for update, @viirya .

…when checking offset validation ### What changes were proposed in this pull request? This patch uses the available offset range obtained during polling Kafka records to do offset validation check. ### Why are the changes needed? We support non-consecutive offsets for Kafka since 2.4.0. In `fetchRecord`, we do offset validation by checking if the offset is in available offset range. But currently we obtain latest available offset range to do the check. It looks not correct as the available offset range could be changed during the batch, so the available offset range is different than the one when we polling the records from Kafka. It is possible that an offset is valid when polling, but at the time we do the above check, it is out of latest available offset range. We will wrongly consider it as data loss case and fail the query or drop the record. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? This should pass existing unit tests. This is hard to have unit test as the Kafka producer and the consumer is asynchronous. Further, we also need to make the offset out of new available offset range. Closes #31275 from viirya/SPARK-34187. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit ab6c0e5) Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

dongjoon-hyun · 2021-01-24T19:51:54Z

Merged to master/3.1. Could you make a backport to 3.0/2.4?

SparkQA · 2021-01-24T20:02:13Z

Test build #134414 has finished for PR 31275 at commit 8716a16.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2021-01-24T20:19:23Z

Thanks @dongjoon-hyun. Will file backport later.

gaborgsomogyi

Late LGTM. It took me some time to reproduce it on cluster but it works like charm.

HeartSaVioR · 2021-01-26T00:41:38Z

Nice finding, and thanks for the fix!

…when checking offset validation ### What changes were proposed in this pull request? This patch uses the available offset range obtained during polling Kafka records to do offset validation check. ### Why are the changes needed? We support non-consecutive offsets for Kafka since 2.4.0. In `fetchRecord`, we do offset validation by checking if the offset is in available offset range. But currently we obtain latest available offset range to do the check. It looks not correct as the available offset range could be changed during the batch, so the available offset range is different than the one when we polling the records from Kafka. It is possible that an offset is valid when polling, but at the time we do the above check, it is out of latest available offset range. We will wrongly consider it as data loss case and fail the query or drop the record. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? This should pass existing unit tests. This is hard to have unit test as the Kafka producer and the consumer is asynchronous. Further, we also need to make the offset out of new available offset range. Closes apache#31275 from viirya/SPARK-34187. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

Use available offset range obtainted during polling.

3494e18

github-actions bot added SQL STRUCTURED STREAMING labels Jan 21, 2021

This comment has been minimized.

Sign in to view

jaceklaskowski reviewed Jan 21, 2021

View reviewed changes

...kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/consumer/KafkaDataConsumer.scala Outdated Show resolved Hide resolved

Fix typo.

81c52e5

jaceklaskowski reviewed Jan 22, 2021

View reviewed changes

dongjoon-hyun reviewed Jan 24, 2021

View reviewed changes

...l/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/consumer/FetchedDataPool.scala Outdated Show resolved Hide resolved

dongjoon-hyun approved these changes Jan 24, 2021

View reviewed changes

viirya added 2 commits January 24, 2021 10:41

Move import.

ce9d1ea

Merge remote-tracking branch 'upstream/master' into SPARK-34187

8716a16

dongjoon-hyun closed this in ab6c0e5 Jan 24, 2021

gaborgsomogyi reviewed Jan 25, 2021

View reviewed changes

viirya deleted the SPARK-34187 branch December 27, 2023 18:24

[SPARK-34187][SS] Use available offset range obtained during polling when checking offset validation #31275

[SPARK-34187][SS] Use available offset range obtained during polling when checking offset validation #31275

Uh oh!

Conversation

viirya commented Jan 21, 2021

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

This comment has been minimized.

Uh oh!

HyukjinKwon commented Jan 21, 2021

Uh oh!

SparkQA commented Jan 21, 2021

Uh oh!

SparkQA commented Jan 21, 2021

Uh oh!

SparkQA commented Jan 21, 2021

Uh oh!

jaceklaskowski Jan 22, 2021

Choose a reason for hiding this comment

Uh oh!

viirya Jan 24, 2021

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

viirya commented Jan 24, 2021

Uh oh!

dongjoon-hyun commented Jan 24, 2021

Uh oh!

dongjoon-hyun commented Jan 24, 2021

Uh oh!

SparkQA commented Jan 24, 2021

Uh oh!

viirya commented Jan 24, 2021

Uh oh!

gaborgsomogyi left a comment

Choose a reason for hiding this comment

Uh oh!

HeartSaVioR commented Jan 26, 2021

Uh oh!

Uh oh!