From 1bd3cd27660a1639eb9a2c1b7f62f599b5f99b98 Mon Sep 17 00:00:00 2001 From: Liang-Chi Hsieh Date: Thu, 7 Jan 2021 22:49:45 -0800 Subject: [PATCH 1/4] Add doc. --- docs/structured-streaming-kafka-integration.md | 9 ++++++++- 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/docs/structured-streaming-kafka-integration.md b/docs/structured-streaming-kafka-integration.md index 5336695478c14..938e1c8e99e31 100644 --- a/docs/structured-streaming-kafka-integration.md +++ b/docs/structured-streaming-kafka-integration.md @@ -878,7 +878,14 @@ group id, however, please read warnings for this option and use it with caution. where to start instead. Structured Streaming manages which offsets are consumed internally, rather than rely on the kafka Consumer to do it. This will ensure that no data is missed when new topics/partitions are dynamically subscribed. Note that `startingOffsets` only applies when a new - streaming query is started, and that resuming will always pick up from where the query left off. + streaming query is started, and that resuming will always pick up from where the query left off. Note + that when the offsets consumed by a streaming application is not in Kafka (e.g., topics are deleted, + offsets are out of range, or offsets are removed after offset retention period), because the offsets + are not reset and the streaming application will see data lost. In extream cases, for example the + throughput of the streaming application cannot catch up the retention speed of Kafka, the input rows + of a batch might be gradually reduced until zero when the offset ranges of the batch are completely + not in Kafka. Enabling `failOnDataLoss` option can ask Structured Streaming to fail the query for such + cases. - **key.deserializer**: Keys are always deserialized as byte arrays with ByteArrayDeserializer. Use DataFrame operations to explicitly deserialize the keys. - **value.deserializer**: Values are always deserialized as byte arrays with ByteArrayDeserializer. From db72dc7532af6a11722f98581ff97d04cb42ad45 Mon Sep 17 00:00:00 2001 From: Liang-Chi Hsieh Date: Thu, 7 Jan 2021 23:02:20 -0800 Subject: [PATCH 2/4] for review comment. --- docs/structured-streaming-kafka-integration.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/structured-streaming-kafka-integration.md b/docs/structured-streaming-kafka-integration.md index 938e1c8e99e31..9a880aa7e55ee 100644 --- a/docs/structured-streaming-kafka-integration.md +++ b/docs/structured-streaming-kafka-integration.md @@ -880,8 +880,8 @@ group id, however, please read warnings for this option and use it with caution. topics/partitions are dynamically subscribed. Note that `startingOffsets` only applies when a new streaming query is started, and that resuming will always pick up from where the query left off. Note that when the offsets consumed by a streaming application is not in Kafka (e.g., topics are deleted, - offsets are out of range, or offsets are removed after offset retention period), because the offsets - are not reset and the streaming application will see data lost. In extream cases, for example the + offsets are out of range, or offsets are removed after offset retention period), the offsets + are not reset and the streaming application will see data lost. In extreme cases, for example the throughput of the streaming application cannot catch up the retention speed of Kafka, the input rows of a batch might be gradually reduced until zero when the offset ranges of the batch are completely not in Kafka. Enabling `failOnDataLoss` option can ask Structured Streaming to fail the query for such From 17fd73c82fdfa07d8bc301d2dfed3bfc2417922b Mon Sep 17 00:00:00 2001 From: Liang-Chi Hsieh Date: Thu, 7 Jan 2021 23:38:44 -0800 Subject: [PATCH 3/4] For review comment. --- docs/structured-streaming-kafka-integration.md | 13 ++++++------- 1 file changed, 6 insertions(+), 7 deletions(-) diff --git a/docs/structured-streaming-kafka-integration.md b/docs/structured-streaming-kafka-integration.md index 9a880aa7e55ee..9ff36e1f9094c 100644 --- a/docs/structured-streaming-kafka-integration.md +++ b/docs/structured-streaming-kafka-integration.md @@ -879,13 +879,12 @@ group id, however, please read warnings for this option and use it with caution. than rely on the kafka Consumer to do it. This will ensure that no data is missed when new topics/partitions are dynamically subscribed. Note that `startingOffsets` only applies when a new streaming query is started, and that resuming will always pick up from where the query left off. Note - that when the offsets consumed by a streaming application is not in Kafka (e.g., topics are deleted, - offsets are out of range, or offsets are removed after offset retention period), the offsets - are not reset and the streaming application will see data lost. In extreme cases, for example the - throughput of the streaming application cannot catch up the retention speed of Kafka, the input rows - of a batch might be gradually reduced until zero when the offset ranges of the batch are completely - not in Kafka. Enabling `failOnDataLoss` option can ask Structured Streaming to fail the query for such - cases. + that when the offsets consumed by a streaming application are not in Kafka (e.g., topics are deleted, + offsets are out of range, or offsets are removed after retention period), the offsets will not be reset + and the streaming application will see data loss. In extreme cases, for example the throughput of the + streaming application cannot catch up the retention speed of Kafka, the input rows of a batch might be + gradually reduced until zero when the offset ranges of the batch are completely not in Kafka. Enabling + `failOnDataLoss` option can ask Structured Streaming to fail the query for such cases. - **key.deserializer**: Keys are always deserialized as byte arrays with ByteArrayDeserializer. Use DataFrame operations to explicitly deserialize the keys. - **value.deserializer**: Values are always deserialized as byte arrays with ByteArrayDeserializer. From e1f5a33e06c523267f01f31862d8b9ec3561a01a Mon Sep 17 00:00:00 2001 From: Liang-Chi Hsieh Date: Fri, 8 Jan 2021 19:12:59 -0800 Subject: [PATCH 4/4] For review comment. --- docs/structured-streaming-kafka-integration.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/structured-streaming-kafka-integration.md b/docs/structured-streaming-kafka-integration.md index 9ff36e1f9094c..c6928abfad191 100644 --- a/docs/structured-streaming-kafka-integration.md +++ b/docs/structured-streaming-kafka-integration.md @@ -879,7 +879,7 @@ group id, however, please read warnings for this option and use it with caution. than rely on the kafka Consumer to do it. This will ensure that no data is missed when new topics/partitions are dynamically subscribed. Note that `startingOffsets` only applies when a new streaming query is started, and that resuming will always pick up from where the query left off. Note - that when the offsets consumed by a streaming application are not in Kafka (e.g., topics are deleted, + that when the offsets consumed by a streaming application no longer exist in Kafka (e.g., topics are deleted, offsets are out of range, or offsets are removed after retention period), the offsets will not be reset and the streaming application will see data loss. In extreme cases, for example the throughput of the streaming application cannot catch up the retention speed of Kafka, the input rows of a batch might be