[MINOR][SS] Add some description about auto reset and data loss note to SS doc by viirya · Pull Request #31089 · apache/spark

viirya · 2021-01-08T06:53:56Z

What changes were proposed in this pull request?

This patch adds a few description to SS doc about offset reset and data loss.

Why are the changes needed?

During recent SS test, the behavior of gradual reducing input rows are confusing me. Comparing with Flink, I do not see a similar behavior. After looking into the code and doing some tests, I feel it is better to add some more description there in SS doc.

Does this PR introduce any user-facing change?

No, doc only.

How was this patch tested?

Doc only.

dongjoon-hyun · 2021-01-08T07:01:34Z

docs/structured-streaming-kafka-integration.md

- streaming query is started, and that resuming will always pick up from where the query left off.
+ streaming query is started, and that resuming will always pick up from where the query left off. Note
+ that when the offsets consumed by a streaming application is not in Kafka (e.g., topics are deleted,
+ offsets are out of range, or offsets are removed after offset retention period), because the offsets


Maybe, do we need to remove because in this sentence?

dongjoon-hyun · 2021-01-08T07:03:25Z

docs/structured-streaming-kafka-integration.md

+ streaming query is started, and that resuming will always pick up from where the query left off. Note
+ that when the offsets consumed by a streaming application is not in Kafka (e.g., topics are deleted,
+ offsets are out of range, or offsets are removed after offset retention period), because the offsets
+ are not reset and the streaming application will see data lost. In extream cases, for example the


extream -> extreme

sunchao

I'm not familiar with SS, so just trying to help on the grammar. :)

sunchao · 2021-01-08T07:13:53Z

docs/structured-streaming-kafka-integration.md

 topics/partitions are dynamically subscribed. Note that `startingOffsets` only applies when a new
- streaming query is started, and that resuming will always pick up from where the query left off.
+ streaming query is started, and that resuming will always pick up from where the query left off. Note
+ that when the offsets consumed by a streaming application is not in Kafka (e.g., topics are deleted,


"is not in" -> "are not in"

sunchao · 2021-01-08T07:14:26Z

docs/structured-streaming-kafka-integration.md

- streaming query is started, and that resuming will always pick up from where the query left off.
+ streaming query is started, and that resuming will always pick up from where the query left off. Note
+ that when the offsets consumed by a streaming application is not in Kafka (e.g., topics are deleted,
+ offsets are out of range, or offsets are removed after offset retention period), the offsets


"offset retention period" : not sure if the offset is redundant.

Also, perhaps "the offsets are not reset" -> "they will not be reset".

sunchao · 2021-01-08T07:16:21Z

docs/structured-streaming-kafka-integration.md

+ streaming query is started, and that resuming will always pick up from where the query left off. Note
+ that when the offsets consumed by a streaming application is not in Kafka (e.g., topics are deleted,
+ offsets are out of range, or offsets are removed after offset retention period), the offsets
+ are not reset and the streaming application will see data lost. In extreme cases, for example the


"see data lost" -> "see data loss"

SparkQA · 2021-01-08T07:26:09Z

Test build #133824 has finished for PR 31089 at commit db72dc7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2021-01-08T07:39:12Z

Thanks @dongjoon-hyun @sunchao

SparkQA · 2021-01-08T07:59:11Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38413/

SparkQA · 2021-01-08T08:27:13Z

Test build #133826 has finished for PR 31089 at commit 17fd73c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-01-08T08:27:17Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38413/

SparkQA · 2021-01-08T09:03:14Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38415/

SparkQA · 2021-01-08T09:31:26Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38415/

viirya · 2021-01-08T23:44:38Z

cc @HeartSaVioR

HeartSaVioR · 2021-01-09T02:02:56Z

docs/structured-streaming-kafka-integration.md

 topics/partitions are dynamically subscribed. Note that `startingOffsets` only applies when a new
- streaming query is started, and that resuming will always pick up from where the query left off.
+ streaming query is started, and that resuming will always pick up from where the query left off. Note
+ that when the offsets consumed by a streaming application are not in Kafka (e.g., topics are deleted,


I feel more natural to say "no longer exist in" instead of "are not in", but as I'm not a native speaker, please treat this as a grain of salt.

thanks. ok for me. updated.

HeartSaVioR · 2021-01-09T02:04:48Z

I feel this is better to be added in "failOnDataLoss", instead of "auto reset", but let's hear others' voices.
cc. @gaborgsomogyi @xuanyuanking

SparkQA · 2021-01-09T03:31:14Z

Test build #133859 has finished for PR 31089 at commit e1f5a33.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-01-09T04:09:23Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38448/

SparkQA · 2021-01-09T04:44:07Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38448/

dongjoon-hyun · 2021-01-11T19:47:35Z

Merged to master for Apache Spark 3.2.0.

Add doc.

1bd3cd2

github-actions bot added the DOCS label Jan 8, 2021

dongjoon-hyun reviewed Jan 8, 2021

View reviewed changes

for review comment.

db72dc7

sunchao reviewed Jan 8, 2021

View reviewed changes

For review comment.

17fd73c

HeartSaVioR reviewed Jan 9, 2021

View reviewed changes

For review comment.

e1f5a33

dongjoon-hyun approved these changes Jan 9, 2021

View reviewed changes

dongjoon-hyun closed this in ad9fad7 Jan 11, 2021

viirya deleted the ss-minor-5 branch December 27, 2023 18:24

Conversation

viirya commented Jan 8, 2021

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

dongjoon-hyun Jan 8, 2021

Choose a reason for hiding this comment

Uh oh!

viirya Jan 8, 2021

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Jan 8, 2021

Choose a reason for hiding this comment

Uh oh!

sunchao left a comment

Choose a reason for hiding this comment

Uh oh!

sunchao Jan 8, 2021

Choose a reason for hiding this comment

Uh oh!

sunchao Jan 8, 2021

Choose a reason for hiding this comment

Uh oh!

sunchao Jan 8, 2021

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jan 8, 2021

Uh oh!

viirya commented Jan 8, 2021

Uh oh!

SparkQA commented Jan 8, 2021

Uh oh!

SparkQA commented Jan 8, 2021

Uh oh!

SparkQA commented Jan 8, 2021

Uh oh!

SparkQA commented Jan 8, 2021

Uh oh!

SparkQA commented Jan 8, 2021

Uh oh!

viirya commented Jan 8, 2021

Uh oh!

HeartSaVioR Jan 9, 2021

Choose a reason for hiding this comment

Uh oh!

viirya Jan 9, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HeartSaVioR commented Jan 9, 2021

Uh oh!

SparkQA commented Jan 9, 2021

Uh oh!

SparkQA commented Jan 9, 2021

Uh oh!

SparkQA commented Jan 9, 2021

Uh oh!

dongjoon-hyun commented Jan 11, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Comments

viirya Jan 9, 2021 •

edited

Loading