Skip to content

[MINOR][SS] Add some description about auto reset and data loss note to SS doc#31089

Closed
viirya wants to merge 4 commits intoapache:masterfrom
viirya:ss-minor-5
Closed

[MINOR][SS] Add some description about auto reset and data loss note to SS doc#31089
viirya wants to merge 4 commits intoapache:masterfrom
viirya:ss-minor-5

Conversation

@viirya
Copy link
Member

@viirya viirya commented Jan 8, 2021

What changes were proposed in this pull request?

This patch adds a few description to SS doc about offset reset and data loss.

Why are the changes needed?

During recent SS test, the behavior of gradual reducing input rows are confusing me. Comparing with Flink, I do not see a similar behavior. After looking into the code and doing some tests, I feel it is better to add some more description there in SS doc.

Does this PR introduce any user-facing change?

No, doc only.

How was this patch tested?

Doc only.

@github-actions github-actions bot added the DOCS label Jan 8, 2021
streaming query is started, and that resuming will always pick up from where the query left off.
streaming query is started, and that resuming will always pick up from where the query left off. Note
that when the offsets consumed by a streaming application is not in Kafka (e.g., topics are deleted,
offsets are out of range, or offsets are removed after offset retention period), because the offsets
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe, do we need to remove because in this sentence?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

streaming query is started, and that resuming will always pick up from where the query left off. Note
that when the offsets consumed by a streaming application is not in Kafka (e.g., topics are deleted,
offsets are out of range, or offsets are removed after offset retention period), because the offsets
are not reset and the streaming application will see data lost. In extream cases, for example the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extream -> extreme

Copy link
Member

@sunchao sunchao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not familiar with SS, so just trying to help on the grammar. :)

topics/partitions are dynamically subscribed. Note that `startingOffsets` only applies when a new
streaming query is started, and that resuming will always pick up from where the query left off.
streaming query is started, and that resuming will always pick up from where the query left off. Note
that when the offsets consumed by a streaming application is not in Kafka (e.g., topics are deleted,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"is not in" -> "are not in"

streaming query is started, and that resuming will always pick up from where the query left off.
streaming query is started, and that resuming will always pick up from where the query left off. Note
that when the offsets consumed by a streaming application is not in Kafka (e.g., topics are deleted,
offsets are out of range, or offsets are removed after offset retention period), the offsets
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"offset retention period" : not sure if the offset is redundant.

Also, perhaps "the offsets are not reset" -> "they will not be reset".

streaming query is started, and that resuming will always pick up from where the query left off. Note
that when the offsets consumed by a streaming application is not in Kafka (e.g., topics are deleted,
offsets are out of range, or offsets are removed after offset retention period), the offsets
are not reset and the streaming application will see data lost. In extreme cases, for example the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"see data lost" -> "see data loss"

@SparkQA
Copy link

SparkQA commented Jan 8, 2021

Test build #133824 has finished for PR 31089 at commit db72dc7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Jan 8, 2021

Thanks @dongjoon-hyun @sunchao

@SparkQA
Copy link

SparkQA commented Jan 8, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38413/

@SparkQA
Copy link

SparkQA commented Jan 8, 2021

Test build #133826 has finished for PR 31089 at commit 17fd73c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 8, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38413/

@SparkQA
Copy link

SparkQA commented Jan 8, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38415/

@SparkQA
Copy link

SparkQA commented Jan 8, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38415/

@viirya
Copy link
Member Author

viirya commented Jan 8, 2021

cc @HeartSaVioR

topics/partitions are dynamically subscribed. Note that `startingOffsets` only applies when a new
streaming query is started, and that resuming will always pick up from where the query left off.
streaming query is started, and that resuming will always pick up from where the query left off. Note
that when the offsets consumed by a streaming application are not in Kafka (e.g., topics are deleted,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel more natural to say "no longer exist in" instead of "are not in", but as I'm not a native speaker, please treat this as a grain of salt.

Copy link
Member Author

@viirya viirya Jan 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks. ok for me. updated.

@HeartSaVioR
Copy link
Contributor

I feel this is better to be added in "failOnDataLoss", instead of "auto reset", but let's hear others' voices.
cc. @gaborgsomogyi @xuanyuanking

@SparkQA
Copy link

SparkQA commented Jan 9, 2021

Test build #133859 has finished for PR 31089 at commit e1f5a33.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 9, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38448/

@SparkQA
Copy link

SparkQA commented Jan 9, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38448/

@dongjoon-hyun
Copy link
Member

Merged to master for Apache Spark 3.2.0.

@viirya viirya deleted the ss-minor-5 branch December 27, 2023 18:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants

Comments