Skip to content

Couldn't restart streams when using Spark Structured Streaming when Kafka offset goes out of range #16528

@hudi-bot

Description

@hudi-bot

When using spark structured streaming with kafka and writing data in Hudi,. when jobs sometimes cant keep up with the input rate and fails as the kafka offset goes out of range (i.e earliest kafka messages are cleaned up due to the retention policy) and when we try to restart the job by clearing the previous checkpoint and consume from latest offset we see that the batches are skipped by the 'HoodieStreamingSink'. 

There is no way to restart these streams again currently.

JIRA info

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions