Skip to content

Flink job stuck at the “INITIALIZING” stage when using RANGE distribution mode #14079

@hydrogenlee

Description

@hydrogenlee

Apache Iceberg version

1.8.1

Query engine

Flink

Please describe the bug 🐞

ENV:
flink version: 1.19.0
parallelism: 128

Problem:
When running a flink job with RANGE distribution mode, after several cycles of running the job for a while, stopping with a savepoint, and restarting from that savepoint, the range shuffle operator stuck at the “INITIALIZING” status without any error or warn logs, while all other operators successfully transition to the RUNNING state.

Steps to Reproduce:

  1. Run a job in range distribution mode.
  2. Stop with savepoint.
  3. Restart from the savepoint.
  4. Repeat steps 1–3 multiple times.
  5. Eventually, after a restart, the job gets stuck in the INITIALIZING stage.

The hot thread of range-shuffle operator is:
Image

Willingness to contribute

  • I can contribute a fix for this bug independently
  • I would be willing to contribute a fix for this bug with guidance from the Iceberg community
  • I cannot contribute a fix for this bug at this time

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions