[FLINK-33932][checkpointing] Add retry mechanism in RocksDBStateUploader #23986

xiangyuf · 2023-12-25T07:11:24Z

What is the purpose of the change

Add retry mechanism in RocksDBStateUploader

Brief change log

Introduce RetryStrategy in RocksDBStateUploader when uploadFilesToCheckpointFs

Verifying this change

This change added tests and can be verified as follows:

Added Unit Test

Does this pull request potentially affect one of the following parts:

Dependencies (does it add or upgrade a dependency): no
The public API, i.e., is any changed class annotated with @Public(Evolving): no
The serializers: no
The runtime per-record code paths (performance sensitive): no
Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: no
The S3 file system connector: no

Documentation

Does this pull request introduce a new feature? no

flinkbot · 2023-12-25T07:15:24Z

CI report:

7ca00fd Azure: FAILURE

Bot commands

The @flinkbot bot supports the following commands:

@flinkbot run azure re-run the last Azure build

pnowojski

Thanks for the contribution, I've left a couple of comments. Apart of those could you implement a test for this in RocksDBStateUploader? You could inject a testing CheckpointStreamFactory that throws a configured amount of exceptions first.

pnowojski · 2024-01-04T16:24:01Z

...end-rocksdb/src/main/java/org/apache/flink/contrib/streaming/state/RocksDBStateUploader.java

+    private static final int DEFAULT_RETRY_TIMES = 3;
+
+    private static final Duration DEFAULT_RETRY_DELAY = Duration.ofSeconds(1L);


This should be configurable and most likely in the first release default to the current behaviour.

pnowojski · 2024-01-04T16:24:17Z

...end-rocksdb/src/main/java/org/apache/flink/contrib/streaming/state/RocksDBStateUploader.java

-                                                                stateScope,
-                                                                closeableRegistry,
-                                                                tmpResourcesRegistry)),
+                                                () -> {


Please extract to separate method.

pnowojski · 2024-01-04T16:25:39Z

...end-rocksdb/src/main/java/org/apache/flink/contrib/streaming/state/RocksDBStateUploader.java

+                                                        TimeUnit.MILLISECONDS.sleep(
+                                                                retryStrategy
+                                                                        .getRetryDelay()
+                                                                        .toMillis());


It would be better to not synchronously wait on the timeout, and free up the thread pool to do other things.

…ransfer

Zakelly · 2024-02-20T15:57:57Z

Sorry for jumping in but may I ask about the current status?

xiangyuf · 2024-02-21T02:42:43Z

Sorry for jumping in but may I ask about the current status?

Hi @Zakelly , I'm still working on FLIP-414 to support a more general retry mechanism for all statebackends.

pnowojski requested changes Jan 4, 2024

View reviewed changes

[FLINK-33932][checkpointing] Add retry mechanism in RocksDBStateDataT…

7ca00fd

…ransfer

xiangyuf force-pushed the FLINK-33932 branch from 4f3d085 to 7ca00fd Compare January 7, 2024 16:43

flinkbot added the component=Runtime/Checkpointing label Apr 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FLINK-33932][checkpointing] Add retry mechanism in RocksDBStateUploader #23986

[FLINK-33932][checkpointing] Add retry mechanism in RocksDBStateUploader #23986

xiangyuf commented Dec 25, 2023

flinkbot commented Dec 25, 2023 •

edited

pnowojski left a comment

pnowojski Jan 4, 2024

pnowojski Jan 4, 2024

pnowojski Jan 4, 2024

Zakelly commented Feb 20, 2024

xiangyuf commented Feb 21, 2024

		private static final int DEFAULT_RETRY_TIMES = 3;

		private static final Duration DEFAULT_RETRY_DELAY = Duration.ofSeconds(1L);

[FLINK-33932][checkpointing] Add retry mechanism in RocksDBStateUploader #23986

Are you sure you want to change the base?

[FLINK-33932][checkpointing] Add retry mechanism in RocksDBStateUploader #23986

Conversation

xiangyuf commented Dec 25, 2023

What is the purpose of the change

Brief change log

Verifying this change

Does this pull request potentially affect one of the following parts:

Documentation

flinkbot commented Dec 25, 2023 • edited

CI report:

pnowojski left a comment

Choose a reason for hiding this comment

pnowojski Jan 4, 2024

Choose a reason for hiding this comment

pnowojski Jan 4, 2024

Choose a reason for hiding this comment

pnowojski Jan 4, 2024

Choose a reason for hiding this comment

Zakelly commented Feb 20, 2024

xiangyuf commented Feb 21, 2024

flinkbot commented Dec 25, 2023 •

edited