[client] Add retry when get one available tablet serverNode fails #426

zhaozijun109 · 2025-02-18T11:32:11Z

Purpose

Linked issue: #480
Currently, fluss will directly throw an exception when get available tablet server node fails, no matter what the reason once tablet server list is empty at the first attampet get, it will throw an exception, this is too unfriendly, so we can add retry logic to become strong.

Tests

FlussFailServerTableITCase.testRetryGetTabletServerNodes()
TestingMetadataUpdater

API and Format

getOneAvailableTabletServerNode()

Documentation

Add retry times to avoid throwing an exception directly in getOneAvailableTabletServerNode(), and we need send metadata request to update cluster then try to get one available tablet server node again when available tablet server node list is empty in cluster.

SteNicholas · 2025-02-19T12:56:40Z

fluss-client/src/main/java/com/alibaba/fluss/client/FlussConnection.java

                                                    metadataUpdater.getCluster()),
                                    rpcClient,
                                    AdminReadOnlyGateway.class);
+


Unnecessary change. Pls revert this line.

Ok, I will revert this line.

SteNicholas · 2025-02-19T12:57:13Z

fluss-client/src/main/java/com/alibaba/fluss/client/utils/MetadataUtils.java

 public class MetadataUtils {
+    private static final Logger LOG = LoggerFactory.getLogger(MetadataUtils.class);
+
+    private static final int MAX_RETRY_TIMES = 5;


Could this retry time be configurable?

Yes, I will make it as system configuration

polyzos · 2025-02-19T13:50:23Z

@gkatzioura also added a retry mechanism here
to fix some tests. I assume retry logic will be needed in other places as well, so it maybe makes sense to introduce a class to handle that or use a library. WDYT? cc @wuchong

UPDATE: There is a retry method, but its only available in the test utils

zhaozijun109 · 2025-02-20T07:46:27Z

@polyzos Good idea, I will try to make a common util method of retry logic. WDYT? @SteNicholas please left some comments, thank you

luoyuxia · 2025-02-21T07:10:52Z

fluss-client/src/main/java/com/alibaba/fluss/client/utils/MetadataUtils.java

+        List<ServerNode> aliveTabletServers = null;
+        for (int retryTimes = 0; retryTimes <= MAX_RETRY_TIMES; retryTimes++) {
+            aliveTabletServers = cluster.getAliveTabletServerList();
+            if (aliveTabletServers.isEmpty()) {


When aliveTabletServers is empty in cluster, the retry doesn't work since the cluster wil stil be empty for aliveTabletServers, we need send metadata request to update cluster`.

OK，I will adjust it，thank you.

SteNicholas · 2025-02-21T09:12:37Z

@zhaozijun109, @polyzos, IMO, the retry mechanism maybe different from different behavior, therefore it may not need to introduce a common util method of retry logic.

zhaozijun109 · 2025-02-21T12:48:11Z

@SteNicholas Thank for your advice, I will re-adjust based on all the comments above

polyzos · 2025-02-21T13:12:32Z

@SteNicholas Indeed, but there might be re-use of some implementations.. my thoughts were that it might be best to have them in a centralized place, to avoid introducing multiple retry implementations across the codebase, because I already saw 3 different cases popping up in different places. Whatever you think works best

zhaozijun109 · 2025-02-23T09:57:05Z

@luoyuxia @SteNicholas Could you please help me review again when you have free time? Thank you.

SteNicholas · 2025-02-24T06:08:42Z

fluss-common/src/main/java/com/alibaba/fluss/config/ConfigOptions.java

                            "Enable metrics for client. When metrics is enabled, the client "
                                    + "will collect metrics and report by the JMX metrics reporter.");

+    public static final ConfigOption<Integer> CLIENT_GET_TABLET_SERVER_NODE_MAX_RETRY_TIMES =


Add this option in configuration.md.

SteNicholas · 2025-02-24T06:10:50Z

@zhaozijun109, could you create new issue for this pull request? BTW, could you also add some description for this pull request?

zhaozijun109 · 2025-02-24T10:41:25Z

@SteNicholas Thank you for your review, I will create a new issue.

…ache#425)

…ache#480)

zhaozijun109 · 2025-02-25T06:50:23Z

@luoyuxia @SteNicholas The new issue is #480, could you please help me to review again when you have free time, thank you.

wuchong · 2025-02-25T08:52:56Z

Thank you for your contribution, @zhaozijun109! You're absolutely right—the current client metadata implementation has significant issues with retrying, timeout handling, and error management. While this PR effectively addresses the specific case of getOneAvailableTabletServerNode, there are many other methods in the client metadata module that could benefit from similar improvements.

To tackle this comprehensively, I've created Issue #483 to propose a broader refactoring of the client metadata implementation. The goal is to address all recent bugs and provide a more robust, general solution.

zhaozijun109 · 2025-02-25T11:46:58Z

@wuchong Thank you, I will continue to study the principle of fluss and try my best to contribute it. Another, sounds good that it's very good for refactoring of the client metadata implementation, many methods will benefit from this.

zhaozijun109 force-pushed the develop branch from f116150 to c2287b3 Compare February 18, 2025 11:37

SteNicholas reviewed Feb 19, 2025

View reviewed changes

luoyuxia reviewed Feb 21, 2025

View reviewed changes

zhaozijun109 force-pushed the develop branch from c2287b3 to fc5a64f Compare February 23, 2025 09:55

SteNicholas reviewed Feb 24, 2025

View reviewed changes

zhaozijun109 added 3 commits February 25, 2025 14:45

[client] Add retry when get one available tablet serverNode fails (ap…

5e14b67

…ache#425)

[client] Add retry when get one available tablet serverNode fails (ap…

63bde78

…ache#425)

[client] Add retry when get one available tablet serverNode fails (ap…

330d19b

…ache#480)

zhaozijun109 force-pushed the develop branch from fc5a64f to 330d19b Compare February 25, 2025 06:45

wuchong linked an issue Feb 25, 2025 that may be closed by this pull request

Support retry mechanism when get available tablet server nodes failed #480

Open

2 tasks

polyzos force-pushed the main branch 3 times, most recently from d88c76c to 434a4f4 Compare August 31, 2025 15:13

[client] Add retry when get one available tablet serverNode fails #426

Are you sure you want to change the base?

[client] Add retry when get one available tablet serverNode fails #426

Uh oh!

Conversation

zhaozijun109 commented Feb 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Tests

API and Format

Documentation

Uh oh!

SteNicholas Feb 19, 2025

Choose a reason for hiding this comment

Uh oh!

zhaozijun109 Feb 20, 2025

Choose a reason for hiding this comment

Uh oh!

SteNicholas Feb 19, 2025

Choose a reason for hiding this comment

Uh oh!

zhaozijun109 Feb 20, 2025

Choose a reason for hiding this comment

Uh oh!

polyzos commented Feb 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zhaozijun109 commented Feb 20, 2025

Uh oh!

luoyuxia Feb 21, 2025

Choose a reason for hiding this comment

Uh oh!

zhaozijun109 Feb 21, 2025

Choose a reason for hiding this comment

Uh oh!

SteNicholas commented Feb 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zhaozijun109 commented Feb 21, 2025

Uh oh!

polyzos commented Feb 21, 2025

Uh oh!

zhaozijun109 commented Feb 23, 2025

Uh oh!

SteNicholas Feb 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SteNicholas commented Feb 24, 2025

Uh oh!

zhaozijun109 commented Feb 24, 2025

Uh oh!

zhaozijun109 commented Feb 25, 2025

Uh oh!

wuchong commented Feb 25, 2025

Uh oh!

zhaozijun109 commented Feb 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

zhaozijun109 commented Feb 18, 2025 •

edited

Loading

polyzos commented Feb 19, 2025 •

edited

Loading

SteNicholas commented Feb 21, 2025 •

edited

Loading

SteNicholas Feb 24, 2025 •

edited

Loading