Skip to content

Conversation

@zhaozijun109
Copy link
Contributor

@zhaozijun109 zhaozijun109 commented Feb 18, 2025

Purpose

Linked issue: #480
Currently, fluss will directly throw an exception when get available tablet server node fails, no matter what the reason once tablet server list is empty at the first attampet get, it will throw an exception, this is too unfriendly, so we can add retry logic to become strong.

Tests

FlussFailServerTableITCase.testRetryGetTabletServerNodes()
TestingMetadataUpdater

API and Format

getOneAvailableTabletServerNode()

Documentation

Add retry times to avoid throwing an exception directly in getOneAvailableTabletServerNode(), and we need send metadata request to update cluster then try to get one available tablet server node again when available tablet server node list is empty in cluster.

metadataUpdater.getCluster()),
rpcClient,
AdminReadOnlyGateway.class);

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unnecessary change. Pls revert this line.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I will revert this line.

public class MetadataUtils {
private static final Logger LOG = LoggerFactory.getLogger(MetadataUtils.class);

private static final int MAX_RETRY_TIMES = 5;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this retry time be configurable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I will make it as system configuration

@polyzos
Copy link
Contributor

polyzos commented Feb 19, 2025

@gkatzioura also added a retry mechanism here
to fix some tests. I assume retry logic will be needed in other places as well, so it maybe makes sense to introduce a class to handle that or use a library. WDYT? cc @wuchong

UPDATE: There is a retry method, but its only available in the test utils

@zhaozijun109
Copy link
Contributor Author

@polyzos Good idea, I will try to make a common util method of retry logic. WDYT? @SteNicholas please left some comments, thank you

List<ServerNode> aliveTabletServers = null;
for (int retryTimes = 0; retryTimes <= MAX_RETRY_TIMES; retryTimes++) {
aliveTabletServers = cluster.getAliveTabletServerList();
if (aliveTabletServers.isEmpty()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When aliveTabletServers is empty in cluster, the retry doesn't work since the cluster wil stil be empty for aliveTabletServers, we need send metadata request to update cluster`.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK,I will adjust it,thank you.

@SteNicholas
Copy link
Member

SteNicholas commented Feb 21, 2025

@zhaozijun109, @polyzos, IMO, the retry mechanism maybe different from different behavior, therefore it may not need to introduce a common util method of retry logic.

@zhaozijun109
Copy link
Contributor Author

@SteNicholas Thank for your advice, I will re-adjust based on all the comments above

@polyzos
Copy link
Contributor

polyzos commented Feb 21, 2025

@SteNicholas Indeed, but there might be re-use of some implementations.. my thoughts were that it might be best to have them in a centralized place, to avoid introducing multiple retry implementations across the codebase, because I already saw 3 different cases popping up in different places. Whatever you think works best

@zhaozijun109
Copy link
Contributor Author

@luoyuxia @SteNicholas Could you please help me review again when you have free time? Thank you.

"Enable metrics for client. When metrics is enabled, the client "
+ "will collect metrics and report by the JMX metrics reporter.");

public static final ConfigOption<Integer> CLIENT_GET_TABLET_SERVER_NODE_MAX_RETRY_TIMES =
Copy link
Member

@SteNicholas SteNicholas Feb 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add this option in configuration.md.

@SteNicholas
Copy link
Member

@zhaozijun109, could you create new issue for this pull request? BTW, could you also add some description for this pull request?

@zhaozijun109
Copy link
Contributor Author

@SteNicholas Thank you for your review, I will create a new issue.

@zhaozijun109
Copy link
Contributor Author

@luoyuxia @SteNicholas The new issue is #480, could you please help me to review again when you have free time, thank you.

@wuchong wuchong linked an issue Feb 25, 2025 that may be closed by this pull request
2 tasks
@wuchong
Copy link
Member

wuchong commented Feb 25, 2025

Thank you for your contribution, @zhaozijun109! You're absolutely right—the current client metadata implementation has significant issues with retrying, timeout handling, and error management. While this PR effectively addresses the specific case of getOneAvailableTabletServerNode, there are many other methods in the client metadata module that could benefit from similar improvements.

To tackle this comprehensively, I've created Issue #483 to propose a broader refactoring of the client metadata implementation. The goal is to address all recent bugs and provide a more robust, general solution.

@zhaozijun109
Copy link
Contributor Author

@wuchong Thank you, I will continue to study the principle of fluss and try my best to contribute it. Another, sounds good that it's very good for refactoring of the client metadata implementation, many methods will benefit from this.

@polyzos polyzos force-pushed the main branch 3 times, most recently from d88c76c to 434a4f4 Compare August 31, 2025 15:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support retry mechanism when get available tablet server nodes failed

5 participants