With multi-host HTTP client, stop after trying all the addresses#4272
Conversation
|
java tests failed with "Build timed out (after 400 minutes). Marking the build as aborted." |
|
run java8 tests |
|
run integration tests |
|
I think the default value was set to 60s in this pull request, but there was a conflicting PR #4235 merged before merging this PR hence the problem was not captured by the CI. I am not sure if we should break the loop after retrying all the hosts or shall we just honor to the request timeout that the user sets. We can adjust the request timeout for our tests. we can continue the discussion in #4272 The motivation of having request timeout is that pulsar client should be honored to request timeout and the cli should try to do retries before request timeout. so I think the right fix is to fix the default value or change the default value for tests as it is a side effort introduced by #4235 |
|
If the change is to keep the original behavior for one host, I am fine with reverting the behavior back for one host. However based on my conversations with pulsar admin users, I would prefer keeping the behavior for multi-hosts as keeping retries until request timeout has elapsed. |
Considering that the async path is completely broken now for multi-host, this could be fixed at same time in subsequent PR. |
I don't think it is "broken". "multi-host" is an incomplete feature anyway. |
Sure, I mean it's not working in the current form. This PR was merged and I don't see any plan to add it for async path. I just discovered this by chance.. I guess it would have gone out otherwise. |
There is a multi-tasks master issue for tracking all the multi-hosts features #3218. The only completed task is the java client. If you are interested in this feature, you should just follow the master issue. |
I'm not interested in this feature but these kind of problems should have been caught in the PR review time. This is not the first time something (perfectly avoidable) like this went in that broke master or production. |
I have explained at my first comment. The PR passed all the integration tests which didn't break master. The problem was that there are two concurrent merges, one set the default value to 5 minutes, one introduced multi-hosts feature. Such a problem can still happen if there are concurrent merges based on current CI pipeline, unless we add some sort of merge queues before merging. |
|
I know about that, but still the PR had 2 problems:
|
|
run java8 tests |
|
Again, I'm not talking about the merge problem. There's no easy solution for that and, thankfully is not something that comes up very frequently. |
Sure. I have agreed with you on that. I am fine with changing the behavior back for single-host URL.
Explained above. The multi-host URL feature is a multi-tasks feature.
|
|
run java8 tests |
|
run java8 tests |
Motivation
The current multi-host handler for HTTP request in pulsar-client-admin (from #4018) is going into an infinite loop of retries when there is a failure, up to the request timeout (of 5min).
We need to break it immediately after having tried all the addresses.
This is currently making tests to fail all the times on master.