Skip to content

With multi-host HTTP client, stop after trying all the addresses#4272

Merged
merlimat merged 2 commits into
apache:masterfrom
merlimat:fix-multi-host-errors
May 15, 2019
Merged

With multi-host HTTP client, stop after trying all the addresses#4272
merlimat merged 2 commits into
apache:masterfrom
merlimat:fix-multi-host-errors

Conversation

@merlimat
Copy link
Copy Markdown
Contributor

Motivation

The current multi-host handler for HTTP request in pulsar-client-admin (from #4018) is going into an infinite loop of retries when there is a failure, up to the request timeout (of 5min).

We need to break it immediately after having tried all the addresses.

This is currently making tests to fail all the times on master.

@merlimat merlimat added the type/bug The PR fixed a bug or issue reported a bug label May 14, 2019
@merlimat merlimat added this to the 2.4.0 milestone May 14, 2019
@merlimat merlimat self-assigned this May 14, 2019
@jiazhai
Copy link
Copy Markdown
Member

jiazhai commented May 14, 2019

java tests failed with "Build timed out (after 400 minutes). Marking the build as aborted."

@jiazhai
Copy link
Copy Markdown
Member

jiazhai commented May 14, 2019

run java8 tests

Copy link
Copy Markdown
Member

@jiazhai jiazhai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice catch

@srkukarni
Copy link
Copy Markdown
Contributor

run integration tests

@sijie
Copy link
Copy Markdown
Member

sijie commented May 14, 2019

I think the default value was set to 60s in this pull request, but there was a conflicting PR #4235 merged before merging this PR hence the problem was not captured by the CI.

I am not sure if we should break the loop after retrying all the hosts or shall we just honor to the request timeout that the user sets. We can adjust the request timeout for our tests. we can continue the discussion in #4272

The motivation of having request timeout is that pulsar client should be honored to request timeout and the cli should try to do retries before request timeout. so I think the right fix is to fix the default value or change the default value for tests as it is a side effort introduced by #4235

@sijie
Copy link
Copy Markdown
Member

sijie commented May 14, 2019

If the change is to keep the original behavior for one host, I am fine with reverting the behavior back for one host. However based on my conversations with pulsar admin users, I would prefer keeping the behavior for multi-hosts as keeping retries until request timeout has elapsed.

@merlimat
Copy link
Copy Markdown
Contributor Author

However based on my conversations with pulsar admin users, I would prefer keeping the behavior for multi-hosts as keeping retries until request timeout has elapsed.

Considering that the async path is completely broken now for multi-host, this could be fixed at same time in subsequent PR.

@sijie
Copy link
Copy Markdown
Member

sijie commented May 14, 2019

the async path is completely broken now for multi-host

I don't think it is "broken". "multi-host" is an incomplete feature anyway.

@merlimat
Copy link
Copy Markdown
Contributor Author

I don't think it is "broken". "multi-host" is an incomplete feature anyway.

Sure, I mean it's not working in the current form. This PR was merged and I don't see any plan to add it for async path. I just discovered this by chance.. I guess it would have gone out otherwise.

@sijie
Copy link
Copy Markdown
Member

sijie commented May 14, 2019

This PR was merged and I don't see any plan to add it for async path. I just discovered this by chance.

There is a multi-tasks master issue for tracking all the multi-hosts features #3218. The only completed task is the java client. If you are interested in this feature, you should just follow the master issue.

@merlimat
Copy link
Copy Markdown
Contributor Author

If you are interested in this feature, you should just follow the master issue.

I'm not interested in this feature but these kind of problems should have been caught in the PR review time. This is not the first time something (perfectly avoidable) like this went in that broke master or production.

@sijie
Copy link
Copy Markdown
Member

sijie commented May 14, 2019

This is not the first time something (perfectly avoidable) like this went in that broke master or production.

I have explained at my first comment. The PR passed all the integration tests which didn't break master. The problem was that there are two concurrent merges, one set the default value to 5 minutes, one introduced multi-hosts feature. Such a problem can still happen if there are concurrent merges based on current CI pipeline, unless we add some sort of merge queues before merging.

@merlimat
Copy link
Copy Markdown
Contributor Author

I know about that, but still the PR had 2 problems:

  • Changed behavior for single-host URLs
  • Only half-implementation for multi-host URLs

@merlimat
Copy link
Copy Markdown
Contributor Author

run java8 tests

@merlimat
Copy link
Copy Markdown
Contributor Author

Again, I'm not talking about the merge problem. There's no easy solution for that and, thankfully is not something that comes up very frequently.

@sijie
Copy link
Copy Markdown
Member

sijie commented May 14, 2019

still the PR had 2 problems:
Changed behavior for single-host URLs

Sure. I have agreed with you on that. I am fine with changing the behavior back for single-host URL.

Only half-implementation for multi-host URLs

Explained above. The multi-host URL feature is a multi-tasks feature.

  1. I initiated the efforts. The java change was even merged in 2.3.0 without other features to complete. But I don't have time to complete. so this task falls back into a community driven task. People picked up the task to continue the task in their interests.

  2. No one is declaring the feature was completed in pulsar-admin. No Javadoc, no website documentation. Assume there are more interests from community to push the things forward to complete the whole picture.

  3. If the problem was the reviewers, I listened to your opinions but I do have my own opinions as well. I don't expect us to be aligned with everything. If we can find the path to the fix, let's move forward with the fix.

@srkukarni
Copy link
Copy Markdown
Contributor

run java8 tests

@merlimat
Copy link
Copy Markdown
Contributor Author

run java8 tests

@merlimat merlimat merged commit 6b2eaa8 into apache:master May 15, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

type/bug The PR fixed a bug or issue reported a bug

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants