Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARTEMIS-3496 Replica connection to its live should fail fast #3771

Merged
merged 2 commits into from Oct 12, 2021

Conversation

franz1981
Copy link
Contributor

@franz1981
Copy link
Contributor Author

@clebertsuconic do you remember any reason why reconnect attempts was set to 1 on purpose on the replica connection to its live?

@clebertsuconic
Copy link
Contributor

it was set to 1 as in not meant to retry.. .a single connection...

its a bug... I meant for one connection only.

Replication and clustering should have the retries through the cluster bridge and clustered connection. not through the serverLocator.

Is there a test that fails with this? run the whole test suite just in case.. and if you can add a failing test it would be even greater.

@franz1981
Copy link
Contributor Author

franz1981 commented Sep 27, 2021

Is there a test that fails with this?

I'm still investigating on multiple test failure after applied this change :(
Probably I was using an old branch version that was leaking fds: just sent another round on the CI, let's see

@franz1981 franz1981 marked this pull request as ready for review September 27, 2021 17:32
@franz1981
Copy link
Contributor Author

@clebertsuconic I'm going to add a separate test for this tomorrow: I've rebased and now the CI is fully green with this change

@gtully
Copy link
Contributor

gtully commented Oct 7, 2021

I guess a test would be best, but this looks a trivial fix and it is important for time to recover from failure. I would like to see this in 2.19.0.

@gtully
Copy link
Contributor

gtully commented Oct 8, 2021

I started with mokito and wow was it involved, the end result is not pretty but it does validate the fix. comments welcome!

@gemmellr
Copy link
Member

Main comment, without looking at the code, the tests all hung and caused the GHA job to time out after 6 hours.

The tests should have an appropriate timeout to stop any one test taking an excessive amount of time before failing (also makes clear which one goes bang when they do, aiding following analysis of what happened).

@gtully
Copy link
Contributor

gtully commented Oct 11, 2021

fair, I guess it is related to the use of the default port, that needs sorting.

@gtully gtully force-pushed the ARTEMIS-3496 branch 3 times, most recently from daf5f5f to 650a92c Compare October 11, 2021 16:05
@gtully gtully merged commit 6f4c609 into apache:main Oct 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants