IGNITE-20852 Opposite connection attempts may cause connection failure#2850

Closed

rpuch wants to merge 16 commits intoapache:mainfrom

gridgain:ignite-20852

Contributor

rpuch commented Nov 18, 2023

https://issues.apache.org/jira/browse/IGNITE-20852

Thank you for submitting the pull request.

To streamline the review process of the patch and ensure better code quality
we ask both an author and a reviewer to verify the following:

The Review Checklist

Formal criteria: TC status, codestyle, mandatory documentation. Also make sure to complete the following:
- There is a single JIRA ticket related to the pull request.
- The web-link to the pull request is attached to the JIRA ticket.
- The JIRA ticket has the Patch Available state.
- The description of the JIRA ticket explains WHAT was made, WHY and HOW.
- The pull request title is treated as the final commit message. The following pattern must be used: IGNITE-XXXX Change summary where XXXX - number of JIRA issue.
Design: new code conforms with the design principles of the components it is added to.
Patch quality: patch cannot be split into smaller pieces, its size must be reasonable.
Code quality: code is clean and readable, necessary developer documentation is added if needed.
Tests code quality: test set covers positive/negative scenarios, happy/edge cases. Tests are effective in terms of execution time and resources.

Notes

Apache Ignite Coding Guidelines

rpuch added 10 commits

November 20, 2023 13:09


          IGNITE-20852 / a late handshake always terminates itself

41a0544


          IGNITE-20852 / fix javadoc

ed216c1


          IGNITE-20852 / in case of a clinch, the client of the losing side get…

703ebe4

…s a ChannelAlreadyExistsException


          IGNITE-20852 / in case of a clinch, use the incoming handshake result…

60b31d6

…s either directly or after a retry; never throw ChannelAlreadyExistsException to the users of NettyClient


          IGNITE-20852 / remove handling of ChannelAlreadyExistsException from …

8f3f314

…DefaultMessagingService


          IGNITE-20852 / add javadoc

b4f3130


          IGNITE-20852 / change to CompletionStage

50e52b4


          IGNITE-20852 / fix checkstyle violations

671384f


          IGNITE-20852 / complete with competitor's future if the competitor is…

843d4f3

… available


          IGNITE-20852 / add more tests

0cac556

rpuch force-pushed the ignite-20852 branch from 76b91e1 to 0cac556 Compare

November 20, 2023 09:53

rpuch changed the title ~~IGNITE-20852 Connection attempt clinches may cause connection failure~~ IGNITE-20852 Opposite connection attempts may cause connection failure

sergey-chugunov-1985 reviewed

View reviewed changes

...les/network/src/main/java/org/apache/ignite/internal/network/handshake/HandshakeManager.java

+                   *
+                   * @return Final future that represents the handshake operation.
+                   */
+                  CompletionStage<NettySender> finalHandshakeFuture();

Contributor

sergey-chugunov-1985 Nov 21, 2023

How about different name - globalHandshakeFuture?

Contributor Author

rpuch Nov 23, 2023

'final' says that this is the result we are interested in. 'ultimate' seems to be ok as well, but it seems too loud.

'global' would be about a different property: not about the 'final result we want to obtain', but that it's common for everyone, and this is not true (another side would have its own future).

...rc/main/java/org/apache/ignite/internal/network/recovery/RecoveryClientHandshakeManager.java

+                      boolean ignorable = stopping.get() || !msg.reason().critical();
+                      if (ignorable) {
+                          LOG.debug("Handshake rejected by server: {}", msg.message());

Contributor

sergey-chugunov-1985 Nov 21, 2023

Maybe check for debug enabled?

Contributor Author

rpuch Nov 23, 2023

Inside, there is a check, so if debug level is not enabled, nothing will be logged. If we add the check here, we'll save one method call to message() (negligible) and one allocation to create a vararg array. This saving seems to be not important as this code is not hot, we don't handle a million handshakes per second. But we'll have to pay with one line for this. I'm not sure it's worth it.

...rc/main/java/org/apache/ignite/internal/network/recovery/RecoveryClientHandshakeManager.java Outdated Show resolved Hide resolved

...rc/main/java/org/apache/ignite/internal/network/recovery/RecoveryClientHandshakeManager.java Outdated

+                   * Master future used to complete the handshake either with the results of this handshake of the competing one
+                   * (in the opposite direction), if it wins.
+                   */
+                  private final CompletableFuture<CompletionStage<NettySender>> masterHandshakeCompleteFuture = new CompletableFuture<>();

Contributor

sergey-chugunov-1985 Nov 22, 2023

And here it could be final, terminal or resulting - WDYT?

Contributor Author

rpuch Nov 23, 2023

Changed it to resulting

...rc/main/java/org/apache/ignite/internal/network/recovery/RecoveryClientHandshakeManager.java Outdated Show resolved Hide resolved

...rc/main/java/org/apache/ignite/internal/network/recovery/RecoveryClientHandshakeManager.java Outdated Show resolved Hide resolved

...rc/main/java/org/apache/ignite/internal/network/recovery/RecoveryClientHandshakeManager.java

+                      );
+                      DescriptorAcquiry myAcquiry = descriptor.holder();
+                      assert myAcquiry != null;

Contributor

sergey-chugunov-1985 Nov 22, 2023

I maybe a bit paranoid here but I really would prefer to have some sort of IDs on recovery descriptors. I cannot imaging a scenario when we get a HandshakeRejectedMessage out of thin air but if we do we'll fail on these asserts immediately.

Or these messages could be constructed maliciously e.g. to fail a node so this code could be a security vulnerability.

What do you think about these ideas? However this is not a blocker for this improvement right now.

Contributor Author

rpuch Nov 23, 2023

We cannot do anything here 'easily', and the protocol seems to be designed around trust to the other side. If this has to be changed, we'll redesign the protocol, but I think this should be solved by other means (firewall and TLS auth)

...es/network/src/main/java/org/apache/ignite/internal/network/recovery/RecoveryDescriptor.java

+                      if (oldAcquiry != null && oldAcquiry.channel() == ctx.channel()) {
+                          // We have successfully released the descriptor.
+                          // Let's mark the clinch resolved just in case.
+                          oldAcquiry.markClinchResolved();

Contributor

sergey-chugunov-1985 Nov 22, 2023

Can we check if the acquiry is in clinch state? AFAIK it should be a wrong state for acquiry here, so this fact deserves to be logged for further investigation.

Contributor Author

rpuch Nov 23, 2023 •

edited

Loading

An acquiry is a local thing, and a clinch is a distributed state, so we cannot see whether there is (was) a clinch or not.

Here, it's just a cleanup procedure to make sure that we always release the clinch (if it existed).

.../main/java/org/apache/ignite/internal/network/recovery/message/HandshakeRejectionReason.java Outdated

+                   * Returns {@code true} iff the rejection is not expected and should be treated as a critical failure (requiring
+                   * the rejected node to restart).
+                   */
+                  public boolean critical() {

Contributor

sergey-chugunov-1985 Nov 22, 2023

How about renaming critical to something like hazardous to make it clearer that we'd better send this node into the FailureHandler mouth?

Contributor Author

rpuch Nov 23, 2023

Renamed it to restartRequired()

rpuch added 6 commits

November 23, 2023 14:57


          IGNITE-20852 / fix English grammar mistake

4a71cc9


          IGNITE-20852 / remove TODOs

b56fcf8


          IGNITE-20852 / rename a method

4094d09


          IGNITE-20852 / rename a field

e7b5ea7


          IGNITE-20852 / rename a field

0765bdb


          IGNITE-20852 / fix checkstyle violations

9790f69

sergey-chugunov-1985 approved these changes

View reviewed changes

asfgit pushed a commit that referenced this pull request


          IGNITE-20852 Simultaneous incoming and outgoing connection attempts m…

29dedf6

…ay cause connection failure (#2850)

* Handshake protocol is extended to allow a node losing a clinch notify its origin
* As a result of a handshake, the caller always gets a NettySender, even if the caller lost the clinch
* If, during a handshake, a party cannot obtain a lock at its side, it gives the competitor way unconditionally (as the competitor has advanced further)

Signed-off-by: Sergey Chugunov <sergey.chugunov@gmail.com>

rpuch closed this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet