Skip to content

IGNITE-20852 Opposite connection attempts may cause connection failure#2850

Closed
rpuch wants to merge 16 commits intoapache:mainfrom
gridgain:ignite-20852
Closed

IGNITE-20852 Opposite connection attempts may cause connection failure#2850
rpuch wants to merge 16 commits intoapache:mainfrom
gridgain:ignite-20852

Conversation

@rpuch
Copy link
Contributor

@rpuch rpuch commented Nov 18, 2023

https://issues.apache.org/jira/browse/IGNITE-20852

Thank you for submitting the pull request.

To streamline the review process of the patch and ensure better code quality
we ask both an author and a reviewer to verify the following:

The Review Checklist

  • Formal criteria: TC status, codestyle, mandatory documentation. Also make sure to complete the following:
    - There is a single JIRA ticket related to the pull request.
    - The web-link to the pull request is attached to the JIRA ticket.
    - The JIRA ticket has the Patch Available state.
    - The description of the JIRA ticket explains WHAT was made, WHY and HOW.
    - The pull request title is treated as the final commit message. The following pattern must be used: IGNITE-XXXX Change summary where XXXX - number of JIRA issue.
  • Design: new code conforms with the design principles of the components it is added to.
  • Patch quality: patch cannot be split into smaller pieces, its size must be reasonable.
  • Code quality: code is clean and readable, necessary developer documentation is added if needed.
  • Tests code quality: test set covers positive/negative scenarios, happy/edge cases. Tests are effective in terms of execution time and resources.

Notes

@rpuch rpuch changed the title IGNITE-20852 Connection attempt clinches may cause connection failure IGNITE-20852 Opposite connection attempts may cause connection failure Nov 20, 2023
*
* @return Final future that represents the handshake operation.
*/
CompletionStage<NettySender> finalHandshakeFuture();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about different name - globalHandshakeFuture?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'final' says that this is the result we are interested in. 'ultimate' seems to be ok as well, but it seems too loud.

'global' would be about a different property: not about the 'final result we want to obtain', but that it's common for everyone, and this is not true (another side would have its own future).

boolean ignorable = stopping.get() || !msg.reason().critical();

if (ignorable) {
LOG.debug("Handshake rejected by server: {}", msg.message());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe check for debug enabled?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inside, there is a check, so if debug level is not enabled, nothing will be logged. If we add the check here, we'll save one method call to message() (negligible) and one allocation to create a vararg array. This saving seems to be not important as this code is not hot, we don't handle a million handshakes per second. But we'll have to pay with one line for this. I'm not sure it's worth it.

* Master future used to complete the handshake either with the results of this handshake of the competing one
* (in the opposite direction), if it wins.
*/
private final CompletableFuture<CompletionStage<NettySender>> masterHandshakeCompleteFuture = new CompletableFuture<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And here it could be final, terminal or resulting - WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed it to resulting

);

DescriptorAcquiry myAcquiry = descriptor.holder();
assert myAcquiry != null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I maybe a bit paranoid here but I really would prefer to have some sort of IDs on recovery descriptors. I cannot imaging a scenario when we get a HandshakeRejectedMessage out of thin air but if we do we'll fail on these asserts immediately.

Or these messages could be constructed maliciously e.g. to fail a node so this code could be a security vulnerability.

What do you think about these ideas? However this is not a blocker for this improvement right now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We cannot do anything here 'easily', and the protocol seems to be designed around trust to the other side. If this has to be changed, we'll redesign the protocol, but I think this should be solved by other means (firewall and TLS auth)

if (oldAcquiry != null && oldAcquiry.channel() == ctx.channel()) {
// We have successfully released the descriptor.
// Let's mark the clinch resolved just in case.
oldAcquiry.markClinchResolved();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we check if the acquiry is in clinch state? AFAIK it should be a wrong state for acquiry here, so this fact deserves to be logged for further investigation.

Copy link
Contributor Author

@rpuch rpuch Nov 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An acquiry is a local thing, and a clinch is a distributed state, so we cannot see whether there is (was) a clinch or not.

Here, it's just a cleanup procedure to make sure that we always release the clinch (if it existed).

* Returns {@code true} iff the rejection is not expected and should be treated as a critical failure (requiring
* the rejected node to restart).
*/
public boolean critical() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about renaming critical to something like hazardous to make it clearer that we'd better send this node into the FailureHandler mouth?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed it to restartRequired()

asfgit pushed a commit that referenced this pull request Nov 23, 2023
…ay cause connection failure (#2850)

* Handshake protocol is extended to allow a node losing a clinch notify its origin
* As a result of a handshake, the caller always gets a NettySender, even if the caller lost the clinch
* If, during a handshake, a party cannot obtain a lock at its side, it gives the competitor way unconditionally (as the competitor has advanced further)

Signed-off-by: Sergey Chugunov <sergey.chugunov@gmail.com>
@rpuch rpuch closed this Nov 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants