Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

quorum during self repair should fetch from all nodes #859

Conversation

tenmoves
Copy link
Contributor

Description

During the self repair, we use the quorum read to fetch the missed transactions and the previous summary aggregate.
But if there is some synchronization error, the 3 requested node from the quorum may return that the transaction does not exist and so the self repair crashes.
If the transaction is in the beacon summary it means that it exists and so it should be fetched during self repair.

If the quorum read return an unexpected response (transaction does not exists, transaction invalid ...), we remove from the node list the previously requested nodes, and retry the quorum until we get the expected response or the node list is empty

Fixes #856

Type of change

  • Bug fix (non-breaking change which fixes an issue)

How Has This Been Tested?

Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce. Please also list any relevant details for your test configuration

  • Unit tests

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • Any dependent changes have been merged and published in downstream modules

@tenmoves tenmoves added bug Something isn't working self repair Involve SelfRepair mechanism core team Assigned to the core team labels Jan 24, 2023
@tenmoves tenmoves self-assigned this Jan 24, 2023
@tenmoves
Copy link
Contributor Author

what happens if we have both network_issue and transaction_not_exists which one should we raise ?
should we expect other errors than these two mentioned ?

@tenmoves tenmoves marked this pull request as ready for review January 25, 2023 12:11
@tenmoves tenmoves force-pushed the quorum_during_self_repair_should_fetch_from_all_nodes branch from 510d9c7 to 120dc16 Compare January 25, 2023 12:12
lib/archethic/p2p.ex Show resolved Hide resolved
lib/archethic/p2p.ex Outdated Show resolved Hide resolved
lib/archethic/transaction_chain.ex Outdated Show resolved Hide resolved
lib/archethic/p2p.ex Outdated Show resolved Hide resolved
message,
conflict_resolver,
acceptance_resolver,
consistency_level - 1,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should not reduce the consistency level as we want at least 3 results (by defaut), so here only 2 nodes will be asked instead of 3. Then 1, then 0.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You forgot to remove one - 1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ty

message,
conflict_resolver,
acceptance_resolver,
consistency_level - 1,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You forgot to remove one - 1

end
)

assert {:ok, %NotFound{}} =
Copy link
Member

@Neylix Neylix Jan 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I runned this test multiple times, I have some case where the last message is not %NotFound. We should be sure the test is deterministic. Maybe the mock %NotFound is returned faster than the last %Transaction, so the last element of the list for the conflict resolver is %Transaction.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't reproduce it on my machine it's weird, normally this shouldn't happen but I will add a timer.sleep() before the NotFound message to make sure that the other messages have arrived.

@@ -130,6 +130,37 @@ defmodule Archethic.SelfRepair.Sync.TransactionHandlerTest do
)
end

test "download_transaction/2 should download the transaction even after a first failure" do
Copy link
Member

@Neylix Neylix Jan 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this test does not test what it should.
As this function uses quorum, it will request to the 3 connected nodes (constency level to 3) so the first quorum attempt will get both result (network_issue and tx) and so the conflict resolver will return the tx. So your test actually verify if the conflict resolver works well.
To test the acceptance resolver behavior, you should add more nodes and the first 3 should return an error, then the next should return the transaction.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are right

@Neylix Neylix merged commit 17d3e73 into archethic-foundation:develop Feb 1, 2023
@samuelmanzanera samuelmanzanera mentioned this pull request May 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working core team Assigned to the core team self repair Involve SelfRepair mechanism
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Quorum during self repair should fetch from all the node
3 participants