Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recursive checker implementation. #225

Merged
merged 7 commits into from
Jan 14, 2023
Merged

Conversation

ptabor
Copy link
Contributor

@ptabor ptabor commented Jun 26, 2020

Recursive checker confirms database consistency with respect to b-tree
key order constraints:

  • keys on pages must be sorted
  • keys on children pages are between 2 consecutive keys on parent
    branch page).

This PR moves checker logic to separate tx_checker file as the logic is already substantial (200).
It does not modify the original logic apart of running recursivelyCheckPages.

The reason to create this stronger checker is data corruption issue in etcd I was investigating that led to following branch state:

000000000000014d5f0000000000000000: <pgid=618>
000000000000014d5f0000000000000000: <pgid=778>
00000000013a29135f0000000000000000: <pgid=343>

...

Such state - according to the previous checker was correct.
The new checker reports:

key (0: 000000000000014d5f0000000000000000) on the branch page(618) needs to be < than key of the next element reachable from the ancestor (000000000000014d5f0000000000000000). Pages stack: [56 618]

@mndrix
Copy link
Contributor

mndrix commented Aug 29, 2020

I'm not from etcd, but I've incorporated parts of this patch into my private fork. Thanks.

@ahrtr
Copy link
Member

ahrtr commented Nov 30, 2022

@ptabor Let's resurrect this PR. Could you please rebase this PR to resolve conflict?

@ptabor
Copy link
Contributor Author

ptabor commented Dec 1, 2022

[scheduling this on my TODO queue]

@ptabor ptabor force-pushed the extend_checker2 branch 3 times, most recently from 8d1bd97 to bdade3c Compare December 8, 2022 09:25
@ptabor
Copy link
Contributor Author

ptabor commented Dec 8, 2022

Resurrected. But still need to add a test.

@ptabor
Copy link
Contributor Author

ptabor commented Jan 4, 2023

The PR has tests and resolved conflicts.

cmd/bbolt/main.go Outdated Show resolved Hide resolved
internal/tests/tx_check_test.go Outdated Show resolved Hide resolved
tx_check.go Outdated Show resolved Hide resolved
tx_check.go Outdated Show resolved Hide resolved
tx_check.go Outdated Show resolved Hide resolved
tx_check.go Outdated
maxKeyInSubtree = tx.recursivelyCheckPagesInternal(elem.pgid, elem.key(), maxKey, pagesStack, keyToString, ch)
runningMin = maxKeyInSubtree
}
return
Copy link
Member

@ahrtr ahrtr Jan 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to define maxKey outside the for loop

Suggested change
return
return maxKey

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not return maxKeyInSubtree that is implicit ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, we should return maxKeyInSubtree. For the rightest element in a branch page, the maxKey in the page it links to should be the largest key in the tree rooted from current branch page.

tx_check.go Outdated Show resolved Hide resolved
xRay := surgeon.NewXRay(db.Path())

path1, err := xRay.FindPathsToKey([]byte("0451"))
require.NoError(t, err, "Cannot find page that contains key:'0451'")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: better to lower case the first letter when a message is wrapped in any other utilities (github.com/stretchr/testify in this case) instead of being outputted directly. This comment applies to the following similar messages as well.

Suggested change
require.NoError(t, err, "Cannot find page that contains key:'0451'")
require.NoError(t, err, "cannot find page that contains key:'0451'")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


import (
"fmt"
"github.com/stretchr/testify/require"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please reorder this import item.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

tx_check.go Outdated
// - Are in right ordering relationship to their parents.
// `pagesStack` is expected to contain IDs of pages from the tree root to `pgid` for the clean debugging message.
func (tx *Tx) recursivelyCheckPagesInternal(
pgid pgid, minKeyClosed, maxKeyOpen []byte, pagesStack []pgid,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pgid pgid... the name is the same string as the type. I know it works, but I'd suggest to avoid it to make it clearer and less confusion.

How about change it to something like pid pgid ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done pgId pgid --> pid sounds too much like process id.

tx_check.go Outdated
maxKeyInSubtree = tx.recursivelyCheckPagesInternal(elem.pgid, elem.key(), maxKey, pagesStack, keyToString, ch)
runningMin = maxKeyInSubtree
}
return
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, we should return maxKeyInSubtree. For the rightest element in a branch page, the maxKey in the page it links to should be the largest key in the tree rooted from current branch page.

tx_check.go Outdated
for i := range p.leafPageElements() {
elem := p.leafPageElement(uint16(i))
if i == 0 && runningMin != nil && compareKeys(runningMin, elem.key()) > 0 {
ch <- fmt.Errorf("The first key[%d]=(hex)%s on leaf page(%d) needs to be >= the key in the ancestor (%s). Stack: %v",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ch <- fmt.Errorf("The first key[%d]=(hex)%s on leaf page(%d) needs to be >= the key in the ancestor (%s). Stack: %v",
ch <- fmt.Errorf("the first key[%d]=(hex)%s on leaf page(%d) needs to be >= the key in the ancestor (%s). Stack: %v",

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@ahrtr
Copy link
Member

ahrtr commented Jan 13, 2023

In (*Tx) recursivelyCheckPages, the case p.flags&branchPageFlag != 0 should have almost the same logic as case p.flags&leafPageFlag != 0. The only difference is that it needs to recursively check the child page each branch element links to.

So the existing logic in case p.flags&branchPageFlag != 0 isn't completed as compared to the case p.flags&leafPageFlag != 0.

I would suggest to implement a common function or method to be shared by both cases.

Signed-off-by: Piotr Tabor <ptab@google.com>
Recursive checker confirms database consistency with respect to b-tree
key order constraints:
  - keys on pages must be sorted
  - keys on children pages are between 2 consecutive keys on parent
branch page).

Signed-off-by: Piotr Tabor <ptab@google.com>
Signed-off-by: Piotr Tabor <ptab@google.com>
Signed-off-by: Piotr Tabor <ptab@google.com>
Signed-off-by: Piotr Tabor <ptab@google.com>
Signed-off-by: Piotr Tabor <ptab@google.com>
Signed-off-by: Piotr Tabor <ptab@google.com>
Copy link
Member

@ahrtr ahrtr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ptabor ptabor merged commit 51370e4 into etcd-io:master Jan 14, 2023
@ahrtr ahrtr added this to the 1.3.7 milestone Jan 16, 2023
// because of caching. This overhead can be removed if running on a read-only
// transaction, however, it is not safe to execute other writer transactions at
// the same time.
func (tx *Tx) Check(kvStringer KVStringer) <-chan error {
Copy link
Member

@ahrtr ahrtr Jan 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a breaking change. It changed the signature of method Check. Previously there was no parameter. FYI. etcd-io/etcd@87c0da8

The proposals:

  1. keep it as it's, and add a breaking change note when we release v1.3.7;
  2. revert the change, and add a separate method something like CheckWithStringer(kvString KVString), and use a default KVStringer (e.g. bolt.HexKVStringer) when calling Check();
  3. allow users to pass a nil KVStringer, and we use a default one in that case.

"2" seems to be the safest solution, because it has no any impact on users.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants