Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: slice bounds out of range during a restore #64508

Open
adityamaru opened this issue May 1, 2021 · 4 comments
Open

storage: slice bounds out of range during a restore #64508

adityamaru opened this issue May 1, 2021 · 4 comments
Labels
A-disaster-recovery C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-disaster-recovery

Comments

@adityamaru
Copy link
Contributor

adityamaru commented May 1, 2021

While stressing https://github.com/cockroachdb/cockroach/pull/64136/files#diff-bba123fef2874274ad1daec1f4663fe6aa4dc555e1bf655015414d4fa6c4a9acR8153, I occasionally run into a panic with the following stack trace:

panic: runtime error: slice bounds out of range [-8:]

goroutine 29742 [running]:
github.com/cockroachdb/pebble/sstable.readFooter(0x8f345c0, 0xc005da0690, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc000118030, ...)
        /Users/adityamaru/go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/sstable/table.go:233 +0xcac
github.com/cockroachdb/pebble/sstable.NewReader(0x8f345c0, 0xc005da0690, 0x0, 0xb096ce0, 0x0, 0x8276f98, 0x12, 0x0, 0x0, 0x0, ...)
        /Users/adityamaru/go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/sstable/reader.go:2302 +0x233
github.com/cockroachdb/cockroach/pkg/storage.NewSSTIterator(0x8f345c0, 0xc005da0690, 0x0, 0x0, 0x4000000000000000, 0x2)
        /Users/adityamaru/go/src/github.com/cockroachdb/cockroach/pkg/storage/sst_iterator.go:42 +0xa5
github.com/cockroachdb/cockroach/pkg/ccl/storageccl.ExternalSSTReader(0x8f91c00, 0xc0044d7358, 0x8fe28e0, 0xc0010271f0, 0xc005243d00, 0x16, 0x0, 0x0, 0x0, 0x0, ...)
        /Users/adityamaru/go/src/github.com/cockroachdb/cockroach/pkg/ccl/storageccl/import.go:360 +0x390
github.com/cockroachdb/cockroach/pkg/ccl/backupccl.(*restoreDataProcessor).processRestoreSpanEntry(0xc00464ed00, 0xc0055a0348, 0x3, 0x8, 0xc0055a0350, 0x3, 0x8, 0xc0025fdd90, 0x1, 0x1, ...)
        /Users/adityamaru/go/src/github.com/cockroachdb/cockroach/pkg/ccl/backupccl/restore_data_processor.go:183 +0x3c5
github.com/cockroachdb/cockroach/pkg/ccl/backupccl.(*restoreDataProcessor).Next(0xc00464ed00, 0x0, 0x0, 0x0, 0xc005e7bab0)
        /Users/adityamaru/go/src/github.com/cockroachdb/cockroach/pkg/ccl/backupccl/restore_data_processor.go:135 +0x52b
github.com/cockroachdb/cockroach/pkg/sql/execinfra.Run(0x8f91c00, 0xc0044d7358, 0x8f96a80, 0xc00464ed00, 0x8f45140, 0xc00390f180)
        /Users/adityamaru/go/src/github.com/cockroachdb/cockroach/pkg/sql/execinfra/base.go:175 +0x35
github.com/cockroachdb/cockroach/pkg/sql/execinfra.(*ProcessorBase).Run(0xc00464ed00, 0x8f74740, 0xc0026db480)
        /Users/adityamaru/go/src/github.com/cockroachdb/cockroach/pkg/sql/execinfra/processorsbase.go:774 +0x96
github.com/cockroachdb/cockroach/pkg/sql/flowinfra.(*FlowBase).startInternal.func1(0xc002eebb30, 0x2, 0x3, 0x8f74740, 0xc0026db480, 0xc004ce6900, 0x1)
        /Users/adityamaru/go/src/github.com/cockroachdb/cockroach/pkg/sql/flowinfra/flow.go:327 +0x5c
created by github.com/cockroachdb/cockroach/pkg/sql/flowinfra.(*FlowBase).startInternal
        /Users/adityamaru/go/src/github.com/cockroachdb/cockroach/pkg/sql/flowinfra/flow.go:326 +0x2e8

The test shuts down a node during backup to ensure our job is resilient to such transient errors. Once the backup job completes, we attempt to restore from the backup to check its correctness. During this restore, I have seen us hit the above panic. I've managed to grab the test logs for one such failure, and am attempting to reproduce it so that I can grab the backup as well.

restorepanic.txt

Jira issue: CRDB-7092

@adityamaru adityamaru added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label May 1, 2021
@adityamaru adityamaru added this to Incoming in Storage via automation May 1, 2021
@adityamaru adityamaru added this to Triage in Disaster Recovery Backlog via automation May 1, 2021
@mwang1026 mwang1026 moved this from Triage to Bug in Disaster Recovery Backlog May 3, 2021
@adityamaru adityamaru removed this from Incoming in Storage May 3, 2021
@dt
Copy link
Member

dt commented May 3, 2021

I suspect this is our virtual file-like that streams reads -- Stat said it had bytes, but then Read didn't return any. Maybe a userfile bug. I'll dig a bit.

@adityamaru
Copy link
Contributor Author

adityamaru commented Mar 17, 2023

fwiw this still happens every time I stress TestBackupWorkerFailure in my 6-monthly attempt to unskip the test. I'm going to switch to nodelocal to see if its a userfile specific thing.

@adityamaru
Copy link
Contributor Author

adityamaru commented Mar 28, 2023

This should be investigated using TestBackupWorkerFailure as the reproduction. It has been failing with this error since 2021. This might be related to #98964.

@adityamaru
Copy link
Contributor Author

This might be fixed by #106503. Try stressing the test after the change merges.

@dt dt moved this from Bug to Backlog in Disaster Recovery Backlog Feb 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-disaster-recovery C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-disaster-recovery
Development

No branches or pull requests

3 participants