storage: slice bounds out of range during a restore #64508

adityamaru · 2021-05-01T10:14:55Z

While stressing https://github.com/cockroachdb/cockroach/pull/64136/files#diff-bba123fef2874274ad1daec1f4663fe6aa4dc555e1bf655015414d4fa6c4a9acR8153, I occasionally run into a panic with the following stack trace:

panic: runtime error: slice bounds out of range [-8:]

goroutine 29742 [running]:
github.com/cockroachdb/pebble/sstable.readFooter(0x8f345c0, 0xc005da0690, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc000118030, ...)
        /Users/adityamaru/go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/sstable/table.go:233 +0xcac
github.com/cockroachdb/pebble/sstable.NewReader(0x8f345c0, 0xc005da0690, 0x0, 0xb096ce0, 0x0, 0x8276f98, 0x12, 0x0, 0x0, 0x0, ...)
        /Users/adityamaru/go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/sstable/reader.go:2302 +0x233
github.com/cockroachdb/cockroach/pkg/storage.NewSSTIterator(0x8f345c0, 0xc005da0690, 0x0, 0x0, 0x4000000000000000, 0x2)
        /Users/adityamaru/go/src/github.com/cockroachdb/cockroach/pkg/storage/sst_iterator.go:42 +0xa5
github.com/cockroachdb/cockroach/pkg/ccl/storageccl.ExternalSSTReader(0x8f91c00, 0xc0044d7358, 0x8fe28e0, 0xc0010271f0, 0xc005243d00, 0x16, 0x0, 0x0, 0x0, 0x0, ...)
        /Users/adityamaru/go/src/github.com/cockroachdb/cockroach/pkg/ccl/storageccl/import.go:360 +0x390
github.com/cockroachdb/cockroach/pkg/ccl/backupccl.(*restoreDataProcessor).processRestoreSpanEntry(0xc00464ed00, 0xc0055a0348, 0x3, 0x8, 0xc0055a0350, 0x3, 0x8, 0xc0025fdd90, 0x1, 0x1, ...)
        /Users/adityamaru/go/src/github.com/cockroachdb/cockroach/pkg/ccl/backupccl/restore_data_processor.go:183 +0x3c5
github.com/cockroachdb/cockroach/pkg/ccl/backupccl.(*restoreDataProcessor).Next(0xc00464ed00, 0x0, 0x0, 0x0, 0xc005e7bab0)
        /Users/adityamaru/go/src/github.com/cockroachdb/cockroach/pkg/ccl/backupccl/restore_data_processor.go:135 +0x52b
github.com/cockroachdb/cockroach/pkg/sql/execinfra.Run(0x8f91c00, 0xc0044d7358, 0x8f96a80, 0xc00464ed00, 0x8f45140, 0xc00390f180)
        /Users/adityamaru/go/src/github.com/cockroachdb/cockroach/pkg/sql/execinfra/base.go:175 +0x35
github.com/cockroachdb/cockroach/pkg/sql/execinfra.(*ProcessorBase).Run(0xc00464ed00, 0x8f74740, 0xc0026db480)
        /Users/adityamaru/go/src/github.com/cockroachdb/cockroach/pkg/sql/execinfra/processorsbase.go:774 +0x96
github.com/cockroachdb/cockroach/pkg/sql/flowinfra.(*FlowBase).startInternal.func1(0xc002eebb30, 0x2, 0x3, 0x8f74740, 0xc0026db480, 0xc004ce6900, 0x1)
        /Users/adityamaru/go/src/github.com/cockroachdb/cockroach/pkg/sql/flowinfra/flow.go:327 +0x5c
created by github.com/cockroachdb/cockroach/pkg/sql/flowinfra.(*FlowBase).startInternal
        /Users/adityamaru/go/src/github.com/cockroachdb/cockroach/pkg/sql/flowinfra/flow.go:326 +0x2e8

The test shuts down a node during backup to ensure our job is resilient to such transient errors. Once the backup job completes, we attempt to restore from the backup to check its correctness. During this restore, I have seen us hit the above panic. I've managed to grab the test logs for one such failure, and am attempting to reproduce it so that I can grab the backup as well.

restorepanic.txt

Jira issue: CRDB-7092

The text was updated successfully, but these errors were encountered:

dt · 2021-05-03T14:47:15Z

I suspect this is our virtual file-like that streams reads -- Stat said it had bytes, but then Read didn't return any. Maybe a userfile bug. I'll dig a bit.

adityamaru · 2023-03-17T20:09:56Z

fwiw this still happens every time I stress TestBackupWorkerFailure in my 6-monthly attempt to unskip the test. I'm going to switch to nodelocal to see if its a userfile specific thing.

adityamaru · 2023-03-28T14:35:43Z

This should be investigated using TestBackupWorkerFailure as the reproduction. It has been failing with this error since 2021. This might be related to #98964.

adityamaru · 2023-07-10T20:53:55Z

This might be fixed by #106503. Try stressing the test after the change merges.

adityamaru added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label May 1, 2021

adityamaru added this to Incoming in Storage via automation May 1, 2021

adityamaru added this to Triage in Disaster Recovery Backlog via automation May 1, 2021

adityamaru mentioned this issue May 1, 2021

importccl: add more errors to transient retryable error list #64136

Merged

mwang1026 added A-disaster-recovery T-disaster-recovery labels May 3, 2021

mwang1026 assigned dt May 3, 2021

mwang1026 moved this from Triage to Bug in Disaster Recovery Backlog May 3, 2021

adityamaru removed this from Incoming in Storage May 3, 2021

pbardea mentioned this issue May 6, 2021

backupccl: TestBackupWorkerFailure is skipped #64773

Closed

adityamaru unassigned dt Mar 28, 2023

dt moved this from Bug to Backlog in Disaster Recovery Backlog Feb 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: slice bounds out of range during a restore #64508

storage: slice bounds out of range during a restore #64508

adityamaru commented May 1, 2021 •

edited by cockroach-jira-scripts

dt commented May 3, 2021 •

edited

adityamaru commented Mar 17, 2023 •

edited

adityamaru commented Mar 28, 2023 •

edited

adityamaru commented Jul 10, 2023

storage: slice bounds out of range during a restore #64508

storage: slice bounds out of range during a restore #64508

Comments

adityamaru commented May 1, 2021 • edited by cockroach-jira-scripts

dt commented May 3, 2021 • edited

adityamaru commented Mar 17, 2023 • edited

adityamaru commented Mar 28, 2023 • edited

adityamaru commented Jul 10, 2023

adityamaru commented May 1, 2021 •

edited by cockroach-jira-scripts

dt commented May 3, 2021 •

edited

adityamaru commented Mar 17, 2023 •

edited

adityamaru commented Mar 28, 2023 •

edited