Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pebble: Download can get stuck #3470

Closed
Tracked by #121166
RaduBerinde opened this issue Mar 28, 2024 · 1 comment · Fixed by #3483
Closed
Tracked by #121166

pebble: Download can get stuck #3470

RaduBerinde opened this issue Mar 28, 2024 · 1 comment · Fixed by #3483
Assignees
Projects

Comments

@RaduBerinde
Copy link
Member

Here are logs from a meta test instance where the Download op got stuck:

        // Mar 27 21:30:19.982 INFO: [JOB 39] compacted(copy) L0 [000038] (803B) Score=0.00 + L0 [] (0B) Score=0.00 -> L0 [000050] (803B), in 0.0s (0.0s total), output rate 8.5MB/s
        // Mar 27 21:30:19.982 INFO: [JOB 41] compacting(move) L5 [000045] (97B) Score=10.80 + L6 [] (0B) Score=0.20; OverlappingRatio: Single 0.00, Multi 0.00
        // Mar 27 21:30:19.982 INFO: [JOB 41] compacted(move) L5 [000045] (97B) Score=10.80 + L6 [] (0B) Score=0.20 -> L6 [000045] (97B), in 0.0s (0.0s total), output rate 13MB/s
        // Mar 27 21:30:19.982 INFO: [JOB 42] compacting(copy) L0 [000047] (803B) Score=0.00 + L0 [] (0B) Score=0.00; OverlappingRatio: Single 0.00, Multi 0.00
        // Mar 27 21:30:19.982 INFO: [JOB 42] compacted(copy) L0 [000047] (803B) Score=0.00 + L0 [] (0B) Score=0.00 -> L0 [000051] (803B), in 0.0s (0.0s total), output rate 13MB/s
        // Mar 27 21:30:19.982 INFO: [JOB 43] compacting(copy) L0 [000040] (803B) Score=0.00 + L0 [] (0B) Score=0.00; OverlappingRatio: Single 0.00, Multi 0.00
        // Mar 27 21:30:19.982 INFO: [JOB 43] compacted(copy) L0 [000040] (803B) Score=0.00 + L0 [] (0B) Score=0.00 -> L0 [000052] (803B), in 0.0s (0.0s total), output rate 20MB/s
        // Mar 27 21:30:19.982 INFO: [JOB 46] compacting(copy) L5 [000039] (803B) Score=0.00 + L5 [] (0B) Score=0.00; OverlappingRatio: Single 0.00, Multi 0.00
        // Mar 27 21:30:19.983 INFO: [JOB 46] compacted(copy) L5 [000039] (803B) Score=0.00 + L5 [] (0B) Score=0.00 -> L5 [000053] (803B), in 0.0s (0.0s total), output rate 2.0MB/s
        // Mar 27 21:30:19.983 INFO: [JOB 47] compacting(copy) L6 [000030] (803B) Score=0.00 + L6 [] (0B) Score=0.00; OverlappingRatio: Single 0.00, Multi 0.00
        // Mar 27 21:30:19.983 INFO: [JOB 47] compacted(copy) L6 [000030] (803B) Score=0.00 + L6 [] (0B) Score=0.00 -> L6 [000054] (803B), in 0.0s (0.0s total), output rate 1.8MB/s
        // Mar 27 21:30:19.983 INFO: [JOB 48] compacting(copy) L6 [000005] (803B) Score=0.00 + L6 [] (0B) Score=0.00; OverlappingRatio: Single 0.00, Multi 0.00
        // Mar 27 21:30:19.983 INFO: [JOB 48] compacted(copy) L6 [000005] (803B) Score=0.00 + L6 [] (0B) Score=0.00 -> L6 [000055] (803B), in 0.0s (0.0s total), output rate 13MB/s
        // Mar 27 21:30:19.983 INFO: [JOB 49] compacting(move) L5 [000048] (803B) Score=9.74 + L6 [] (0B) Score=0.22; OverlappingRatio: Single 0.00, Multi 0.00
        // Mar 27 21:30:19.983 INFO: [JOB 49] compacted(move) L5 [000048] (803B) Score=9.74 + L6 [] (0B) Score=0.22 -> L6 [000048] (803B), in 0.0s (0.0s total), output rate 656MB/s
        // Mar 27 21:30:19.984 INFO: [JOB 51] compacting(move) L5 [000053] (803B) Score=4.17 + L6 [] (0B) Score=0.32; OverlappingRatio: Single 0.00, Multi 0.00
        // Mar 27 21:30:19.984 INFO: [JOB 51] compacted(move) L5 [000053] (803B) Score=4.17 + L6 [] (0B) Score=0.32 -> L6 [000053] (803B), in 0.0s (0.0s total), output rate 471MB/s

The key here is the "compaction cancelled" error. tryScheduleDownloadCompaction tries to find an external file that isn't compacting:

if objMeta.IsExternal() && !f.IsCompacting() {

If it finds no such file, it declares the download operation completed. However, if that compaction doesn't complete - like in the case above where it gets cancelled, we will never actually download the file and Download ends up waiting forever.

There is another problem with Download - if a download compaction is cancelled, we will pass the error up to the user instead of retrying.

@RaduBerinde
Copy link
Member Author

Another way this can happen (which is probably more common) is an external file undergoing a move compaction. We skip that file but it will still be external after the compaction.

        // Mar 28 05:57:44.473 INFO: [JOB 86] compacting(download,rewrite) L0 [001042] (889B) Score=0.00 + L0 [] (0B) Score=0.00; OverlappingRatio: Single 0.00, Multi 0.00
        // Mar 28 05:57:44.473 INFO: Mar 28 05:57:44.473 [JOB 86] compacting: sstable created 001054
        // Mar 28 05:57:44.473 INFO: [JOB 86] compacted(download,rewrite) L0 [001042] (889B) Score=0.00 + L0 [] (0B) Score=0.00 -> L0 [001054] (857B), in 0.0s (0.0s total), output rate 18MB/s
        // Mar 28 05:57:44.473 INFO: [JOB 87] compacting(download,rewrite) L0 [001048] (889B) Score=0.00 + L0 [] (0B) Score=0.00; OverlappingRatio: Single 0.00, Multi 0.00
        // Mar 28 05:57:44.473 INFO: Mar 28 05:57:44.473 [JOB 87] compacting: sstable created 001055
        // Mar 28 05:57:44.473 INFO: [JOB 87] MANIFEST created 001056
        // Mar 28 05:57:44.473 INFO: [JOB 87] compacted(download,rewrite) L0 [001048] (889B) Score=0.00 + L0 [] (0B) Score=0.00 -> L0 [001055] (857B), in 0.0s (0.0s total), output rate 10MB/s
        // Mar 28 05:57:44.473 INFO: [JOB 89] compacting(default) L5 [001052] (4.8KB) Score=1.31 + L6 [001003 000637 000391 000218 000376 000762 000887 000830 000984 000719] (19KB) Score=0.79; OverlappingRatio: Single 3.98, Multi 0.00
        // Mar 28 05:57:44.473 INFO: Mar 28 05:57:44.473 [JOB 89] compacting: sstable created 001057
        // Mar 28 05:57:44.474 INFO: [JOB 87] MANIFEST deleted 001039
        // Mar 28 05:57:44.474 INFO: [JOB 89] compacted(default) L5 [001052] (4.8KB) Score=1.31 + L6 [001003 000637 000391 000218 000376 000762 000887 000830 000984 000719] (19KB) Score=0.79 -> L6 [001057] (13KB), in 0.0s (0.0s total), output rate 30MB/s
        // Mar 28 05:57:44.479 INFO: [JOB 89] sstable deleted 001052
        // Mar 28 05:57:44.479 INFO: [JOB 90] compacting(move) L4 [001041] (889B) Score=151.45 + L5 [] (0B) Score=0.00; OverlappingRatio: Single 0.00, Multi 0.00
        // Mar 28 05:57:44.479 INFO: [JOB 90] MANIFEST created 001058
        // Mar 28 05:57:44.479 INFO: [JOB 90] compacted(move) L4 [001041] (889B) Score=151.45 + L5 [] (0B) Score=0.00 -> L5 [001041] (889B), in 0.0s (0.0s total), output rate 226MB/s
        // Mar 28 05:57:44.479 INFO: [JOB 90] MANIFEST deleted 001049
        // Mar 28 05:57:45.117 INFO: [JOB 91] WAL created 001059
        // Mar 28 05:57:45.122 INFO: [JOB 92] flushing 1 memtable (450B) to L0
        // Mar 28 05:57:45.127 INFO: Mar 28 05:57:45.127 [JOB 92] flushing: sstable created 001060
        // Mar 28 05:57:45.136 INFO: [JOB 92] flushed 1 memtable (450B) to L0 [001060] (1.3KB), in 0.0s (0.0s total), output rate 166KB/s
        // Mar 28 05:57:45.139 INFO: [JOB 92] WAL deleted 001044

Things get stuck here. This is the LSM, with 001041 (which had a move compaction) still being external.

RaduBerinde added a commit to RaduBerinde/pebble that referenced this issue Mar 28, 2024
Disable download op until cockroachdb#3470 is fixed.
RaduBerinde added a commit that referenced this issue Mar 29, 2024
Disable download op until #3470 is fixed.
RaduBerinde added a commit to RaduBerinde/pebble that referenced this issue Apr 1, 2024
This commits changes the download operation to keep track of where we
are in the download process and provide more clear completion
guarantees. The mechanics are described in the comment for
`downloadSpanTask`.

We also re-enable download ops in the metamorphic test.

Fixes cockroachdb#3470.
RaduBerinde added a commit to RaduBerinde/pebble that referenced this issue Apr 1, 2024
This commits changes the download operation to keep track of where we
are in the download process and provide more clear completion
guarantees. The mechanics are described in the comment for
`downloadSpanTask`.

We also add events/log messages around the download operation and
re-enable download ops in the metamorphic test.

Fixes cockroachdb#3470.
@RaduBerinde RaduBerinde moved this from Incoming to In Progress (this milestone) in Storage Apr 1, 2024
RaduBerinde added a commit to RaduBerinde/pebble that referenced this issue Apr 1, 2024
This commits changes the download operation to keep track of where we
are in the download process and provide more clear completion
guarantees. The mechanics are described in the comment for
`downloadSpanTask`.

We also add events/log messages around the download operation and
re-enable download ops in the metamorphic test.

Fixes cockroachdb#3470.
RaduBerinde added a commit to RaduBerinde/pebble that referenced this issue Apr 3, 2024
This commits changes the download operation to keep track of where we
are in the download process and provide more clear completion
guarantees. The mechanics are described in the comment for
`downloadSpanTask`.

We also add events/log messages around the download operation and
re-enable download ops in the metamorphic test.

Fixes cockroachdb#3470.
RaduBerinde added a commit to RaduBerinde/pebble that referenced this issue Apr 3, 2024
This commits changes the download operation to keep track of where we
are in the download process and provide more clear completion
guarantees. The mechanics are described in the comment for
`downloadSpanTask`.

We also add events/log messages around the download operation and
re-enable download ops in the metamorphic test.

Fixes cockroachdb#3470.
RaduBerinde added a commit to RaduBerinde/pebble that referenced this issue Apr 4, 2024
This commits changes the download operation to keep track of where we
are in the download process and provide more clear completion
guarantees. The mechanics are described in the comment for
`downloadSpanTask`.

We also add events/log messages around the download operation and
re-enable download ops in the metamorphic test.

Fixes cockroachdb#3470.
RaduBerinde added a commit to RaduBerinde/pebble that referenced this issue Apr 4, 2024
This commits changes the download operation to keep track of where we
are in the download process and provide more clear completion
guarantees. The mechanics are described in the comment for
`downloadSpanTask`.

We also add events/log messages around the download operation and
re-enable download ops in the metamorphic test.

Fixes cockroachdb#3470.
RaduBerinde added a commit to RaduBerinde/pebble that referenced this issue Apr 4, 2024
This commits changes the download operation to keep track of where we
are in the download process and provide more clear completion
guarantees. The mechanics are described in the comment for
`downloadSpanTask`.

We also add events/log messages around the download operation and
re-enable download ops in the metamorphic test.

Fixes cockroachdb#3470.
RaduBerinde added a commit to RaduBerinde/pebble that referenced this issue Apr 4, 2024
This commits changes the download operation to keep track of where we
are in the download process and provide more clear completion
guarantees. The mechanics are described in the comment for
`downloadSpanTask`.

We also add events/log messages around the download operation and
re-enable download ops in the metamorphic test.

Fixes cockroachdb#3470.
RaduBerinde added a commit to RaduBerinde/pebble that referenced this issue Apr 4, 2024
This commits changes the download operation to keep track of where we
are in the download process and provide more clear completion
guarantees. The mechanics are described in the comment for
`downloadSpanTask`.

We also add events/log messages around the download operation and
re-enable download ops in the metamorphic test.

Fixes cockroachdb#3470.
Storage automation moved this from In Progress (this milestone) to Done Apr 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Storage
  
Done
Development

Successfully merging a pull request may close this issue.

1 participant