Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IMPORT: failure due to gc threshold subsequently fails to rollback #122351

Open
dt opened this issue Apr 15, 2024 · 3 comments
Open

IMPORT: failure due to gc threshold subsequently fails to rollback #122351

dt opened this issue Apr 15, 2024 · 3 comments
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-testcluster Issues found or occurred on a test cluster, i.e. a long-running internal cluster P-3 Issues/test failures with no fix SLA T-sql-queries SQL Queries Team

Comments

@dt
Copy link
Member

dt commented Apr 15, 2024

Observed on drt-ua2, running A0AF5664: the initial import failed after several days, and now appears stuck and unable to revert:

addsstable [/Tenant/3/Table/114/1/127876/7/-3001/1/0,/Tenant/3/Table/114/1/127877/5/-2150/11/0/NULL): batch timestamp 1713131887.795927192,0 must be after replica GC threshold 1713138461.824411737,0 | 7 | 591956281271723629 | {"reverting execution from '2024-04-15 04:31:34.478337' to '2024-04-15 04:31:38.13894' on 7 failed: rolling back IMPORT INTO in empty table via DeleteRange: delete range /Tenant/3/Table/114/1/190836/1/-2999/1 - /Tenant/3/Table/114/1/216277/8/-2823/6: replica unavailable: (n2,s2):5 unable to serve request to r76626:/Tenant/3/Table/114/1/{190836/1/-2999/1-214285/8/-3001/1} [(n2,s2):5, (n1,s1):2, (n3,s3):3VOTER_DEMOTING_LEARNER, (n5,s5):6VOTER_INCOMING, next=7, gen=17148, sticky=1713076548.036832953,0]: closed timestamp: 1713076086.298572324,0 (2024-04-14 06:28:06); raft status: {"id":"5","term":7,"vote":"5","commit":39,"lead":"5","raftState":"StateLeader","applied":39,"progress":{"3":{"match":66425,"next":66426,"state":"StateReplicate"},"5":{"match":66425,"next":66426,"state":"StateReplicate"},"6":{"match":0,"next":37,"state":"StateSnapshot"},"2":{"match":0,"next":37,"state":"StateSnapshot"}},"leadtransferee":"0"}: encountered poisoned latch /M{in-ax}@0,0","reverting execution from '2024-04-15 05:03:15.042559' to '2024-04-15 05:03:18.809365' on 2 failed: rolling back IMPORT INTO in empty table via DeleteRange: delete range /Tenant/3/Table/114/1/190836/1/-2999/1 - /Tenant/3/Table/114/1/216277/8/-2823/6: replica unavailable: (n2,s2):5 unable to serve request to r76626:/Tenant/3/Table/114/1/{190836/1/-2999/1-214285/8/-3001/1} [(n2,s2):5, (n1,s1):2, (n3,s3):3VOTER_DEMOTING_LEARNER, (n5,s5):6VOTER_INCOMING, next=7, gen=17148, sticky=1713076548.036832953,0]: closed timestamp: 1713076086.298572324,0 (2024-04-14 06:28:06); raft status: {"id":"5","term":7,"vote":"5","commit":39,"lead":"5","raftState":"StateLeader","applied":39,"progress":{"2":{"match":0,"next":37,"state":"StateSnapshot"},"3":{"match":66425,"next":66426,"state":"StateReplicate"},"5":{"match":66425,"next":66426,"state":"StateReplicate"},"6":{"match":0,"next":37,"state":"StateSnapshot"}},"leadtransferee":"0"}: encountered poisoned latch /M{in-ax}@0,0","reverting execution from '2024-04-15 06:07:14.455743' to '2024-04-15 06:07:20.81839' on 3 failed: rolling back IMPORT INTO in empty table via DeleteRange: delete range /Tenant/3/Table/114/1/189154/4/-3000/1 - /Tenant/3/Table/114/1/214594/5/-1359/5: replica unavailable: (n2,s2):5 unable to serve request to r76626:/Tenant/3/Table/114/1/{190836/1/-2999/1-214285/8/-3001/1} [(n2,s2):5, (n1,s1):2, (n3,s3):3VOTER_DEMOTING_LEARNER, (n5,s5):6VOTER_INCOMING, next=7, gen=17148, sticky=1713076548.036832953,0]: closed timestamp: 1713076086.298572324,0 (2024-04-14 06:28:06); raft status: {"id":"5","term":7,"vote":"5","commit":39,"lead":"5","raftState":"StateLeader","applied":39,"progress":{"5":{"match":66425,"next":66426,"state":"StateReplicate"},"6":{"match":0,"next":37,"state":"StateSnapshot"},"2":{"match":0,"next":37,"state":"StateSnapshot"},"3":{"match":66425,"next":66426,"state":"StateReplicate"}},"leadtransferee":"0"}: encountered poisoned latch /M{in-ax}@0,0"}

The DB console reports no unavailable ranges.

Jira issue: CRDB-37824

@dt dt added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-testcluster Issues found or occurred on a test cluster, i.e. a long-running internal cluster labels Apr 15, 2024
@ajstorm ajstorm added T-sql-queries SQL Queries Team P-1 Issues/test failures with a fix SLA of 1 month labels Apr 17, 2024
@rytaft rytaft self-assigned this Apr 23, 2024
@rytaft rytaft assigned yuzefovich and unassigned rytaft May 6, 2024
@rytaft
Copy link
Collaborator

rytaft commented May 6, 2024

@yuzefovich I didn't get a chance to look at this during my on-call rotation. If you get a chance, could you please take a look? Thank you!

@yuzefovich yuzefovich removed their assignment May 12, 2024
@michae2
Copy link
Collaborator

michae2 commented May 21, 2024

(quoting @yuzefovich during triage)

We think a way to reproduce this is to:

  1. Set a short GC TTL on the range
  2. make the import take longer than the TTL,
  3. and then cause the import to fail, and it seems like that might hit this assertion

@DrewKimball
Copy link
Collaborator

@dt does it seem possible that this issue is caused by #91151?

@yuzefovich yuzefovich added P-3 Issues/test failures with no fix SLA and removed P-1 Issues/test failures with a fix SLA of 1 month labels May 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-testcluster Issues found or occurred on a test cluster, i.e. a long-running internal cluster P-3 Issues/test failures with no fix SLA T-sql-queries SQL Queries Team
Projects
Status: Bugs to Fix
Development

No branches or pull requests

6 participants