Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: tables with abandoned schema change mutations with failed jobs cannot be dropped #57597

Closed
thoszhang opened this issue Dec 5, 2020 · 3 comments · Fixed by #57836
Closed
Assignees
Labels
A-schema-changes C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.

Comments

@thoszhang
Copy link
Contributor

When a schema change job undergoes a failure while reverting, the corresponding mutation, if there is one, gets stuck on the table descriptor forever, and the job will end up as failed. In this state, it becomes impossible to drop the table or its parent database, because we'll attempt to mark the job as succeeded (under the assumption that it's an ongoing schema change), but this isn't possible with a failed job:

jobIDs := make(map[int64]struct{})
var id descpb.MutationID
for _, m := range tableDesc.Mutations {
if id != m.MutationID {
id = m.MutationID
jobID, err := getJobIDForMutationWithDescriptor(ctx, tableDesc, id)
if err != nil {
return err
}
jobIDs[jobID] = struct{}{}
}
}
for jobID := range jobIDs {
if err := p.ExecCfg().JobRegistry.Succeeded(ctx, p.txn, jobID); err != nil {
return errors.Wrapf(err,
"failed to mark job %d as as successful", errors.Safe(jobID))
}
delete(p.ExtendedEvalContext().SchemaChangeJobCache, tableDesc.ID)
}

The error that gets returned is incomprehensible:

root@127.0.0.1:64580/db> drop database defaultdb cascade;
ERROR: failed to mark job 612970272619069441 as as successful: job with status failed cannot be marked as succeeded
@thoszhang thoszhang added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-schema-changes labels Dec 5, 2020
@thoszhang thoszhang added this to Triage in SQL Foundations via automation Dec 5, 2020
@thoszhang
Copy link
Contributor Author

We should just mark these jobs as failed. It makes no sense to mark them as succeeded if we forcibly stop them so we can drop the table.

@thoszhang
Copy link
Contributor Author

It turns out we can't mark these jobs as failed in general because it interferes with tables being dropped after other schema changes to the table have been queued earlier .

Ultimately, I don't think #56589 is the correct solution for handling those cases. It just bypasses updating the earlier job queued by the earlier schema change, but we still wait for the successful completion of that job. Deleting the earlier job entirely would work, but also just removing it from extraTxnState would work too, I think. (And it should be deleted from MutationJobs.)

So I think the right solution here is to mark these jobs as failed and delete them from MutationJobs, and if it was queued in the transaction, delete the job from extraTxnState and from MutationJobs. I'll try this out.

@thoszhang
Copy link
Contributor Author

This can also happen when truncating tables in 20.1, since we used to drop the table and create a new one. We saw this in a user cluster. TRUNCATE works fine for tables in this state in 20.2, which makes me suspicious about what happens to running schema change jobs with the new implementation, but that's a separate issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-schema-changes C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants