-
-
Notifications
You must be signed in to change notification settings - Fork 481
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixed race when multiple clients attempt to connect to the same branc… #5330
Conversation
…h the first time a replica fetches it
…ranch on a read replica (maybe should be an error to create a local branch in the first place but we don't enforce that)
…n newly pulled branch)
…lt into zachmu/remote-ref-replication
This breaks a couple of the async push bats tests. I didn't change anything directly related to that, and adding sleeps makes it work again. Which makes me think that clean shutdown logic has been broken for some time, and we just passed some threshold where we are now losing the race. |
It looks like graceful server shutdown on kill is just broken... there's no signal handling to invoke the shutdown logic, it just relies on a defer: sqlEngine, err := engine.NewSqlEngine(
ctx,
mrEnv,
engine.FormatTabular,
config,
)
if err != nil {
return err, nil
}
defer sqlEngine.Close() Edit: no, this works fine, confirmed with logging. Background threads are allegedly being canceled. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the changes are not incorrect, but I don't understand how the edge cases are triggered, and i'm not seeing tests that try to trigger optimistic lock failures.
|
||
// loop on optimistic lock failures | ||
OPTIMISTIC_RETRY: | ||
for { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would it be equivalent ditch the gotos and just break
out of the for when we don't have a lock failure? otherwise continue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is that break
breaks out of the switch, not the loop. I'll see if I can eliminate the switch and make this easier to read.
if err != nil { | ||
return "", false, err | ||
} | ||
|
||
if branchExists { | ||
err = createLocalBranchFromRemoteTrackingBranch(ctx, srcDB.DbData(), ddb, branchName, remoteRef) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oof
case ref.BranchRefType: | ||
err := rrd.createNewBranchFromRemote(ctx, remoteRef, trackingRef) | ||
if errors.Is(err, datas.ErrOptimisticLockFailed) { | ||
continue OPTIMISTIC_RETRY |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a different rrd for every client? i'm confused how two clients could trigger this create retry, since the limiter is supposed to collapse the parallel calls. same for pullLocalBranch
but that one seems more reasonable. What are the concurrent events that trigger retries? a write overlapping a pull?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any manifest update can cause an optimistic lock write failure. There are writes that take place outside a limiter run (specifically, during pullBranchesAndUpdateWorkingSet)
Actually the work in CreateLocalBranch needs similar protection for this reason, but I'll follow up for that.
…h the first time a replica fetches it