-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
improve initial pre-push imports #1040
Conversation
Adds "ref" parameter extracted from git pre-push hook into the upload request. This way servers with branch access permissions can enforce them before large objects are transfered. Issue: #845
In push hook refs comes from user and points to local branches. Transporting them to server doesn't make sense.
Don't add empty "ref" parameter to the lfs-request object when its value is empty.
Instead of passing around additional string parameter with destination ref, we pass special struct called UploadMetadata. For now UploadMetadata contains only ref field but in the future more metadata like file name (or path) can be added.
Move "ref" property of batch request into "meta" object, this way all meta data have a common place to live instead of being spread all around the batch request.
Since we also want to have metadata included in fetch request the UploadMetadata struct is renamed to TransferMetadata. Patch that adds metadata to fetch request will follow.
Include meta.ref property in download request object, same as it is in the upload request. In case of download meta.ref contain name of currently check out local branch and can be only a suggestion to server implementation.
Normalize the value of meta.ref parameter of batch request object. Previously ref parameter format differs between "download" and "upload" objects. The "upload" always was using the fully qualified branch name (eg. refs/heads/master), when the "download" used short name (eg. master). Right now meta.ref will have consistent values. When "refs/heads/" prefix will be appended when it is not present. Special "HEAD" reference will be replaced with empty string, which will result in meta.ref parameter not being included in the request object.
This only addresses the issue when using the pre-push hook, it'll still be there when using |
@@ -137,7 +133,19 @@ func prePushRef(left, right string) { | |||
os.Exit(2) | |||
} | |||
} | |||
} | |||
|
|||
func filteredPointers(pointers []*lfs.WrappedPointer, filter lfs.StringSet) []*lfs.WrappedPointer { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of building a brand new list, how about either doing this as a channel or accepting a callback function for unfiltered items?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seemed a lot simpler. Though I am curious enough to try writing some benchmark code for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about this? https://gist.github.com/technoweenie/34671a1296ba375b477a
$ go test -bench . -benchmem
testing: warning: no tests to run
PASS
BenchmarkFilterByList-4 1000 1577368 ns/op 346271 B/op 127 allocs/op
BenchmarkFilterByCallback-4 1000 1532885 ns/op 264313 B/op 125 allocs/op
BenchmarkFilterByChan-4 200 9493396 ns/op 182958 B/op 129 allocs/op
ok _/Users/rick/p/go-filter-benchmark 6.305s
Still leaning towards the current approach. It's not much slower than using a callback, and has the simplest implementation. Unless there's some radical way to make the channel approach simpler and faster, I'm not even really considering it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant more like your callback example, channels as output and still lists as input; but I see the issue is that you need 2 iterations with shared reference data which is causing the channel way to be slow in parallel because of the mutex so it's no benefit over the callback; they'd have to be serialised to avoid the mutex, and it's more complex.
I do prefer the callback style to returning new lists but that's probably the grumpy old man in me seeing 30% more allocations and freaking out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried using callbacks:
- pointers = filteredPointers(pointers, uploadedOids)
- pointers = filteredPointers(pointers, prePushCheckForMissingObjects(pointers))
+ filtered := make([]*lfs.WrappedPointer, 0, len(pointers))
+ cb := func(p *lfs.WrappedPointer) {
+ filtered = append(filtered, p)
+ }
+
+ filteredPointers(pointers, uploadedOids, cb)
+
+ serverOids := prePushCheckForMissingObjects(filtered)
+ filteredPointers(filtered, serverOids, cb)
This is more complex, and introduces a concurrency issue because the second filter call is looping through the same slice while the callback is appending to it. The original approach, though 30% more wasteful with allocations, is the best option IMO. Though, that's assuming filtering twice is a good idea. My intention is to filter out everything that has been uploaded from this same process before checking the server to see what it has. It'll result in fewer batch calls in large import pushes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
introduces a concurrency issue because the second filter call is looping through the same slice while the callback is appending to it
I don't think it does - the filteredPointers and prePushCheckForMissingObjects calls are completely synchronous/serial. But in any case, if you're happier with this style it's fine, I just can't help myself looking for allocation optimisations given my background ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that filteredPointers is used in multiple places, maybe promote it to the lfs package for re-use rather than keep it in the pre-push source file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW just wanted to acknowledge you're absolutely right about the concurrency thing. Temporary brain fart.
I'm working on an update that attempts to centralize the uploading logic across pre-push and push. I'll probably post that as a PR against this PR so we can review that, and discuss package options there.
Yup, that's on purpose. Trying to get feedback on the approach. Then we can take it to the fetch and push commands 🤘 |
Ah fair enough, approach is fine to me, very similar to what was done before just in a different place. I wondered if it could be generalised & centralised instead but I recognise that would probably have to be in the transfer queue which is just going to cause a whole lot of new race condition possibilities. This way is probably better just for sanity. |
Me too! It may or may not happen in this PR. I just wanted to get feedback on this before spending more time on it. I'll reply again when I'm ready for round 2 :) |
Okay, this works with |
Looks good 👍 I've seen that checkout failure that's red-flagging the Travis build before, I think it's unrelated to this PR. |
extract common pre-push and push uploading code
This conflicts with the scanner changes in master, so I'm going to redo it. |
When pushing a repository to a new host, Git LFS pushes each branch or tag individually. This leads to a lot of skipped objects. Assuming all branches are based on master, then there should be a lot of duplicate objects. Here's a push with a repository with 502 objects: 501 in master, and an extra in another branch:
With this PR: