improve initial pre-push imports #1040

technoweenie · 2016-02-24T19:14:10Z

When pushing a repository to a new host, Git LFS pushes each branch or tag individually. This leads to a lot of skipped objects. Assuming all branches are based on master, then there should be a lot of duplicate objects. Here's a push with a repository with 502 objects: 501 in master, and an extra in another branch:

$ time git push 5 --all
Git LFS: (501 of 501 files) 1.85 KB / 1.85 KB
Git LFS: (0 of 501 files, 501 skipped) 0 B / 1.85 KB, 1.85 KB skipped
Git LFS: (1 of 502 files, 501 skipped) 3 B / 1.86 KB, 1.85 KB skipped
Git LFS: (0 of 501 files, 501 skipped) 0 B / 1.85 KB, 1.85 KB skipped
Counting objects: 516, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (511/511), done.
Writing objects: 100% (516/516), 70.52 KiB | 0 bytes/s, done.
Total 516 (delta 1), reused 512 (delta 0)
To https://git-server/technoweenie/symmetrical-octo-pancake
 * [new branch]      1 -> 1
 * [new branch]      2 -> 2
 * [new branch]      3 -> 3
 * [new branch]      master -> master
git push 5 --all  2.06s user 1.51s system 7% cpu 51.051 total

With this PR:

$ time git push 4 --all
Git LFS: (501 of 501 files) 1.85 KB / 1.85 KB
Git LFS: (1 of 1 files) 3 B / 3 B
Counting objects: 516, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (511/511), done.
Writing objects: 100% (516/516), 70.52 KiB | 0 bytes/s, done.
Total 516 (delta 1), reused 512 (delta 0)
To https://git-server/technoweenie/symmetrical-octo-funicular
 * [new branch]      1 -> 1
 * [new branch]      2 -> 2
 * [new branch]      3 -> 3
 * [new branch]      master -> master
git push 4 --all  1.62s user 0.89s system 5% cpu 45.280 total

Adds "ref" parameter extracted from git pre-push hook into the upload request. This way servers with branch access permissions can enforce them before large objects are transfered. Issue: #845

In push hook refs comes from user and points to local branches. Transporting them to server doesn't make sense.

Don't add empty "ref" parameter to the lfs-request object when its value is empty.

Instead of passing around additional string parameter with destination ref, we pass special struct called UploadMetadata. For now UploadMetadata contains only ref field but in the future more metadata like file name (or path) can be added.

Move "ref" property of batch request into "meta" object, this way all meta data have a common place to live instead of being spread all around the batch request.

Since we also want to have metadata included in fetch request the UploadMetadata struct is renamed to TransferMetadata. Patch that adds metadata to fetch request will follow.

Include meta.ref property in download request object, same as it is in the upload request. In case of download meta.ref contain name of currently check out local branch and can be only a suggestion to server implementation.

Normalize the value of meta.ref parameter of batch request object. Previously ref parameter format differs between "download" and "upload" objects. The "upload" always was using the fully qualified branch name (eg. refs/heads/master), when the "download" used short name (eg. master). Right now meta.ref will have consistent values. When "refs/heads/" prefix will be appended when it is not present. Special "HEAD" reference will be replaced with empty string, which will result in meta.ref parameter not being included in the request object.

…th multiple refs

sinbad · 2016-02-25T15:55:03Z

This only addresses the issue when using the pre-push hook, it'll still be there when using git lfs push ref1 ref2 ref3 or git lfs push --all. So actually my comment in #1038 that I edited to ref this one still stands right now.

sinbad · 2016-02-25T15:57:12Z

commands/command_pre_push.go

@@ -137,7 +133,19 @@ func prePushRef(left, right string) {
 			os.Exit(2)
 		}
 	}
+}
+
+func filteredPointers(pointers []*lfs.WrappedPointer, filter lfs.StringSet) []*lfs.WrappedPointer {


Instead of building a brand new list, how about either doing this as a channel or accepting a callback function for unfiltered items?

This seemed a lot simpler. Though I am curious enough to try writing some benchmark code for it.

How about this? https://gist.github.com/technoweenie/34671a1296ba375b477a

$ go test -bench . -benchmem testing: warning: no tests to run PASS BenchmarkFilterByList-4 1000 1577368 ns/op 346271 B/op 127 allocs/op BenchmarkFilterByCallback-4 1000 1532885 ns/op 264313 B/op 125 allocs/op BenchmarkFilterByChan-4 200 9493396 ns/op 182958 B/op 129 allocs/op ok _/Users/rick/p/go-filter-benchmark 6.305s

Still leaning towards the current approach. It's not much slower than using a callback, and has the simplest implementation. Unless there's some radical way to make the channel approach simpler and faster, I'm not even really considering it.

I meant more like your callback example, channels as output and still lists as input; but I see the issue is that you need 2 iterations with shared reference data which is causing the channel way to be slow in parallel because of the mutex so it's no benefit over the callback; they'd have to be serialised to avoid the mutex, and it's more complex.

I do prefer the callback style to returning new lists but that's probably the grumpy old man in me seeing 30% more allocations and freaking out.

I tried using callbacks:

- pointers = filteredPointers(pointers, uploadedOids) - pointers = filteredPointers(pointers, prePushCheckForMissingObjects(pointers)) + filtered := make([]*lfs.WrappedPointer, 0, len(pointers)) + cb := func(p *lfs.WrappedPointer) { + filtered = append(filtered, p) + } + + filteredPointers(pointers, uploadedOids, cb) + + serverOids := prePushCheckForMissingObjects(filtered) + filteredPointers(filtered, serverOids, cb)

This is more complex, and introduces a concurrency issue because the second filter call is looping through the same slice while the callback is appending to it. The original approach, though 30% more wasteful with allocations, is the best option IMO. Though, that's assuming filtering twice is a good idea. My intention is to filter out everything that has been uploaded from this same process before checking the server to see what it has. It'll result in fewer batch calls in large import pushes.

introduces a concurrency issue because the second filter call is looping through the same slice while the callback is appending to it

I don't think it does - the filteredPointers and prePushCheckForMissingObjects calls are completely synchronous/serial. But in any case, if you're happier with this style it's fine, I just can't help myself looking for allocation optimisations given my background ;)

Now that filteredPointers is used in multiple places, maybe promote it to the lfs package for re-use rather than keep it in the pre-push source file?

BTW just wanted to acknowledge you're absolutely right about the concurrency thing. Temporary brain fart.

I'm working on an update that attempts to centralize the uploading logic across pre-push and push. I'll probably post that as a PR against this PR so we can review that, and discuss package options there.

technoweenie · 2016-02-25T16:30:07Z

This only addresses the issue when using the pre-push hook

Yup, that's on purpose. Trying to get feedback on the approach. Then we can take it to the fetch and push commands 🤘

sinbad · 2016-02-25T17:16:03Z

Ah fair enough, approach is fine to me, very similar to what was done before just in a different place. I wondered if it could be generalised & centralised instead but I recognise that would probably have to be in the transfer queue which is just going to cause a whole lot of new race condition possibilities. This way is probably better just for sanity.

technoweenie · 2016-03-04T17:06:21Z

I wondered if it could be generalised & centralised

Me too! It may or may not happen in this PR. I just wanted to get feedback on this before spending more time on it. I'll reply again when I'm ready for round 2 :)

technoweenie · 2016-03-10T23:04:06Z

Okay, this works with push --all now. Basically, instead of scanning with lfs.ScanAllMode, it uses git.LocalRefs(), and scans each ref separately. I think the behavior is now correct for both lfs push and lfs pre-push. Though I'd like to take a stab at consolidating the similar push logic between the two commands

sinbad · 2016-03-11T14:46:24Z

Looks good 👍 I've seen that checkout failure that's red-flagging the Travis build before, I think it's unrelated to this PR.

extract common pre-push and push uploading code

technoweenie · 2016-04-06T16:01:16Z

This conflicts with the scanner changes in master, so I'm going to redo it.

dluksza and others added 15 commits February 3, 2016 12:17

Add ref information to upload request

2b38b2a

Adds "ref" parameter extracted from git pre-push hook into the upload request. This way servers with branch access permissions can enforce them before large objects are transfered. Issue: #845

Don't include ref name for push operation

c2e9a6c

In push hook refs comes from user and points to local branches. Transporting them to server doesn't make sense.

Don't add "ref" parameter when it is empty

6f8cc27

Don't add empty "ref" parameter to the lfs-request object when its value is empty.

Move "ref" property into "meta" object

aaaaf64

Move "ref" property of batch request into "meta" object, this way all meta data have a common place to live instead of being spread all around the batch request.

Rename UploadMetadata to TransferMetadata

08ea65c

Since we also want to have metadata included in fetch request the UploadMetadata struct is renamed to TransferMetadata. Patch that adds metadata to fetch request will follow.

Add meta.ref to download request object

c9c3c58

Include meta.ref property in download request object, same as it is in the upload request. In case of download meta.ref contain name of currently check out local branch and can be only a suggestion to server implementation.

Merge remote-tracking branch 'github/master' into HEAD

3113387

merge master

2b8c6ee

defer a couple more cmd.Wait() calls

4e3c8d7

add git.LocalRefs()

c65b24d

move all the upload transfer queue handling to uploadPointers()

a09d81c

run separate upload queues for each ref

02fa5df

don't attempt to upload the same old on a large pre-push operation wi…

ddb164b

…th multiple refs

technoweenie mentioned this pull request Feb 24, 2016

Dluksza include ref in upload request #1036

Closed

technoweenie added 3 commits February 24, 2016 12:47

doh

b378ffa

doh pt 2

4f77030

update push tests for the updated output

bde853f

technoweenie added the review label Feb 24, 2016

technoweenie mentioned this pull request Feb 24, 2016

Show refs #1038

Closed

sinbad reviewed Feb 25, 2016
View reviewed changes

technoweenie added 2 commits March 4, 2016 11:07

Merge branch 'show-refs' into push-optimizations

ec66794

ignore a bunch of invalid refs right off the bat

6c1cf54

technoweenie mentioned this pull request Mar 8, 2016

Git LFS Push problems #1061

Closed

don't push dupe objects through lfs push

8aae355

extract common pre-push and push uploading code

5efa143

technoweenie mentioned this pull request Mar 11, 2016

extract common pre-push and push uploading code #1071

Merged

technoweenie added 5 commits March 11, 2016 11:57

experiment with a clientContext struct

66457f4

initialize missingLocalObjects with a good capacity

b535690

merge in #969, fixing conflicts

eda2c30

send transfer metadata with 'git lfs push'

badfac4

pass the remote branch name through metadata on fetch requests

ce73828

technoweenie mentioned this pull request Mar 11, 2016

Add ref information to upload request #969

Closed

Merge pull request #1071 from github/push-extraction

8d61ff6

extract common pre-push and push uploading code

technoweenie closed this Apr 6, 2016

technoweenie deleted the skip-pre-push-dupes branch April 6, 2016 18:39

technoweenie mentioned this pull request Apr 6, 2016

push optimizations (take 2) #1128

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

improve initial pre-push imports #1040

improve initial pre-push imports #1040

technoweenie commented Feb 24, 2016

sinbad commented Feb 25, 2016

sinbad Feb 25, 2016

technoweenie Feb 25, 2016

technoweenie Mar 3, 2016

sinbad Mar 4, 2016

technoweenie Mar 4, 2016

sinbad Mar 4, 2016

sinbad Mar 11, 2016

technoweenie Mar 11, 2016

technoweenie commented Feb 25, 2016

sinbad commented Feb 25, 2016

technoweenie commented Mar 4, 2016

technoweenie commented Mar 10, 2016

sinbad commented Mar 11, 2016

technoweenie commented Apr 6, 2016

improve initial pre-push imports #1040

improve initial pre-push imports #1040

Conversation

technoweenie commented Feb 24, 2016

sinbad commented Feb 25, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

technoweenie commented Feb 25, 2016

sinbad commented Feb 25, 2016

technoweenie commented Mar 4, 2016

technoweenie commented Mar 10, 2016

sinbad commented Mar 11, 2016

technoweenie commented Apr 6, 2016