Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve initial pre-push imports #1040

Closed
wants to merge 28 commits into from
Closed

Conversation

technoweenie
Copy link
Contributor

When pushing a repository to a new host, Git LFS pushes each branch or tag individually. This leads to a lot of skipped objects. Assuming all branches are based on master, then there should be a lot of duplicate objects. Here's a push with a repository with 502 objects: 501 in master, and an extra in another branch:

$ time git push 5 --all
Git LFS: (501 of 501 files) 1.85 KB / 1.85 KB
Git LFS: (0 of 501 files, 501 skipped) 0 B / 1.85 KB, 1.85 KB skipped
Git LFS: (1 of 502 files, 501 skipped) 3 B / 1.86 KB, 1.85 KB skipped
Git LFS: (0 of 501 files, 501 skipped) 0 B / 1.85 KB, 1.85 KB skipped
Counting objects: 516, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (511/511), done.
Writing objects: 100% (516/516), 70.52 KiB | 0 bytes/s, done.
Total 516 (delta 1), reused 512 (delta 0)
To https://git-server/technoweenie/symmetrical-octo-pancake
 * [new branch]      1 -> 1
 * [new branch]      2 -> 2
 * [new branch]      3 -> 3
 * [new branch]      master -> master
git push 5 --all  2.06s user 1.51s system 7% cpu 51.051 total

With this PR:

$ time git push 4 --all
Git LFS: (501 of 501 files) 1.85 KB / 1.85 KB
Git LFS: (1 of 1 files) 3 B / 3 B
Counting objects: 516, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (511/511), done.
Writing objects: 100% (516/516), 70.52 KiB | 0 bytes/s, done.
Total 516 (delta 1), reused 512 (delta 0)
To https://git-server/technoweenie/symmetrical-octo-funicular
 * [new branch]      1 -> 1
 * [new branch]      2 -> 2
 * [new branch]      3 -> 3
 * [new branch]      master -> master
git push 4 --all  1.62s user 0.89s system 5% cpu 45.280 total

dluksza and others added 15 commits February 3, 2016 12:17
Adds "ref" parameter extracted from git pre-push hook into the upload
request. This way servers with branch access permissions can enforce
them before large objects are transfered.

Issue: #845
In push hook refs comes from user and points to local
branches. Transporting them to server doesn't make sense.
Don't add empty "ref" parameter to the lfs-request object when its value
is empty.
Instead of passing around additional string parameter with destination
ref, we pass special struct called UploadMetadata.

For now UploadMetadata contains only ref field but in the future more
metadata like file name (or path) can be added.
Move "ref" property of batch request into "meta" object, this way all
meta data have a common place to live instead of being spread all around
the batch request.
Since we also want to have metadata included in fetch request the
UploadMetadata struct is renamed to TransferMetadata.

Patch that adds metadata to fetch request will follow.
Include meta.ref property in download request object, same as it is in
the upload request.

In case of download meta.ref contain name of currently check out local
branch and can be only a suggestion to server implementation.
Normalize the value of meta.ref parameter of batch request
object. Previously ref parameter format differs between "download" and
"upload" objects.

The "upload" always was using the fully qualified branch
name (eg. refs/heads/master), when the "download" used short
name (eg. master).

Right now meta.ref will have consistent values. When "refs/heads/"
prefix will be appended when it is not present. Special "HEAD" reference
will be replaced with empty string, which will result in meta.ref
parameter not being included in the request object.
@technoweenie technoweenie mentioned this pull request Feb 24, 2016
@sinbad
Copy link
Contributor

sinbad commented Feb 25, 2016

This only addresses the issue when using the pre-push hook, it'll still be there when using git lfs push ref1 ref2 ref3 or git lfs push --all. So actually my comment in #1038 that I edited to ref this one still stands right now.

@@ -137,7 +133,19 @@ func prePushRef(left, right string) {
os.Exit(2)
}
}
}

func filteredPointers(pointers []*lfs.WrappedPointer, filter lfs.StringSet) []*lfs.WrappedPointer {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of building a brand new list, how about either doing this as a channel or accepting a callback function for unfiltered items?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seemed a lot simpler. Though I am curious enough to try writing some benchmark code for it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about this? https://gist.github.com/technoweenie/34671a1296ba375b477a

$ go test -bench . -benchmem
testing: warning: no tests to run
PASS
BenchmarkFilterByList-4         1000       1577368 ns/op      346271 B/op        127 allocs/op
BenchmarkFilterByCallback-4     1000       1532885 ns/op      264313 B/op        125 allocs/op
BenchmarkFilterByChan-4          200       9493396 ns/op      182958 B/op        129 allocs/op
ok      _/Users/rick/p/go-filter-benchmark  6.305s

Still leaning towards the current approach. It's not much slower than using a callback, and has the simplest implementation. Unless there's some radical way to make the channel approach simpler and faster, I'm not even really considering it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant more like your callback example, channels as output and still lists as input; but I see the issue is that you need 2 iterations with shared reference data which is causing the channel way to be slow in parallel because of the mutex so it's no benefit over the callback; they'd have to be serialised to avoid the mutex, and it's more complex.

I do prefer the callback style to returning new lists but that's probably the grumpy old man in me seeing 30% more allocations and freaking out.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried using callbacks:

-       pointers = filteredPointers(pointers, uploadedOids)
-       pointers = filteredPointers(pointers, prePushCheckForMissingObjects(pointers))
+       filtered := make([]*lfs.WrappedPointer, 0, len(pointers))
+       cb := func(p *lfs.WrappedPointer) {
+               filtered = append(filtered, p)
+       }
+
+       filteredPointers(pointers, uploadedOids, cb)
+
+       serverOids := prePushCheckForMissingObjects(filtered)
+       filteredPointers(filtered, serverOids, cb)

This is more complex, and introduces a concurrency issue because the second filter call is looping through the same slice while the callback is appending to it. The original approach, though 30% more wasteful with allocations, is the best option IMO. Though, that's assuming filtering twice is a good idea. My intention is to filter out everything that has been uploaded from this same process before checking the server to see what it has. It'll result in fewer batch calls in large import pushes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

introduces a concurrency issue because the second filter call is looping through the same slice while the callback is appending to it

I don't think it does - the filteredPointers and prePushCheckForMissingObjects calls are completely synchronous/serial. But in any case, if you're happier with this style it's fine, I just can't help myself looking for allocation optimisations given my background ;)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that filteredPointers is used in multiple places, maybe promote it to the lfs package for re-use rather than keep it in the pre-push source file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW just wanted to acknowledge you're absolutely right about the concurrency thing. Temporary brain fart.

I'm working on an update that attempts to centralize the uploading logic across pre-push and push. I'll probably post that as a PR against this PR so we can review that, and discuss package options there.

@technoweenie
Copy link
Contributor Author

This only addresses the issue when using the pre-push hook

Yup, that's on purpose. Trying to get feedback on the approach. Then we can take it to the fetch and push commands 🤘

@sinbad
Copy link
Contributor

sinbad commented Feb 25, 2016

Ah fair enough, approach is fine to me, very similar to what was done before just in a different place. I wondered if it could be generalised & centralised instead but I recognise that would probably have to be in the transfer queue which is just going to cause a whole lot of new race condition possibilities. This way is probably better just for sanity.

@technoweenie
Copy link
Contributor Author

I wondered if it could be generalised & centralised

Me too! It may or may not happen in this PR. I just wanted to get feedback on this before spending more time on it. I'll reply again when I'm ready for round 2 :)

@technoweenie
Copy link
Contributor Author

Okay, this works with push --all now. Basically, instead of scanning with lfs.ScanAllMode, it uses git.LocalRefs(), and scans each ref separately. I think the behavior is now correct for both lfs push and lfs pre-push. Though I'd like to take a stab at consolidating the similar push logic between the two commands

@sinbad
Copy link
Contributor

sinbad commented Mar 11, 2016

Looks good 👍 I've seen that checkout failure that's red-flagging the Travis build before, I think it's unrelated to this PR.

extract common pre-push and push uploading code
@technoweenie
Copy link
Contributor Author

This conflicts with the scanner changes in master, so I'm going to redo it.

@technoweenie technoweenie deleted the skip-pre-push-dupes branch April 6, 2016 18:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants