Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve push command, kill the queue #104

Closed
technoweenie opened this issue Aug 17, 2014 · 12 comments
Closed

improve push command, kill the queue #104

technoweenie opened this issue Aug 17, 2014 · 12 comments

Comments

@technoweenie
Copy link
Contributor

It's time to beef up the queue.

  • Don't queue the same file twice.
  • If a queued file is not in the push, don't push it.
  • Add a command to gc the queue.
  • Look for other Git Media objects in the queue to push.

These features are too much for the queuedir package. It'd be awesome to find a good Go based k/v store that'll work cross compiled to every platform.

Fixes #82 and #99. /cc #102 (comment)

@rubyist
Copy link
Contributor

rubyist commented Aug 17, 2014

How much data are we talking? More than an in-memory map and writing to disk with gob?

@technoweenie
Copy link
Contributor Author

We may be good with just a simple key/value map. The key should be based on the filename. If a filename is queued again, we can just update that key's value. The last 3 items just depend on the ability to figure out what is being pushed, so we can look around for git media blobs.

@technoweenie
Copy link
Contributor Author

One other option: Change push so it requires the remote and branch (which should be provided by the pre-push hook. The pre-push hook gets this in STDIN:

{local ref}       {sha} {remote ref}       {remote sha}
refs/heads/master 67890 refs/heads/foreign 12345

We can get this easily with:

$ git ls-remote -h origin master
b336a0f59a945b53259856e94cfb2440bfd5ca4e    refs/heads/master

@mhagger
Copy link
Contributor

mhagger commented Aug 21, 2014

We can get this easily with:

$ git ls-remote -h origin master
b336a0f59a945b53259856e94cfb2440bfd5ca4e  refs/heads/master

Easy, yes, but don't forget that git ls-remote requires a round-trip to the upstream repo so it is a bit expensive.

@mhagger
Copy link
Contributor

mhagger commented Aug 21, 2014

It seems to me that instead of recording the filename, it would be more robust and easier to handle the SHA-1 of the blob representing the pointer file. If and only if that blob has to be pushed to the git server, then the corresponding git-media blob has to be pushed to the media server.

@technoweenie technoweenie changed the title add a smarter queue improve push command, kill the queue Sep 15, 2014
@rubyist rubyist self-assigned this Sep 16, 2014
@rubyist
Copy link
Contributor

rubyist commented Sep 16, 2014

Ok, I'm leveling up my git game here, want to check in to make sure I'm on the right track. Here's a scenario:

I've got 3 commits I want to push and sync, efd7, da5e, and 60a0. When I run git push origin master, the pre-commit hook will get: origin https://github.com/rubyist/mediarepo on its command line, and refs/heads/master efd7 refs/head/master 9b77 on its stdin. Cool.

Now, I want to figure out what objects it's about to push so I know what to sync. I know that I could do git rev-list --objects efd7 ^9b77 and get a list of objects. Is this an accurate representation of what it's about to push? I'm thinking you just trawl through that list, ignore anything that isn't a blob, and 💥 you've got your list of objects to sync.

Does that sound legit? Is there a better way to get that list?

@mhagger
Copy link
Contributor

mhagger commented Sep 16, 2014

s/pre-commit/pre-push/, as you obviously know.

git rev-list --objects efd7 ^9b77 will include all of the objects being pushed, but it can also include far more objects, namely objects that already exist on the remote because they are reachable from a different branch. For example, if I create a new commit and push it:

git checkout -b mybranch master
git push origin mybranch

then the hook script will get

refs/heads/mybranch efd7 refs/heads/mybranch 0000

and your rev-list command will list all of the commits on mybranch even though those commits are on master, and many of them are likely already present at origin.

(Actually, your command would fail because ^0000000000000000000000000000000000000000 is not a valid argument to git rev-list. So in this case the analogous command would be

git rev-parse efd7

. But you know what I mean.)

AFAIK there is no way to get at the information you want (namely, the list of objects being newly pushed to the remote) without either hacking git or duplicating git's negotiation with the remote server 😦 Essentially you would need to know the value of all of the remote server's references before your push.

Maybe @peff knows a trick.

@technoweenie
Copy link
Contributor Author

Essentially you would need to know the value of all of the remote server's references before your push.

This should work, right?

# Do we want pull request refs too?
$ git ls-remote  --heads --tags
From https://github.com/github/git-media
fe6bba2b69f7c18dbf274a9bb958b75f05a50468    refs/heads/ensureprefix
3dc9a647e8b6df76f4f3f5cf026262f09b78f819    refs/heads/master
d71891cb680f16246c236946f07fbe8e49f52365    refs/tags/v0.0.1
4843542e284497c1e3745eed63059f791604868e    refs/tags/v0.1.0
690c8da8487c76b3db7bb12a35d9bab84be484ba    refs/tags/v0.2.0
d5255d7be0d70996b8561a26998f2e0300a2710a    refs/tags/v0.2.1
26fe77e8f089272f44b6968a194b3e0ae4d9a578    refs/tags/v0.2.1-p1
a3f157876e6c30a4cb9ad385feebd33dad5e1b75    refs/tags/v0.2.2
3dc9a647e8b6df76f4f3f5cf026262f09b78f819    refs/tags/v0.2.3

If we're pushing a new branch, we'd want to find the nearest ref (probably master, but it'd depend on where the push originated from).

@rubyist: Having a "dry run" option would be useful here. Have git media push --dry-run list the media blobs it intends to push. Maybe a --debug option could show more detailed info about the git refs it used to come up with that list.

@rubyist
Copy link
Contributor

rubyist commented Sep 16, 2014

In the case of a new branch where git really is pushing all of the objects, we can only know whether or not we should sync a git media file by asking the git media server. The git media server should already prevent uploading a full blob it already knows about, so maybe it's OK to use the full list that rev-parse gives? We could perhaps add a remote-list like command of our own to avoid all the roundtrips that just 201 or whatever.

We'll definitely add a --dry-run option.

@peff
Copy link

peff commented Sep 17, 2014

There's not really a trick here. The closest simple thing is probably git rev-list --objects $all_the_things_you_are_pushing --not --remotes=$remote_you_are_pushing_to. That is slightly different than what git generates because we are using what's in refs/remotes (i.e., the cache from the last time we fetched), whereas git push will actually ask the upstream for all of its ref tips. So it falls down in a few cases:

  1. We might not be pushing to a named ref with tracking branches at all (e.g., git push $url).
  2. The tracking branches might not be up to date. Usually this means we have a subset of what the other side has, and we'll err on the side of thinking we have to send more objects. But if the other side has rewound a ref, we could err in the other direction.
  3. The other side may have refs not included in what we've fetched (e.g., refs/pull, or even .have tips from alternates repositories).

You can get closer with:

# Find out what the other side has, but omit any objects that we don't
# have ourselves. Give them a `^`-negation, and then use the
# result to limit our object list.
git ls-remote $remote |
cut -f1 |
git cat-file --batch-check |
perl -lne 'print "^$1" if /^([0-9a-f]{40})/' |
git rev-list --objects $stuff_you_are_pushing --stdin

That's still not quite right. The server side of a push may advertise slightly more refs than ls-remote will report, as it will advertise tips for alternates repos it has access to. We don't currently do that ourselves, but other servers might.

Git doesn't actually do any other fancy negotiation to find common refs during a push, like it does for a fetch. The assumption is that you're typically building on what the other side has, so one of their refs should be a subset of what you already have. That could change in the future, but I don't think anybody has ever proposed to make it so.

@mhagger
Copy link
Contributor

mhagger commented Sep 17, 2014

@peff, does the server also report peeled versions of tags?

It sounds like a git ls-remote --haves $remote command or something like that might be useful here.

Though really, it is a shame to have to run git ls-remote separately. It costs another round-trip, and possibly opens up race conditions between the pre-push script trying to second-guess what will be sent vs. the git push itself which knows exactly what it will send. For this purpose it would be better to have a hook that gets, on stdin, a list of all objects that are about to be pushed.

@peff
Copy link

peff commented Sep 17, 2014

@mhagger Yeah, I never noticed that, but we don't advertise peeled tags for pushes. Another issue: receive-pack does not necessarily advertise the same set of refs as upload-pack. Specifically, our servers do not advertise refs/pull for pushes (because it confuses --mirror pushers).

I agree that ls-remote --haves would be useful and would solve all of these problems, but that's going to mean a dependency on a newer version of git. If we're going that route, we might as well go with a separate sent-objects hook, which has all the advantages you noted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants