Smarter push, no queues #122

rubyist · 2014-09-18T15:32:09Z

Smarter Push

The first aspect of smarter push is removing the upload queue. Previously, whenever a git media file was cleaned (i.e. via git add) it was added to the upload queue. Then whenever git push was run, all queued files were synced. This was a problem because the git media files in the upload queue may not necessarily be in the commit(s) that are actually being pushed.

Instead of a queue, this PR keeps a database linking git's sha1 to the git media oid. This database is just a set of files under .git/media/objects, where the file name is the git sha1. This is similar in structure to git's own .git/objects database. The contents of each file contain the git media oid and the file name (if available). Future extensions could add more meta information to these files (e.g. which git media remotes it has been pushed to).

Cleaning

With this PR, files are run through the clean filter as normal, and upon success of everything, the git media link file will be created.

Smudging

The git media link files are also created from the smudge filter. This allows the database to be built on clone, and for new files that are synced down to be properly linked up.

Pushing

When git push is run, the pre-push hook gets information on its command line and via stdin about the refs being pushed and where they're being pushed to. We use this information to get a list of git objects that are being pushed and look those objects up in the git media links database. If the git object has a matching git media file, that file is pushed to the git media endpoint.

There are cases where this will still attempt to push git media objects that the server already has. Before any object is PUT to the server, the client requests an OPTIONS to verify authentication. The server will now return a 200 status if it already has the file for the oid, and a 204 if it does not. In the event of a 200, the client will not send the PUT request.

Updating git media config

This PR contains code that updates some git media configuration. First, the pre-push hook needs to be updated because the command line arguments were not being passed to git media.

Second, it will walk all of the git media objects under .git/media and create entries in the git media links database to ensure it's populated.

The update mechanism so far is pretty crude. I didn't want to go too far into it without some feedback, so I just got it to the point of working. I'm liking the ideas @mhagger proposed below around adding a configuration version and using that to help determine what to run or how to respond to the user in the event that something needs updated. This will likely be needed in the future if other changes are made.

Other goodies

I've added a couple other things that were useful in building this out. There is now a git media ls-files and a git media push --dry-run. They work as such:

media scott$ git media ls-files
x.dat
y.dat
a/b.dat

This pretty much just traverses the git media link database and prints out everything it finds. A more authoritative version could be made where it's actually checking git's database for pointer files.

$ echo "file 1" > file1.dat
$ git add file1.dat
$ git commit -m "add file 1"
[master 8332325] add file 1
 1 file changed, 3 insertions(+)
 create mode 100644 file1.dat
$ echo "file 2" > file2.dat
$ git add file2.dat
$ git commit -m "add file 2"
[master bf76010] add file 2
 1 file changed, 3 insertions(+)
 create mode 100644 file2.dat
$ git media push --dry-run origin master
push file1.dat
push file2.dat

We have to drop to running git media push over git push because git does not pass the parameters down into the hook scripts.

Original PR

Here are some proposals for the first steps toward a smarter push.

First, we'll nuke the queue as we currently have it. We shouldn't need this.

When the clean filter runs, e.g. on git add <file>, we'll clean it as normal but instead of adding it to a sync queue, we'll get the sha1 that git calculates for the pointer file.

We'll create a file under .git/media/objects/aa/bbccddeeff00112233445566778899aabbcc. This file will contain the git media oid. So it's kind of like git's object db, but with pointers to git media objects instead of content.

When the pre-push hook runs, we can gather a list of objects git wants to send from git rev-list --objects. As discussed in #104 this list of objects may include files we've already pushed, so we need a way to trim that down.

We can trim the list by adding a query to the server api which, given a list of git media oids will return the list of objects it doesn't know about, similar to how git negotiates what it's going to send. (Are there security implications to this? It is authenticated and scoped to the repo, so I think it should be OK.)

Once we have that list, we sync those files and continue on as normal.

technoweenie · 2014-09-18T15:40:46Z

We can trim the list by adding a query to the server api which, given a list of git media oids will return the list of objects it doesn't know about, similar to how git negotiates what it's going to send. (Are there security implications to this? It is authenticated and scoped to the repo, so I think it should be OK.)

There shouldn't be any harm in sending the file twice to the Git Media API (at least the way our server works). It should bail early if the file already exists. I don't think there are any security implications there if the file has already been uploaded for that repository network.

rubyist · 2014-09-18T15:50:07Z

Ah right on, I forgot that it bails early in that case. I was thinking the query API could save round trips, but maybe that tradeoff isn't worth the complexity.

This only works when running `git media push`, I dont' think we can get any args down to the pre-push hook from `git push`

rubyist · 2014-09-19T15:45:34Z

I think we are going to need the ability to query whether the media server has the media object or not. The PUT will always send the full payload. Instead of adding another query we could modify the OPTIONS pre-flight check to return a status indicating the server has or does not have the file, or perhaps a header in the response. It's authenticated at that point, so I think that should be OK.

technoweenie · 2014-09-19T16:42:47Z

I'd say HEAD is the proper method for this. Though OPTIONS is fine I suppose.

rubyist · 2014-09-19T16:53:37Z

Yeah, I'd agree that HEAD is more proper, and I'm really only suggesting OPTIONS because we're already making an OPTIONS call before the PUT.

technoweenie · 2014-09-19T17:13:17Z

Oh, that makes sense 👍 Be sure to update the API spec.

… push it again.

If the subprocess then launches any commands itself, it will have no environment. e.g. PATH, GOPATH, etc are all gone.

This will trawl through .git/media/objects and list out the files we know about.

…simpleExec for everything

mhagger · 2014-09-25T13:25:26Z

The thing with the changes in this PR is that this only needs to be run on git media repos one has already cloned before this version of git media was installed. If you do a fresh clone with this version, the update will not need to be run.

For purposes like that it is good to record a "config-version" or something like that keeps track of the last time the configuration has been updated to a new format. Then the update script should always know what config-version it wants to bring the config to. If the configuration is already at that config-version, it just doesn't do anything. This also helps avoid clobbering changes that the user might have made to his/her configuration.

If you want to get fancy, you can even do semantic versioning major.minor[.patch] and give every release of git-media a global config version number (which needn't be updated between releases). Then

if config.major < software.major:
    # backwards-incompatible change to configuration; update required
    die "your repository's configuration must be updated to work with this version of git-media\n" +
         "warning: after you do the update, you cannot use older git-media versions with this repo"
elif config.major > software.major:
    die "a new version of git-media must be installed to work with this repository"
elif config.minor < software.minor:
    # backwards-compatible configuration update
    update_configuration_automatically()
elif config.minor > software.minor:
    # backwards-compatible change; software can run but new features might be missing
elif config.patch != software.patch:
    # inconsequential differences in configuration format; no update available or needed
else:
    # all version numbers agree

You notice that the patch number actually has no effect in this scheme, so it could be omitted. But it might sometimes be useful for debugging.

rubyist · 2014-09-25T13:41:49Z

Yeah, I was just thinking about recording the config version in there. I think that's a good idea, and that flow you've outlined sounds good.

rubyist · 2014-09-25T15:59:04Z

OK, I think this is ready to start getting some real 👀 on it. There are some automated tests I want to add around pushing in the meantime, but feel free to start tearing it apart!

technoweenie · 2014-09-25T16:59:50Z

Instead of a queue, this PR keeps a database linking git's sha1 to the git media oid.

So does this all break down if the database doesn't have everything? You mention that the database is built as files are cleaned and smudged. But only the HEAD versions of each file are smudged on a fresh clone/checkout.

Is this good enough to move repos around?

$ git clone https://github.com/FOO/BAR
$ git remote add other SOMEOTHERREMOTE
$ git config remote.other.media SOMEOTHERMEDIASERVER
$ git push other master

technoweenie · 2014-09-25T17:27:47Z

When does it update the pre-push hook?

$ git media init
Hook already exists: pre-push
git media initialized

$ cat .git/hooks/pre-push 
#!/bin/sh
git media push

Also, we should figure out a way to get the SHA into the git media version. I can't distinguish this from the master branch (but ls-files works so I know I'm on the right version).

EDIT:

Found git media update. Any thoughts on having just a single command to run regardless of whether you're installing for the first time or updating?

rubyist · 2014-09-25T17:40:25Z

I think a single command would be fine. We could probably also run a check before commands run in a similar way that the init function is run. Not sure if such a thing should auto update or tell the user, though.

rubyist · 2014-09-25T18:54:39Z

I think the scenario you outlined would work fine for moving the repo around, if it uses the same git media server. However, I think that it currently would not work correctly if it's also switching git media servers, like you have. I don't think that would have worked with the old queues either. I haven't focused much on pushing to other git media servers in this PR, but some foundations are there.

I bet we only have HEAD versions of the media files in .git/media too in this case. We might need to do some more aggressive scanning for media files when pushing out to a new git media server.

technoweenie · 2014-09-25T19:08:47Z

Yes, our current queue system is unable to push a new repo to a new remote. I think this system is a big improvement. At the least, it should help with pushing different branches to different remotes, and dealing with changes that get squashed or reset away.

Maybe for v0.4 we can focus on a push command that takes it further and makes it possible to push entire repos and their media around.

mhagger · 2014-09-26T08:12:51Z

I like this new proposal. I think it fixes a bunch of problems that were in the earlier design.

This database is just a set of files under .git/media/objects, where the file name is the git sha1. This is similar in structure to git's own .git/objects database. The contents of each file contain the git media oid and the file name (if available). Future extensions could add more meta information to these files (e.g. which git media remotes it has been pushed to).

I assume that these files will be retained even after the corresponding assets have been pushed?

What is the format of these files? Is it extensible? What does the current file parser do if the file contains information (i.e., written by a future git-media version) that it doesn't recognize?

Is there a way to populate this database aside from "smudge" and "clean"? For example, there could be a git media find-assets [--stdin | <rev-list-args>] command that scans the specified git objects for pointer files and records any that it finds in the database. (Maybe it would even make sense to do this when fetching new objects from upstream?)

In fact, let me brainstorm about some other plumbing commands that might be useful...

There could be another command git media download-assets [--stdin | <rev-list-args>] that ensures that any assets among the specified Git objects are available locally, and git media upload-assets [--stdin | <rev-list-args>] that uploads the corresponding assets to the server. (And the combination of download-assets and upload-assets would pretty much be all you needed to copy assets to a new server.) If there were a git media purge-assets [--local|--remote] [[<sha1>|<media-id>]...|--stdin, then (along with the commands above) users would have a rudimentary way to keep their repository from ballooning up with old media assets and/or purging old assets from a media server.

git media ls-files
...
This pretty much just traverses the git media link database and prints out everything it finds.

Does it print everything in the database for all commits, or just for HEAD, or what?

It might be handy to have closer analogues of git ls-files and git ls-tree. For example, one could pass git media ls-tree arguments to select what Git tree(s) it should scan. Also, these commands could list the media-ids along with the filenames in the output, like the corresponding git commands, and perhaps optionally indicate which of the assets are present locally and/or present on the server.

Store filename in link file, if available. It gets displayed when pushing

This is fine as long as you remember that the same asset can have different filenames in different commits (e.g., if the file is renamed), or can even appear multiple times (under different filenames) in a single Git tree. As long as this filename is just for informational purposes it's probably OK to just show one filename that we happen to know.

Incidentally, git rev-list --objects already emits the filenames of blobs that it finds. If you are going to run that command anyway, then its output might be more up-to-date than what is stored in the link file.

Do you update the link files atomically, to avoid corruption if there are two processes running in the same git repo? (It would make the most sense to use git's standard lockfile scheme: create <filename>.lock and when it is ready rename it to <filename>.)

To what extent is it possible (or does it even make sense?) to run git-media in a --bare git repository? If it doesn't work, do you verify that the repo is non-bare before continuing? If it does work, would a --bare git repository on a server be a reasonable back end for a reference implementation of a git-media server?

Is it possible to ask a git-media server for a list of all of the media-ids of the assets that it is holding? Can you ask for other information (especially file sizes, maybe filenames and/or paths (though see caveat above)) without downloading the assets?

rubyist · 2014-09-26T15:06:00Z

I assume that these files will be retained even after the corresponding assets have been pushed?

Yes.

What is the format of these files? Is it extensible? What does the current file parser do if the file contains information (i.e., written by a future git-media version) that it doesn't recognize?

The format is the same format used by the pointer files. It currently only ensures that an oid is present, it'll happily parse and add any new keys. I did it with the intention of being extensible in case we want to add something.

Is there a way to populate this database aside from "smudge" and "clean"? For example, there could be a git media find-assets [--stdin | <rev-list-args>] command that scans the specified git objects for pointer files and records any that it finds in the database. (Maybe it would even make sense to do this when fetching new objects from upstream?)

As it is now, the git media update command will create entries for all the git media blobs in .git/media. I do want to build something that looks through the git objects for pointer files and either downloads or prompts to download anything that's missing. I think the download, upload, and purge commands you outline are good approaches to that. I think we can do that as part of the next piece, which will build better support for multiple remotes, because we'll need those to do that.

Does it print everything in the database for all commits, or just for HEAD, or what?

Right now it's just everything in the database. That's about as sophisticated as I needed while building this, but I agree it should be a closer analog to git ls-files. I'll open an issue for that with these notes.

This is fine as long as you remember that the same asset can have different filenames in different commits (e.g., if the file is renamed), or can even appear multiple times (under different filenames) in a single Git tree. As long as this filename is just for informational purposes it's probably OK to just show one filename that we happen to know.

It is just for information purposes. It currently gets the filename from what is passed to the clean and smudge filters. If we find that that isn't good enough, we could build in stuff that works more like rev-list.

Do you update the link files atomically, to avoid corruption if there are two processes running in the same git repo? (It would make the most sense to use git's standard lockfile scheme: create <filename>.lock and when it is ready rename it to <filename>.)

That's a good point, I'll do that in this PR.

To what extent is it possible (or does it even make sense?) to run git-media in a --bare git repository? If it doesn't work, do you verify that the repo is non-bare before continuing? If it does work, would a --bare git repository on a server be a reasonable back end for a reference implementation of a git-media server?

I don't think we've really looked at working with bare repos. Does it make sense to do so? I haven't tried it so I'm not sure yet what does or doesn't currently work.

Is it possible to ask a git-media server for a list of all of the media-ids of the assets that it is holding? Can you ask for other information (especially file sizes, maybe filenames and/or paths (though see caveat above)) without downloading the assets?

You can currently get some meta information from the API given an OID, currently only the file size, I believe. You cannot currently get a list of all media files for a given repo, but that might be something we need to add with the better multiple remote support. The client does prevent re-uploading assets the server already has, but it might be more efficient to get a list from the server and save a bunch of round trips.

technoweenie · 2014-09-26T16:07:52Z

Is it possible to ask a git-media server for a list of all of the media-ids of the assets that it is holding?

I'd like to avoid this if possible. The current server API is incredibly simple to implement. Though if it makes sense to have optional API endpoints for various git media client commands, but not required for normal operation, that's fine.

mhagger · 2014-09-29T08:34:32Z

@rubyist: Thanks for all the answers; they sound good 👍

To what extent is it possible (or does it even make sense?) to run git-media in a --bare git repository? If it doesn't work, do you verify that the repo is non-bare before continuing? If it does work, would a --bare git repository on a server be a reasonable back end for a reference implementation of a git-media server?

I don't think we've really looked at working with bare repos. Does it make sense to do so? I haven't tried it so I'm not sure yet what does or doesn't currently work.

All the repos on our servers are bare. So one obvious counter-question is: do we need to run git-media within them?

It would be nice for any git-media operations that only require the object database to work in bare repositories. For example, if you implement a command to scan commits for git-media pointer files, there is no reason for it to need a working tree.

And that should usually Just Work if you are careful not to hard-code directory names in git-media. For example, never use the literal string ".git/media/objects", but rather use $(git rev-parse --git-dir)/media/objects or its equivalent. Commands that need a working copy should verify explicitly that the current Git repository is not bare, using something like test "$(git rev-parse --is-bare-repository)" = "false". Your commands should also work if they are invoked in a subdirectory of the working tree by taking advantage of commands like git rev-parse --show-cdup, git rev-parse --show-prefix, and/or git rev-parse --show-toplevel. These commands respect things like the GIT_DIR and GIT_WORK_TREE environment variables, the --git-dir=<path> and --work-tree=<path> git command-line options, and the core.worktree configuration variable, so it is important that you use them directly (preferable!) or duplicate their logic carefully.

Is it possible to ask a git-media server for a list of all of the media-ids of the assets that it is holding? Can you ask for other information (especially file sizes, maybe filenames and/or paths (though see caveat above)) without downloading the assets?

You can currently get some meta information from the API given an OID, currently only the file size, I believe. You cannot currently get a list of all media files for a given repo, but that might be something we need to add with the better multiple remote support. The client does prevent re-uploading assets the server already has, but it might be more efficient to get a list from the server and save a bunch of round trips.

I was more thinking about how one could implement asset management via the server API. People would likely want to be able to list all assets they are paying for to decide which ones to delete. Even better would be to provide more metadata such as the object sizes, creation time, time of last download, etc.

I agree that this functionality is not needed for the first version, but if we are going to leave asset lifecycle management to our users (as suggested by @peff) then we will eventually need to provide them with enough information to do so. Perhaps this should be specified as a second "optional" part of the git-media API.

… waiting for stdin.

Smarter push, no queues

The "git lfs ls-files" command was added in PR git-lfs#122 (technically, the precursor "git media ls-files" command) and its test suite converted to shell tests in PR git-lfs#336; however, the basic functionality of the command has never had tests which confirm it handles files in subdirectories appropriately. (Some tests added in later PRs which do check files in subdirectories under specific conditions, such as with the --exclude option or with non-ASCII characters in subdirectory names.) As we expect to expand this command's test suite in subsequent commits in this PR, we first add a new test which simply confirms that the normal output of the command, and the output with the --debug option, perform as expected when files have been both added and removed within a subdirectory. We also add a test which confirms the same behaviour when the --json option is used. The use of a separate test for this option was preferred in PR git-lfs#5007 when the --json option was first introduced (rather than overloading the existing tests, as was done for the --debug option when it was added in PR git-lfs#2540), so we follow that model here as well.

A git hasher that calculates the git sha1 for a stream of data

adfe84c

rubyist added 7 commits September 18, 2014 14:51

Create the link file when cleaning

52a139a

Don't queue the file when cleaning

0375015

what even are tests anyway: :/

326790e

Error prone but functioning happy path. Doesn't handle many cases yet

c8e49b3

pre-push needs to pass its command line args to git media push

18fbfbb

Handle new branch case

af7f92c

Add a --dry-run flag

0d6b35d

This only works when running `git media push`, I dont' think we can get any args down to the pre-push hook from `git push`

When deleting a branch, don't send any objects

e9092a8

Store filename in pointer link, display during push

53aaad1

Tests do catch my horrible errors

dfbbd0f

rubyist added 13 commits September 22, 2014 11:47

If the pre-push OPTIONS returns a 200, the server has the file, don't…

60f43a9

… push it again.

Need to append to sub-process environment, not overwrite it.

1be35b2

If the subprocess then launches any commands itself, it will have no environment. e.g. PATH, GOPATH, etc are all gone.

Handle a couple errors around link files

567907c

Write link files when smudging

d8ade10

git media ls-files command

069e96f

This will trawl through .git/media/objects and list out the files we know about.

kill the queue

0e217df

Extract some git commands, start cleaning up push

781d8bb

Move git config runner to git/, simpleExec takes a stdin reader, use …

493808b

…simpleExec for everything

Consolidate pointer link functions, simplify push

edd94a3

extract decodeRefs, simplify pushCommand

b850e5e

extract linksFromRefs, simplify pushCommand

e5354bc

Unbreak ls-files

b925e87

RevListObjects returns real objects

5222604

rubyist changed the title ~~[WIP] Smarter push, no queues~~ [:eyes:] Smarter push, no queues Sep 25, 2014

rubyist changed the title ~~[:eyes:] Smarter push, no queues~~ [Review] Smarter push, no queues Sep 25, 2014

rubyist mentioned this pull request Sep 26, 2014

Support multiple git media endpoints #124

Closed

2 tasks

rubyist added 2 commits September 26, 2014 12:37

Ensure link files are created atomically

0491067

basic ls-files test

67e8f72

rubyist mentioned this pull request Sep 29, 2014

Configuration updater #126

Closed

rubyist added 7 commits September 29, 2014 11:41

Add a repo.GitCmd, tidy up ls-files test

11128a8

Add some more push tests

dba825f

Make sure git-media is in the path for tests

321c1a5

Using ls-remote for media push --dry-run is more accurate

e6b143a

This should have a ^

6d7460f

Ensure we're testing with the bin in the repo and not any installed bins

344f0e9

Ensure that git media push called on its own doesn't just sit there…

ae5d667

… waiting for stdin.

rubyist changed the title ~~[Review] Smarter push, no queues~~ Smarter push, no queues Oct 1, 2014

rubyist added a commit that referenced this pull request Oct 1, 2014

Merge pull request #122 from github/smarterpush

8a9d65a

Smarter push, no queues

rubyist merged commit 8a9d65a into master Oct 1, 2014

rubyist deleted the smarterpush branch October 1, 2014 14:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Smarter push, no queues #122

Smarter push, no queues #122

rubyist commented Sep 18, 2014

technoweenie commented Sep 18, 2014

rubyist commented Sep 18, 2014

rubyist commented Sep 19, 2014

technoweenie commented Sep 19, 2014

rubyist commented Sep 19, 2014

technoweenie commented Sep 19, 2014

mhagger commented Sep 25, 2014

rubyist commented Sep 25, 2014

rubyist commented Sep 25, 2014

technoweenie commented Sep 25, 2014

technoweenie commented Sep 25, 2014

rubyist commented Sep 25, 2014

rubyist commented Sep 25, 2014

technoweenie commented Sep 25, 2014

mhagger commented Sep 26, 2014

rubyist commented Sep 26, 2014

technoweenie commented Sep 26, 2014

mhagger commented Sep 29, 2014

Smarter push, no queues #122

Smarter push, no queues #122

Conversation

rubyist commented Sep 18, 2014

Smarter Push

Cleaning

Smudging

Pushing

Updating git media config

Other goodies

Original PR

technoweenie commented Sep 18, 2014

rubyist commented Sep 18, 2014

rubyist commented Sep 19, 2014

technoweenie commented Sep 19, 2014

rubyist commented Sep 19, 2014

technoweenie commented Sep 19, 2014

mhagger commented Sep 25, 2014

rubyist commented Sep 25, 2014

rubyist commented Sep 25, 2014

technoweenie commented Sep 25, 2014

technoweenie commented Sep 25, 2014

rubyist commented Sep 25, 2014

rubyist commented Sep 25, 2014

technoweenie commented Sep 25, 2014

mhagger commented Sep 26, 2014

rubyist commented Sep 26, 2014

technoweenie commented Sep 26, 2014

mhagger commented Sep 29, 2014