Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Smarter push, no queues #122

Merged
merged 37 commits into from
Oct 1, 2014
Merged

Smarter push, no queues #122

merged 37 commits into from
Oct 1, 2014

Conversation

rubyist
Copy link
Contributor

@rubyist rubyist commented Sep 18, 2014

Smarter Push

The first aspect of smarter push is removing the upload queue. Previously, whenever a git media file was cleaned (i.e. via git add) it was added to the upload queue. Then whenever git push was run, all queued files were synced. This was a problem because the git media files in the upload queue may not necessarily be in the commit(s) that are actually being pushed.

Instead of a queue, this PR keeps a database linking git's sha1 to the git media oid. This database is just a set of files under .git/media/objects, where the file name is the git sha1. This is similar in structure to git's own .git/objects database. The contents of each file contain the git media oid and the file name (if available). Future extensions could add more meta information to these files (e.g. which git media remotes it has been pushed to).

Cleaning

With this PR, files are run through the clean filter as normal, and upon success of everything, the git media link file will be created.

Smudging

The git media link files are also created from the smudge filter. This allows the database to be built on clone, and for new files that are synced down to be properly linked up.

Pushing

When git push is run, the pre-push hook gets information on its command line and via stdin about the refs being pushed and where they're being pushed to. We use this information to get a list of git objects that are being pushed and look those objects up in the git media links database. If the git object has a matching git media file, that file is pushed to the git media endpoint.

There are cases where this will still attempt to push git media objects that the server already has. Before any object is PUT to the server, the client requests an OPTIONS to verify authentication. The server will now return a 200 status if it already has the file for the oid, and a 204 if it does not. In the event of a 200, the client will not send the PUT request.

Updating git media config

This PR contains code that updates some git media configuration. First, the pre-push hook needs to be updated because the command line arguments were not being passed to git media.

Second, it will walk all of the git media objects under .git/media and create entries in the git media links database to ensure it's populated.

The update mechanism so far is pretty crude. I didn't want to go too far into it without some feedback, so I just got it to the point of working. I'm liking the ideas @mhagger proposed below around adding a configuration version and using that to help determine what to run or how to respond to the user in the event that something needs updated. This will likely be needed in the future if other changes are made.

Other goodies

I've added a couple other things that were useful in building this out. There is now a git media ls-files and a git media push --dry-run. They work as such:

media scott$ git media ls-files
x.dat
y.dat
a/b.dat

This pretty much just traverses the git media link database and prints out everything it finds. A more authoritative version could be made where it's actually checking git's database for pointer files.

$ echo "file 1" > file1.dat
$ git add file1.dat
$ git commit -m "add file 1"
[master 8332325] add file 1
 1 file changed, 3 insertions(+)
 create mode 100644 file1.dat
$ echo "file 2" > file2.dat
$ git add file2.dat
$ git commit -m "add file 2"
[master bf76010] add file 2
 1 file changed, 3 insertions(+)
 create mode 100644 file2.dat
$ git media push --dry-run origin master
push file1.dat
push file2.dat

We have to drop to running git media push over git push because git does not pass the parameters down into the hook scripts.

Original PR

Here are some proposals for the first steps toward a smarter push.

First, we'll nuke the queue as we currently have it. We shouldn't need this.

When the clean filter runs, e.g. on git add <file>, we'll clean it as normal but instead of adding it to a sync queue, we'll get the sha1 that git calculates for the pointer file.

We'll create a file under .git/media/objects/aa/bbccddeeff00112233445566778899aabbcc. This file will contain the git media oid. So it's kind of like git's object db, but with pointers to git media objects instead of content.

When the pre-push hook runs, we can gather a list of objects git wants to send from git rev-list --objects. As discussed in #104 this list of objects may include files we've already pushed, so we need a way to trim that down.

We can trim the list by adding a query to the server api which, given a list of git media oids will return the list of objects it doesn't know about, similar to how git negotiates what it's going to send. (Are there security implications to this? It is authenticated and scoped to the repo, so I think it should be OK.)

Once we have that list, we sync those files and continue on as normal.

  • Create link file when cleaning
  • Update pre push hook
    • Get list of git objects push thinks it's going to send
    • Weed out anything that doesn't have a matching link file
    • Push as normal
  • Handle missing refs case (e.g. pushing a new branch)
  • Provide a git media queues replacement showing unsynced files
  • Provide a --dry-run option
  • Store filename in link file, if available. It gets displayed when pushing
  • Migrate any data that's in an upload queue then 💀 it
  • If a branch is being deleted, ensure no media objects are sent

@technoweenie
Copy link
Contributor

We can trim the list by adding a query to the server api which, given a list of git media oids will return the list of objects it doesn't know about, similar to how git negotiates what it's going to send. (Are there security implications to this? It is authenticated and scoped to the repo, so I think it should be OK.)

There shouldn't be any harm in sending the file twice to the Git Media API (at least the way our server works). It should bail early if the file already exists. I don't think there are any security implications there if the file has already been uploaded for that repository network.

@rubyist
Copy link
Contributor Author

rubyist commented Sep 18, 2014

Ah right on, I forgot that it bails early in that case. I was thinking the query API could save round trips, but maybe that tradeoff isn't worth the complexity.

@rubyist
Copy link
Contributor Author

rubyist commented Sep 19, 2014

I think we are going to need the ability to query whether the media server has the media object or not. The PUT will always send the full payload. Instead of adding another query we could modify the OPTIONS pre-flight check to return a status indicating the server has or does not have the file, or perhaps a header in the response. It's authenticated at that point, so I think that should be OK.

@technoweenie
Copy link
Contributor

I'd say HEAD is the proper method for this. Though OPTIONS is fine I suppose.

@rubyist
Copy link
Contributor Author

rubyist commented Sep 19, 2014

Yeah, I'd agree that HEAD is more proper, and I'm really only suggesting OPTIONS because we're already making an OPTIONS call before the PUT.

@technoweenie
Copy link
Contributor

Oh, that makes sense 👍 Be sure to update the API spec.

@mhagger
Copy link
Contributor

mhagger commented Sep 25, 2014

The thing with the changes in this PR is that this only needs to be run on git media repos one has already cloned before this version of git media was installed. If you do a fresh clone with this version, the update will not need to be run.

For purposes like that it is good to record a "config-version" or something like that keeps track of the last time the configuration has been updated to a new format. Then the update script should always know what config-version it wants to bring the config to. If the configuration is already at that config-version, it just doesn't do anything. This also helps avoid clobbering changes that the user might have made to his/her configuration.

If you want to get fancy, you can even do semantic versioning major.minor[.patch] and give every release of git-media a global config version number (which needn't be updated between releases). Then

if config.major < software.major:
    # backwards-incompatible change to configuration; update required
    die "your repository's configuration must be updated to work with this version of git-media\n" +
         "warning: after you do the update, you cannot use older git-media versions with this repo"
elif config.major > software.major:
    die "a new version of git-media must be installed to work with this repository"
elif config.minor < software.minor:
    # backwards-compatible configuration update
    update_configuration_automatically()
elif config.minor > software.minor:
    # backwards-compatible change; software can run but new features might be missing
elif config.patch != software.patch:
    # inconsequential differences in configuration format; no update available or needed
else:
    # all version numbers agree

You notice that the patch number actually has no effect in this scheme, so it could be omitted. But it might sometimes be useful for debugging.

@rubyist
Copy link
Contributor Author

rubyist commented Sep 25, 2014

Yeah, I was just thinking about recording the config version in there. I think that's a good idea, and that flow you've outlined sounds good.

@rubyist rubyist changed the title [WIP] Smarter push, no queues [:eyes:] Smarter push, no queues Sep 25, 2014
@rubyist rubyist changed the title [:eyes:] Smarter push, no queues [Review] Smarter push, no queues Sep 25, 2014
@rubyist
Copy link
Contributor Author

rubyist commented Sep 25, 2014

OK, I think this is ready to start getting some real 👀 on it. There are some automated tests I want to add around pushing in the meantime, but feel free to start tearing it apart!

@technoweenie
Copy link
Contributor

Instead of a queue, this PR keeps a database linking git's sha1 to the git media oid.

So does this all break down if the database doesn't have everything? You mention that the database is built as files are cleaned and smudged. But only the HEAD versions of each file are smudged on a fresh clone/checkout.

Is this good enough to move repos around?

$ git clone https://github.com/FOO/BAR
$ git remote add other SOMEOTHERREMOTE
$ git config remote.other.media SOMEOTHERMEDIASERVER
$ git push other master

@technoweenie
Copy link
Contributor

When does it update the pre-push hook?

$ git media init
Hook already exists: pre-push
git media initialized

$ cat .git/hooks/pre-push 
#!/bin/sh
git media push

Also, we should figure out a way to get the SHA into the git media version. I can't distinguish this from the master branch (but ls-files works so I know I'm on the right version).

EDIT:

Found git media update. Any thoughts on having just a single command to run regardless of whether you're installing for the first time or updating?

@rubyist
Copy link
Contributor Author

rubyist commented Sep 25, 2014

I think a single command would be fine. We could probably also run a check before commands run in a similar way that the init function is run. Not sure if such a thing should auto update or tell the user, though.

@rubyist
Copy link
Contributor Author

rubyist commented Sep 25, 2014

I think the scenario you outlined would work fine for moving the repo around, if it uses the same git media server. However, I think that it currently would not work correctly if it's also switching git media servers, like you have. I don't think that would have worked with the old queues either. I haven't focused much on pushing to other git media servers in this PR, but some foundations are there.

I bet we only have HEAD versions of the media files in .git/media too in this case. We might need to do some more aggressive scanning for media files when pushing out to a new git media server.

@technoweenie
Copy link
Contributor

Yes, our current queue system is unable to push a new repo to a new remote. I think this system is a big improvement. At the least, it should help with pushing different branches to different remotes, and dealing with changes that get squashed or reset away.

Maybe for v0.4 we can focus on a push command that takes it further and makes it possible to push entire repos and their media around.

@rubyist rubyist mentioned this pull request Sep 26, 2014
2 tasks
@mhagger
Copy link
Contributor

mhagger commented Sep 26, 2014

I like this new proposal. I think it fixes a bunch of problems that were in the earlier design.

This database is just a set of files under .git/media/objects, where the file name is the git sha1. This is similar in structure to git's own .git/objects database. The contents of each file contain the git media oid and the file name (if available). Future extensions could add more meta information to these files (e.g. which git media remotes it has been pushed to).

I assume that these files will be retained even after the corresponding assets have been pushed?

What is the format of these files? Is it extensible? What does the current file parser do if the file contains information (i.e., written by a future git-media version) that it doesn't recognize?

Is there a way to populate this database aside from "smudge" and "clean"? For example, there could be a git media find-assets [--stdin | <rev-list-args>] command that scans the specified git objects for pointer files and records any that it finds in the database. (Maybe it would even make sense to do this when fetching new objects from upstream?)

In fact, let me brainstorm about some other plumbing commands that might be useful...

There could be another command git media download-assets [--stdin | <rev-list-args>] that ensures that any assets among the specified Git objects are available locally, and git media upload-assets [--stdin | <rev-list-args>] that uploads the corresponding assets to the server. (And the combination of download-assets and upload-assets would pretty much be all you needed to copy assets to a new server.) If there were a git media purge-assets [--local|--remote] [[<sha1>|<media-id>]...|--stdin, then (along with the commands above) users would have a rudimentary way to keep their repository from ballooning up with old media assets and/or purging old assets from a media server.

git media ls-files
...
This pretty much just traverses the git media link database and prints out everything it finds.

Does it print everything in the database for all commits, or just for HEAD, or what?

It might be handy to have closer analogues of git ls-files and git ls-tree. For example, one could pass git media ls-tree arguments to select what Git tree(s) it should scan. Also, these commands could list the media-ids along with the filenames in the output, like the corresponding git commands, and perhaps optionally indicate which of the assets are present locally and/or present on the server.

Store filename in link file, if available. It gets displayed when pushing

This is fine as long as you remember that the same asset can have different filenames in different commits (e.g., if the file is renamed), or can even appear multiple times (under different filenames) in a single Git tree. As long as this filename is just for informational purposes it's probably OK to just show one filename that we happen to know.

Incidentally, git rev-list --objects already emits the filenames of blobs that it finds. If you are going to run that command anyway, then its output might be more up-to-date than what is stored in the link file.

Do you update the link files atomically, to avoid corruption if there are two processes running in the same git repo? (It would make the most sense to use git's standard lockfile scheme: create <filename>.lock and when it is ready rename it to <filename>.)

To what extent is it possible (or does it even make sense?) to run git-media in a --bare git repository? If it doesn't work, do you verify that the repo is non-bare before continuing? If it does work, would a --bare git repository on a server be a reasonable back end for a reference implementation of a git-media server?

Is it possible to ask a git-media server for a list of all of the media-ids of the assets that it is holding? Can you ask for other information (especially file sizes, maybe filenames and/or paths (though see caveat above)) without downloading the assets?

@rubyist
Copy link
Contributor Author

rubyist commented Sep 26, 2014

I assume that these files will be retained even after the corresponding assets have been pushed?

Yes.

What is the format of these files? Is it extensible? What does the current file parser do if the file contains information (i.e., written by a future git-media version) that it doesn't recognize?

The format is the same format used by the pointer files. It currently only ensures that an oid is present, it'll happily parse and add any new keys. I did it with the intention of being extensible in case we want to add something.

Is there a way to populate this database aside from "smudge" and "clean"? For example, there could be a git media find-assets [--stdin | <rev-list-args>] command that scans the specified git objects for pointer files and records any that it finds in the database. (Maybe it would even make sense to do this when fetching new objects from upstream?)

As it is now, the git media update command will create entries for all the git media blobs in .git/media. I do want to build something that looks through the git objects for pointer files and either downloads or prompts to download anything that's missing. I think the download, upload, and purge commands you outline are good approaches to that. I think we can do that as part of the next piece, which will build better support for multiple remotes, because we'll need those to do that.

Does it print everything in the database for all commits, or just for HEAD, or what?

Right now it's just everything in the database. That's about as sophisticated as I needed while building this, but I agree it should be a closer analog to git ls-files. I'll open an issue for that with these notes.

This is fine as long as you remember that the same asset can have different filenames in different commits (e.g., if the file is renamed), or can even appear multiple times (under different filenames) in a single Git tree. As long as this filename is just for informational purposes it's probably OK to just show one filename that we happen to know.

It is just for information purposes. It currently gets the filename from what is passed to the clean and smudge filters. If we find that that isn't good enough, we could build in stuff that works more like rev-list.

Do you update the link files atomically, to avoid corruption if there are two processes running in the same git repo? (It would make the most sense to use git's standard lockfile scheme: create <filename>.lock and when it is ready rename it to <filename>.)

That's a good point, I'll do that in this PR.

To what extent is it possible (or does it even make sense?) to run git-media in a --bare git repository? If it doesn't work, do you verify that the repo is non-bare before continuing? If it does work, would a --bare git repository on a server be a reasonable back end for a reference implementation of a git-media server?

I don't think we've really looked at working with bare repos. Does it make sense to do so? I haven't tried it so I'm not sure yet what does or doesn't currently work.

Is it possible to ask a git-media server for a list of all of the media-ids of the assets that it is holding? Can you ask for other information (especially file sizes, maybe filenames and/or paths (though see caveat above)) without downloading the assets?

You can currently get some meta information from the API given an OID, currently only the file size, I believe. You cannot currently get a list of all media files for a given repo, but that might be something we need to add with the better multiple remote support. The client does prevent re-uploading assets the server already has, but it might be more efficient to get a list from the server and save a bunch of round trips.

@technoweenie
Copy link
Contributor

Is it possible to ask a git-media server for a list of all of the media-ids of the assets that it is holding?

I'd like to avoid this if possible. The current server API is incredibly simple to implement. Though if it makes sense to have optional API endpoints for various git media client commands, but not required for normal operation, that's fine.

@mhagger
Copy link
Contributor

mhagger commented Sep 29, 2014

@rubyist: Thanks for all the answers; they sound good 👍

To what extent is it possible (or does it even make sense?) to run git-media in a --bare git repository? If it doesn't work, do you verify that the repo is non-bare before continuing? If it does work, would a --bare git repository on a server be a reasonable back end for a reference implementation of a git-media server?

I don't think we've really looked at working with bare repos. Does it make sense to do so? I haven't tried it so I'm not sure yet what does or doesn't currently work.

All the repos on our servers are bare. So one obvious counter-question is: do we need to run git-media within them?

It would be nice for any git-media operations that only require the object database to work in bare repositories. For example, if you implement a command to scan commits for git-media pointer files, there is no reason for it to need a working tree.

And that should usually Just Work if you are careful not to hard-code directory names in git-media. For example, never use the literal string ".git/media/objects", but rather use $(git rev-parse --git-dir)/media/objects or its equivalent. Commands that need a working copy should verify explicitly that the current Git repository is not bare, using something like test "$(git rev-parse --is-bare-repository)" = "false". Your commands should also work if they are invoked in a subdirectory of the working tree by taking advantage of commands like git rev-parse --show-cdup, git rev-parse --show-prefix, and/or git rev-parse --show-toplevel. These commands respect things like the GIT_DIR and GIT_WORK_TREE environment variables, the --git-dir=<path> and --work-tree=<path> git command-line options, and the core.worktree configuration variable, so it is important that you use them directly (preferable!) or duplicate their logic carefully.

Is it possible to ask a git-media server for a list of all of the media-ids of the assets that it is holding? Can you ask for other information (especially file sizes, maybe filenames and/or paths (though see caveat above)) without downloading the assets?

You can currently get some meta information from the API given an OID, currently only the file size, I believe. You cannot currently get a list of all media files for a given repo, but that might be something we need to add with the better multiple remote support. The client does prevent re-uploading assets the server already has, but it might be more efficient to get a list from the server and save a bunch of round trips.

I was more thinking about how one could implement asset management via the server API. People would likely want to be able to list all assets they are paying for to decide which ones to delete. Even better would be to provide more metadata such as the object sizes, creation time, time of last download, etc.

I agree that this functionality is not needed for the first version, but if we are going to leave asset lifecycle management to our users (as suggested by @peff) then we will eventually need to provide them with enough information to do so. Perhaps this should be specified as a second "optional" part of the git-media API.

@rubyist rubyist mentioned this pull request Sep 29, 2014
@rubyist rubyist changed the title [Review] Smarter push, no queues Smarter push, no queues Oct 1, 2014
rubyist added a commit that referenced this pull request Oct 1, 2014
@rubyist rubyist merged commit 8a9d65a into master Oct 1, 2014
@rubyist rubyist deleted the smarterpush branch October 1, 2014 14:52
chrisd8088 added a commit to chrisd8088/git-lfs that referenced this pull request Feb 20, 2024
The "git lfs ls-files" command was added in PR git-lfs#122 (technically,
the precursor "git media ls-files" command) and its test suite
converted to shell tests in PR git-lfs#336; however, the basic functionality
of the command has never had tests which confirm it handles files
in subdirectories appropriately.  (Some tests added in later PRs which
do check files in subdirectories under specific conditions, such as
with the --exclude option or with non-ASCII characters in subdirectory
names.)

As we expect to expand this command's test suite in subsequent commits
in this PR, we first add a new test which simply confirms that the
normal output of the command, and the output with the --debug option,
perform as expected when files have been both added and removed within
a subdirectory.

We also add a test which confirms the same behaviour when the --json
option is used.  The use of a separate test for this option was
preferred in PR git-lfs#5007 when the --json option was first introduced (rather
than overloading the existing tests, as was done for the --debug option
when it was added in PR git-lfs#2540), so we follow that model here as well.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants