-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Smarter push, no queues #122
Conversation
There shouldn't be any harm in sending the file twice to the Git Media API (at least the way our server works). It should bail early if the file already exists. I don't think there are any security implications there if the file has already been uploaded for that repository network. |
Ah right on, I forgot that it bails early in that case. I was thinking the query API could save round trips, but maybe that tradeoff isn't worth the complexity. |
This only works when running `git media push`, I dont' think we can get any args down to the pre-push hook from `git push`
I think we are going to need the ability to query whether the media server has the media object or not. The |
I'd say HEAD is the proper method for this. Though OPTIONS is fine I suppose. |
Yeah, I'd agree that |
Oh, that makes sense 👍 Be sure to update the API spec. |
If the subprocess then launches any commands itself, it will have no environment. e.g. PATH, GOPATH, etc are all gone.
This will trawl through .git/media/objects and list out the files we know about.
…simpleExec for everything
For purposes like that it is good to record a "config-version" or something like that keeps track of the last time the configuration has been updated to a new format. Then the update script should always know what config-version it wants to bring the config to. If the configuration is already at that config-version, it just doesn't do anything. This also helps avoid clobbering changes that the user might have made to his/her configuration. If you want to get fancy, you can even do semantic versioning
You notice that the |
Yeah, I was just thinking about recording the config version in there. I think that's a good idea, and that flow you've outlined sounds good. |
OK, I think this is ready to start getting some real 👀 on it. There are some automated tests I want to add around pushing in the meantime, but feel free to start tearing it apart! |
So does this all break down if the database doesn't have everything? You mention that the database is built as files are cleaned and smudged. But only the HEAD versions of each file are smudged on a fresh clone/checkout. Is this good enough to move repos around?
|
When does it update the pre-push hook?
Also, we should figure out a way to get the SHA into the git media version. I can't distinguish this from the master branch (but ls-files works so I know I'm on the right version). EDIT: Found |
I think a single command would be fine. We could probably also run a check before commands run in a similar way that the init function is run. Not sure if such a thing should auto update or tell the user, though. |
I think the scenario you outlined would work fine for moving the repo around, if it uses the same git media server. However, I think that it currently would not work correctly if it's also switching git media servers, like you have. I don't think that would have worked with the old queues either. I haven't focused much on pushing to other git media servers in this PR, but some foundations are there. I bet we only have HEAD versions of the media files in .git/media too in this case. We might need to do some more aggressive scanning for media files when pushing out to a new git media server. |
Yes, our current queue system is unable to push a new repo to a new remote. I think this system is a big improvement. At the least, it should help with pushing different branches to different remotes, and dealing with changes that get squashed or reset away. Maybe for v0.4 we can focus on a push command that takes it further and makes it possible to push entire repos and their media around. |
I like this new proposal. I think it fixes a bunch of problems that were in the earlier design.
I assume that these files will be retained even after the corresponding assets have been pushed? What is the format of these files? Is it extensible? What does the current file parser do if the file contains information (i.e., written by a future Is there a way to populate this database aside from "smudge" and "clean"? For example, there could be a In fact, let me brainstorm about some other plumbing commands that might be useful... There could be another command
Does it print everything in the database for all commits, or just for HEAD, or what? It might be handy to have closer analogues of
This is fine as long as you remember that the same asset can have different filenames in different commits (e.g., if the file is renamed), or can even appear multiple times (under different filenames) in a single Git tree. As long as this filename is just for informational purposes it's probably OK to just show one filename that we happen to know. Incidentally, Do you update the link files atomically, to avoid corruption if there are two processes running in the same git repo? (It would make the most sense to use git's standard lockfile scheme: create To what extent is it possible (or does it even make sense?) to run Is it possible to ask a git-media server for a list of all of the media-ids of the assets that it is holding? Can you ask for other information (especially file sizes, maybe filenames and/or paths (though see caveat above)) without downloading the assets? |
Yes.
The format is the same format used by the pointer files. It currently only ensures that an oid is present, it'll happily parse and add any new keys. I did it with the intention of being extensible in case we want to add something.
As it is now, the
Right now it's just everything in the database. That's about as sophisticated as I needed while building this, but I agree it should be a closer analog to
It is just for information purposes. It currently gets the filename from what is passed to the clean and smudge filters. If we find that that isn't good enough, we could build in stuff that works more like
That's a good point, I'll do that in this PR.
I don't think we've really looked at working with bare repos. Does it make sense to do so? I haven't tried it so I'm not sure yet what does or doesn't currently work.
You can currently get some meta information from the API given an OID, currently only the file size, I believe. You cannot currently get a list of all media files for a given repo, but that might be something we need to add with the better multiple remote support. The client does prevent re-uploading assets the server already has, but it might be more efficient to get a list from the server and save a bunch of round trips. |
I'd like to avoid this if possible. The current server API is incredibly simple to implement. Though if it makes sense to have optional API endpoints for various git media client commands, but not required for normal operation, that's fine. |
@rubyist: Thanks for all the answers; they sound good 👍
All the repos on our servers are bare. So one obvious counter-question is: do we need to run git-media within them? It would be nice for any git-media operations that only require the object database to work in bare repositories. For example, if you implement a command to scan commits for git-media pointer files, there is no reason for it to need a working tree. And that should usually Just Work if you are careful not to hard-code directory names in git-media. For example, never use the literal string
I was more thinking about how one could implement asset management via the server API. People would likely want to be able to list all assets they are paying for to decide which ones to delete. Even better would be to provide more metadata such as the object sizes, creation time, time of last download, etc. I agree that this functionality is not needed for the first version, but if we are going to leave asset lifecycle management to our users (as suggested by @peff) then we will eventually need to provide them with enough information to do so. Perhaps this should be specified as a second "optional" part of the git-media API. |
… waiting for stdin.
The "git lfs ls-files" command was added in PR git-lfs#122 (technically, the precursor "git media ls-files" command) and its test suite converted to shell tests in PR git-lfs#336; however, the basic functionality of the command has never had tests which confirm it handles files in subdirectories appropriately. (Some tests added in later PRs which do check files in subdirectories under specific conditions, such as with the --exclude option or with non-ASCII characters in subdirectory names.) As we expect to expand this command's test suite in subsequent commits in this PR, we first add a new test which simply confirms that the normal output of the command, and the output with the --debug option, perform as expected when files have been both added and removed within a subdirectory. We also add a test which confirms the same behaviour when the --json option is used. The use of a separate test for this option was preferred in PR git-lfs#5007 when the --json option was first introduced (rather than overloading the existing tests, as was done for the --debug option when it was added in PR git-lfs#2540), so we follow that model here as well.
Smarter Push
The first aspect of smarter push is removing the upload queue. Previously, whenever a git media file was cleaned (i.e. via
git add
) it was added to the upload queue. Then whenevergit push
was run, all queued files were synced. This was a problem because the git media files in the upload queue may not necessarily be in the commit(s) that are actually being pushed.Instead of a queue, this PR keeps a database linking git's sha1 to the git media oid. This database is just a set of files under
.git/media/objects
, where the file name is the git sha1. This is similar in structure to git's own.git/objects
database. The contents of each file contain the git media oid and the file name (if available). Future extensions could add more meta information to these files (e.g. which git media remotes it has been pushed to).Cleaning
With this PR, files are run through the clean filter as normal, and upon success of everything, the git media link file will be created.
Smudging
The git media link files are also created from the smudge filter. This allows the database to be built on clone, and for new files that are synced down to be properly linked up.
Pushing
When
git push
is run, thepre-push
hook gets information on its command line and via stdin about the refs being pushed and where they're being pushed to. We use this information to get a list of git objects that are being pushed and look those objects up in the git media links database. If the git object has a matching git media file, that file is pushed to the git media endpoint.There are cases where this will still attempt to push git media objects that the server already has. Before any object is
PUT
to the server, the client requests anOPTIONS
to verify authentication. The server will now return a 200 status if it already has the file for the oid, and a 204 if it does not. In the event of a 200, the client will not send thePUT
request.Updating git media config
This PR contains code that updates some git media configuration. First, the
pre-push
hook needs to be updated because the command line arguments were not being passed to git media.Second, it will walk all of the git media objects under
.git/media
and create entries in the git media links database to ensure it's populated.The update mechanism so far is pretty crude. I didn't want to go too far into it without some feedback, so I just got it to the point of working. I'm liking the ideas @mhagger proposed below around adding a configuration version and using that to help determine what to run or how to respond to the user in the event that something needs updated. This will likely be needed in the future if other changes are made.
Other goodies
I've added a couple other things that were useful in building this out. There is now a
git media ls-files
and agit media push --dry-run
. They work as such:This pretty much just traverses the git media link database and prints out everything it finds. A more authoritative version could be made where it's actually checking git's database for pointer files.
We have to drop to running
git media push
overgit push
because git does not pass the parameters down into the hook scripts.Original PR
Here are some proposals for the first steps toward a smarter push.
First, we'll nuke the queue as we currently have it. We shouldn't need this.
When the clean filter runs, e.g. on
git add <file>
, we'll clean it as normal but instead of adding it to a sync queue, we'll get the sha1 that git calculates for the pointer file.We'll create a file under
.git/media/objects/aa/bbccddeeff00112233445566778899aabbcc
. This file will contain the git media oid. So it's kind of like git's object db, but with pointers to git media objects instead of content.When the pre-push hook runs, we can gather a list of objects git wants to send from
git rev-list --objects
. As discussed in #104 this list of objects may include files we've already pushed, so we need a way to trim that down.We can trim the list by adding a query to the server api which, given a list of git media oids will return the list of objects it doesn't know about, similar to how git negotiates what it's going to send. (Are there security implications to this? It is authenticated and scoped to the repo, so I think it should be OK.)
Once we have that list, we sync those files and continue on as normal.
git media queues
replacement showing unsynced files--dry-run
option