New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal: Drop smudge in favor of post-checkout hook #616
Comments
Are post-checkout hooks invoked with actions like reset? |
@whoisj I think so, but that's definitely a use case we'd need to check. |
I've started an implementation which I'm tracking over at ttaylorr#1 😄. |
Unreal Engine uses post-checkot hook for very similar usecase: they run GitDependencies.exe tool (written by Epic Games) for downloading large binaries files. I asked them about their git-lfs plans: https://answers.unrealengine.com/questions/293214/gitdependenciesexe-vs-git-lfs.html |
Finally got a chance to sit down with this, and there are a few issues I'd like to address. The idea was to use the Once I implemented this, nearly a third of integration tests were failing. I discovered the culprit is usually cloning a bare repo, and then pulling down commits from the remote later. This edge case does not invoke the @whoisj brings up another issue, To me, these two cases together are significant for users of Git LFS. I'd like to discuss whether or not a |
Thanks for checking this out! Let's start small:
|
My pleasure! I think taking that route would be a solid move. |
I looked at hooks as an alternative to filters a while ago (before I started contributing to git-lfs), and found they were too full of holes on the client side, they were clearly envisioned mostly for server-side processing. I was hoping you might find something I missed but I'm not surprised you've come to this conclusion. I agree the smudge filters should remain, as much as it's not the ideal. Bear in mind that people using I'm in 2 minds about post-checkout, I think it might be harder to explain to people which conditions it covers and which it doesn't, rather than simply giving people the option to turn off automatic lfs pointer->content conversion entirely & saying they have to run |
A while back I started to play around with this idea, but didn't get very far. It might be a bad idea, so if you think it is or prove it is, I'm fine with that 😄 The smudge filters could start and feed a daemon, similar to the way the credential cache daemon works. It could go something like this:
The smudge filter can have logic on whether to do this or behave normally, depending on configuration. It could perhaps also make this choice based on object sizes, letting small objects operate normally and feeding large ones to a daemon. I don't know that that would be any benefit or not. The immediate caveat with this approach is that the user feedback is poor. If the filters don't wait for the daemon to finish the repo will be in a weird, changing state until it finishes. There also may be a problem updating the index while the clone is running. I don't know for sure if git locks the index for the entire operation, or per file. I think we can work around that, though. The smudge filter that starts the daemon ought to be able to look at I'll admit that's fairly complicated. There may still be index updating issues and this approach may not work at all, but I thought I'd throw it out there anyway. I think it's possible, but the cost may not be worth it. |
I think this is a good idea but that the unpredictable parallel behaviour will be a bitch. What files git locks while it's calling all the smudge filters is certainly one issue, but another is simply the daemon fighting with the smudge filter (one writing the pointer data, the other writing the real data) if the retrieval is fast enough. Then there's just the inevitable nastiness of using lock files & daemons on Windows (cue all sorts of crazy problems from slight delays in locks being visibly released by the OS, leading to retries and horrid edge cases). It would be nice if we could take the idea of a background aggregation tool but make it more predictable, say by having the entire git command book-ended so that the daemon starts at the beginning, and finishes up at the end of all the smudge filters, at that point updating the working copy and the index in bulk when it knows everything else is out the way (but being able to do anything else in parallel). I can't see a way of doing that reliably without fully wrapping the git commands though. But then again, maybe that would be ok as an option for those wanting higher perf? Each wrapper command could just use goroutines for parallelism and then call out to git for everything else, all state would be contained & no need for lock files, daemons & timeouts that I'm pretty sure will bite us in the arse in random ways. |
Wrapping git would make many things we do much easier. I like that we've avoided it to this point, but a higher perf clone/checkout wrapper is probably the easiest way to go. The wrapper command doesn't need to be complex, it can set an env variable which tells the smudge filters to act as a passthrough then run |
Yeah I think that's the pragmatic option. Although we could still probably still make a start on the required work as the smudge filters are running; either on-demand by unambiguous output messages in stderr that we see from the wrapper (and not reflecting that back to the real stderr) or even by pre-empting the process entirely and starting work that's read-only on the core git data (not not necessarily lfs data) in parallel at the beginning of the wrapper - since in a |
I'm with @sinbad on this one; a background worker daemon would be awesome, but super tricky to reason about. I'm sure there are a countless number of ways that a user could mess up that process, and having to recover from all of them would be very annoying. I'm not opposed to wrapping more git commands for users that want/need higher perf out of Git LFS. I think we should still prefer the unwrapped commands, but as long as we're explicit about that and the wrapped commands have well-documented behavior, I think we'd be cool. |
Agreed on all points. FWIW: one of my original plans over a year ago included adding support through hub. This was before I knew about the |
Sorry if this has already been answered, but in the case of smudge where you're restricted to one file at a time, why not add parallelism within each individual file download by doing chunked/segmented downloading? This is how download managers work and it helps increase download speeds tremendously. |
Just to clarify, I'm suggesting that during smudge, you open several connections when downloading each single S3 object and use the Range field in your request headers to grab different segments of that individual file, then assemble these segments on the client side. See this link for details. Having to run separate commands or use wrapper commands to enable batch downloads is a bit onerous and error prone, so it would be nice if you considered an option like this that could still fit within the one-file-at-a-time smudge filter paradigm and yet had good download performance. You can also have some kind of basic heuristic to decide when to use multiple connections per file, so you don't end up spinning up 8 connections to download a 50KB file or something (could take longer than a single connection?) |
This is a completely separate idea from the segmented downloading: Smudge filter starts
Can someone please explain why this wouldn't work? It seems pretty obvious, so I'm assuming I'm just missing something. I still think range requests / segmented downloads are a better overall option, because they improve download performance even if you're grabbing only a single large file, whereas file-level parallelism doesn't offer an improvement for that case. |
Thanks for the suggestion. The challenge is that Git LFS is not a long running system that keeps state. Each smudge call is a separate process with no knowledge of other processes. It doesn't know if it's smudging the first or the last file. So even if a single smudge can download a single file in parallel, it's still adding overhead by calling the LFS API and spinning up new processes for each file. That said, it may be a good solution. I'd love to see results if you want to do some experiments. |
Sure, I created this repo to benchmark things (40x10MB and 40x1MB files tracked by lfs, and 3801 C and header files from the linux kernel tracked normally by git. The LFS tracked files were created by dd'ing /dev/urandom): First, I test how long LFS takes when the smudge is doing the downloading one file at a time:
SmudgeWithDownload: Next I delete my old clone and run:
SmudgeWithoutDownload: Assuming nothing about my methodology is off, for this particular repo and my connection, download time is overwhelmingly the bottleneck and the thing that needs to be parallelized. I think multi-part downloading of single files during smudge could therefore be a huge win in many cases. By the way, I'm on linux. On windows fork() and particularly file IO is just slower, so I expect everything will be worse overall. |
How long did the |
Updated my post with my git lfs fetch time. Note I'm not comparing git clone times at all whatsoever. If you check my commands, I only time the "git checkout master" commands. |
The git lfs fetch time is not a part of my timings either -- I've only timed "git checkout master" times -- in the first case without having done "lfs fetch", in the latter case, after already having done "lfs fetch" |
I think you came to the wrong conclusions. Smugle filters is stateless and make for each file (one file - one filter process):
These requests are executed sequentially. The HTTPS/SSH handshake is a very expensive operation and depends on the network round-trip time. Smugle filter is unusable slow even on the local network (I write about same problem in #376). It would be nice if Git could:
But it requires API changes inside the Git. |
I understand the single file / single process nature of smudge, but I wasn't aware that LFS has a such a long sequential handshake sequence for every file, which will indeed hurt badly. Can someone give some pointers to documents/code explaining why "first https request for github api server (get real file location);" that bozaro mentioned is necessary? Why can't a stable base URL be agreed upon and used for all requests with the object part of the URL based on the file hash which the git client will already have knowledge of? Does S3 not allow for object naming or something? |
|
To understand the scale of the problem with handshakes I clone my micro repostory (https://github.com/bozaro/test) with single lfs file:
|
Yep, it's a huge problem. I know windows is super bad too, because even forking can be expensive there. Imo, the git api has needed work to better handle this problem for a long time. If that can happen, it's ideal. Going back to the above dialog between @sinbad and @rubyist, a daemon launched at the first smudge that identifies what ref it's checking out, builds a list of the lfs objects, then does the work of gathering all of the direct object download URLs in parallel and/or perhaps via the batch api so that the ongoing smudge processes can immediately have access to those download URLs rather than needing to do the whole sequential handshaking on every smudge might be nice. The smudge processes could still be responsible for writing out the smudged large files so that you don't have to deal with any weird locking issues with the daemon fighting the smudge processes for ability to write/read certain files. Taking that a step further, the daemon could also maintain a https connection to S3 and act as a local proxy for the smudge processes so that those smudge processes don't have to constantly set up a new https connection. |
Basically, the LFS API is the stable base URL. This way we don't have to tie into any specific back end storage, or dictate how the back end storage might work. There's another logging mode people looking at this kind of thing might be interested in. You can set the env var
Or this for an upload (5 files, batched here):
This gives you specific request/response header and body sizes and times for api and storage operations. The caveat is that it's process based so a smudge-based clone will leave 1 log per file, so you'd want to aggregate those for analysis. |
I think the implicit daemon will create a lot of problems (for example, it will block the removal of directories in Windows). Loss of explicit complete the process of obtaining data from the server as unacceptable (mainly to build farms and build scripts). Because of this, the demon only reduces the severity of the handshake problem. But can not solve the problem of sequential queries. So it is not clear how the demon had to deal with a filter:
I think it is necessary to solve the problem by updating Git API. The rest of the options seem too complicated to operate. |
Yeah, ideally such things would be more of a stopgap until git apis are changed. |
👍 |
1 similar comment
+1 |
+1 plz fix |
Does anyone have a working post-hook solution to this problem as a stop gap until a more robust solution is implemented? |
@my-digital-decay Yes, you can define the env var |
Thanks! I'll give that a try. |
@sinbad, can you walk me through how to do this on a new clone? I tried:
That seemed to clone without engaging the lfs downloads. Then I tried:
Git lfs then downloaded the files as one batch, but then error'd out. I'm guessing this is because the HEAD wasn't checked out by git clone -n... |
@WestonThayer Assuming you're using the latest git-lfs, once you've done the
|
@ajohnson23 thank you, I think I have my head wrapped around it now. That looked like it downloaded all my lfs files sequentially. Is there a way to parallelize at least part of it? |
@WestonThayer Add the following to your .gitconfig, and tweak concurrenttransfers to your liking:
|
The default is Personally I prefer to leave the smudge filter enabled globally so that switching branch automatically updates my working copy, and only disable it at clone time to speed things up. To do that you just need to set the environment variable |
Thanks I get it now. I posted a summary for others. |
The improvements of the batch API and the updated
fetch
command are wasted on the Git smudge filters. Each checked out file has to invokesmudge
, which only operates on a single file at a time. There's no parallelism here (unless Git adds it).The
post-checkout
hook runs after agit checkout
is run after updating the work tree. The hook can then callgit lfs pull
, which downloads the LFS objects for the current working directory and replaces the pointers. This means it can make full use of the batch API and concurrent transfers.Some downsides:
post-checkout
hook. We could solve this with agit lfs clone
wrapper.post-checkout
scripts? I know that the Unreal Engine uses one, for example. Git needs a way for multiple tools to add their hooks in harmony.The text was updated successfully, but these errors were encountered: