Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Drop smudge in favor of post-checkout hook #616

Closed
technoweenie opened this issue Aug 28, 2015 · 42 comments
Closed

Proposal: Drop smudge in favor of post-checkout hook #616

technoweenie opened this issue Aug 28, 2015 · 42 comments
Labels

Comments

@technoweenie
Copy link
Contributor

The improvements of the batch API and the updated fetch command are wasted on the Git smudge filters. Each checked out file has to invoke smudge, which only operates on a single file at a time. There's no parallelism here (unless Git adds it).

The post-checkout hook runs after a git checkout is run after updating the work tree. The hook can then call git lfs pull, which downloads the LFS objects for the current working directory and replaces the pointers. This means it can make full use of the batch API and concurrent transfers.

Some downsides:

  1. Cloning is more difficult. I don't know of a way to clone with a post-checkout hook. We could solve this with a git lfs clone wrapper.
  2. What happens if users have their own post-checkout scripts? I know that the Unreal Engine uses one, for example. Git needs a way for multiple tools to add their hooks in harmony.
@whoisj
Copy link

whoisj commented Aug 28, 2015

Are post-checkout hooks invoked with actions like reset?

@technoweenie
Copy link
Contributor Author

@whoisj I think so, but that's definitely a use case we'd need to check.

@ttaylorr
Copy link
Contributor

I've started an implementation which I'm tracking over at ttaylorr#1 😄.

@bozaro
Copy link
Contributor

bozaro commented Aug 29, 2015

Unreal Engine uses post-checkot hook for very similar usecase: they run GitDependencies.exe tool (written by Epic Games) for downloading large binaries files.

I asked them about their git-lfs plans: https://answers.unrealengine.com/questions/293214/gitdependenciesexe-vs-git-lfs.html

@ttaylorr
Copy link
Contributor

ttaylorr commented Sep 2, 2015

Finally got a chance to sit down with this, and there are a few issues I'd like to address. The idea was to use the post-checkout hook as a means to trigger git lfs pull, which would in effect run the smudge filter in parallel.

Once I implemented this, nearly a third of integration tests were failing. I discovered the culprit is usually cloning a bare repo, and then pulling down commits from the remote later. This edge case does not invoke the post-checkout hook. To me, this seems like a correct implementation of post-checkout since going from no commits to some commits doesn't involve a checkout. I am not immediately sure how to solve this problem beyond forcing users to run git lfs pull in that scenario.

@whoisj brings up another issue, resets in Git also do not trigger a post-checkout.

To me, these two cases together are significant for users of Git LFS. I'd like to discuss whether or not a post-checkout hook is the ideal way to implement this, and see if there are any other options.

@technoweenie
Copy link
Contributor Author

Thanks for checking this out! Let's start small:

  1. Keep the current smudge filter as the default.
  2. You had the idea in chat of passing a flag to git lfs init to swap lfs download modes: smudge filter vs post-checkout.
  3. Be upfront about all the caveats that we know about so users can make an informed decision if smudge filters take too long.

@ttaylorr
Copy link
Contributor

ttaylorr commented Sep 3, 2015

My pleasure! I think taking that route would be a solid move.

@sinbad
Copy link
Contributor

sinbad commented Sep 3, 2015

I looked at hooks as an alternative to filters a while ago (before I started contributing to git-lfs), and found they were too full of holes on the client side, they were clearly envisioned mostly for server-side processing. I was hoping you might find something I missed but I'm not surprised you've come to this conclusion. I agree the smudge filters should remain, as much as it's not the ideal.

Bear in mind that people using git lfs fetch --recent will avoid some of the waiting involved in future, since there's a better chance that they will already have downloaded objects by the time they check out / pull, particularly if they regularly run 'git fetch' on its own (and some tools, like SourceTree, have an option to run fetch periodically in the background).

I'm in 2 minds about post-checkout, I think it might be harder to explain to people which conditions it covers and which it doesn't, rather than simply giving people the option to turn off automatic lfs pointer->content conversion entirely & saying they have to run git lfs checkout themselves always.

@rubyist
Copy link
Contributor

rubyist commented Sep 3, 2015

A while back I started to play around with this idea, but didn't get very far. It might be a bad idea, so if you think it is or prove it is, I'm fine with that 😄

The smudge filters could start and feed a daemon, similar to the way the credential cache daemon works. It could go something like this:

  • Smudge filter starts
    • Start daemon if it has not been started (maybe check for a lock file in .git/lfs or something)
    • Feed oid/pointer data to the daemon
    • Output the pointer data
  • Daemon runs
    • Receive input from a smudge filter
    • Download file, update git index (possible batching)
    • Exit after n seconds of inactivity (no active transfers, nothing received from filter)

The smudge filter can have logic on whether to do this or behave normally, depending on configuration. It could perhaps also make this choice based on object sizes, letting small objects operate normally and feeding large ones to a daemon. I don't know that that would be any benefit or not.

The immediate caveat with this approach is that the user feedback is poor. If the filters don't wait for the daemon to finish the repo will be in a weird, changing state until it finishes. There also may be a problem updating the index while the clone is running. I don't know for sure if git locks the index for the entire operation, or per file.

I think we can work around that, though. The smudge filter that starts the daemon ought to be able to look at .git/HEAD (or something, that stuff should all be there by that time) and know what ref it's checking out. It should be able to scan for lfs objects to know how many it's going to check out, and start up the daemon with those expectations. The daemon could then give better feedback and know when it's done. It could even tell each smudge command which number it is, so the last smudge filter to run could wait for the daemon to exit.

I'll admit that's fairly complicated. There may still be index updating issues and this approach may not work at all, but I thought I'd throw it out there anyway. I think it's possible, but the cost may not be worth it.

@sinbad
Copy link
Contributor

sinbad commented Sep 3, 2015

I think this is a good idea but that the unpredictable parallel behaviour will be a bitch. What files git locks while it's calling all the smudge filters is certainly one issue, but another is simply the daemon fighting with the smudge filter (one writing the pointer data, the other writing the real data) if the retrieval is fast enough. Then there's just the inevitable nastiness of using lock files & daemons on Windows (cue all sorts of crazy problems from slight delays in locks being visibly released by the OS, leading to retries and horrid edge cases).

It would be nice if we could take the idea of a background aggregation tool but make it more predictable, say by having the entire git command book-ended so that the daemon starts at the beginning, and finishes up at the end of all the smudge filters, at that point updating the working copy and the index in bulk when it knows everything else is out the way (but being able to do anything else in parallel). I can't see a way of doing that reliably without fully wrapping the git commands though. But then again, maybe that would be ok as an option for those wanting higher perf? Each wrapper command could just use goroutines for parallelism and then call out to git for everything else, all state would be contained & no need for lock files, daemons & timeouts that I'm pretty sure will bite us in the arse in random ways.

@rubyist
Copy link
Contributor

rubyist commented Sep 3, 2015

Wrapping git would make many things we do much easier. I like that we've avoided it to this point, but a higher perf clone/checkout wrapper is probably the easiest way to go. The wrapper command doesn't need to be complex, it can set an env variable which tells the smudge filters to act as a passthrough then run lfs fetch and friends after git runs. It shouldn't need any parallelism or daemons.

@sinbad
Copy link
Contributor

sinbad commented Sep 3, 2015

Yeah I think that's the pragmatic option. Although we could still probably still make a start on the required work as the smudge filters are running; either on-demand by unambiguous output messages in stderr that we see from the wrapper (and not reflecting that back to the real stderr) or even by pre-empting the process entirely and starting work that's read-only on the core git data (not not necessarily lfs data) in parallel at the beginning of the wrapper - since in a git checkout wrapper we already have all the information we need to start doing the git lfs fetch before git even starts, only the git lfs checkout has to be done at the end.

@ttaylorr
Copy link
Contributor

ttaylorr commented Sep 3, 2015

I'm with @sinbad on this one; a background worker daemon would be awesome, but super tricky to reason about. I'm sure there are a countless number of ways that a user could mess up that process, and having to recover from all of them would be very annoying.

I'm not opposed to wrapping more git commands for users that want/need higher perf out of Git LFS. I think we should still prefer the unwrapped commands, but as long as we're explicit about that and the wrapped commands have well-documented behavior, I think we'd be cool.

@technoweenie
Copy link
Contributor Author

Agreed on all points. FWIW: one of my original plans over a year ago included adding support through hub. This was before I knew about the pre-push hook though. I was worried about adding an extra step to all pushes.

@AndrewJDR
Copy link
Contributor

Sorry if this has already been answered, but in the case of smudge where you're restricted to one file at a time, why not add parallelism within each individual file download by doing chunked/segmented downloading? This is how download managers work and it helps increase download speeds tremendously.

@AndrewJDR
Copy link
Contributor

Just to clarify, I'm suggesting that during smudge, you open several connections when downloading each single S3 object and use the Range field in your request headers to grab different segments of that individual file, then assemble these segments on the client side. See this link for details.

Having to run separate commands or use wrapper commands to enable batch downloads is a bit onerous and error prone, so it would be nice if you considered an option like this that could still fit within the one-file-at-a-time smudge filter paradigm and yet had good download performance.

You can also have some kind of basic heuristic to decide when to use multiple connections per file, so you don't end up spinning up 8 connections to download a 50KB file or something (could take longer than a single connection?)

@AndrewJDR
Copy link
Contributor

This is a completely separate idea from the segmented downloading:

Smudge filter starts

  • Just before receiving stdin (or after receiving just enough of it to know it's a git-lfs pointer?) but before writing to stdout, it checks whether there are any objects that need to be downloaded and stored in the .git/lfs/object cache (via git ls-files or equivalent which doesn't require a lock on the index).
  • If objects do need downloading and caching, do that right now in parallel. git will (hopefully?) just wait for us while we finish that.
  • Once the download/caching of the objects is done, (bear in mind we still haven't processed our first file through the smudge filter at this point) read the rest of stdin and via stdout give git the smudged data that it has been waiting for (from cache).
  • The smudge filter will continue to run on other files. Because we have already downloaded and cached all objects, the smudge filter can provide the data immediately from the .git/lfs/objects cache.

Can someone please explain why this wouldn't work? It seems pretty obvious, so I'm assuming I'm just missing something.

I still think range requests / segmented downloads are a better overall option, because they improve download performance even if you're grabbing only a single large file, whereas file-level parallelism doesn't offer an improvement for that case.

@technoweenie
Copy link
Contributor Author

Thanks for the suggestion. The challenge is that Git LFS is not a long running system that keeps state. Each smudge call is a separate process with no knowledge of other processes. It doesn't know if it's smudging the first or the last file. So even if a single smudge can download a single file in parallel, it's still adding overhead by calling the LFS API and spinning up new processes for each file.

That said, it may be a good solution. I'd love to see results if you want to do some experiments.

@AndrewJDR
Copy link
Contributor

Sure, I created this repo to benchmark things (40x10MB and 40x1MB files tracked by lfs, and 3801 C and header files from the linux kernel tracked normally by git. The LFS tracked files were created by dd'ing /dev/urandom):
https://github.com/ajohnson23/git-lfs-test-repo.git

First, I test how long LFS takes when the smudge is doing the downloading one file at a time:

# -n means do not run checkout, so no smudge.
# This is just a plain old git checkout so we're not interested in how long this takes:
git clone -n https://github.com/ajohnson23/git-lfs-test-repo.git
# checkout runs smudge, which downloads the LFS tracked files one at a time:
cd git-lfs-test-repo && time git checkout master

SmudgeWithDownload:
real 12m48.320s
user 0m11.524s
sys 0m4.712s

Next I delete my old clone and run:

# Same as before:
git clone -n https://github.com/ajohnson23/git-lfs-test-repo.git
# Downloads all the LFS tracked files and caches them into .git/lfs/objects
# This is NOT what we're measuring here, but it does give us an idea of how
# fast downloads can be when they're optimized - it takes about 40sec on my
# system (batch enabled)
cd git-lfs-test-repo && git lfs fetch --all
# SmudgeOnlyTime: Still running the smudge filter,
# so still forking LFS for every file, but instead of downloading as before
# it just copies from .git/lfs/objects:
time git checkout master

SmudgeWithoutDownload:
real 0m0.830s
user 0m0.152s
sys 0m0.448s

Assuming nothing about my methodology is off, for this particular repo and my connection, download time is overwhelmingly the bottleneck and the thing that needs to be parallelized. I think multi-part downloading of single files during smudge could therefore be a huge win in many cases.

By the way, I'm on linux. On windows fork() and particularly file IO is just slower, so I expect everything will be worse overall.

@technoweenie
Copy link
Contributor Author

How long did the git lfs fetch take? Also, comparing a git lfs fetch --all is not an equal comparison to a plain git clone. The plain git clone will only download what's being checked out. So compare it to git lfs fetch instead.

@AndrewJDR
Copy link
Contributor

Updated my post with my git lfs fetch time.

Note I'm not comparing git clone times at all whatsoever. If you check my commands, I only time the "git checkout master" commands.

@AndrewJDR
Copy link
Contributor

The git lfs fetch time is not a part of my timings either -- I've only timed "git checkout master" times -- in the first case without having done "lfs fetch", in the latter case, after already having done "lfs fetch"

@bozaro
Copy link
Contributor

bozaro commented Oct 6, 2015

@ajohnson23

I think multi-part downloading of single files during smudge could therefore be a huge win in many cases.

I think you came to the wrong conclusions.

Smugle filters is stateless and make for each file (one file - one filter process):

  • ssh request (for ssh-downloaded repository, but not in you case) - run git-lfs-authenticate;
  • first https request for github api server (get real file location);
  • second htttps request for downloading file from CDN.

These requests are executed sequentially. The HTTPS/SSH handshake is a very expensive operation and depends on the network round-trip time.

Smugle filter is unusable slow even on the local network (I write about same problem in #376).

It would be nice if Git could:

  • use one process for multiple files (looks like CGI vs FastCGI);
  • run multiple git-lfs filters in parallel mode.

But it requires API changes inside the Git.

@AndrewJDR
Copy link
Contributor

I understand the single file / single process nature of smudge, but I wasn't aware that LFS has a such a long sequential handshake sequence for every file, which will indeed hurt badly.

Can someone give some pointers to documents/code explaining why "first https request for github api server (get real file location);" that bozaro mentioned is necessary? Why can't a stable base URL be agreed upon and used for all requests with the object part of the URL based on the file hash which the git client will already have knowledge of? Does S3 not allow for object naming or something?

@bozaro
Copy link
Contributor

bozaro commented Oct 6, 2015

Can someone give some pointers to documents/code explaining

See: https://github.com/github/git-lfs/tree/master/docs/api

@bozaro
Copy link
Contributor

bozaro commented Oct 6, 2015

To understand the scale of the problem with handshakes I clone my micro repostory (https://github.com/bozaro/test) with single lfs file:

time git lfs fetch
Fetching master
(1 of 1 files) 33.64 KB / 40.27 KB

real    0m1.779s
user    0m0.032s
sys 0m0.008s

@AndrewJDR
Copy link
Contributor

Yep, it's a huge problem. I know windows is super bad too, because even forking can be expensive there. Imo, the git api has needed work to better handle this problem for a long time. If that can happen, it's ideal.

Going back to the above dialog between @sinbad and @rubyist, a daemon launched at the first smudge that identifies what ref it's checking out, builds a list of the lfs objects, then does the work of gathering all of the direct object download URLs in parallel and/or perhaps via the batch api so that the ongoing smudge processes can immediately have access to those download URLs rather than needing to do the whole sequential handshaking on every smudge might be nice. The smudge processes could still be responsible for writing out the smudged large files so that you don't have to deal with any weird locking issues with the daemon fighting the smudge processes for ability to write/read certain files.

Taking that a step further, the daemon could also maintain a https connection to S3 and act as a local proxy for the smudge processes so that those smudge processes don't have to constantly set up a new https connection.

@rubyist
Copy link
Contributor

rubyist commented Oct 6, 2015

Can someone give some pointers to documents/code explaining why "first https request for github api server (get real file location);" that bozaro mentioned is necessary? Why can't a stable base URL be agreed upon and used for all requests with the object part of the URL based on the file hash which the git client will already have knowledge of?

Basically, the LFS API is the stable base URL. This way we don't have to tie into any specific back end storage, or dictate how the back end storage might work.

There's another logging mode people looking at this kind of thing might be interested in. You can set the env var GIT_LOG_STATS=1 for a command and it'll dump some specifics about HTTP operations to logs under .git/lfs/objects/logs/http. They'll look something like this for a download:

concurrent=3 batch=true time=1444127717 version=1.0.0
key=lfs.api.download reqheader=264 reqbody=0 resheader=121 resbody=337 restime=678553523 status=200 url=http://localhost:8080/objects/b1129123471803dc4f81c9c5a6c410061c5b1c33c736cf177619160e99f96a39
key=lfs.data.download reqheader=244 reqbody=0 resheader=117 resbody=33 restime=2646905 status=200 url=http://localhost:8080/objects/b1129123471803dc4f81c9c5a6c410061c5b1c33c736cf177619160e99f96a39

Or this for an upload (5 files, batched here):

concurrent=3 batch=true time=1444127871 version=1.0.0
key=lfs.api.batch reqheader=265 reqbody=459 resheader=122 resbody=1689 restime=670966601 status=200 url=http://localhost:8080/objects/batch
key=lfs.data.upload reqheader=284 reqbody=33 resheader=17 resbody=0 restime=2199957 status=200 url=http://localhost:8080/objects/cec8b8184af7b53f58e94b86f75162a29316f9f948a284cd2052eb6441dcc4a0
key=lfs.data.upload reqheader=284 reqbody=33 resheader=17 resbody=0 restime=2761676 status=200 url=http://localhost:8080/objects/bcc6fd9f4734af7917a1779990d20141c19f0bccef20ecfa383fdea76463e792
key=lfs.data.upload reqheader=284 reqbody=33 resheader=17 resbody=0 restime=3975767 status=200 url=http://localhost:8080/objects/0ec730dbfdf697246668c09b9bf74432ca68a44b43080dad4d113fd01649d2ed
key=lfs.data.upload reqheader=284 reqbody=33 resheader=17 resbody=0 restime=1834502 status=200 url=http://localhost:8080/objects/abc4d932160acbac152940b8893cfe41cd772a0709be203a9cb1b93cdd7a52b4
key=lfs.data.upload reqheader=284 reqbody=33 resheader=17 resbody=0 restime=1934129 status=200 url=http://localhost:8080/objects/360dff6f03c6ee9b94e4d139ae0294fd6f3cf10d727b5518296615f9817a57d7

This gives you specific request/response header and body sizes and times for api and storage operations. The caveat is that it's process based so a smudge-based clone will leave 1 log per file, so you'd want to aggregate those for analysis.

@bozaro
Copy link
Contributor

bozaro commented Oct 6, 2015

I think the implicit daemon will create a lot of problems (for example, it will block the removal of directories in Windows).

Loss of explicit complete the process of obtaining data from the server as unacceptable (mainly to build farms and build scripts). Because of this, the demon only reduces the severity of the handshake problem. But can not solve the problem of sequential queries.

So it is not clear how the demon had to deal with a filter:

  • TCP - is not safe in a multi-user environment;
  • File sockets - is platform specific.

I think it is necessary to solve the problem by updating Git API. The rest of the options seem too complicated to operate.

@AndrewJDR
Copy link
Contributor

Yeah, ideally such things would be more of a stopgap until git apis are changed.
One issue is that this problem is far worse on windows, and the core git devs simply do not historically care about windows. I therefore do not have high hopes for the git api being changed anytime soon to accommodate this. This is why I'm proposing more drastic hacks/workarounds for this problem than I would normally entertain.
Oh well, I will just skip smudge on clone + git lfs fetch for now and hope the apis get changed one day.

@gurugray
Copy link
Contributor

👍

1 similar comment
@shuhrat
Copy link

shuhrat commented Nov 24, 2015

+1

@flyingrobots
Copy link

+1 plz fix

@my-digital-decay
Copy link

Does anyone have a working post-hook solution to this problem as a stop gap until a more robust solution is implemented?

@sinbad
Copy link
Contributor

sinbad commented Dec 14, 2015

@my-digital-decay Yes, you can define the env var GIT_LFS_SKIP_SMUDGE=1 then just use git lfs pull (or use fetch and checkout separately if you'd prefer) instead to download. You can also make this setting permanent for this repo, run git lfs install --skip-smudge.

@my-digital-decay
Copy link

Thanks! I'll give that a try.

@WestonThayer
Copy link

@sinbad, can you walk me through how to do this on a new clone? I tried:

git clone -n https://myrepo.git

That seemed to clone without engaging the lfs downloads. Then I tried:

git lfs install --skip-smudge
git lfs pull

Git lfs then downloaded the files as one batch, but then error'd out. I'm guessing this is because the HEAD wasn't checked out by git clone -n...

@AndrewJDR
Copy link
Contributor

@WestonThayer Assuming you're using the latest git-lfs, once you've done the git lfs install --skip-smudge, you can just git clone https://example.git and run the git lfs pull right after. no need for using -n with clone.

git lfs install --skip-smudge adds a setting to your ~/.gitconfig which git lfs checks during the git clone, so that's why this works.

@WestonThayer
Copy link

@ajohnson23 thank you, I think I have my head wrapped around it now. That looked like it downloaded all my lfs files sequentially. Is there a way to parallelize at least part of it?

@AndrewJDR
Copy link
Contributor

@WestonThayer Add the following to your .gitconfig, and tweak concurrenttransfers to your liking:

[lfs]
  ­­­­­­­batch = true
  concurrenttransfers = 6 

@sinbad
Copy link
Contributor

sinbad commented Jan 29, 2016

The default is concurrenttransfers = 3 so it should have been parallel so long as the smudge filter wasn't enabled.

Personally I prefer to leave the smudge filter enabled globally so that switching branch automatically updates my working copy, and only disable it at clone time to speed things up. To do that you just need to set the environment variable GIT_LFS_SKIP_SMUDGE=1 before git clone to have the same effect as git lfs install --skip-smudge, but just for the clone command. I don't like to have to remember to git lfs pull every time I switch branch.

@WestonThayer
Copy link

Thanks I get it now. I posted a summary for others.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests