Add symbolic-link backend #41

KeithHanlan · 2015-03-31T17:07:26Z

For performance reasons, the fastest way to make big fat binary files available in a repo would be to simply update symbolic links. This can already be done with native git; one can create a symbolic link that points outside the repo. Checking out new baselines remains very fast regardless of the size of the binary.

However, git's support for symbolic links does nothing to guarantee the integrity of the file being referenced. Since git-fat records the SHA-1 of the fat file, this could be adapted to enhance the use of symbolic links for external files (fat or otherwise).

Other back-ends could still be configured as fallbacks for remote users.

cztomczak · 2015-03-31T17:14:14Z

Symbolic links are not supported by Git on Windows.

abraithwaite · 2015-03-31T17:18:27Z

It might work as a backend (between .git/fat/objects and a different directory), but it wouldn't work as you're expecting it too @KeithHanlan.

See #33 for more discussion and an explanation as to the technical reason why we cannot use symlinks in the repo.

KeithHanlan · 2015-03-31T17:57:21Z

It would still be a huge performance improvement for environments which don't care about Windows. For example, we have some regression test cases which rely on 600MB compressed tarballs. Currently, we just rely on those files remaining where they are but there is no guarantee of their continued existence nor of their integrity. It would be nice to store them as git-fat files but even with rsync, the performance penalty would be too great. I just don't want those files in my repo working directory.

One of my repos already does store binary files (link libraries) offboard with symbolic links. The destination filenames are constructed of a sequence number (for convenience) and the SHA-1. This works but requires an awkward submission procedure. Gitattributes cleaning and smudging would make for a much more transparent solution. It seems only natural to adapt git-fat to this purpose, even if it is limited to Unix environments.

justinclift · 2015-04-01T18:24:27Z

@KeithHanlan What do you think about the technical problems in #33? Any idea how to work around the ones for non-Windows systems? 😄

KeithHanlan · 2015-04-01T19:19:00Z

@justinclift, I'm still wrapping my mind around the comments made by @mzemb and @abraithwaite in #33. These are good points that I hadn't considered. Originally, I imagined that one could use symbolic links if the remote is reachable and then use rsync or http as a fall back.

So, thinking further, I still think there is value in using git-fat as a front-end to help manage symbolic links but "link" would not be simply an alternative backend and some special rules would be required. Instead of using special files with '#$# get-fat' prefixes, I think we would store the symbolic links normally.

The added value of git-fat would be to facilitate the push (clean) and creation of the symbolic link. The pull (smudge) would be a null operation.

Ultimately, in our environment, I hope to enhance our pre-push hooks to use git-fat so that its use is more transparent. In conjunction with Antonio's changes #40, the use of git-fat could be made completely transparent.

At the end of the day, we have dependencies on large (sometimes extremely large) binary files and git-fat offers an attractive way to use git's commit tree to maintain the integrity of those dependencies.

AndrewJDR · 2015-04-02T00:47:53Z

This is an interesting topic.

Git-annex uses a symlink approach akin to what you've described, but doesn't use smudge/clean to accomplish it. The author of git-annex wanted to switch to using smudge/clean, but it didn't work out well. Here is his writeup on his attempts:
https://git-annex.branchable.com/todo/smudge/

Ultimately I think this would be a great thing, but it comes with challenges, some of which are probably insurmountable without changing git itself.. maybe even significantly. Changing git itself to facilitate the handling of large files better through smudge/clean could be a worthy goal, and if I sometime get the free time to look into it thoroughly I want to do so. If anyone here is also a C programmer and wants to chat more about it let me know -- I have an old email exchange with the git maintainer about it that is of interest.

abraithwaite · 2015-04-02T02:07:37Z

Indeed. I'll let you in on a little secret in that I don't really think git-fat is a great solution to this problem. Facebook has in fact worked to solve this problem themselves, only with mercurial instead of git.

The strategy they use is lazy loading:

This extension changes the clone and pull commands to download only the commit metadata, while omitting all file changes that account for the bulk of the download. When a user performs an operation that needs the contents of files (such as checkout), we download the file contents on demand using Facebook's existing memcache infrastructure. This allows clone and pull to be fast no matter how much history has changed, while only adding a slight overhead to checkout.

Which is the same strategy I would use to patch git, if I were to. The major difference I would make is instead of downloading the commit metadata, I'd make it so git downloads all objects under a certain size threshold, and then fetch them on demand when you run a checkout. The problem of course with both strategies is what happens when you're offline. Handling cases where you're missing files becomes an issue for all the various commands. Without handling those cases, it would just look like you're always missing those files if you try to change branches when you're offline (which might be okay, as long as it's clear to the user that that's the case).

Some of the work has already been done, with git 1.9 you can push from a shallow clone but there's still more to be done if you want the full history and selectively exclude the large files in some cases.

Edit: To add, I still think this is an okay interim solution, but I've long known since starting working on this that the problem is deeper than a git-plugin. I've just never admitted it.

AndrewJDR · 2015-04-02T02:26:53Z

It's a bit of a shame really. Git does so many things nicely, but it's still no good for projects where you just need to toss a bunch of mixed binary / source code assets in there with minimum fuss. I'd really love to go in there handle it but man... time.

AndrewJDR · 2015-04-02T02:57:18Z

And I agree that a lazy loading approach is probably the right one. Now to get it done and into git master! :)

abraithwaite · 2015-04-03T02:53:40Z

It's a bit of a shame really. Git does so many things nicely, but it's still no good for projects where you just need to toss a bunch of mixed binary / source code assets in there with minimum fuss

I don't think it's a shame at all, git is young compared to other projects used as widely as it is. That and git is really just a data structure sepcification. You could write any client on top of it and things would still work fine. But yes, ultimately the problem is time. I would estimate the work that needs to be done at at least 120 hours. Not many people have that much free time so would need an employer to sponsor it or convince the maintainers that it's worth their time to implement. :-P

AndrewJDR · 2015-04-03T04:34:46Z

Maybe shame was the wrong word. What's a good word for that feeling when something is very close to being the ideal setup but falls just short of it? ;)

I think some gaming companies could really benefit by it. They need to check in art assets alongside their source code with minimum of fuss. The ones I've talked to are still stuck on perforce (or svn) largely for the better binary handling, but both of those have branching models inferior to git (imo), so they miss out on that.

If you think you or someone you know might take it on as a sponsored job, let me know -- I will keep my ears and eyes open for opportunities and be in touch if something came up.

AndrewJDR · 2015-04-09T02:15:11Z

Aaaaannd now Github announced Git LFS.

It looks like it uses the same approach as git-fat (smudge/clean + checksums), so despite it using Go I really don't expect any huge performance difference compared to git-fat. I'll be pleasantly surprised if it is better, though!

If it does have the same limitations, and folks start seeing major slowdowns for 'git status' (esp. on windows), we may see some more attention at addressing the root issue.

abraithwaite · 2015-04-09T02:19:40Z

Yep. However, it doesn't support multiple backends yet. :-) 👍
I opened up git-lfs/git-lfs#193 for them though, so they know it's valued. I do believe performance will be signifigantly better though. The time it takes to spin up a python process is actually quite high compared to the ammount of time it acutally spends doing things.

It does still have the same limitations though and I still believe that it should be implemented in git. Hearing about this realease is really motivating me to dig deeper. 🐇

AndrewJDR · 2015-04-09T02:42:42Z

I'm interested too! When I get some more time, I'll have to grab the source code for the client and reference server and give it a spin.

I had tried some of the Python -> .exe tools (Cython, Numba, Pypy, etc) a while back with git-fat and didn't see much of an improvement.

I suppose this is as expected, since most of them probably just embed the python interpreter into the .exe and suffer the same startup penalty. That said, at the time I thought Numba built straight to native code using LLVM (i.e. no full interpreter in the resulting exe), so I expected better from it. But now I'm not so sure that's what it actually does. So that was inconclusive at the end of the day.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add symbolic-link backend #41

Add symbolic-link backend #41

KeithHanlan commented Mar 31, 2015

cztomczak commented Mar 31, 2015

abraithwaite commented Mar 31, 2015

KeithHanlan commented Mar 31, 2015

justinclift commented Apr 1, 2015

KeithHanlan commented Apr 1, 2015

AndrewJDR commented Apr 2, 2015

abraithwaite commented Apr 2, 2015

AndrewJDR commented Apr 2, 2015

AndrewJDR commented Apr 2, 2015

abraithwaite commented Apr 3, 2015

AndrewJDR commented Apr 3, 2015

AndrewJDR commented Apr 9, 2015

abraithwaite commented Apr 9, 2015

AndrewJDR commented Apr 9, 2015

Add symbolic-link backend #41

Add symbolic-link backend #41

Comments

KeithHanlan commented Mar 31, 2015

cztomczak commented Mar 31, 2015

abraithwaite commented Mar 31, 2015

KeithHanlan commented Mar 31, 2015

justinclift commented Apr 1, 2015

KeithHanlan commented Apr 1, 2015

AndrewJDR commented Apr 2, 2015

abraithwaite commented Apr 2, 2015

AndrewJDR commented Apr 2, 2015

AndrewJDR commented Apr 2, 2015

abraithwaite commented Apr 3, 2015

AndrewJDR commented Apr 3, 2015

AndrewJDR commented Apr 9, 2015

abraithwaite commented Apr 9, 2015

AndrewJDR commented Apr 9, 2015