Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add symbolic-link backend #41

Open
KeithHanlan opened this issue Mar 31, 2015 · 14 comments
Open

Add symbolic-link backend #41

KeithHanlan opened this issue Mar 31, 2015 · 14 comments

Comments

@KeithHanlan
Copy link

For performance reasons, the fastest way to make big fat binary files available in a repo would be to simply update symbolic links. This can already be done with native git; one can create a symbolic link that points outside the repo. Checking out new baselines remains very fast regardless of the size of the binary.

However, git's support for symbolic links does nothing to guarantee the integrity of the file being referenced. Since git-fat records the SHA-1 of the fat file, this could be adapted to enhance the use of symbolic links for external files (fat or otherwise).

Other back-ends could still be configured as fallbacks for remote users.

@cztomczak
Copy link

Symbolic links are not supported by Git on Windows.

@abraithwaite
Copy link

It might work as a backend (between .git/fat/objects and a different directory), but it wouldn't work as you're expecting it too @KeithHanlan.

See #33 for more discussion and an explanation as to the technical reason why we cannot use symlinks in the repo.

@KeithHanlan
Copy link
Author

It would still be a huge performance improvement for environments which don't care about Windows. For example, we have some regression test cases which rely on 600MB compressed tarballs. Currently, we just rely on those files remaining where they are but there is no guarantee of their continued existence nor of their integrity. It would be nice to store them as git-fat files but even with rsync, the performance penalty would be too great. I just don't want those files in my repo working directory.

One of my repos already does store binary files (link libraries) offboard with symbolic links. The destination filenames are constructed of a sequence number (for convenience) and the SHA-1. This works but requires an awkward submission procedure. Gitattributes cleaning and smudging would make for a much more transparent solution. It seems only natural to adapt git-fat to this purpose, even if it is limited to Unix environments.

@justinclift
Copy link

@KeithHanlan What do you think about the technical problems in #33? Any idea how to work around the ones for non-Windows systems? 😄

@KeithHanlan
Copy link
Author

@justinclift, I'm still wrapping my mind around the comments made by @mzemb and @abraithwaite in #33. These are good points that I hadn't considered. Originally, I imagined that one could use symbolic links if the remote is reachable and then use rsync or http as a fall back.

So, thinking further, I still think there is value in using git-fat as a front-end to help manage symbolic links but "link" would not be simply an alternative backend and some special rules would be required. Instead of using special files with '#$# get-fat' prefixes, I think we would store the symbolic links normally.

The added value of git-fat would be to facilitate the push (clean) and creation of the symbolic link. The pull (smudge) would be a null operation.

Ultimately, in our environment, I hope to enhance our pre-push hooks to use git-fat so that its use is more transparent. In conjunction with Antonio's changes #40, the use of git-fat could be made completely transparent.

At the end of the day, we have dependencies on large (sometimes extremely large) binary files and git-fat offers an attractive way to use git's commit tree to maintain the integrity of those dependencies.

@AndrewJDR
Copy link

This is an interesting topic.

Git-annex uses a symlink approach akin to what you've described, but doesn't use smudge/clean to accomplish it. The author of git-annex wanted to switch to using smudge/clean, but it didn't work out well. Here is his writeup on his attempts:
https://git-annex.branchable.com/todo/smudge/

Ultimately I think this would be a great thing, but it comes with challenges, some of which are probably insurmountable without changing git itself.. maybe even significantly. Changing git itself to facilitate the handling of large files better through smudge/clean could be a worthy goal, and if I sometime get the free time to look into it thoroughly I want to do so. If anyone here is also a C programmer and wants to chat more about it let me know -- I have an old email exchange with the git maintainer about it that is of interest.

@abraithwaite
Copy link

Indeed. I'll let you in on a little secret in that I don't really think git-fat is a great solution to this problem. Facebook has in fact worked to solve this problem themselves, only with mercurial instead of git.

The strategy they use is lazy loading:

This extension changes the clone and pull commands to download only the commit metadata, while omitting all file changes that account for the bulk of the download. When a user performs an operation that needs the contents of files (such as checkout), we download the file contents on demand using Facebook's existing memcache infrastructure. This allows clone and pull to be fast no matter how much history has changed, while only adding a slight overhead to checkout.

Which is the same strategy I would use to patch git, if I were to. The major difference I would make is instead of downloading the commit metadata, I'd make it so git downloads all objects under a certain size threshold, and then fetch them on demand when you run a checkout. The problem of course with both strategies is what happens when you're offline. Handling cases where you're missing files becomes an issue for all the various commands. Without handling those cases, it would just look like you're always missing those files if you try to change branches when you're offline (which might be okay, as long as it's clear to the user that that's the case).

Some of the work has already been done, with git 1.9 you can push from a shallow clone but there's still more to be done if you want the full history and selectively exclude the large files in some cases.

Edit: To add, I still think this is an okay interim solution, but I've long known since starting working on this that the problem is deeper than a git-plugin. I've just never admitted it.

@AndrewJDR
Copy link

It's a bit of a shame really. Git does so many things nicely, but it's still no good for projects where you just need to toss a bunch of mixed binary / source code assets in there with minimum fuss. I'd really love to go in there handle it but man... time.

@AndrewJDR
Copy link

And I agree that a lazy loading approach is probably the right one. Now to get it done and into git master! :)

@abraithwaite
Copy link

It's a bit of a shame really. Git does so many things nicely, but it's still no good for projects where you just need to toss a bunch of mixed binary / source code assets in there with minimum fuss

I don't think it's a shame at all, git is young compared to other projects used as widely as it is. That and git is really just a data structure sepcification. You could write any client on top of it and things would still work fine. But yes, ultimately the problem is time. I would estimate the work that needs to be done at at least 120 hours. Not many people have that much free time so would need an employer to sponsor it or convince the maintainers that it's worth their time to implement. :-P

@AndrewJDR
Copy link

Maybe shame was the wrong word. What's a good word for that feeling when something is very close to being the ideal setup but falls just short of it? ;)

I think some gaming companies could really benefit by it. They need to check in art assets alongside their source code with minimum of fuss. The ones I've talked to are still stuck on perforce (or svn) largely for the better binary handling, but both of those have branching models inferior to git (imo), so they miss out on that.

If you think you or someone you know might take it on as a sponsored job, let me know -- I will keep my ears and eyes open for opportunities and be in touch if something came up.

@AndrewJDR
Copy link

Aaaaannd now Github announced Git LFS.

It looks like it uses the same approach as git-fat (smudge/clean + checksums), so despite it using Go I really don't expect any huge performance difference compared to git-fat. I'll be pleasantly surprised if it is better, though!

If it does have the same limitations, and folks start seeing major slowdowns for 'git status' (esp. on windows), we may see some more attention at addressing the root issue.

@abraithwaite
Copy link

Yep. However, it doesn't support multiple backends yet. :-) 👍
I opened up git-lfs/git-lfs#193 for them though, so they know it's valued. I do believe performance will be signifigantly better though. The time it takes to spin up a python process is actually quite high compared to the ammount of time it acutally spends doing things.

It does still have the same limitations though and I still believe that it should be implemented in git. Hearing about this realease is really motivating me to dig deeper. 🐇

@AndrewJDR
Copy link

I'm interested too! When I get some more time, I'll have to grab the source code for the client and reference server and give it a spin.

I had tried some of the Python -> .exe tools (Cython, Numba, Pypy, etc) a while back with git-fat and didn't see much of an improvement.

I suppose this is as expected, since most of them probably just embed the python interpreter into the .exe and suffer the same startup penalty. That said, at the time I thought Numba built straight to native code using LLVM (i.e. no full interpreter in the resulting exe), so I expected better from it. But now I'm not so sure that's what it actually does. So that was inconclusive at the end of the day.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants