WIP: long running clean/smudge filter protocol #1382

larsxschneider · 2016-07-18T13:38:32Z

This is my first WIP version of a Git/Git-LFS stream filter. Please keep in
mind that I have little go-lang knowledge and experience. Therefore
I would be happy to receive a very strict review to improve my go-lang
skills 😄 👍

What is the problem with Git LFS?

Git LFS is an application that is executed via Git clean/smudge filter.
The process invocation of these filters requires noticeable time (especially
on Windows). An individual filter process is required for every single file
that Git touches during its operations (e.g. checkout etc).

Proposed solution

Instead of a single Git LFS process per file, I propose a single Git LFS
process per Git invocation. That means Git invokes the filter process
(e.g. Git LFS) only once and then continuously talks to the same filter
process via a pipes.

You can find the corressponding WIP Git core implementation here:
https://github.com/larsxschneider/git/tree/filter-stream

Performance tests

I executed both test runs on a 2,5 GHz Intel Core i7 with SSD and OS X.
A test run is the consecutive execution of four Git commands:

clone the repo
checkout to the "removed-files" branch
timed: checkout the "master" branch
timed: checkout "removed-files" branch

Test command:

set -x; git lfs clone https://github.com/larsxschneider/lfstest-manyfiles.git repo; cd repo; git checkout removed-files; time git checkout master; time git checkout removed-files

I compiled Git with the following flags:

NO_OPENSSL=YesPlease APPLE_COMMON_CRYPTO=YesPlease NO_GETTEXT=YesPlease make -j 8

TEST RUN A -- Default Git 2.9 (ab7797d) and Git LFS 1.2.1

+ git lfs clone https://github.com/larsxschneider/lfstest-manyfiles.git repo
Cloning into 'repo'...
warning: templates not found /Users/lars/share/git-core/templates
remote: Counting objects: 15012, done.
remote: Total 15012 (delta 0), reused 0 (delta 0), pack-reused 15012
Receiving objects: 100% (15012/15012), 2.02 MiB | 1.77 MiB/s, done.
Checking connectivity... done.
Checking out files: 100% (15001/15001), done.
Git LFS: (15000 of 15000 files) 0 B / 77.04 KB
+ cd repo
+ git checkout removed-files
Branch removed-files set up to track remote branch removed-files from origin.
Switched to a new branch 'removed-files'
+ git checkout master
Checking out files: 100% (12000/12000), done.
Switched to branch 'master'
Your branch is up-to-date with 'origin/master'.

real    6m2.979s
user    2m39.066s
sys 2m41.610s
+ git checkout removed-files
Switched to branch 'removed-files'
Your branch is up-to-date with 'origin/removed-files'.

real    0m1.310s
user    0m0.385s
sys 0m0.881s

TEST RUN B -- Git and Git LFS with stream filter support

+ git lfs clone https://github.com/larsxschneider/lfstest-manyfiles.git repo
Cloning into 'repo'...
warning: templates not found /Users/lars/share/git-core/templates
remote: Counting objects: 15012, done.
remote: Total 15012 (delta 0), reused 0 (delta 0), pack-reused 15012
Receiving objects: 100% (15012/15012), 2.02 MiB | 1.30 MiB/s, done.
Checking connectivity... done.
Git LFS: (15000 of 15000 files) 0 B / 77.04 KB
+ cd repo
+ git checkout removed-files
Branch removed-files set up to track remote branch removed-files from origin.
Switched to a new branch 'removed-files'
+ git checkout master
Checking out files: 100% (12000/12000), done.
Switched to branch 'master'
Your branch is up-to-date with 'origin/master'.

real    0m2.528s
user    0m0.209s
sys 0m1.602s
+ git checkout removed-files
Switched to branch 'removed-files'
Your branch is up-to-date with 'origin/removed-files'.

real    0m2.280s
user    0m0.066s
sys 0m0.637s

Results

Default Git:                      6m2.979s + 0m1.310s = 364s
Git and Git LFS with stream filter support: 0m2.528s + 0m2.280s = 5s

The Git stream filter solution is almost ✨ 70x faster ✨ when switching branches
on my local machine with a test repository containing 12,000 Git LFS files.
Based on my previous experience with Git LFS clone I expect even more
dramatic results on Windows.

Next Steps

Make Travis-CI tests pass (anyone an idea what is wrong with the "clone with submodules" test 1a24a7c ?)
Make the pipe protocol more robust against errors (e.g. by adding ACK messages). Do you have other suggestions for the protocol?
Cleanup Git-core patch, propose patch to mailing list (/cc heads up @peff).
Cleanup code duplication in command_smudge.go/command_clean.go and command_filter.go.

Questions

Would you be OK with this approach in general? /cc @technoweenie @sinbad @ttaylorr
How should I handle integration tests? Git LFS would need to support and test both protocols ("per file" and "filter stream"). I was thinking about running all integration tests twice with different Git filter configs.

Thanks,
Lars

larsxschneider · 2016-07-18T13:42:11Z

.travis.yml

+    esac;
+    make install;
+    export PATH=$PWD:$PATH;
+    cd ..;


These Travis-CI changes won't be in the final patch, of course.

sinbad · 2016-07-18T16:34:24Z

This is going to be great, it will remove the need for more lfs-specific commands and eventually could lead to us deprecating git lfs clone for recent git versions.

I've started a few comments but haven't finished reviewing yet, will pick up again tomorrow.

ttaylorr · 2016-07-18T20:43:46Z

commands/command_filter.go

+	for {
+		buf := make([]byte, 4)
+		readBytes, err := reader.Read(buf)
+		if readBytes == 0 {


Are you checking whether or not you're at the end of stdin? Checking this may be more appropriate:

if _, err := reader.Read(buf); err == io.EOF { return } // <snip>

ttaylorr · 2016-07-18T23:04:17Z

I like this approach. This will definitely be a huge speed gain for what are currently pretty slow operations under certain scenarios.

I made a few comments on some the code above, mainly boiling down to a few suggestions:

A InputDataHdr struct
Using binary.Read instead of buffering and calling binary.LittleEndian.Uint32(...)
A Processor type

I think that the header/data idea is a good one for this protocol. I am wondering, however, about the benefits of implementing a multiplexed chunked protocol. Would it make sense to interleave the data that was being sent across these file descriptors? I am not entirely sure.

On one hand, this would allow the clean and smudge filters to start processing one file before they had finished others, which would enable increased parallelism.

On the other hand, perhaps this sort of optimization is not necessary. One concern with this approach is the additional complexity that would be incurred by this sort of muxing. I am thinking in particular of the approach that is implemented in the RTMP protocol, which is certainly complex. The relevant parts of the documentation can be found here, and a reference implementation that I wrote in Go can be found here (docs). There is far more going on in that chunk package than we would implement in LFS, but it'd still be more than we have now.

Your thoughts?

larsxschneider · 2016-07-20T12:40:23Z

@sinbad and @ttaylorr: Thanks a lot for your feedback!

Re: Multiplexing
Multiplexing would considerably complicate the protocol and is therefore
more error prone. Plus, with the Git clean/smudge interface that we have
today, the parallelism wouldn't buy us anything as Git processes the
files sequentially anyway. However, I have some vague ideas how we could
approach that. Therefore, I plan to define a "protocol version" field
that allows Git to support different filter protocols in the future.

ttaylorr · 2016-07-20T16:01:32Z

commands/command_filter.go

+	if fileNameLen > 0 {
+		buf := make([]byte, fileNameLen)
+		readLen, err := r.Read(buf)
+		if err != nil || readLen != int(fileNameLen) {


Hmm, I would split out these two cases individually. Returning the error if it's != nil would be the first, and the unexpected EOF would be the second. This makes things a little clearer since we can get an "error" when the read succeeded, just read less than we wanted. I'm thinking:

if readLen, err := r.Read(buf); err != nil { return errutil.Errorf(err, "Unexpected error") } else if readLEn != int(fileNameLen) { return fmt.Errorf("unexpected EOF when reading file (got %d, wanted %d)", readLen, fileNameLen) }

ttaylorr · 2016-07-21T15:37:08Z

git/git.go

@@ -733,15 +733,18 @@ func CloneWithoutFilters(flags CloneFlags, args []string) error {
 	// not working. You can get around that with https://github.com/kr/pty but that
 	// causes difficult issues with passing through Stdin for login prompts
 	// This way is simpler & more practical.
-	filterOverride := ""
+	filterDriverOverride := ""


Empty variable initializations are typically done in go using the var block. I would write this like so:

var ( filterDriverOverride bool smudgeFilterOverride string )

and then use the appropriate fmt verbs to turn the above string and bool into cmdargs down below 😄

Thanks! 👍 9aebb9c

Although "false" is actually a string because I am referring to the bash built in:
http://tldp.org/LDP/abs/html/internal.html

false: A command that returns an unsuccessful exit status, but does nothing else.

larsxschneider · 2016-07-21T22:20:38Z

relevant to Git LFS in general:
I am writing tests for the Git core side of the protocol and I discovered that Git calls clean way more than necessary: http://thread.gmane.org/gmane.comp.version-control.git/300028

rubyist · 2016-07-22T14:17:04Z

commands/command_filter.go

+	lfs.InstallHooks(false)
+
+	reader := bufio.NewReader(os.Stdin)
+	writer := bufio.NewWriter(os.Stdout)


stdio buffering should already be handled by the kernel, is there a reason we're buffering stdout here?

rubyist · 2016-07-22T15:45:36Z

It'll be pretty awesome if this makes it into git! I have two primary critiques so far. I think there's too much use of Panic() in deeper parts of the code, and the protocol parsing is happening in multiple places. I don't think the InputFileHdr is quite the right abstraction and the Read method on it is a little awkward. As it is, this parsing code would be very difficult to unit test.

I think something similar to bufio.Scanner would look pretty clean here, leaving us with something like this:

func filterCommand() {
    // ...

    scanner := NewObjectScanner(os.Stdin)
    for scanner.Scan() {
        obj := scanner.Object()
        r := obj.Reader()

        // ...

        switch obj.Command {
        case cmdClean:
            clean(r, obj.Name)
        case cmdSmudge:
            smudge(r, obj.Name)
        }
    }

    if err := scanner.Err(); err != nil {
        // ...
    }

    // write output
}

Here's a gist with a quick pass at an implementation.

ttaylorr · 2016-07-22T15:48:11Z

Oh snap, that's way better. I like the idea of an object scanner, that puts the parsing implementation in the right place I think. Still not sold on the switch block, there may be some other abstraction that we can reach for, but I think this is awesome.

Git's clean/smudge mechanism invokes an external filter process for every single blob that is affected by a filter. If Git filters a lot of blobs then the startup time of the external filter processes can become a significant part of the overall Git execution time. In a preliminary performance test this developer used a clean/smudge filter written in golang to filter 12,000 files. This process took 364s with the existing filter mechanism and 5s with the new mechanism. See details here: git-lfs/git-lfs#1382 This patch adds the `filter.<driver>.process` string option which, if used, keeps the external filter process running and processes all blobs with the packet format (pkt-line) based protocol over standard input and standard output. The full protocol is explained in detail in `Documentation/gitattributes.txt`. A few key decisions: * The long running filter process is referred to as filter protocol version 2 because the existing single shot filter invocation is considered version 1. * Git sends a welcome message and expects a response right after the external filter process has started. This ensures that Git will not hang if a version 1 filter is incorrectly used with the filter.<driver>.process option for version 2 filters. In addition, Git can detect this kind of error and warn the user. * The status of a filter operation (e.g. "success" or "error) is set before the actual response and (if necessary!) re-set after the response. The advantage of this two step status response is that if the filter detects an error early, then the filter can communicate this and Git does not even need to create structures to read the response. * All status responses are pkt-line lists terminated with a flush packet. This allows us to send other status fields with the same protocol in the future. Helped-by: Martin-Louis Bright <mlbright@gmail.com> Reviewed-by: Jakub Narebski <jnareb@gmail.com> Signed-off-by: Lars Schneider <larsxschneider@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>

larsxschneider · 2016-10-22T17:01:48Z

Junio merged the Git core patch series required for this PR to next ( git/git@ffd0de0 ) and it is scheduled for master ( http://public-inbox.org/git/xmqqk2d2ein7.fsf@gitster.mtv.corp.google.com/ ) 🎉 🎉 🎉 .
I'll update this PR with a working version soon, so that we can discuss how to use the new Git filter protocol feature in the most efficient way.

sinbad · 2016-10-24T10:07:16Z

Woooo, amazing work Lars. 🤘

ttaylorr · 2016-10-24T14:54:16Z

@larsxschneider great to hear. I'm looking forward to seeing your up-to-date version of this PR once it's ready. I'll be around to provide any help or answer any questions that you might have.

technoweenie · 2016-10-24T15:47:40Z

@larsxschneider Congrats, great work!

larsxschneider · 2016-10-24T21:03:29Z

I didn't had much time today and I won't have too much time in the next 1.5 weeks. That's why I push a very rough versions here. A few thoughts:

The implementation as it is right now would smudge files from the Git LFS cache or download files if they are not in the cache. That works but we cannot take advantage of parallel downloads. Ideally I would extend the filter protocol in a way that a filter can tell Git "I can't process this file just yet, ask me later again". However, this wouldn't be straight forward in Git core and the filter protocol change was already pretty large. Until the filter protocol is improved that way we could use a trick, though. If Git asks Git LFS for a file that is not in cache, then Git LFS could return the pointer file as-is and download the file in background (maybe even batched with other files). When Git shuts down, then it sends an EOF to Git LFS via pipe and waits until Git LFS terminates. In this step Git LFS could finish the downloads and replace the pointer files with the actual content. This should give us git lfs clone like speed and we could get rid of the wrapper commands. Credit: I think @sinbad posted that idea somewhere but I can't find the link.
ReadRequest reads the entire content into a []byte which is then passed to the clean/smudge functions. I would rather pass a Reader to clean/smudge.
clean/smudge returns a []byte which is then written to the pipe in PktLine format using WriteResponse. I would rather pass a Writer to clean/smudge.
clean/smudge functions in command_filter.go are pretty much the same as the ones in command_clean.go and command_smudge.go. I want to reuse code here.
The process filter takes precedence over clean/smudge filter. That's nice as it makes the filter update for the user very easy. However, I would like to ensure in the tests that Git versions > 2.10 actually use the process filter. I am thinking in the direction of an env variable (see GIT_LFS_USE_LEGACY_FILTER) to control that. Any good idea how to do that in a nice way?
Unit test are missing for git/git_filter_protocol.go

larsxschneider · 2016-10-24T21:11:37Z

git/git_filter_protocol.go

+	for _, pair := range requestList {
+		v := strings.Split(pair, "=")
+		requestMap[v[0]] = v[1]
+	}


@peff I finally grokked your "parse as dictionary" comment completely. As you have already recognized it parses this nicely:

packet: git> command=smudge packet: git> pathname=path/testfile.dat

But not this:

packet: git> capability=clean packet: git> capability=smudge

... I wonder what you think about this?

packet: git> clean=capability packet: git> smudge=capability

However, it's probably too late to change it. I guess in the end it is not that important and I don't want to annoy Junio with this kind of late change now that the code is in next.

Well, it's always nice to be vindicated eventually. ;)

If you feel there's room for improvement in the protocol, I don't think being in next is too late. You are welcome to submit patches on top, and nothing is cemented until the feature is in a released version of git.

It's up to you whether you think the change is worth making on top.

It's up to you whether you think the change is worth making on top.

To clarify: this is whether you think it's worth your time in dealing with the patch and any additional review.

From the maintainer's perspective, I think "I implemented a protocol, and now that it is close to cemented, I was fleshing out the other end of the protocol and realized there are some deficiencies" seems like a perfectly good reason to add more patches to the original topic.

technoweenie · 2016-10-25T19:00:28Z

@larsxschneider 👍 on your 6 points. Regarding the first one:

When Git shuts down, then it sends an EOF to Git LFS via pipe and waits until Git LFS terminates.

How is this handled in the filter? Would it be after the for loop exits, just before filterCommand() returns?

technoweenie

I was going to do a full review, but I think Taylor or I will start tackling items 2-6 in your list. I want us to start with the protocol tests first, make some of the interface changes, and then end up with a working version. I think adding the bg downloading should be done in a separate PR after this is a functioning filter.

So, enjoy my single, totally super important review comment

technoweenie · 2016-10-25T18:50:22Z

git/git_filter_protocol.go

+)
+
+// Private function copied from "github.com/xeipuuv/gojsonschema/utils.go"
+// TODO: Is there a way to reuse this?


Nope, it's private in gojsonschema too. A little copied code isn't too bad :)

ttaylorr · 2016-10-25T21:14:56Z

Unit test are missing for git/git_filter_protocol.go

I'm working on this right now and will post a PR to merge into this one as soon as I have the tests fleshed out! 🤘

sinbad · 2016-10-26T10:44:59Z

Short of time this week but I totally agree with your "trick" in point 1 (I think we discussed it before as you said) - just smudging to the pointer in serial and doing the actual download in batch/parallel in the background, gated on the termination at the end seems like the best approach. The one Q I had outstanding about that is whether the stat info in the index would then be out of date and would need a git update-index to avoid files being displayed as modified.

larsxschneider · 2016-10-26T21:45:42Z

@sinbad I haven't worked with git update-index, yet. Here is what would need to happen:

Files that GitLFS does not have in cache would end up as LFS pointer files in the working tree
After Git is done, GitLFS exchanges the pointer files with the actual content. I guess then we need to tell Git that the new content is OK and the working tree clean. That's what git update-index is for ?!

sinbad · 2016-10-27T11:14:09Z

@larsxschneider yeah what I'm not sure about is exactly when Git updates the index for a file; it must be after the smudge filter is run so the size & date are correct, but I don't know if it does it immediately after calling the filter, or at the end of the entire checkout. If it can do it at any time before LFS replaces the pointer file with the real content then the stat in the index will be out of date and probably a git update-index will be needed to avoid it appearing to be modified.

larsxschneider · 2016-11-08T10:47:38Z

I close this PR as the work is continued in #1617

larsxschneider reviewed Jul 18, 2016
View reviewed changes

ttaylorr reviewed Jul 18, 2016
View reviewed changes

ttaylorr reviewed Jul 20, 2016
View reviewed changes

larsxschneider force-pushed the filter-stream branch from 63478dc to e4bc84d Compare July 21, 2016 11:29

ttaylorr reviewed Jul 21, 2016
View reviewed changes

rubyist reviewed Jul 22, 2016
View reviewed changes

wip

46a8ee5

larsxschneider force-pushed the filter-stream branch from 0b1ccbd to 46a8ee5 Compare October 24, 2016 21:00

larsxschneider commented Oct 24, 2016

View reviewed changes

technoweenie reviewed Oct 25, 2016

View reviewed changes

ttaylorr mentioned this pull request Oct 26, 2016

test: add an integration test for checkouts using filter protocol larsxschneider/git-lfs#1

Closed

This was referenced Nov 2, 2016

Filter Protocol Support #1617

Merged

filter_stream: Make ObjectScanner behave like a Scanner #1620

Merged

larsxschneider closed this Nov 8, 2016

technoweenie mentioned this pull request Nov 9, 2016

LFS Roadmap: v1.5.0 -> v2.0 #1632

Closed

This was referenced Nov 22, 2016

Working on integrating git-lfs with Git through external object database support in Git #1702

Closed

What is smudge and why does it fail all the time randomly? #1720

Closed

Locking part 1: locking package #1625

Closed

This pull request was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: long running clean/smudge filter protocol #1382

WIP: long running clean/smudge filter protocol #1382

larsxschneider commented Jul 18, 2016 •

edited

Loading

larsxschneider Jul 18, 2016

sinbad commented Jul 18, 2016

ttaylorr Jul 18, 2016

ttaylorr commented Jul 18, 2016

larsxschneider commented Jul 20, 2016

ttaylorr Jul 20, 2016

ttaylorr Jul 21, 2016

larsxschneider Jul 21, 2016

larsxschneider commented Jul 21, 2016

rubyist Jul 22, 2016

rubyist commented Jul 22, 2016

ttaylorr commented Jul 22, 2016

larsxschneider commented Oct 22, 2016

sinbad commented Oct 24, 2016

ttaylorr commented Oct 24, 2016

technoweenie commented Oct 24, 2016

larsxschneider commented Oct 24, 2016

larsxschneider Oct 24, 2016

peff Oct 25, 2016

peff Oct 25, 2016

technoweenie commented Oct 25, 2016

technoweenie left a comment

technoweenie Oct 25, 2016

ttaylorr commented Oct 25, 2016

sinbad commented Oct 26, 2016

larsxschneider commented Oct 26, 2016

sinbad commented Oct 27, 2016

larsxschneider commented Nov 8, 2016

WIP: long running clean/smudge filter protocol #1382

WIP: long running clean/smudge filter protocol #1382

Conversation

larsxschneider commented Jul 18, 2016 • edited Loading

What is the problem with Git LFS?

Proposed solution

Performance tests

TEST RUN A -- Default Git 2.9 (ab7797d) and Git LFS 1.2.1

TEST RUN B -- Git and Git LFS with stream filter support

Results

Next Steps

Questions

Choose a reason for hiding this comment

sinbad commented Jul 18, 2016

Choose a reason for hiding this comment

ttaylorr commented Jul 18, 2016

larsxschneider commented Jul 20, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

larsxschneider commented Jul 21, 2016

Choose a reason for hiding this comment

rubyist commented Jul 22, 2016

ttaylorr commented Jul 22, 2016

larsxschneider commented Oct 22, 2016

sinbad commented Oct 24, 2016

ttaylorr commented Oct 24, 2016

technoweenie commented Oct 24, 2016

larsxschneider commented Oct 24, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

technoweenie commented Oct 25, 2016

technoweenie left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ttaylorr commented Oct 25, 2016

sinbad commented Oct 26, 2016

larsxschneider commented Oct 26, 2016

sinbad commented Oct 27, 2016

larsxschneider commented Nov 8, 2016

larsxschneider commented Jul 18, 2016 •

edited

Loading