-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: long running clean/smudge filter protocol #1382
Conversation
esac; | ||
make install; | ||
export PATH=$PWD:$PATH; | ||
cd ..; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These Travis-CI changes won't be in the final patch, of course.
This is going to be great, it will remove the need for more lfs-specific commands and eventually could lead to us deprecating I've started a few comments but haven't finished reviewing yet, will pick up again tomorrow. |
for { | ||
buf := make([]byte, 4) | ||
readBytes, err := reader.Read(buf) | ||
if readBytes == 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you checking whether or not you're at the end of stdin
? Checking this may be more appropriate:
if _, err := reader.Read(buf); err == io.EOF {
return
}
// <snip>
I like this approach. This will definitely be a huge speed gain for what are currently pretty slow operations under certain scenarios. I made a few comments on some the code above, mainly boiling down to a few suggestions:
I think that the header/data idea is a good one for this protocol. I am wondering, however, about the benefits of implementing a multiplexed chunked protocol. Would it make sense to interleave the data that was being sent across these file descriptors? I am not entirely sure. On one hand, this would allow the clean and smudge filters to start processing one file before they had finished others, which would enable increased parallelism. On the other hand, perhaps this sort of optimization is not necessary. One concern with this approach is the additional complexity that would be incurred by this sort of muxing. I am thinking in particular of the approach that is implemented in the RTMP protocol, which is certainly complex. The relevant parts of the documentation can be found here, and a reference implementation that I wrote in Go can be found here (docs). There is far more going on in that Your thoughts? |
@sinbad and @ttaylorr: Thanks a lot for your feedback! Re: Multiplexing |
if fileNameLen > 0 { | ||
buf := make([]byte, fileNameLen) | ||
readLen, err := r.Read(buf) | ||
if err != nil || readLen != int(fileNameLen) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I would split out these two cases individually. Returning the error if it's != nil
would be the first, and the unexpected EOF would be the second. This makes things a little clearer since we can get an "error" when the read succeeded, just read less than we wanted. I'm thinking:
if readLen, err := r.Read(buf); err != nil {
return errutil.Errorf(err, "Unexpected error")
} else if readLEn != int(fileNameLen) {
return fmt.Errorf("unexpected EOF when reading file (got %d, wanted %d)", readLen, fileNameLen)
}
63478dc
to
e4bc84d
Compare
@@ -733,15 +733,18 @@ func CloneWithoutFilters(flags CloneFlags, args []string) error { | |||
// not working. You can get around that with https://github.com/kr/pty but that | |||
// causes difficult issues with passing through Stdin for login prompts | |||
// This way is simpler & more practical. | |||
filterOverride := "" | |||
filterDriverOverride := "" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Empty variable initializations are typically done in go using the var
block. I would write this like so:
var (
filterDriverOverride bool
smudgeFilterOverride string
)
and then use the appropriate fmt
verbs to turn the above string
and bool
into cmdargs
down below 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! 👍 9aebb9c
Although "false" is actually a string because I am referring to the bash built in:
http://tldp.org/LDP/abs/html/internal.html
false: A command that returns an unsuccessful exit status, but does nothing else.
relevant to Git LFS in general: |
lfs.InstallHooks(false) | ||
|
||
reader := bufio.NewReader(os.Stdin) | ||
writer := bufio.NewWriter(os.Stdout) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
stdio buffering should already be handled by the kernel, is there a reason we're buffering stdout here?
It'll be pretty awesome if this makes it into git! I have two primary critiques so far. I think there's too much use of I think something similar to func filterCommand() {
// ...
scanner := NewObjectScanner(os.Stdin)
for scanner.Scan() {
obj := scanner.Object()
r := obj.Reader()
// ...
switch obj.Command {
case cmdClean:
clean(r, obj.Name)
case cmdSmudge:
smudge(r, obj.Name)
}
}
if err := scanner.Err(); err != nil {
// ...
}
// write output
} Here's a gist with a quick pass at an implementation. |
Oh snap, that's way better. I like the idea of an object scanner, that puts the parsing implementation in the right place I think. Still not sold on the |
Git's clean/smudge mechanism invokes an external filter process for every single blob that is affected by a filter. If Git filters a lot of blobs then the startup time of the external filter processes can become a significant part of the overall Git execution time. In a preliminary performance test this developer used a clean/smudge filter written in golang to filter 12,000 files. This process took 364s with the existing filter mechanism and 5s with the new mechanism. See details here: git-lfs/git-lfs#1382 This patch adds the `filter.<driver>.process` string option which, if used, keeps the external filter process running and processes all blobs with the packet format (pkt-line) based protocol over standard input and standard output. The full protocol is explained in detail in `Documentation/gitattributes.txt`. A few key decisions: * The long running filter process is referred to as filter protocol version 2 because the existing single shot filter invocation is considered version 1. * Git sends a welcome message and expects a response right after the external filter process has started. This ensures that Git will not hang if a version 1 filter is incorrectly used with the filter.<driver>.process option for version 2 filters. In addition, Git can detect this kind of error and warn the user. * The status of a filter operation (e.g. "success" or "error) is set before the actual response and (if necessary!) re-set after the response. The advantage of this two step status response is that if the filter detects an error early, then the filter can communicate this and Git does not even need to create structures to read the response. * All status responses are pkt-line lists terminated with a flush packet. This allows us to send other status fields with the same protocol in the future. Helped-by: Martin-Louis Bright <mlbright@gmail.com> Reviewed-by: Jakub Narebski <jnareb@gmail.com> Signed-off-by: Lars Schneider <larsxschneider@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Git's clean/smudge mechanism invokes an external filter process for every single blob that is affected by a filter. If Git filters a lot of blobs then the startup time of the external filter processes can become a significant part of the overall Git execution time. In a preliminary performance test this developer used a clean/smudge filter written in golang to filter 12,000 files. This process took 364s with the existing filter mechanism and 5s with the new mechanism. See details here: git-lfs/git-lfs#1382 This patch adds the `filter.<driver>.process` string option which, if used, keeps the external filter process running and processes all blobs with the packet format (pkt-line) based protocol over standard input and standard output. The full protocol is explained in detail in `Documentation/gitattributes.txt`. A few key decisions: * The long running filter process is referred to as filter protocol version 2 because the existing single shot filter invocation is considered version 1. * Git sends a welcome message and expects a response right after the external filter process has started. This ensures that Git will not hang if a version 1 filter is incorrectly used with the filter.<driver>.process option for version 2 filters. In addition, Git can detect this kind of error and warn the user. * The status of a filter operation (e.g. "success" or "error) is set before the actual response and (if necessary!) re-set after the response. The advantage of this two step status response is that if the filter detects an error early, then the filter can communicate this and Git does not even need to create structures to read the response. * All status responses are pkt-line lists terminated with a flush packet. This allows us to send other status fields with the same protocol in the future. Helped-by: Martin-Louis Bright <mlbright@gmail.com> Reviewed-by: Jakub Narebski <jnareb@gmail.com> Signed-off-by: Lars Schneider <larsxschneider@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Git's clean/smudge mechanism invokes an external filter process for every single blob that is affected by a filter. If Git filters a lot of blobs then the startup time of the external filter processes can become a significant part of the overall Git execution time. In a preliminary performance test this developer used a clean/smudge filter written in golang to filter 12,000 files. This process took 364s with the existing filter mechanism and 5s with the new mechanism. See details here: git-lfs/git-lfs#1382 This patch adds the `filter.<driver>.process` string option which, if used, keeps the external filter process running and processes all blobs with the packet format (pkt-line) based protocol over standard input and standard output. The full protocol is explained in detail in `Documentation/gitattributes.txt`. A few key decisions: * The long running filter process is referred to as filter protocol version 2 because the existing single shot filter invocation is considered version 1. * Git sends a welcome message and expects a response right after the external filter process has started. This ensures that Git will not hang if a version 1 filter is incorrectly used with the filter.<driver>.process option for version 2 filters. In addition, Git can detect this kind of error and warn the user. * The status of a filter operation (e.g. "success" or "error) is set before the actual response and (if necessary!) re-set after the response. The advantage of this two step status response is that if the filter detects an error early, then the filter can communicate this and Git does not even need to create structures to read the response. * All status responses are pkt-line lists terminated with a flush packet. This allows us to send other status fields with the same protocol in the future. Helped-by: Martin-Louis Bright <mlbright@gmail.com> Reviewed-by: Jakub Narebski <jnareb@gmail.com> Signed-off-by: Lars Schneider <larsxschneider@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Git's clean/smudge mechanism invokes an external filter process for every single blob that is affected by a filter. If Git filters a lot of blobs then the startup time of the external filter processes can become a significant part of the overall Git execution time. In a preliminary performance test this developer used a clean/smudge filter written in golang to filter 12,000 files. This process took 364s with the existing filter mechanism and 5s with the new mechanism. See details here: git-lfs/git-lfs#1382 This patch adds the `filter.<driver>.process` string option which, if used, keeps the external filter process running and processes all blobs with the packet format (pkt-line) based protocol over standard input and standard output. The full protocol is explained in detail in `Documentation/gitattributes.txt`. A few key decisions: * The long running filter process is referred to as filter protocol version 2 because the existing single shot filter invocation is considered version 1. * Git sends a welcome message and expects a response right after the external filter process has started. This ensures that Git will not hang if a version 1 filter is incorrectly used with the filter.<driver>.process option for version 2 filters. In addition, Git can detect this kind of error and warn the user. * The status of a filter operation (e.g. "success" or "error) is set before the actual response and (if necessary!) re-set after the response. The advantage of this two step status response is that if the filter detects an error early, then the filter can communicate this and Git does not even need to create structures to read the response. * All status responses are pkt-line lists terminated with a flush packet. This allows us to send other status fields with the same protocol in the future. Helped-by: Martin-Louis Bright <mlbright@gmail.com> Reviewed-by: Jakub Narebski <jnareb@gmail.com> Signed-off-by: Lars Schneider <larsxschneider@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Git's clean/smudge mechanism invokes an external filter process for every single blob that is affected by a filter. If Git filters a lot of blobs then the startup time of the external filter processes can become a significant part of the overall Git execution time. In a preliminary performance test this developer used a clean/smudge filter written in golang to filter 12,000 files. This process took 364s with the existing filter mechanism and 5s with the new mechanism. See details here: git-lfs/git-lfs#1382 This patch adds the `filter.<driver>.process` string option which, if used, keeps the external filter process running and processes all blobs with the packet format (pkt-line) based protocol over standard input and standard output. The full protocol is explained in detail in `Documentation/gitattributes.txt`. A few key decisions: * The long running filter process is referred to as filter protocol version 2 because the existing single shot filter invocation is considered version 1. * Git sends a welcome message and expects a response right after the external filter process has started. This ensures that Git will not hang if a version 1 filter is incorrectly used with the filter.<driver>.process option for version 2 filters. In addition, Git can detect this kind of error and warn the user. * The status of a filter operation (e.g. "success" or "error) is set before the actual response and (if necessary!) re-set after the response. The advantage of this two step status response is that if the filter detects an error early, then the filter can communicate this and Git does not even need to create structures to read the response. * All status responses are pkt-line lists terminated with a flush packet. This allows us to send other status fields with the same protocol in the future. Helped-by: Martin-Louis Bright <mlbright@gmail.com> Reviewed-by: Jakub Narebski <jnareb@gmail.com> Signed-off-by: Lars Schneider <larsxschneider@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Junio merged the Git core patch series required for this PR to |
Woooo, amazing work Lars. 🤘 |
@larsxschneider great to hear. I'm looking forward to seeing your up-to-date version of this PR once it's ready. I'll be around to provide any help or answer any questions that you might have. |
@larsxschneider Congrats, great work! |
0b1ccbd
to
46a8ee5
Compare
I didn't had much time today and I won't have too much time in the next 1.5 weeks. That's why I push a very rough versions here. A few thoughts:
|
for _, pair := range requestList { | ||
v := strings.Split(pair, "=") | ||
requestMap[v[0]] = v[1] | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@peff I finally grokked your "parse as dictionary" comment completely. As you have already recognized it parses this nicely:
packet: git> command=smudge
packet: git> pathname=path/testfile.dat
But not this:
packet: git> capability=clean
packet: git> capability=smudge
... I wonder what you think about this?
packet: git> clean=capability
packet: git> smudge=capability
However, it's probably too late to change it. I guess in the end it is not that important and I don't want to annoy Junio with this kind of late change now that the code is in next
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, it's always nice to be vindicated eventually. ;)
If you feel there's room for improvement in the protocol, I don't think being in next
is too late. You are welcome to submit patches on top, and nothing is cemented until the feature is in a released version of git.
It's up to you whether you think the change is worth making on top.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's up to you whether you think the change is worth making on top.
To clarify: this is whether you think it's worth your time in dealing with the patch and any additional review.
From the maintainer's perspective, I think "I implemented a protocol, and now that it is close to cemented, I was fleshing out the other end of the protocol and realized there are some deficiencies" seems like a perfectly good reason to add more patches to the original topic.
@larsxschneider 👍 on your 6 points. Regarding the first one:
How is this handled in the filter? Would it be after the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was going to do a full review, but I think Taylor or I will start tackling items 2-6 in your list. I want us to start with the protocol tests first, make some of the interface changes, and then end up with a working version. I think adding the bg downloading should be done in a separate PR after this is a functioning filter.
So, enjoy my single, totally super important review comment
) | ||
|
||
// Private function copied from "github.com/xeipuuv/gojsonschema/utils.go" | ||
// TODO: Is there a way to reuse this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope, it's private in gojsonschema
too. A little copied code isn't too bad :)
I'm working on this right now and will post a PR to merge into this one as soon as I have the tests fleshed out! 🤘 |
Short of time this week but I totally agree with your "trick" in point 1 (I think we discussed it before as you said) - just smudging to the pointer in serial and doing the actual download in batch/parallel in the background, gated on the termination at the end seems like the best approach. The one Q I had outstanding about that is whether the stat info in the index would then be out of date and would need a |
@sinbad I haven't worked with
|
@larsxschneider yeah what I'm not sure about is exactly when Git updates the index for a file; it must be after the smudge filter is run so the size & date are correct, but I don't know if it does it immediately after calling the filter, or at the end of the entire checkout. If it can do it at any time before LFS replaces the pointer file with the real content then the |
I close this PR as the work is continued in #1617 |
This is my first WIP version of a Git/Git-LFS stream filter. Please keep in
mind that I have little go-lang knowledge and experience. Therefore
I would be happy to receive a very strict review to improve my go-lang
skills 😄 👍
What is the problem with Git LFS?
Git LFS is an application that is executed via Git clean/smudge filter.
The process invocation of these filters requires noticeable time (especially
on Windows). An individual filter process is required for every single file
that Git touches during its operations (e.g.
checkout
etc).Proposed solution
Instead of a single Git LFS process per file, I propose a single Git LFS
process per Git invocation. That means Git invokes the filter process
(e.g. Git LFS) only once and then continuously talks to the same filter
process via a pipes.
You can find the corressponding WIP Git core implementation here:
https://github.com/larsxschneider/git/tree/filter-stream
Performance tests
I executed both test runs on a 2,5 GHz Intel Core i7 with SSD and OS X.
A test run is the consecutive execution of four Git commands:
Test command:
I compiled Git with the following flags:
TEST RUN A -- Default Git 2.9 (ab7797d) and Git LFS 1.2.1
TEST RUN B -- Git and Git LFS with stream filter support
Results
The Git stream filter solution is almost ✨ 70x faster ✨ when switching branches
on my local machine with a test repository containing 12,000 Git LFS files.
Based on my previous experience with Git LFS clone I expect even more
dramatic results on Windows.
Next Steps
command_smudge.go
/command_clean.go
andcommand_filter.go
.Questions
Thanks,
Lars