New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
git/{odb,githistory}: don't write unchanged objects #2541
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a great optimization. I have 1 question about the *Blob
implementation though.
} | ||
|
||
if b != nil { | ||
return b.Contents == other.Contents && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure comparing io.Reader
is the right solution here. I'd expect a test like this to pass:
func TestBlobEqualReturnsTrueWithUnchangedButReadContents(t *testing.T) {
c1 := strings.NewReader("Hello, world!")
c2 := strings.NewReader("Hello, world!")
b1 := &Blob{Size: int64(c1.Len()), Contents: c1}
b2 := &Blob{Size: int64(c2.Len()), Contents: c2}
assert.True(t, b1.Equal(b2))
}
Based on the way that the blob rewrite function usually returns a new *Blob
, I think this will always fail the Equal()
check, unless both Contents
are pointers to the same reader.
git-lfs/commands/command_migrate_import.go
Lines 31 to 49 in 7aeeec2
BlobFn: func(path string, b *odb.Blob) (*odb.Blob, error) { | |
if filepath.Base(path) == ".gitattributes" { | |
return b, nil | |
} | |
var buf bytes.Buffer | |
if _, err := clean(&buf, b.Contents, path, b.Size); err != nil { | |
return nil, err | |
} | |
if ext := filepath.Ext(path); len(ext) > 0 { | |
exts.Add(fmt.Sprintf("*%s filter=lfs diff=lfs merge=lfs -text", ext)) | |
} | |
return &odb.Blob{ | |
Contents: &buf, Size: int64(buf.Len()), | |
}, nil | |
}, |
I suppose this works out in practice, since any call to BlobFn
means that blob is definitely being converted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great question -- you're absolutely right. Two io.Reader
's that yield the same content but are instantiated differently will fail the equality check.
That's fine in practice here, since as you pointed out, all calls to the BlobFn
mean the blob is definitely being converted. That said, there are cases in the 'info' migrator where the contents of the blob does not change:
git-lfs/commands/command_migrate_info.go
Lines 70 to 91 in cdd0cb7
BlobFn: func(path string, b *odb.Blob) (*odb.Blob, error) { | |
ext := fmt.Sprintf("*%s", filepath.Ext(path)) | |
if len(ext) > 1 { | |
entry := exts[ext] | |
if entry == nil { | |
entry = &MigrateInfoEntry{Qualifier: ext} | |
} | |
entry.Total++ | |
entry.BytesTotal += b.Size | |
if b.Size > int64(migrateInfoAbove) { | |
entry.TotalAbove++ | |
entry.BytesAbove += b.Size | |
} | |
exts[ext] = entry | |
} | |
return b, nil | |
}, |
A more complete implementation might involve some buffering elsewhere, and keeping track of the SHA as it changes, but I think that this will do for now.
This pull request teaches a typesafe
Equal()
to*Blob
's,*Commit
's, and*Tree
's, and uses that function in thegit/githistory
package to avoid writing unchanged objects to disk.Previously, the
git lfs migrate
command would write the contents of every object that it examined in the migration, regardless of whether that object's contents changed. This is sub-optimal, because it involves not only marshaling the contents of unchanged objects to a buffer and performing an expensive save, but also because it effectively has the behavior of unpacking all objects in a repository. That's no good.To address this, this pull request opts for the following option that I originally proposed in #2359:
We can detect whether objects already exist in the object database in either a loose or packed format by determining whether or not the contents of the blob/tree/commit changed when rewriting it. In order to open the object, it must exist, so therefore, if it is unchanged, the object is already saved to the database.
By defining
(t *<T>) Equal(<T> other)
on each object type in packagegit/odb
, and comparing the "rewritten" instances of each object against the original, we can avoid writing objects to the database (which involves several filesystem-level operations,open()
,write()
, a syscall to move, etc.) and instead return early, avoiding all of the previously mentioned operations.This yields a significant performance boost. When examining the
git-lfs/git-lfs
repository prior to this change, it takes about 38.12 seconds to examing all ~6,000 commits:Including this change, however, the operation takes only 18.14 seconds:
Closes: #2359.
/cc @git-lfs/core