git/{odb,githistory}: don't write unchanged objects #2541

ttaylorr · 2017-08-30T16:43:02Z

This pull request teaches a typesafe Equal() to *Blob's, *Commit's, and *Tree's, and uses that function in the git/githistory package to avoid writing unchanged objects to disk.

Previously, the git lfs migrate command would write the contents of every object that it examined in the migration, regardless of whether that object's contents changed. This is sub-optimal, because it involves not only marshaling the contents of unchanged objects to a buffer and performing an expensive save, but also because it effectively has the behavior of unpacking all objects in a repository. That's no good.

To address this, this pull request opts for the following option that I originally proposed in #2359:

Teach either:

[...]

Packfile APIs in order to write either a) packed objects, or b) skip writing packed objects if the object already exists in the database (loose or packed).

We can detect whether objects already exist in the object database in either a loose or packed format by determining whether or not the contents of the blob/tree/commit changed when rewriting it. In order to open the object, it must exist, so therefore, if it is unchanged, the object is already saved to the database.

By defining (t *<T>) Equal(<T> other) on each object type in package git/odb, and comparing the "rewritten" instances of each object against the original, we can avoid writing objects to the database (which involves several filesystem-level operations, open(), write(), a syscall to move, etc.) and instead return early, avoiding all of the previously mentioned operations.

This yields a significant performance boost. When examining the git-lfs/git-lfs repository prior to this change, it takes about 38.12 seconds to examing all ~6,000 commits:

~/g/git-lfs (master) $ time git lfs migrate info
migrate: Sorting commits: ..., done
migrate: Rewriting commits: 100% (5840/5840), done
# ...
git lfs migrate info  36.30s user 19.80s system 147% cpu 38.127 total

Including this change, however, the operation takes only 18.14 seconds:

~/g/git-lfs (master) $ time git lfs migrate info
migrate: Sorting commits: ..., done
migrate: Examining commits: 100% (5840/5840), done
# ...
git lfs migrate info  23.74s user 5.71s system 162% cpu 18.144 total

Closes: #2359.

/cc @git-lfs/core

technoweenie

This looks like a great optimization. I have 1 question about the *Blob implementation though.

technoweenie · 2017-08-30T17:16:14Z

git/odb/blob.go

+	}
+
+	if b != nil {
+		return b.Contents == other.Contents &&


I'm not sure comparing io.Reader is the right solution here. I'd expect a test like this to pass:

func TestBlobEqualReturnsTrueWithUnchangedButReadContents(t *testing.T) { c1 := strings.NewReader("Hello, world!") c2 := strings.NewReader("Hello, world!") b1 := &Blob{Size: int64(c1.Len()), Contents: c1} b2 := &Blob{Size: int64(c2.Len()), Contents: c2} assert.True(t, b1.Equal(b2)) }

Based on the way that the blob rewrite function usually returns a new *Blob, I think this will always fail the Equal() check, unless both Contents are pointers to the same reader.

git-lfs/commands/command_migrate_import.go

Lines 31 to 49 in 7aeeec2

BlobFn: func(path string, b *odb.Blob) (*odb.Blob, error) {

if filepath.Base(path) == ".gitattributes" {

return b, nil

}

var buf bytes.Buffer

if _, err := clean(&buf, b.Contents, path, b.Size); err != nil {

return nil, err

}

if ext := filepath.Ext(path); len(ext) > 0 {

exts.Add(fmt.Sprintf("*%s filter=lfs diff=lfs merge=lfs -text", ext))

}

return &odb.Blob{

Contents: &buf, Size: int64(buf.Len()),

}, nil

},

I suppose this works out in practice, since any call to BlobFn means that blob is definitely being converted.

Great question -- you're absolutely right. Two io.Reader's that yield the same content but are instantiated differently will fail the equality check.

That's fine in practice here, since as you pointed out, all calls to the BlobFn mean the blob is definitely being converted. That said, there are cases in the 'info' migrator where the contents of the blob does not change:

git-lfs/commands/command_migrate_info.go

Lines 70 to 91 in cdd0cb7

BlobFn: func(path string, b *odb.Blob) (*odb.Blob, error) {

ext := fmt.Sprintf("*%s", filepath.Ext(path))

if len(ext) > 1 {

entry := exts[ext]

if entry == nil {

entry = &MigrateInfoEntry{Qualifier: ext}

}

entry.Total++

entry.BytesTotal += b.Size

if b.Size > int64(migrateInfoAbove) {

entry.TotalAbove++

entry.BytesAbove += b.Size

}

exts[ext] = entry

}

return b, nil

},

A more complete implementation might involve some buffering elsewhere, and keeping track of the SHA as it changes, but I think that this will do for now.

ttaylorr added 6 commits August 30, 2017 12:05

git/odb: teach '(*Blob).Equal()' to compare other blobs

3e51113

git/odb: teach '(*Commit).Equal()' to compare other commits

246f232

git/odb: teach '(*Tree).Equal()' to compare other trees

bfc97f3

git/githistory: only save changed trees

dbf4bc1

git/githistory: only save changed blobs

cd22089

git/githistory: only save change commits

7aeeec2

ttaylorr added the review label Aug 30, 2017

ttaylorr added this to the v2.3.0 milestone Aug 30, 2017

ttaylorr requested a review from technoweenie August 30, 2017 16:43

technoweenie reviewed Aug 30, 2017

View reviewed changes

rubyist approved these changes Sep 7, 2017

View reviewed changes

ttaylorr merged commit a2981c1 into master Sep 7, 2017

ttaylorr deleted the migrate-no-write-unchanged branch September 7, 2017 20:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

git/{odb,githistory}: don't write unchanged objects #2541

git/{odb,githistory}: don't write unchanged objects #2541

ttaylorr commented Aug 30, 2017 •

edited

technoweenie left a comment

technoweenie Aug 30, 2017

ttaylorr Aug 30, 2017

	BlobFn: func(path string, b odb.Blob) (odb.Blob, error) {
	if filepath.Base(path) == ".gitattributes" {
	return b, nil
	}

	var buf bytes.Buffer

	if _, err := clean(&buf, b.Contents, path, b.Size); err != nil {
	return nil, err
	}

	if ext := filepath.Ext(path); len(ext) > 0 {
	exts.Add(fmt.Sprintf("*%s filter=lfs diff=lfs merge=lfs -text", ext))
	}

	return &odb.Blob{
	Contents: &buf, Size: int64(buf.Len()),
	}, nil
	},

	BlobFn: func(path string, b odb.Blob) (odb.Blob, error) {
	ext := fmt.Sprintf("*%s", filepath.Ext(path))

	if len(ext) > 1 {
	entry := exts[ext]
	if entry == nil {
	entry = &MigrateInfoEntry{Qualifier: ext}
	}

	entry.Total++
	entry.BytesTotal += b.Size

	if b.Size > int64(migrateInfoAbove) {
	entry.TotalAbove++
	entry.BytesAbove += b.Size
	}

	exts[ext] = entry
	}

	return b, nil
	},

git/{odb,githistory}: don't write unchanged objects #2541

git/{odb,githistory}: don't write unchanged objects #2541

Conversation

ttaylorr commented Aug 30, 2017 • edited

technoweenie left a comment

Choose a reason for hiding this comment

technoweenie Aug 30, 2017

Choose a reason for hiding this comment

ttaylorr Aug 30, 2017

Choose a reason for hiding this comment

ttaylorr commented Aug 30, 2017 •

edited