Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

git/{odb,githistory}: don't write unchanged objects #2541

Merged
merged 6 commits into from Sep 7, 2017

Conversation

ttaylorr
Copy link
Contributor

@ttaylorr ttaylorr commented Aug 30, 2017

This pull request teaches a typesafe Equal() to *Blob's, *Commit's, and *Tree's, and uses that function in the git/githistory package to avoid writing unchanged objects to disk.

Previously, the git lfs migrate command would write the contents of every object that it examined in the migration, regardless of whether that object's contents changed. This is sub-optimal, because it involves not only marshaling the contents of unchanged objects to a buffer and performing an expensive save, but also because it effectively has the behavior of unpacking all objects in a repository. That's no good.

To address this, this pull request opts for the following option that I originally proposed in #2359:

Teach either:

  1. [...]
  2. Packfile APIs in order to write either a) packed objects, or b) skip writing packed objects if the object already exists in the database (loose or packed).

We can detect whether objects already exist in the object database in either a loose or packed format by determining whether or not the contents of the blob/tree/commit changed when rewriting it. In order to open the object, it must exist, so therefore, if it is unchanged, the object is already saved to the database.

By defining (t *<T>) Equal(<T> other) on each object type in package git/odb, and comparing the "rewritten" instances of each object against the original, we can avoid writing objects to the database (which involves several filesystem-level operations, open(), write(), a syscall to move, etc.) and instead return early, avoiding all of the previously mentioned operations.

This yields a significant performance boost. When examining the git-lfs/git-lfs repository prior to this change, it takes about 38.12 seconds to examing all ~6,000 commits:

~/g/git-lfs (master) $ time git lfs migrate info
migrate: Sorting commits: ..., done
migrate: Rewriting commits: 100% (5840/5840), done
# ...
git lfs migrate info  36.30s user 19.80s system 147% cpu 38.127 total

Including this change, however, the operation takes only 18.14 seconds:

~/g/git-lfs (master) $ time git lfs migrate info
migrate: Sorting commits: ..., done
migrate: Examining commits: 100% (5840/5840), done
# ...
git lfs migrate info  23.74s user 5.71s system 162% cpu 18.144 total

Closes: #2359.


/cc @git-lfs/core

@ttaylorr ttaylorr added this to the v2.3.0 milestone Aug 30, 2017
Copy link
Contributor

@technoweenie technoweenie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a great optimization. I have 1 question about the *Blob implementation though.

}

if b != nil {
return b.Contents == other.Contents &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure comparing io.Reader is the right solution here. I'd expect a test like this to pass:

func TestBlobEqualReturnsTrueWithUnchangedButReadContents(t *testing.T) {
	c1 := strings.NewReader("Hello, world!")
	c2 := strings.NewReader("Hello, world!")

	b1 := &Blob{Size: int64(c1.Len()), Contents: c1}
	b2 := &Blob{Size: int64(c2.Len()), Contents: c2}

	assert.True(t, b1.Equal(b2))
}

Based on the way that the blob rewrite function usually returns a new *Blob, I think this will always fail the Equal() check, unless both Contents are pointers to the same reader.

BlobFn: func(path string, b *odb.Blob) (*odb.Blob, error) {
if filepath.Base(path) == ".gitattributes" {
return b, nil
}
var buf bytes.Buffer
if _, err := clean(&buf, b.Contents, path, b.Size); err != nil {
return nil, err
}
if ext := filepath.Ext(path); len(ext) > 0 {
exts.Add(fmt.Sprintf("*%s filter=lfs diff=lfs merge=lfs -text", ext))
}
return &odb.Blob{
Contents: &buf, Size: int64(buf.Len()),
}, nil
},

I suppose this works out in practice, since any call to BlobFn means that blob is definitely being converted.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great question -- you're absolutely right. Two io.Reader's that yield the same content but are instantiated differently will fail the equality check.

That's fine in practice here, since as you pointed out, all calls to the BlobFn mean the blob is definitely being converted. That said, there are cases in the 'info' migrator where the contents of the blob does not change:

BlobFn: func(path string, b *odb.Blob) (*odb.Blob, error) {
ext := fmt.Sprintf("*%s", filepath.Ext(path))
if len(ext) > 1 {
entry := exts[ext]
if entry == nil {
entry = &MigrateInfoEntry{Qualifier: ext}
}
entry.Total++
entry.BytesTotal += b.Size
if b.Size > int64(migrateInfoAbove) {
entry.TotalAbove++
entry.BytesAbove += b.Size
}
exts[ext] = entry
}
return b, nil
},

A more complete implementation might involve some buffering elsewhere, and keeping track of the SHA as it changes, but I think that this will do for now.

@ttaylorr ttaylorr merged commit a2981c1 into master Sep 7, 2017
@ttaylorr ttaylorr deleted the migrate-no-write-unchanged branch September 7, 2017 20:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants