sync slow with large number of commits #2233

Open
rafael-atticlabs opened this Issue Aug 2, 2016 · 12 comments

Comments

Projects
None yet
3 participants
@rafael-atticlabs
Contributor

rafael-atticlabs commented Aug 2, 2016

The nomfs repo that @ahl linked to in #2220 has 20K+ commits. Syncing each of these commits is currently serialized because we need to each commit chunk to discover the ref of the previous one. We may need to add a server endpoint to do some of this work on the low-latency side of the connection

@aboodman aboodman changed the title from syncing datasets with large number of commits takes a long time to syncing slow with large number of commits Nov 8, 2016

@aboodman aboodman changed the title from syncing slow with large number of commits to sync slow with large number of commits Nov 8, 2016

@aboodman aboodman added the Format label Nov 8, 2016

@aboodman

This comment has been minimized.

Show comment
Hide comment
@aboodman

aboodman Nov 8, 2016

Contributor

Added format label because this probably implies changing the schema of Commit which may as well be a format change.

Contributor

aboodman commented Nov 8, 2016

Added format label because this probably implies changing the schema of Commit which may as well be a format change.

@aboodman

This comment has been minimized.

Show comment
Hide comment
@aboodman

aboodman Dec 2, 2016

Contributor

Here is a simple and concrete proposal for how to fix this:

// There are two fairly distinct use cases for a Commit struct
//
// 1. Working with the current value programmatically
// 2. Exploring history (e.g., for sync or noms log)
//
// Optimizing these two use cases requires conflicting strategies. For use case (1),
// we want to minimize serial fetches before we get to the actual user data. So we
// need to use as much of possible of the root 4k chunk for user data and get
// everything else (metadata, history) out-of-line.
//
// For use case (2), we want to get as much of the history graph as possible in each
// chunk. In particular, we want to get multiple levels of the graph in each chunk
// because typically the history graph is tall and narrow. So we need the user data
// and metadata out-of-line, and the parents (and their parents, and so on) inline.
//
// To address this conflict, this design splits Commit into two different structs. The
// new Commit struct is optimized for use case (1) and the HistoryEntry struct is
// optimized for use case (2).

struct Commit {
	// meta and history out-of-line so that as much of the user data as possible
	// ends up in root chunk.
	meta?: Ref<Struct>
	history: Ref<Set<HistoryEntry>>
	value: Value
}

struct HistoryEntry {
	// commit is out-of-line, but history is inline. Thus each chunk has only nodes
	// of history graph in it, therefore you get about 4k/20 ~= 200 nodes of the
	// graph in each chunk. commit details for each node can then be fetched in
	// parallel.
	commit: Ref<Commit>
	history: Set<HistoryEntry>
}
Contributor

aboodman commented Dec 2, 2016

Here is a simple and concrete proposal for how to fix this:

// There are two fairly distinct use cases for a Commit struct
//
// 1. Working with the current value programmatically
// 2. Exploring history (e.g., for sync or noms log)
//
// Optimizing these two use cases requires conflicting strategies. For use case (1),
// we want to minimize serial fetches before we get to the actual user data. So we
// need to use as much of possible of the root 4k chunk for user data and get
// everything else (metadata, history) out-of-line.
//
// For use case (2), we want to get as much of the history graph as possible in each
// chunk. In particular, we want to get multiple levels of the graph in each chunk
// because typically the history graph is tall and narrow. So we need the user data
// and metadata out-of-line, and the parents (and their parents, and so on) inline.
//
// To address this conflict, this design splits Commit into two different structs. The
// new Commit struct is optimized for use case (1) and the HistoryEntry struct is
// optimized for use case (2).

struct Commit {
	// meta and history out-of-line so that as much of the user data as possible
	// ends up in root chunk.
	meta?: Ref<Struct>
	history: Ref<Set<HistoryEntry>>
	value: Value
}

struct HistoryEntry {
	// commit is out-of-line, but history is inline. Thus each chunk has only nodes
	// of history graph in it, therefore you get about 4k/20 ~= 200 nodes of the
	// graph in each chunk. commit details for each node can then be fetched in
	// parallel.
	commit: Ref<Commit>
	history: Set<HistoryEntry>
}
@rafael-atticlabs

This comment has been minimized.

Show comment
Hide comment
@rafael-atticlabs

rafael-atticlabs Dec 2, 2016

Contributor

Is each HistoryEntry.history the set (i.e. usually just one) of immediate predecessors for the referenced commit? If not, how is the commit DAG navigated?

Contributor

rafael-atticlabs commented Dec 2, 2016

Is each HistoryEntry.history the set (i.e. usually just one) of immediate predecessors for the referenced commit? If not, how is the commit DAG navigated?

@aboodman

This comment has been minimized.

Show comment
Hide comment
@aboodman

aboodman Dec 2, 2016

Contributor
Contributor

aboodman commented Dec 2, 2016

@arv

This comment has been minimized.

Show comment
Hide comment
@arv

arv Dec 2, 2016

Contributor

Seems promising to me

Contributor

arv commented Dec 2, 2016

Seems promising to me

@rafael-atticlabs

This comment has been minimized.

Show comment
Hide comment
@rafael-atticlabs

rafael-atticlabs Dec 2, 2016

Contributor

Right now a collection with 1 item will always inline item (it effectively can't chunk), so the common case of a lineage of commits each with one parent wouldn't ever chunk.

We could change chunking behavior so that a collection of one element would chunk.

Contributor

rafael-atticlabs commented Dec 2, 2016

Right now a collection with 1 item will always inline item (it effectively can't chunk), so the common case of a lineage of commits each with one parent wouldn't ever chunk.

We could change chunking behavior so that a collection of one element would chunk.

@aboodman

This comment has been minimized.

Show comment
Hide comment
@aboodman

aboodman Dec 2, 2016

Contributor

We could change chunking behavior so that a collection of one element would chunk.

seems reasonable to me

Contributor

aboodman commented Dec 2, 2016

We could change chunking behavior so that a collection of one element would chunk.

seems reasonable to me

@aboodman

This comment has been minimized.

Show comment
Hide comment
@aboodman

aboodman Dec 2, 2016

Contributor

side note: I sorta feel like we should redefine "format change" to mean "if we make this change and you have an old noms codebase, it can't successfully talk to us anymore" - regardless of whether the format was actually changed.

Contributor

aboodman commented Dec 2, 2016

side note: I sorta feel like we should redefine "format change" to mean "if we make this change and you have an old noms codebase, it can't successfully talk to us anymore" - regardless of whether the format was actually changed.

@rafael-atticlabs

This comment has been minimized.

Show comment
Hide comment
@rafael-atticlabs

rafael-atticlabs Dec 2, 2016

Contributor

Another possible solution:

struct Commit {
  meta?: Ref<Struct>
  parents: Set<Ref<Commit>>
  history: Ref<Map<Number, Set<Ref<Commit>>
}

The addition here is the reference to a new map which contains the full set of refs to commits prior to this one. The mapping is height(Ref<Commit>) -> Set<Ref<Commits with that height>>

This is useful for sync because the common thing to want is to quickly enumerate all commits between two points (a) -> (b). height(b) will always be > height(a), so just enumerating all commit refs with heights in that range will retrieve all of the commits you care about (it might return too many if there are branches which merge together, but it would frequently be exactly right.

Also, I think we could implement this without a noms version change. it's just an additional field on commit which sync, log or anything else can use if it's present to rapidly fetch a set of commits.

Lastly, for common cases of simple lineages of commits, the chunks in the map will dedup frequently because entries are effectively only ever added to the end of the map sequence.

Contributor

rafael-atticlabs commented Dec 2, 2016

Another possible solution:

struct Commit {
  meta?: Ref<Struct>
  parents: Set<Ref<Commit>>
  history: Ref<Map<Number, Set<Ref<Commit>>
}

The addition here is the reference to a new map which contains the full set of refs to commits prior to this one. The mapping is height(Ref<Commit>) -> Set<Ref<Commits with that height>>

This is useful for sync because the common thing to want is to quickly enumerate all commits between two points (a) -> (b). height(b) will always be > height(a), so just enumerating all commit refs with heights in that range will retrieve all of the commits you care about (it might return too many if there are branches which merge together, but it would frequently be exactly right.

Also, I think we could implement this without a noms version change. it's just an additional field on commit which sync, log or anything else can use if it's present to rapidly fetch a set of commits.

Lastly, for common cases of simple lineages of commits, the chunks in the map will dedup frequently because entries are effectively only ever added to the end of the map sequence.

@aboodman

This comment has been minimized.

Show comment
Hide comment
@aboodman

aboodman Dec 2, 2016

Contributor

This is a pretty good idea for a quick workaround that doesn't require a format change. Cool. I guess whether to do this depends on how complicated it is to maintain the fork in the code that uses this if it is available or falls back to not using it.

Contributor

aboodman commented Dec 2, 2016

This is a pretty good idea for a quick workaround that doesn't require a format change. Cool. I guess whether to do this depends on how complicated it is to maintain the fork in the code that uses this if it is available or falls back to not using it.

@aboodman

This comment has been minimized.

Show comment
Hide comment
@aboodman

aboodman Dec 2, 2016

Contributor

It's also worth remembering that your solution is asymptotically faster (and probably faster in practice) for the specific problem of sync, because it changes the problem from graph traversal to just an index scan.

Contributor

aboodman commented Dec 2, 2016

It's also worth remembering that your solution is asymptotically faster (and probably faster in practice) for the specific problem of sync, because it changes the problem from graph traversal to just an index scan.

@rafael-atticlabs

This comment has been minimized.

Show comment
Hide comment
@rafael-atticlabs

rafael-atticlabs Oct 10, 2017

Contributor

New idea. Use a skip list. I.e. imagine using a prolly tree as an index. Observe that we only ever append to the end. If we use a prolly tree, we'll end up writing log n new chunks and reading log n chunks in order to mutate for each commit.

Using a skip list, means that we'll read (fewer than -on average- I think) log n chunks, but we write only the existing commit chunks.

We should be able to pick the level probability to have the same effective fan out as a prolly tree.

Contributor

rafael-atticlabs commented Oct 10, 2017

New idea. Use a skip list. I.e. imagine using a prolly tree as an index. Observe that we only ever append to the end. If we use a prolly tree, we'll end up writing log n new chunks and reading log n chunks in order to mutate for each commit.

Using a skip list, means that we'll read (fewer than -on average- I think) log n chunks, but we write only the existing commit chunks.

We should be able to pick the level probability to have the same effective fan out as a prolly tree.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment