Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zigzag: cache optimizations #465

Merged
merged 1 commit into from
Mar 13, 2019
Merged

Conversation

schomatis
Copy link
Contributor

@schomatis schomatis commented Feb 2, 2019

No description provided.

@schomatis
Copy link
Contributor Author

Baseline benchmark:

replication_time: 2.594320218s, target: stats, place: filecoin-proofs/examples/zigzag.rs:170 zigzag, root: filecoin-proofs
replication_time/byte: 2.473µs, target: stats, place: filecoin-proofs/examples/zigzag.rs:171 zigzag, root: filecoin-proofs
replication_time/GiB: 2656.583903231s, target: stats, place: filecoin-proofs/examples/zigzag.rs:176 zigzag, root: filecoin-proofs

@schomatis
Copy link
Contributor Author

Need to change the entire parents API to make it mutable to allow it to keep a state.

@dignifiedquire
Copy link
Contributor

Need to change the entire parents API to make it mutable to allow it to keep a state.

I don't think you need to, take a look at this: https://doc.rust-lang.org/book/ch15-05-interior-mutability.html

@schomatis schomatis force-pushed the feat/zigzag/cache-optimizations branch from f68adac to 465c48a Compare February 4, 2019 17:10
@schomatis
Copy link
Contributor Author

Thanks for that reference @dignifiedquire! I implemented the RefCell solution but at the end I got an error because the ZigZag seemed to be used across different threads, so I'm trying to use the RwLock instead (which is currently not working due to another error I need to figure out), does that makes sense to you?

@schomatis
Copy link
Contributor Author

The current error is due to the fact that the RwLock introduced doesn't support the Eq and Clone traits that the ZigZag trait requires. Should I introduce the explicit methods that would need to handle the RwLock?

@dignifiedquire
Copy link
Contributor

dignifiedquire commented Feb 4, 2019 via email

@schomatis
Copy link
Contributor Author

I believe the common solution to refcell across threads is to use Arc<RefCell>

Arc<RefCell<T>> seems to go back to the cannot be shared between threads safely error (see 26069e7). I'll keep digging.

@schomatis schomatis force-pushed the feat/zigzag/cache-optimizations branch from 26069e7 to 2ab0798 Compare February 4, 2019 18:22
Copy link
Contributor

@dignifiedquire dignifiedquire left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using rwlock and implementing partialeq and clone manually seems like the right call to me. It would be nice to use an lru cache instead of a hashmap, to avoid unbounded growth, but I guess that can be done later

@@ -223,7 +231,13 @@ where

#[inline]
fn expanded_parents(&self, node: usize) -> Vec<usize> {
(0..self.expansion_degree)

let parents_cache = self.parents_cache().read().unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this lock should be inside a block like this

{
  let parents_cache = ...
}

otherwise the lock is held for too long, the same goes for the below write lock

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeap, just found that out in my local test

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was actually deadlocking the second lock by never releasing the first one.


let parents_cache = self.parents_cache().read().unwrap();
if (*parents_cache).contains_key(&node) {
return (*parents_cache)[&node].clone();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should change the signature to return a &[]u8 instead, cloning vectors is expensive, and reduces teh usefullness of the cache

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There seem to be some issues that would indicate that has already been fixed, e.g.,

rust-lang/rust#13472
rust-lang/rust#11015
rust-lang/rust#13539

WDYT?

@schomatis schomatis force-pushed the feat/zigzag/cache-optimizations branch from 2ab0798 to dc7906a Compare February 4, 2019 18:36
@schomatis
Copy link
Contributor Author

Made a temporary implementation of the Eq and Clone traits (which ignore the cache to change the current implementation as less as possible, since I'm not sure under which circumstances the graph is cloned) and I can now build the RwLock implementation (still not sure if this is the structure I should be using though).

// TODO(dig): We should change the signature to return a &[]u8 instead,
// cloning vectors is expensive, and reduces teh usefullness of the cache.
}
// Release the read lock (a write one will be taken later).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can, but that would slow down things as it would always block (rwlock is single writer and many reader based).

@schomatis
Copy link
Contributor Author

replication_time: 2.811180136s, target: stats, place: filecoin-proofs/examples/zigzag.rs:170 zigzag, root: filecoin-proofs
replication_time/byte: 2.68µs, target: stats, place: filecoin-proofs/examples/zigzag.rs:171 zigzag, root: filecoin-proofs
replication_time/GiB: 2878.648459263s, target: stats, place: filecoin-proofs/examples/zigzag.rs:176 zigzag, root: filecoin-proofs

Pretty much the same time as without the cache (actually this is a bit slower), so I'll take a closer look at the implementation to check if I'm actually caching the results correctly (I'll try to add a test for it), until then I won't bother too much about the RwLock use (I don't think the lock mechanism is cancelling the performance improvement of the cache, but I don't know how is ZigZag used across threads).

(Thanks for the help @dignifiedquire.)

@porcuquine
Copy link
Collaborator

It's fairly likely that caching parents won't turn out to be a worthwhile tradeoff, so it would be fine to abandon this once you confirm that you've measured what you think. If you find that caching the Feistel computations (or expansion_parents) is slower than not doing so, that would be unexpected and worth investigating.

@schomatis
Copy link
Contributor Author

It's fairly likely that caching parents won't turn out to be a worthwhile tradeoff, so it would be fine to abandon this once you confirm that you've measured what you think. If you find that caching the Feistel computations (or expansion_parents) is slower than not doing so, that would be unexpected and worth investigating.

Good to know, my current obstacles right now are independent of what and where we'll be caching, mainly:

  1. (Probably a bug in my current implementation:) I need to prove that caching anything (everything) is in fact making a difference (we're getting some speed improvement even at the cost of memory footprint, which I'm not even tracking at the moment).

  2. I'm trying not to change the parents API but still mutate the ZigZag structure, I couldn't use the RefCell pattern because the method is being called throughout different threads so I'm giving RwLock a try (I'm not paying much attention to this though, I just want to be able to test the caching without modifying big parts of the code).

@schomatis
Copy link
Contributor Author

so I'll take a closer look at the implementation

Note to self: to simplify the initial Clone implementation I'm not copying the internal cache, this might have an important impact on performance (should clone by default at this point until I see any improvement).

@schomatis
Copy link
Contributor Author

It turns out that the cache in expanded_parents is being used only 10 times out of a total number of 32K nodes (see bbc7329 and its bench output), at least in this benchmark example.

What was the motivation behind including a cache in the first place? In which scenario can we get a significant number of cache hits to actually test its performance impact?

/cc @porcuquine

@schomatis
Copy link
Contributor Author

(Copying the cache when cloning the structure only increased the cache hit number to 15.)

@schomatis
Copy link
Contributor Author

Share both caches.

The caches need to be distinguished (even for the same type of graph) between forward and reverse, the parents are not the same.

Most of the ZigZag graphs are created from their zigzag() counterpart (transforming forward to reverse and vice versa), that means we can't pass the cache of a graph to another because it will be inverted and it won't be useful to the new graph.

As a starting point every ZigZag graph will hold both caches, even if it will only used one of them throughout its lifetime, but it will retain the other one because it will be needed by the next zigzag().

This is already reducing the cache miss to only the encoding of the first layer (which was the desired effect) and providing a performance increase of 20%.

replication_time: 2.049265745s, target: stats, place: filecoin-proofs/examples/zigzag.rs:170 zigzag, root: filecoin-proofs
replication_time/byte: 1.953µs, target: stats, place: filecoin-proofs/examples/zigzag.rs:171 zigzag, root: filecoin-proofs
replication_time/GiB: 2098.448122879s, target: stats, place: filecoin-proofs/examples/zigzag.rs:176 zigzag, root: filecoin-proofs

With an actual working cache that proves its usefulness we can iterate from here evaluating the different trade-offs.

@schomatis schomatis force-pushed the feat/zigzag/cache-optimizations branch from 0f1393e to b524f5a Compare February 8, 2019 03:12
@schomatis
Copy link
Contributor Author

@porcuquine Ready for review.

This is not the optimal/final solution. It's just a low impact cache (provided MAX_CACHE_SIZE is set to an acceptable value) that provides a benchmarked time optimization. The objective is to lock this down to have a concrete implementation to iterate from discussing possible trade-offs.


Strangely this is performing much better than the previous implementation that was conceptually the same, the only concrete difference was allocating up front the size of the caches. (Since I can't really explain this let's keep assuming for now the conservative 20% speed improvement mentioned before and not this 35%.)

replication_time: 1.668175459s, target: stats, place: filecoin-proofs/examples/zigzag.rs:170 zigzag, root: filecoin-proofs
replication_time/byte: 1.59µs, target: stats, place: filecoin-proofs/examples/zigzag.rs:171 zigzag, root: filecoin-proofs
replication_time/GiB: 1708.211670015s, target: stats, place: filecoin-proofs/examples/zigzag.rs:176 zigzag, root: filecoin-proofs

@schomatis schomatis force-pushed the feat/zigzag/cache-optimizations branch from b524f5a to 274a52d Compare February 8, 2019 03:24
@schomatis
Copy link
Contributor Author

(Dropping the CircleCI benchmark now.)

@schomatis schomatis changed the title [WIP] zigzag: cache optimizations zigzag: cache optimizations Feb 8, 2019
@schomatis
Copy link
Contributor Author

the only concrete difference was allocating up front the size of the caches.

Actually, taking a look at the HashMap resize logic in

https://github.com/rust-lang/rust/blob/d1731801163df1d3a8d4ddfa68adac2ec833ef7f/src/libstd/collections/hash/map.rs#L941-L959

it may be more expensive than I originally thought.

@schomatis schomatis force-pushed the feat/zigzag/cache-optimizations branch from 274a52d to a9d50f9 Compare February 8, 2019 07:48
@schomatis
Copy link
Contributor Author

(Fix usize size estimation.)

@porcuquine
Copy link
Collaborator

porcuquine commented Feb 8, 2019

I tried running this and got a message about cache size:

➜  rust-proofs git:(feat/zigzag/cache-optimizations) ✗ ./target/release/examples/zigzag --size 1048576 --m 5 --expansion 8 --no-bench
Feb 08 13:00:42.410 INFO hasher: pedersen, target: config, place: filecoin-proofs/examples/zigzag.rs:408 zigzag, root: filecoin-proofs
Feb 08 13:00:42.410 INFO data size: 1 GB, target: config, place: filecoin-proofs/examples/zigzag.rs:106 zigzag, root: filecoin-proofs
...
Feb 08 13:01:14.022 INFO running setup, place: filecoin-proofs/examples/zigzag.rs:138 zigzag, root: filecoin-proofs
Feb 08 13:01:14.022 INFO using a cache smaller (81920) than the number of nodes (33554432), place: storage-proofs/src/zigzag_graph.rs:107 storage_proofs::zigzag_graph, root: storage-proofs

Is that by design?

[Okay, I see that it is.]

pub type ParentCache = HashMap<usize, Vec<usize>>;
// TODO: Using `usize` as its the dominant type throughout the
// code, but it should be reconciled with the underlying `u32`
// used in Feistel.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

u32 will be fine up to a point, but I don't think it needs to be matched to the Feistel implementation. Graph nodes are 32 bytes (2^5), so u32 will let us handle sectors of up to 2^5 bytes * 2^32 = 2^37 bytes = 128 GiB. We may eventually want/need to support larger sectors — in which case we would need larger index representations. Since 64 bits would be wasteful, maybe we should just be using the smallest number of bytes that will hold all the indexes we need for the graph in question.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but I don't think it needs to be matched to the Feistel implementation.

My concern is that at the moment I haven't seen any check restricting the number of nodes (although I might have missed it in other related files), I seem to be able to pass any value to zigzag --size 100000000000 (I haven't waited for the generation of fake data to see if this actually continues the execution), and a usize in the code also gives the impression that any value is possible when we're actually coercing it later to u32 (so any value above that range will seem to violate the ZigZag semantics).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that we are implicitly limited by Feistel. Here's what I suggest:

Let's put in an explicit check on the number of nodes allowed in a ZigZagGraph. As we move forward, we are going to need to work to be able to support larger and larger sector sizes. 128GiB is still out of range, so we don't need to solve the problem yet. Once we can otherwise handle such large sectors, we can extend our use of Feistel to accommodate that need.

// ZigZagGraph will hold two different (but related) `ParentCache`,
// the first one for the `forward` direction and the second one
// for the `reversed`.
pub type TwoWayParentCache = Vec<ParentCache>;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since it's exactly two, you might also use a pair (ParentCache, ParentCache).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternately, you might consider — instead of two caches — storing a pair (or struct) holding both the 'forward' and 'backward' parents for each node. If your data structure is a BTreeMap of such pairs, this might be faster and/or smaller (I don't think there's physical overhead to such a static tuple) than two trees. Locality will also differ, so it might be something to play with when tweaking.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since it's exactly two, you might also use a pair (ParentCache, ParentCache).

Yes, this seems more natural.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tuple doesn't seem to allow dynamic indexing like tuple.index, it needs the literal number, but I'll change the Vec to just an array of fixed length though.

// TODO: Evaluate decoupling the two caches in different `RwLock` to reduce
// contention. At the moment they are joined under the same lock for simplicity
// since `transform_and_replicate_layers` even in the parallel case seems to
// generate the parents (`vde::encode`) of the different layers sequentially.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think ZigZag almost gives a guarantee of serial access to subsequent layers when encoding. You can't begin to encode the next layer (in the opposite direction) until having finished with the current one. This could be fudged a little, and since the graph is reused for a period of time, one could analyze and perhaps come up with an ordering that violated this assumption, though.

However, multiple simultaneous replication processes certainly will want to have access. Multiple readers should be find, though with RwLock (right?) — so I assume you're just talking about initially populating the cache. We probably don't need to hyper-optimize that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multiple readers should be find, though with RwLock (right?)

Yes.

// This is not an LRU cache, it holds the first `cache_entries` of the total
// possible `base_graph.size()` (the assumption here is that we either request
// all entries sequentially when encoding or any random entry once when proving
// or verifying, but there's no locality to take advantage of so keep the logic
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that assumption is accurate.

// it would allow to batch parents calculations with that single lock. Also,
// since there is a reciprocity between forward and reversed parents,
// we would only need to compute the parents in one direction and with
// that fill both caches.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good observation. If you populate the forward and backward cache on the first pass, you can cut the Feistel calls in half and make full use of each.

// TODO: Arbitrarily chosen for tests.

// Cache of node's parents.
pub type ParentCache = HashMap<usize, Vec<usize>>;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may want to consider using a BTreeMap. I believe it will be more compact and faster to iterate through sequentially (either direction).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depending on how we end up designing it, I'm considering pre-allocating all of it and leave just a [u8], but that's something that should be discussed in the issue.

@schomatis
Copy link
Contributor Author

@porcuquine Thanks for the thorough review, I think most of what's discussed here should actually be moved to the original issue (see #455 (comment)) to finish up delineating the design of the cache. The purpose of this PR is just to have a basic implementation with a low memory footprint to help move the design discussion forward.

Besides I minor change I'll do about the cache tuple, what do you think needs to be changed here now (instead of postponing it for the design discussion in the issue) to land this PR? I think we should set MAX_CACHE_SIZE to a value that would have little impact at the moment (even at the cost of a poor performance improvement, since this is not the final version of the cache), is there something else?

@schomatis
Copy link
Contributor Author

I tried running this and got a message about cache size:

We can make that a debug! or remove it altogether if you prefer. I just wanted to give that some visibility since this PR accomplishes a considerable performance improvement but only if the cache is big enough for the number of sectors we replicate, so that's something to have in mind while using it.

@schomatis
Copy link
Contributor Author

If I understand correctly, the idea is to hold off on merging this until a configuration API is in place (which I think is captured in #501) that would help adjust the knobs of this cache.

@porcuquine
Copy link
Collaborator

That is correct. Please coordinate with @sidke about ETA on that feature.

@schomatis schomatis force-pushed the feat/zigzag/cache-optimizations branch 4 times, most recently from d1a0196 to ce33e38 Compare March 6, 2019 15:34
@schomatis
Copy link
Contributor Author

@porcuquine rebased and unlimited, go wild 🏃‍♂️

@porcuquine
Copy link
Collaborator

Thank you.

@schomatis
Copy link
Contributor Author

@sidke delivered so it's my turn to push this forward (next week).

@schomatis
Copy link
Contributor Author

@porcuquine heads-up, this PR will be changing in the following days (so make sure to cherry-pick what you need before that).

@schomatis
Copy link
Contributor Author

Depends on (and will adapt to) #539.

@porcuquine
Copy link
Collaborator

@schomatis Now that #539 has merged, I think you are clearly to lightly adapt and finally get this one merged. Thank you for your patience.

@schomatis schomatis force-pushed the feat/zigzag/cache-optimizations branch 4 times, most recently from 3e41d91 to 4175ba3 Compare March 13, 2019 16:40
@schomatis
Copy link
Contributor Author

@porcuquine Adapted to the new config, ready for review.

@schomatis schomatis force-pushed the feat/zigzag/cache-optimizations branch from 4175ba3 to 9c2b2b7 Compare March 13, 2019 16:41
@schomatis schomatis force-pushed the feat/zigzag/cache-optimizations branch from 9c2b2b7 to e12f5c4 Compare March 13, 2019 16:51
@schomatis
Copy link
Contributor Author

(Rebasing.)

// If we can't find the `MAXIMIZE_CACHING` assume the conservative
// option of no cache.
};

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good for now. We will probably move this logic into another layer later, but putting it at the point of use seems optimal for present purposes.

@schomatis schomatis merged commit bb5a96e into master Mar 13, 2019
@schomatis schomatis deleted the feat/zigzag/cache-optimizations branch March 13, 2019 17:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants