disk-backed merkle trees #481

porcuquine · 2019-02-15T01:20:18Z

Original write-up

Description

We devote a lot of memory to holding onto merkle trees during replication. We should eliminate this requirement, which limits the sector size we can replicate. The trees should live on disk; they don't need to be in memory between when they are generated and when they are used to create the proof. The root of each tree will be needed to calculate comm_r_star, so it should probably be extracted and held before flushing to disk, though.

Whether the best solution is to construct the tree in mmapped data, or to just dump and load it from a file is an expedient question for now.

Acceptance criteria

Ideally, we would see memory profiling evidence before and after.

Alternately, empirical evidence of replicating a sector size which fails for lack of memory before then succeeds after would do.

Risks + pitfalls

As we consider generalizing our approach to (partial and complete) merkle tree caching, and to sharing the mechanism betwee PoRep and PoSt, the design space becomes somewhat dense. Don't get caught up in trying to solve future problems — but do bear in mind the design trajectory, to the extent you're familiar with it.

Where to begin

Merkle tree generation: https://github.com/filecoin-project/rust-proofs/blob/master/storage-proofs/src/drgraph.rs#L26-L28

We have all the trees together here: https://github.com/filecoin-project/rust-proofs/blob/master/storage-proofs/src/layered_drgporep.rs#L427-L444

We use all the trees when proving here: https://github.com/filecoin-project/rust-proofs/blob/master/storage-proofs/src/layered_drgporep.rs#L237-L240

What we hold onto and pass around in aux should not require a a full merkle-tree's worth of memory, except when the relevant layer's tree is being used to generate the proof: https://github.com/filecoin-project/rust-proofs/blob/master/storage-proofs/src/layered_drgporep.rs#L259-L261

Work Log

Most of the current work is devoted to setting up the tools to automate the memory profiling of the zigzag example (the application of the mmap itself doesn't seem to present much of a problem).

Memory profiling

Building a full profiler script that would allow me to use gperftools with the zigzag example. The PR #487 adds the heap profiler (0004889) already supported in rust-gperftools. This is already working (although I still haven't studied the profile output in depth).

Trying to automate the previous profile mechanism into rust-proofs-infra to have a reliable working environment on which to run the benchmarks: https://github.com/filecoin-project/rust-proofs-infra/issues/17.

`mmap`

Looking at d5bbcd7 as an example. The file has been removed (even though other examples of it exist throughout the code). I'm not sure if it's worth abstracting it again, I've been told @laser might be working on it, so for the moment I'll just try something simple like what's already done in

rust-fil-proofs/filecoin-proofs/examples/drgporep-vanilla-disk.rs

Lines 36 to 47 in 90bb044

    
           fn file_backed_mmap_from_random_bytes(n: usize) -> MmapMut { 
        
               let rng = &mut XorShiftRng::from_seed([0x3dbe6259, 0x8d313d76, 0x3237db17, 0xe5bc0654]); 
        
               let mut tmpfile: File = tempfile::tempfile().unwrap(); 
        
               for _ in 0..n { 
        
                   tmpfile 
        
                       .write_all(&fr_into_bytes::<Bls12>(&rng.gen())) 
        
                       .unwrap(); 
        
               } 
        
               unsafe { MmapOptions::new().map_mut(&tmpfile).unwrap() } 
        
           }

So what I'm implementing at the moment looks like ac21c56.

TODO

Closed [WIP] measure memory usage with gperftools #487 in favor of decoupling gperftools support in examples: zigzag: add heap profiler #509. Update references in "Memory profiling" section.
Merge conclusions from design discussion in sub-thread disk-backed merkle trees #481 (comment).
Merge Docker images disk-backed merkle trees #481 (comment).

The text was updated successfully, but these errors were encountered:

schomatis · 2019-02-23T01:45:46Z

WIP: Making two docker images,

schomatis/rust-fil-proofs: Incremental build of rust-fil-proofs/master.
schomatis/rust-fil-proofs-profile: Build the zigzag example from [WIP] measure memory usage with gperftools #487 with gperftools support for the heap profiler; running this should leave the user in the interactive pprof shell showing the last heap profile dump for the replication process. (This one is being built at the moment, it should be available in 1-2 hours.)

porcuquine · 2019-02-24T01:38:49Z

As discussed offline, here's some more detail about how I think this should probably work.

D1: Create a disk-backed merkle tree struct/type which allows long-term persistence -- just a wrapper around an ordinary tree like we already use, for creation and storage of the trees. Keep using the existing tree type as we do now.
D2: This implies we should be able to instantiate such a tree from a file path. The path would be a field in the wrapper struct.
D3: Path should be optional, in which case a temp file should be used.
D4: Our merkle trees are represented as contiguous blocks of memory, with the original (unhashed data) represented as the first chunk — followed by the hashes of each subsequent row.
D5: Tree should be mmapped such that this entire structure is held by a single mapped buffer.
D6: When the raw data already exists as an mmapped buffer, we could grow this buffer to create the corresponding merkle tree. We already use our trees to double as representations of the raw data in this way.
D7: In zigzag replication, this might be complicated by the fact that we replicate in place. Since that is the case, it might make sense to perform a simple disk copy of the data then instantiate a disk-backed merkle tree from the new file.

schomatis · 2019-02-25T15:47:56Z

@porcuquine Some questions (Q) and observations (O, if you agree you don't need to answer those) about the previous design (D) considerations:

Q1: Re D1, is something like what was already implemented in d5bbcd7 what you had in mind?

rust-fil-proofs/src/writer/memmap.rs

Lines 9 to 32 in d5bbcd7

    
           fn open_existing(path: &PathBuf) -> Result<MmapMut> { 
        
               let file = OpenOptions::new().read(true).write(true).open(&path)?; 
        
               let res = unsafe { MmapMut::map_mut(&file)? }; 
        
               Ok(res) 
        
           } 
        
           fn open_empty(path: &PathBuf, len: u64) -> Result<MmapMut> { 
        
               let file = OpenOptions::new() 
        
                   .read(true) 
        
                   .write(true) 
        
                   .create(true) 
        
                   .open(&path)?; 
        
               file.set_len(len)?; 
        
               let res = unsafe { MmapMut::map_mut(&file)? }; 
        
               Ok(res) 
        
           } 
        
           pub struct MemmapWriter<'a, T: PoRep<'a>> { 
        
               public_params: T::PublicParams, 
        
               master: RefCell<MmapMut>, 
        
               prover_id: &'a [u8], 
        
           }

(But instead of wrapping PublicParams we'd wrap over MerkleTree.)

Q2: Re D2, who manages the paths? Would those files be accessible after different runs? What's the rationale for backing them in files besides already mmaping them?
O3: I would delay this path implementation until after the anonymous one is working and I can already prove that mmap is serving its purpose.
O4: Rephrasing D4 to see if I understand it correctly, the data present in the leaves is the "first chunk", and the hashes of each subsequent level ("subsequent row") follow it.
O5: Re D6, makes sense to combine the two, but at the moment the MerkleTree (a dependency) allocates its own (new) space every time it is build, so it doesn't seem we can tell it to use our allocated mmap'ed space at the moment.
Q6: Re D7, why do we want to make a copy at the disk level instead of just copying memory (maybe related to Q2)?

porcuquine · 2019-02-25T16:56:47Z

@schomatis

Q1: I think so. The devil is in the details, but it would be something like that.
Q2: To be clear, the main reason for introducing paths is forward-looking, for use in future work regarding longer-term merkle-tree caching. I mention it here mainly so work done now will be consistent with that eventual extension. That having been said, the final tree (corresponding to the overall comm_r) would be worth leaving on disk for the caller to manage as he sees fit.
Q3: No problem at all.
Q4: Yes.
Q5: We can take this in stages, but we will probably develop the merkle tree implementation (currently forked) independently so we can extend it to suit our needs. So if getting this right in the end requires also extending that library, this is an option. The best way to map this is probably to first make the changes possible here. Then, if necessary, we can extend the library in preparation for a second round of changes here (which consume the external enhancements).
Q6: I suggested that only on the assumption that in the case that the data already exists on disk, copying at the file system level will maximize efficiency and minimize our process memory consumption. If those assumptions turn out not to hold there is no other reason.

schomatis · 2019-02-26T00:40:33Z

@porcuquine A few follow-ups:

Q7: Trying to expand on Q2, makes sense to have that in mind, but my concrete question at the moment is: if I introduce something like fn open_empty(path: &PathBuf, len: u64) (from the above example), where do I get the path from? At a minimum I need a directory where to store these kind of files (the names of the files themselves can be derived from the layer and other parameters). Is there a designated place like /tmp/filecoin-proof-cache? And if so, where would we define that path? (Basically, I just need something concrete to put in the code.)
O8: Re Q5, I agree to first fix whatever we can here, but most of the memory usage seems to be due to the merkle tree generation. As you said, we copy the data slice (as data_copy) and convert it into an iterator that isn't extended to hold the entire merkle tree (becoming the "first chunk" -- D4) but rather we just discard it and allocate twice that size from scratch.

porcuquine · 2019-02-26T02:21:02Z

Q7: I was thinking you could take an Option and use a temp file if None is passed.

Regarding Q8 or O8 above, I'm not sure if that is a question or observation. If question, I don't follow. If observation, carry on.

schomatis · 2019-02-26T02:37:01Z

Not sure if I completely follow the path solution but I have enough to move forward and we can revisit it over a concrete implementation.

schomatis · 2019-02-28T16:09:25Z

The solution to back merkle trees with mmap is more elaborate than I originally thought and will be continued in filecoin-project/merkletree#8, blocking this issue until that is done (there are other avenues to pursue here but that one seems like the highest priority at the moment).

schomatis · 2019-03-11T12:04:50Z

(Of all the P1s this is the most important at the moment of the ones I'm assigned.)

dignifiedquire · 2019-06-04T10:07:17Z

@schomatis @porcuquine I believe most of this work is done, can we close this issue then?

schomatis · 2019-06-04T12:16:20Z

Not sure what's the full scope of this issue, I'll defer to @porcuquine on this one.

porcuquine · 2019-06-04T17:46:39Z

I'm not completely up-to-date on what has been shipped, but I believe the spirit of this issue has been fulfilled. My only question is around how 'finished' the work is. If we can now replicate with less memory than before by flipping a configuration switch, then I think this work is done. If not, then there are some hanging threads which should be accounted for. In that case, I'm still fine with closing this, as long as we have follow-up issues covering whatever remains. If the needs are unclear, let's discuss -- mostly just need to make sure the work is fully integrated and usable.

schomatis · 2019-06-04T18:03:53Z

If we can now replicate with less memory than before by flipping a configuration switch, then I think this work is done.

I think then we can wrap this up once #655 is finished (I'll mark that issue as a closer for this one) when we'll be able to replicate using a max RSS of 3 sector sizes (lowering it from 5, or even more, originally).

porcuquine assigned schomatis Feb 15, 2019

porcuquine changed the title ~~mmap-backed merkle trees~~ disk-backed merkle trees Feb 15, 2019

porcuquine mentioned this issue Feb 15, 2019

use mmap when sealing from API #482

Closed

This was referenced Feb 19, 2019

Use it for the zigzag example and document it in the README#usage dignifiedquire/rust-gperftools#2

Closed

[WIP] measure memory usage with gperftools #487

Closed

encode: preallocate the vector used in create_key #495

Closed

schomatis mentioned this issue Feb 25, 2019

Avoid last data duplication in transform_and_replicate_layers #504

Closed

schomatis mentioned this issue Feb 25, 2019

readme: add Docker example of the heap profiler dignifiedquire/rust-gperftools#4

Merged

schomatis mentioned this issue Mar 4, 2019

Use mmap for replication in API. #521

Merged

This was referenced Mar 5, 2019

ZigZag PoRep #77

Closed

vanilla proving is way too slow #537

Closed

schomatis added the P1 Medium priority label Mar 11, 2019

schomatis mentioned this issue Mar 19, 2019

disk-backed merkle trees #567

Merged

nicola mentioned this issue Mar 21, 2019

Captain log filecoin-project/research#103

Closed

schomatis mentioned this issue Apr 29, 2019

[WIP] Decouple the data leaves from the rest of the tree filecoin-project/merkletree#16

Closed

dignifiedquire added A-storage-proofs enhancement New feature or request perf-memory labels Jun 4, 2019

schomatis mentioned this issue Jun 4, 2019

Provide an option to reduce memory usage during MT build process #655

Closed

dignifiedquire closed this as completed Jul 8, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

disk-backed merkle trees #481

disk-backed merkle trees #481

porcuquine commented Feb 15, 2019 •

edited by schomatis

schomatis commented Feb 23, 2019 •

edited

porcuquine commented Feb 24, 2019 •

edited by schomatis

schomatis commented Feb 25, 2019

porcuquine commented Feb 25, 2019 •

edited

schomatis commented Feb 26, 2019

porcuquine commented Feb 26, 2019

schomatis commented Feb 26, 2019

schomatis commented Feb 28, 2019

schomatis commented Mar 11, 2019 •

edited

dignifiedquire commented Jun 4, 2019

schomatis commented Jun 4, 2019

porcuquine commented Jun 4, 2019

schomatis commented Jun 4, 2019

disk-backed merkle trees #481

disk-backed merkle trees #481

Comments

porcuquine commented Feb 15, 2019 • edited by schomatis

Original write-up

Description

Acceptance criteria

Risks + pitfalls

Where to begin

Work Log

Memory profiling

mmap

TODO

schomatis commented Feb 23, 2019 • edited

porcuquine commented Feb 24, 2019 • edited by schomatis

schomatis commented Feb 25, 2019

porcuquine commented Feb 25, 2019 • edited

schomatis commented Feb 26, 2019

porcuquine commented Feb 26, 2019

schomatis commented Feb 26, 2019

schomatis commented Feb 28, 2019

schomatis commented Mar 11, 2019 • edited

dignifiedquire commented Jun 4, 2019

schomatis commented Jun 4, 2019

porcuquine commented Jun 4, 2019

schomatis commented Jun 4, 2019

porcuquine commented Feb 15, 2019 •

edited by schomatis

`mmap`

schomatis commented Feb 23, 2019 •

edited

porcuquine commented Feb 24, 2019 •

edited by schomatis

porcuquine commented Feb 25, 2019 •

edited

schomatis commented Mar 11, 2019 •

edited