Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

disk-backed merkle trees #481

Closed
porcuquine opened this issue Feb 15, 2019 · 13 comments
Closed

disk-backed merkle trees #481

porcuquine opened this issue Feb 15, 2019 · 13 comments
Assignees
Labels
A-storage-proofs enhancement New feature or request P1 Medium priority perf-memory

Comments

@porcuquine
Copy link
Collaborator

porcuquine commented Feb 15, 2019

Original write-up

Description

We devote a lot of memory to holding onto merkle trees during replication. We should eliminate this requirement, which limits the sector size we can replicate. The trees should live on disk; they don't need to be in memory between when they are generated and when they are used to create the proof. The root of each tree will be needed to calculate comm_r_star, so it should probably be extracted and held before flushing to disk, though.

Whether the best solution is to construct the tree in mmapped data, or to just dump and load it from a file is an expedient question for now.

Acceptance criteria

Ideally, we would see memory profiling evidence before and after.

Alternately, empirical evidence of replicating a sector size which fails for lack of memory before then succeeds after would do.

Risks + pitfalls

As we consider generalizing our approach to (partial and complete) merkle tree caching, and to sharing the mechanism betwee PoRep and PoSt, the design space becomes somewhat dense. Don't get caught up in trying to solve future problems — but do bear in mind the design trajectory, to the extent you're familiar with it.

Where to begin

Merkle tree generation: https://github.com/filecoin-project/rust-proofs/blob/master/storage-proofs/src/drgraph.rs#L26-L28

We have all the trees together here: https://github.com/filecoin-project/rust-proofs/blob/master/storage-proofs/src/layered_drgporep.rs#L427-L444

We use all the trees when proving here: https://github.com/filecoin-project/rust-proofs/blob/master/storage-proofs/src/layered_drgporep.rs#L237-L240

What we hold onto and pass around in aux should not require a a full merkle-tree's worth of memory, except when the relevant layer's tree is being used to generate the proof: https://github.com/filecoin-project/rust-proofs/blob/master/storage-proofs/src/layered_drgporep.rs#L259-L261


Work Log

Most of the current work is devoted to setting up the tools to automate the memory profiling of the zigzag example (the application of the mmap itself doesn't seem to present much of a problem).

Memory profiling

Building a full profiler script that would allow me to use gperftools with the zigzag example. The PR #487 adds the heap profiler (0004889) already supported in rust-gperftools. This is already working (although I still haven't studied the profile output in depth).

Trying to automate the previous profile mechanism into rust-proofs-infra to have a reliable working environment on which to run the benchmarks: https://github.com/filecoin-project/rust-proofs-infra/issues/17.

mmap

Looking at d5bbcd7 as an example. The file has been removed (even though other examples of it exist throughout the code). I'm not sure if it's worth abstracting it again, I've been told @laser might be working on it, so for the moment I'll just try something simple like what's already done in

fn file_backed_mmap_from_random_bytes(n: usize) -> MmapMut {
let rng = &mut XorShiftRng::from_seed([0x3dbe6259, 0x8d313d76, 0x3237db17, 0xe5bc0654]);
let mut tmpfile: File = tempfile::tempfile().unwrap();
for _ in 0..n {
tmpfile
.write_all(&fr_into_bytes::<Bls12>(&rng.gen()))
.unwrap();
}
unsafe { MmapOptions::new().map_mut(&tmpfile).unwrap() }
}

So what I'm implementing at the moment looks like ac21c56.

TODO

@schomatis
Copy link
Contributor

schomatis commented Feb 23, 2019

WIP: Making two docker images,

@porcuquine
Copy link
Collaborator Author

porcuquine commented Feb 24, 2019

As discussed offline, here's some more detail about how I think this should probably work.

  • D1: Create a disk-backed merkle tree struct/type which allows long-term persistence -- just a wrapper around an ordinary tree like we already use, for creation and storage of the trees. Keep using the existing tree type as we do now.
  • D2: This implies we should be able to instantiate such a tree from a file path. The path would be a field in the wrapper struct.
  • D3: Path should be optional, in which case a temp file should be used.
  • D4: Our merkle trees are represented as contiguous blocks of memory, with the original (unhashed data) represented as the first chunk — followed by the hashes of each subsequent row.
  • D5: Tree should be mmapped such that this entire structure is held by a single mapped buffer.
  • D6: When the raw data already exists as an mmapped buffer, we could grow this buffer to create the corresponding merkle tree. We already use our trees to double as representations of the raw data in this way.
  • D7: In zigzag replication, this might be complicated by the fact that we replicate in place. Since that is the case, it might make sense to perform a simple disk copy of the data then instantiate a disk-backed merkle tree from the new file.

@schomatis
Copy link
Contributor

@porcuquine Some questions (Q) and observations (O, if you agree you don't need to answer those) about the previous design (D) considerations:

  • Q1: Re D1, is something like what was already implemented in d5bbcd7 what you had in mind?

fn open_existing(path: &PathBuf) -> Result<MmapMut> {
let file = OpenOptions::new().read(true).write(true).open(&path)?;
let res = unsafe { MmapMut::map_mut(&file)? };
Ok(res)
}
fn open_empty(path: &PathBuf, len: u64) -> Result<MmapMut> {
let file = OpenOptions::new()
.read(true)
.write(true)
.create(true)
.open(&path)?;
file.set_len(len)?;
let res = unsafe { MmapMut::map_mut(&file)? };
Ok(res)
}
pub struct MemmapWriter<'a, T: PoRep<'a>> {
public_params: T::PublicParams,
master: RefCell<MmapMut>,
prover_id: &'a [u8],
}

(But instead of wrapping PublicParams we'd wrap over MerkleTree.)

  • Q2: Re D2, who manages the paths? Would those files be accessible after different runs? What's the rationale for backing them in files besides already mmaping them?

  • O3: I would delay this path implementation until after the anonymous one is working and I can already prove that mmap is serving its purpose.

  • O4: Rephrasing D4 to see if I understand it correctly, the data present in the leaves is the "first chunk", and the hashes of each subsequent level ("subsequent row") follow it.

  • O5: Re D6, makes sense to combine the two, but at the moment the MerkleTree (a dependency) allocates its own (new) space every time it is build, so it doesn't seem we can tell it to use our allocated mmap'ed space at the moment.

  • Q6: Re D7, why do we want to make a copy at the disk level instead of just copying memory (maybe related to Q2)?

@porcuquine
Copy link
Collaborator Author

porcuquine commented Feb 25, 2019

@schomatis

  • Q1: I think so. The devil is in the details, but it would be something like that.

  • Q2: To be clear, the main reason for introducing paths is forward-looking, for use in future work regarding longer-term merkle-tree caching. I mention it here mainly so work done now will be consistent with that eventual extension. That having been said, the final tree (corresponding to the overall comm_r) would be worth leaving on disk for the caller to manage as he sees fit.

  • Q3: No problem at all.

  • Q4: Yes.

  • Q5: We can take this in stages, but we will probably develop the merkle tree implementation (currently forked) independently so we can extend it to suit our needs. So if getting this right in the end requires also extending that library, this is an option. The best way to map this is probably to first make the changes possible here. Then, if necessary, we can extend the library in preparation for a second round of changes here (which consume the external enhancements).

  • Q6: I suggested that only on the assumption that in the case that the data already exists on disk, copying at the file system level will maximize efficiency and minimize our process memory consumption. If those assumptions turn out not to hold there is no other reason.

@schomatis
Copy link
Contributor

@porcuquine A few follow-ups:

  • Q7: Trying to expand on Q2, makes sense to have that in mind, but my concrete question at the moment is: if I introduce something like fn open_empty(path: &PathBuf, len: u64) (from the above example), where do I get the path from? At a minimum I need a directory where to store these kind of files (the names of the files themselves can be derived from the layer and other parameters). Is there a designated place like /tmp/filecoin-proof-cache? And if so, where would we define that path? (Basically, I just need something concrete to put in the code.)

  • O8: Re Q5, I agree to first fix whatever we can here, but most of the memory usage seems to be due to the merkle tree generation. As you said, we copy the data slice (as data_copy) and convert it into an iterator that isn't extended to hold the entire merkle tree (becoming the "first chunk" -- D4) but rather we just discard it and allocate twice that size from scratch.

@porcuquine
Copy link
Collaborator Author

  • Q7: I was thinking you could take an Option and use a temp file if None is passed.

Regarding Q8 or O8 above, I'm not sure if that is a question or observation. If question, I don't follow. If observation, carry on.

@schomatis
Copy link
Contributor

Not sure if I completely follow the path solution but I have enough to move forward and we can revisit it over a concrete implementation.

@schomatis
Copy link
Contributor

The solution to back merkle trees with mmap is more elaborate than I originally thought and will be continued in filecoin-project/merkletree#8, blocking this issue until that is done (there are other avenues to pursue here but that one seems like the highest priority at the moment).

This was referenced Mar 5, 2019
@schomatis schomatis added the P1 Medium priority label Mar 11, 2019
@schomatis
Copy link
Contributor

schomatis commented Mar 11, 2019

(Of all the P1s this is the most important at the moment of the ones I'm assigned.)

@dignifiedquire
Copy link
Contributor

@schomatis @porcuquine I believe most of this work is done, can we close this issue then?

@schomatis
Copy link
Contributor

Not sure what's the full scope of this issue, I'll defer to @porcuquine on this one.

@porcuquine
Copy link
Collaborator Author

I'm not completely up-to-date on what has been shipped, but I believe the spirit of this issue has been fulfilled. My only question is around how 'finished' the work is. If we can now replicate with less memory than before by flipping a configuration switch, then I think this work is done. If not, then there are some hanging threads which should be accounted for. In that case, I'm still fine with closing this, as long as we have follow-up issues covering whatever remains. If the needs are unclear, let's discuss -- mostly just need to make sure the work is fully integrated and usable.

@schomatis
Copy link
Contributor

If we can now replicate with less memory than before by flipping a configuration switch, then I think this work is done.

I think then we can wrap this up once #655 is finished (I'll mark that issue as a closer for this one) when we'll be able to replicate using a max RSS of 3 sector sizes (lowering it from 5, or even more, originally).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-storage-proofs enhancement New feature or request P1 Medium priority perf-memory
Projects
None yet
Development

No branches or pull requests

3 participants