Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running window post locks with many 512MB sectors #5446

Closed
travisperson opened this issue Jan 27, 2021 · 4 comments
Closed

Running window post locks with many 512MB sectors #5446

travisperson opened this issue Jan 27, 2021 · 4 comments

Comments

@travisperson
Copy link
Contributor

travisperson commented Jan 27, 2021

I setup a network with 3 miners each having 1024 sectors (512MB sector size), this results in 10-11 partitions per window due to 512MB sectors having 2 sectors per partition.

Right now the miners lock up every once in a while when they try to run a window post and require a restart to get the chain to progress forward again.

The current impact of this issue is that we can't setup networks with 512MB miners.

More logs (includes goroutines) https://gist.github.com/travisperson/1712a7e5a2caa3472b8724ead455fc0c

{"level":"warn","ts":"2021-01-26T20:13:46.610Z","logger":"storageminer","caller":"storage/wdpost_run.go:225","msg":"Checked sectors","checked":2,"good":2}
{"level":"warn","ts":"2021-01-26T20:13:46.612Z","logger":"storageminer","caller":"storage/wdpost_run.go:225","msg":"Checked sectors","checked":2,"good":2}
{"level":"warn","ts":"2021-01-26T20:13:46.613Z","logger":"storageminer","caller":"storage/wdpost_run.go:225","msg":"Checked sectors","checked":2,"good":2}
{"level":"warn","ts":"2021-01-26T20:13:46.614Z","logger":"storageminer","caller":"storage/wdpost_run.go:225","msg":"Checked sectors","checked":2,"good":2}
{"level":"warn","ts":"2021-01-26T20:13:46.616Z","logger":"storageminer","caller":"storage/wdpost_run.go:225","msg":"Checked sectors","checked":2,"good":2}
{"level":"warn","ts":"2021-01-26T20:13:46.617Z","logger":"storageminer","caller":"storage/wdpost_run.go:225","msg":"Checked sectors","checked":2,"good":2}
{"level":"warn","ts":"2021-01-26T20:13:46.618Z","logger":"storageminer","caller":"storage/wdpost_run.go:225","msg":"Checked sectors","checked":2,"good":2}
{"level":"warn","ts":"2021-01-26T20:13:46.619Z","logger":"storageminer","caller":"storage/wdpost_run.go:225","msg":"Checked sectors","checked":2,"good":2}
{"level":"warn","ts":"2021-01-26T20:13:46.621Z","logger":"storageminer","caller":"storage/wdpost_run.go:225","msg":"Checked sectors","checked":2,"good":2}
{"level":"warn","ts":"2021-01-26T20:13:46.622Z","logger":"storageminer","caller":"storage/wdpost_run.go:225","msg":"Checked sectors","checked":2,"good":2}
{"level":"warn","ts":"2021-01-26T20:13:46.623Z","logger":"storageminer","caller":"storage/wdpost_run.go:225","msg":"Checked sectors","checked":2,"good":2}
{"level":"info","ts":"2021-01-26T20:13:46.624Z","logger":"storageminer","caller":"storage/wdpost_run.go:578","msg":"running window post","chain-random":"bRsEWj1FKx1INOi+W92tcOiU77IWyJeoNR+pSxZz+nI=","deadline":{"CurrentEpoch":2458,"PeriodStart":658,"Index":30,"Open":2458,"Close":2518,"Challenge":2438,"FaultCutoff":2388,"WPoStPeriodDeadlines":48,"WPoStProvingPeriod":2880,"WPoStChallengeWindow":60,"WPoStChallengeLookback":20,"FaultDeclarationCutoff":70},"height":"2458","skipped":0}
Jan 26 20:13:46 preminer-0.interop.fildev.network lotus-miner[8878]: {"level":"info","ts":"2021-01-26T20:13:46.928+0000","logger":"storage_proofs_core::compound_proof","caller":"/home/circleci/.cargo/registry/src/github.com-1ecc6299db9ec823/storage-proofs-core-5.4.0/src/compound_proof.rs:86","msg":"vanilla_proofs:finish"}
Jan 26 20:13:46 preminer-0.interop.fildev.network lotus-miner[8878]: {"level":"info","ts":"2021-01-26T20:13:46.971+0000","logger":"storage_proofs_core::compound_proof","caller":"/home/circleci/.cargo/registry/src/github.com-1ecc6299db9ec823/storage-proofs-core-5.4.0/src/compound_proof.rs:92","msg":"snark_proof:start"}
Jan 26 20:13:46 preminer-0.interop.fildev.network lotus-miner[8878]: {"level":"info","ts":"2021-01-26T20:13:46.972+0000","logger":"bellperson::groth16::prover","caller":"/home/circleci/.cargo/registry/src/github.com-1ecc6299db9ec823/bellperson-0.12.1/src/groth16/prover.rs:274","msg":"Bellperson 0.12.1 is being used!"}
Jan 26 20:13:47 preminer-0.interop.fildev.network lotus-miner[8878]: {"level":"info","ts":"2021-01-26T20:13:47.663+0000","logger":"bellperson::groth16::prover","caller":"/home/circleci/.cargo/registry/src/github.com-1ecc6299db9ec823/bellperson-0.12.1/src/groth16/prover.rs:309","msg":"starting proof timer"}
Jan 26 20:13:47 preminer-0.interop.fildev.network lotus-miner[8878]: {"level":"debug","ts":"2021-01-26T20:13:47.663+0000","logger":"bellperson::gpu::locks","caller":"/home/circleci/.cargo/registry/src/github.com-1ecc6299db9ec823/bellperson-0.12.1/src/gpu/locks.rs:40","msg":"Acquiring priority lock..."}
ubuntu@preminer-0:~$ ls -alh /tmp
-rw-r--r--  1 fc   fc      0 Jan 25 22:27 bellman.gpu.lock
-rw-r--r--  1 fc   fc      0 Jan 26 20:13 bellman.priority.lock
@f8-ptrk
Copy link
Contributor

f8-ptrk commented Jan 28, 2021

i have the same issue on calibration net.

i described my symptoms here:

https://filecoinproject.slack.com/archives/C01D42NNLMS/p1611706395036000

i was pretty sure this only occurs when actually sealing- but i might be wrong

@travisperson
Copy link
Contributor Author

Some more information:

I setup a minimal network in k8s and ran into this issue immediately upon starting a network.

This problem is pretty consistent and occurs during winning posts in docker. However, it doesn't appear to happen during a lotus-bench sealing.

When starting a network I see a double entry for Acquiring priority lock..., but don't see a Priority lock released! before, eg:

2021-02-03T20:53:16.642 INFO storage_proofs_core::compound_proof > vanilla_proofs:finish
2021-02-03T20:53:16.650 DEBUG merkletree::merkle > generated partial_tree of row_count 3 and len 73 with 8 branches for proof at 39
2021-02-03T20:53:16.650 INFO storage_proofs_core::compound_proof > vanilla_proofs:finish
2021-02-03T20:53:16.656 INFO storage_proofs_core::compound_proof > snark_proof:start
2021-02-03T20:53:16.657 INFO bellperson::groth16::prover > Bellperson 0.12.3 is being used!
2021-02-03T20:53:16.663 INFO storage_proofs_core::compound_proof > snark_proof:start
2021-02-03T20:53:16.663 INFO bellperson::groth16::prover > Bellperson 0.12.3 is being used!
2021-02-03T20:53:16.945 INFO bellperson::groth16::prover > starting proof timer
2021-02-03T20:53:16.945 DEBUG bellperson::gpu::locks > Acquiring priority lock...
2021-02-03T20:53:16.945 DEBUG bellperson::gpu::locks > Priority lock acquired!
2021-02-03T20:53:16.966 INFO bellperson::groth16::prover > starting proof timer
2021-02-03T20:53:16.966 DEBUG bellperson::gpu::locks > Acquiring priority lock...

full logs https://gist.github.com/travisperson/e4cee85d94e47fcf537ce38b75517122

The code that appears to get caught up is the locking in bellman here
https://github.com/filecoin-project/bellperson/blob/d2e88544efb9876bc96c3d7e26a527943c0fdb68/src/gpu/locks.rs#L40-L43

@f8-ptrk
Copy link
Contributor

f8-ptrk commented Feb 3, 2021

i see this issue on machines without a GPU - but with out the "no gpu flag" set. but we might actually have 2 different issues here with the same symptoms

@travisperson
Copy link
Contributor Author

This issue has been resolved.

The locking issue is addressed in bellperson, see the issue here for more details: filecoin-project/rust-fil-proofs#1380

Additionally, there is a maximum of 5 partitions per deadline. With 512MB sectors this is pretty easy to hit because they are limited to 2 sectors per partition.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants