GPU preemption failure #291

Elhorses · 2022-11-25T07:04:28Z

When the wdpost calculation and winningpost calculation occur at the same time, although the priority of winningpost is true and that of wdpost is false, winningpost still fails to preempt the GPU，and then winningpost computing timeout

vmx · 2022-11-25T11:30:32Z

Could you please provide some logs with log level debug, or even better trace (by setting RUST_LOG=trace)?

Do you have some way to reproduce the problem?

Elhorses · 2022-11-27T03:21:49Z

Could you please provide some logs with log level debug, or even better trace (by setting RUST_LOG=trace)?

Do you have some way to reproduce the problem?

2022-11-24T23:07:17.607 INFO storage_proofs_core::compound_proof > snark_proof:start
2022-11-24T23:07:17.750 INFO bellperson::groth16::prover > Bellperson 0.22.0 is being used!
2022-11-24T23:07:40.287Z INFO miner miner/miner.go:548 completed mineOne {"tookMilliseconds": 12, "forRound": 2367496, "baseEpoch": 2367495, "baseDeltaSeconds": 10, "nullRounds": 0, "lateStart": false, "beaconEpoch": 2463341, "lookbackEpochs": 900, "networkPowerAtLookback": "21798573920164216832", "minerPowerAtLookback": "11159012229775360", "isEligible": true, "isWinner": false, "error": null}
2022-11-24T23:07:55.019 INFO bellperson::groth16::prover > synthesis time: 37.268746346s
2022-11-24T23:07:55.019 INFO bellperson::groth16::prover > starting proof timer
2022-11-24T23:07:59.294 INFO bellperson::gpu::locks > GPU is available for FFT!
2022-11-24T23:07:59.295 INFO ec_gpu_gen::program > Using kernel on CUDA.
2022-11-24T23:07:59.317 INFO ec_gpu_gen::fft > FFT: 1 working device(s) selected.
2022-11-24T23:07:59.318 INFO ec_gpu_gen::fft > FFT: Device 0: GeForce RTX 3090
2022-11-24T23:07:59.318 INFO bellperson::gpu::locks > GPU FFT kernel instantiated!
2022-11-24T23:08:10.074Z INFO miner miner/miner.go:548 completed mineOne {"tookMilliseconds": 51, "forRound": 2367497, "baseEpoch": 2367495, "baseDeltaSeconds": 40, "nullRounds": 1, "lateStart": false, "beaconEpoch": 2463342, "lookbackEpochs": 900, "networkPowerAtLookback": "21798574221660848128", "minerPowerAtLookback": "11159012229775360", "isEligible": true, "isWinner": false, "error": null}
2022-11-24T23:08:17.474Z ERROR storagemarket_impl impl/provider.go:205 failed to connect index provider host with the full node: failed to call NetProtectAdd on the full node, err: missing permission to invoke 'NetProtectAdd' (need 'admin')
2022-11-24T23:08:27.245 INFO bellperson::gpu::locks > GPU is available for Multiexp!
2022-11-24T23:08:27.245 INFO bellperson::gpu::multiexp > Multiexp: CPU utilization: 0.
2022-11-24T23:08:27.246 INFO ec_gpu_gen::program > Using kernel on CUDA.
2022-11-24T23:08:27.248 INFO ec_gpu_gen::multiexp > Multiexp: 1 working device(s) selected.
2022-11-24T23:08:27.248 INFO ec_gpu_gen::multiexp > Multiexp: Device 0: GeForce RTX 3090 (Chunk-size: 18061702)
2022-11-24T23:08:27.248 INFO bellperson::gpu::locks > GPU Multiexp kernel instantiated!
2022-11-24T23:08:40.246Z INFO miner miner/miner.go:548 completed mineOne {"tookMilliseconds": 17, "forRound": 2367498, "baseEpoch": 2367497, "baseDeltaSeconds": 10, "nullRounds": 0, "lateStart": false, "beaconEpoch": 2463343, "lookbackEpochs": 900, "networkPowerAtLookback": "21798572294495338496", "minerPowerAtLookback": "11159012229775360", "isEligible": true, "isWinner": false, "error": null}
2022-11-24T23:09:10.044Z INFO miner miner/miner.go:590 round winner, will mine new block, for {"height": "2367499"}
2022-11-24T23:09:10.045Z INFO storageminer storage/winning_prover.go:70 Computing WinningPoSt ;[{SealProof:9 SectorNumber:152313 SectorKey: SealedCID:bagboea4b5abcadqi47tmsbyg24t463o4u2nkb5dbhzn24ndpv7mb7ywjosqc4e27}]; [114 34 174 18 207 210 176 171 229 24 20 68 99 184 137 67 14 147 227 93 13 156 156 207 168 200 156 1 88 161 142 198]
2022-11-24T23:09:10.045Z INFO advmgr sealer/manager_post.go:23 GenerateWinningPoSt run at lotus-miner
2022-11-24T23:09:10.054 INFO filecoin_proofs::api::winning_post > generate_winning_post:start
2022-11-24T23:09:10.065 INFO filecoin_proofs::caches > trying parameters memory cache for: WINNING_POST[68719476736]
2022-11-24T23:09:10.065 INFO filecoin_proofs::caches > found params in memory cache for WINNING_POST[68719476736]
2022-11-24T23:09:10.191 INFO storage_proofs_core::compound_proof > vanilla_proofs:start
2022-11-24T23:09:10.488 INFO storage_proofs_core::compound_proof > vanilla_proofs:finish
2022-11-24T23:09:10.493 INFO storage_proofs_core::compound_proof > snark_proof:start
2022-11-24T23:09:10.494 INFO bellperson::groth16::prover > Bellperson 0.22.0 is being used!
2022-11-24T23:09:10.594 INFO bellperson::groth16::prover > synthesis time: 100.372952ms
2022-11-24T23:09:10.594 INFO bellperson::groth16::prover > starting proof timer
2022-11-24T23:09:10.610 INFO bellperson::gpu::locks > GPU is available for FFT!
2022-11-24T23:09:10.610 INFO ec_gpu_gen::program > Using kernel on CUDA.
2022-11-24T23:09:11.044 INFO ec_gpu_gen::fft > FFT: 1 working device(s) selected.
2022-11-24T23:09:11.044 INFO ec_gpu_gen::fft > FFT: Device 0: GeForce RTX 3090
2022-11-24T23:09:11.044 INFO bellperson::gpu::locks > GPU FFT kernel instantiated!
2022-11-24T23:09:17.476Z ERROR storagemarket_impl impl/provider.go:205 failed to connect index provider host with the full node: failed to call NetProtectAdd on the full node, err: missing permission to invoke 'NetProtectAdd' (need 'admin')
2022-11-24T23:10:17.477Z ERROR storagemarket_impl impl/provider.go:205 failed to connect index provider host with the full node: failed to call NetProtectAdd on the full node, err: missing permission to invoke 'NetProtectAdd' (need 'admin')
2022-11-24T23:11:00.420 INFO bellperson::gpu::locks > GPU is available for Multiexp!
2022-11-24T23:11:00.420 INFO bellperson::gpu::multiexp > Multiexp: CPU utilization: 0.
2022-11-24T23:11:00.420 INFO ec_gpu_gen::program > Using kernel on CUDA.
2022-11-24T23:11:00.421 INFO bellperson::groth16::prover > prover time: 185.401883711s
2022-11-24T23:11:01.763 INFO storage_proofs_core::compound_proof > snark_proof:finish
2022-11-24T23:11:01.763 INFO filecoin_proofs::api::window_post > generate_window_post:finish
2022-11-24T23:11:01.764 INFO ec_gpu_gen::multiexp > Multiexp: 1 working device(s) selected.
2022-11-24T23:11:01.764 INFO ec_gpu_gen::multiexp > Multiexp: Device 0: GeForce RTX 3090 (Chunk-size: 18061702)
2022-11-24T23:11:01.764 INFO bellperson::gpu::locks > GPU Multiexp kernel instantiated!
2022-11-24T23:11:02.383 INFO bellperson::groth16::prover > prover time: 111.788384309s
2022-11-24T23:11:02.387 INFO storage_proofs_core::compound_proof > snark_proof:finish
2022-11-24T23:11:02.387 INFO filecoin_proofs::api::winning_post > generate_winning_post:finish
2022-11-24T23:11:02.388Z INFO storageminer storage/winning_prover.go:77 GenerateWinningPoSt took 1m52.342815192s
2022-11-24T23:11:04.035Z INFO wdpost wdpost/wdpost_run.go:732 computing window post {"batch": 0, "elapsed": 301.29480264, "skip": 0, "err": null}
2022-11-24T23:11:04.047 INFO filecoin_proofs::api::window_post > verify_window_post:start
2022-11-24T23:11:04.047 INFO filecoin_proofs::caches > trying parameters memory cache for: WINDOW_POST[68719476736]-verifying-key
2022-11-24T23:11:04.047 INFO filecoin_proofs::caches > found params in memory cache for WINDOW_POST[68719476736]-verifying-key
2022-11-24T23:11:04.081 INFO filecoin_proofs::api::window_post > verify_window_post:finish
2022-11-24T23:11:06.994Z INFO miner miner/miner.go:645 mined new block {"cid": "bafy2bzaceaukxlerk4p4rjtq6x6yfdn764vjmtbkmr2a7wunn3mnaqov2if2a", "height": 2367499, "miner": "f0502198", "parents": ["f0230861","f01886704","f01702940","f01171513","f01852363","f01680940","f089180","f01926802"], "parentTipset": "{bafy2bzaceduwoeqccvx34e6qosnrzjteg4qelshpxnnmdzqcl33axfjyng3qy,bafy2bzaceabyjgwvovms7vdayfuhcqfhyv4ufbvgtnzj66dt7ksdyakzky7ms,bafy2bzacebnwfa24777iglbvuebudz3ockjqvu3adoqmwqd52dtcbgevmeaqo,bafy2bzacea4b5n6olk6khgla6ntji3oacebljbfywakph6jv2buev2ygbterm,bafy2bzacebl76ealt5wv2nzmxrzbb3saoynuvc3brfwuldltbdiyb2kwu42mi,bafy2bzacec3qgh53icxqezayboosxobtoznilaemn47qqqb5hre7corkqyns6,bafy2bzacedyn2wahnuyaaahtv5nmoub2orhwsfob7ovk5gxn6lqfkcynnmby4,bafy2bzaceasooknnkcphdfsuigvbeet3dzrzhmmcx4v6g335liahl3su73ywm}", "took": 116.967937239}
2022-11-24T23:11:06.994Z WARN miner miner/miner.go:647 CAUTION: block production took longer than the block delay. Your computer may not be fast enough to keep up {"tPowercheck ": 0.016822184, "tTicket ": 0.0015335, "tSeed ": 0.00000244, "tProof ": 112.343196209, "tPending ": 4.507330629, "tCreateBlock ": 0.099052277}

vmx · 2022-11-28T11:34:12Z

From the log messages it's hard to tell, which lines comes from which process/thread. It could well be that the WinningPoSt one got priority. Why are you sure it didn't?

Are you able to reproduce the issue? Are you compiling the Rust parts from source? I'm asking as if you can, I might be able to provide you a version, where it also logs the thread ID, so that we can distinguish them.

Elhorses · 2022-11-28T13:29:07Z

you can run "cargo test test_parallel_prover --features "cuda" -- --nocapture" with v0.21.0 and v0.22.0, and then compare rust DEBUG log, we find v0.21.0 could get "[2022-11-28T13:26:12Z WARN bellperson::gpu::locks] GPU acquired by a high priority process! Freeing up Multiexp kernels..." if happened conflict, but v0.22.0 never get this log. and for my lotus-miner, When the wdpost calculation and winningpost calculation occur at the same time, although the priority of winningpost is true and that of wdpost is false, winningpost still fails to preempt the GPU，and then winningpost computing timeout.

vmx · 2022-11-30T11:09:08Z

Thanks @Elhorses for providing the command to run. I think I can reproduce it, I'm having a look.

Elhorses · 2022-11-30T11:33:03Z

Thanks @Elhorses for providing the command to run. I think I can reproduce it, I'm having a look.

OK, thank you !
I have solved the problem, you can look at https://github.com/Elhorses/bellperson/tree/v0.22.0, commit: , and now my lotus-miner working fine

vmx · 2022-11-30T11:34:30Z

Thanks, that'll save me a lot of time!

Due to refactorings, the `PriorityLock::should_break()` logic was quite confusing and used the wrong way. Make it work correctly while simplifying the logic. This commit also removes `PriorityLock` from the public API as it isn't really needed. Tweak some values in the parallel prover test, so that aborting a low priority kernel from running on the GPU happens more frequently. Fixes #291.

vmx · 2022-11-30T18:39:54Z

@Elhorses here's my version of a fix: #293. It's for the master branch, but it should be easily applicable to older bellperson versions as well. The patch I've done for ec-gpu-gen that is referenced from my PR isn't needed for correctness, it just makes sure the output doesn't contain any messages about panics.

Elhorses · 2022-12-01T01:45:41Z

@Elhorses here's my version of a fix: #293. It's for the master branch, but it should be easily applicable to older bellperson versions as well. The patch I've done for ec-gpu-gen that is referenced from my PR isn't needed for correctness, it just makes sure the output doesn't contain any messages about panics.

Ok, thank for you help, i'll use it

Due to refactorings, the `PriorityLock::should_break()` logic was quite confusing and used the wrong way. Make it work correctly while simplifying the logic. This commit also removes `PriorityLock` from the public API as it isn't really needed. Tweak some values in the parallel prover test, so that aborting a low priority kernel from running on the GPU happens more frequently. It needs a newer version of `ec-gpu-gen`, else it would cause panics (which are not fatal, as they happen within a thread. Though, they still show up in the logs). Fixes #291.

Elhorses · 2022-12-01T09:26:43Z

@Elhorses here's my version of a fix: #293. It's for the master branch, but it should be easily applicable to older bellperson versions as well. The patch I've done for ec-gpu-gen that is referenced from my PR isn't needed for correctness, it just makes sure the output doesn't contain any messages about panics.

hello， can we using bellperson on the AMD GPU?

vmx · 2022-12-01T10:57:40Z

The OpenCL version should run on AMD GPUs. If it doesn't, it's a bug. Please report if you run into problems.

Elhorses · 2022-12-01T11:06:28Z

The OpenCL version should run on AMD GPUs. If it doesn't, it's a bug. Please report if you run into problems.

ok, thank for you help！

Due to refactorings, the `PriorityLock::should_break()` logic was quite confusing and used the wrong way. Make it work correctly while simplifying the logic. This commit also removes `PriorityLock` from the public API as it isn't really needed. Tweak some values in the parallel prover test, so that aborting a low priority kernel from running on the GPU happens more frequently. It needs a newer version of `ec-gpu-gen`, else it would cause panics (which are not fatal, as they happen within a thread. Though, they still show up in the logs). Fixes #291.

vmx self-assigned this Nov 28, 2022

vmx mentioned this issue Nov 30, 2022

fix: make priority locking work again #293

Merged

vmx closed this as completed in #293 Dec 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU preemption failure #291

GPU preemption failure #291

Elhorses commented Nov 25, 2022

vmx commented Nov 25, 2022

Elhorses commented Nov 27, 2022

vmx commented Nov 28, 2022

Elhorses commented Nov 28, 2022

vmx commented Nov 30, 2022

Elhorses commented Nov 30, 2022

vmx commented Nov 30, 2022

vmx commented Nov 30, 2022

Elhorses commented Dec 1, 2022

Elhorses commented Dec 1, 2022

vmx commented Dec 1, 2022

Elhorses commented Dec 1, 2022

GPU preemption failure #291

GPU preemption failure #291

Comments

Elhorses commented Nov 25, 2022

vmx commented Nov 25, 2022

Elhorses commented Nov 27, 2022

vmx commented Nov 28, 2022

Elhorses commented Nov 28, 2022

vmx commented Nov 30, 2022

Elhorses commented Nov 30, 2022

vmx commented Nov 30, 2022

vmx commented Nov 30, 2022

Elhorses commented Dec 1, 2022

Elhorses commented Dec 1, 2022

vmx commented Dec 1, 2022

Elhorses commented Dec 1, 2022