Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU preemption failure #291

Closed
Elhorses opened this issue Nov 25, 2022 · 12 comments · Fixed by #293
Closed

GPU preemption failure #291

Elhorses opened this issue Nov 25, 2022 · 12 comments · Fixed by #293
Assignees

Comments

@Elhorses
Copy link

When the wdpost calculation and winningpost calculation occur at the same time, although the priority of winningpost is true and that of wdpost is false, winningpost still fails to preempt the GPU,and then winningpost computing timeout

@vmx
Copy link

vmx commented Nov 25, 2022

Could you please provide some logs with log level debug, or even better trace (by setting RUST_LOG=trace)?

Do you have some way to reproduce the problem?

@Elhorses
Copy link
Author

Could you please provide some logs with log level debug, or even better trace (by setting RUST_LOG=trace)?

Do you have some way to reproduce the problem?

2022-11-24T23:07:17.607 INFO storage_proofs_core::compound_proof > snark_proof:start
2022-11-24T23:07:17.750 INFO bellperson::groth16::prover > Bellperson 0.22.0 is being used!
2022-11-24T23:07:40.287Z INFO miner miner/miner.go:548 completed mineOne {"tookMilliseconds": 12, "forRound": 2367496, "baseEpoch": 2367495, "baseDeltaSeconds": 10, "nullRounds": 0, "lateStart": false, "beaconEpoch": 2463341, "lookbackEpochs": 900, "networkPowerAtLookback": "21798573920164216832", "minerPowerAtLookback": "11159012229775360", "isEligible": true, "isWinner": false, "error": null}
2022-11-24T23:07:55.019 INFO bellperson::groth16::prover > synthesis time: 37.268746346s
2022-11-24T23:07:55.019 INFO bellperson::groth16::prover > starting proof timer
2022-11-24T23:07:59.294 INFO bellperson::gpu::locks > GPU is available for FFT!
2022-11-24T23:07:59.295 INFO ec_gpu_gen::program > Using kernel on CUDA.
2022-11-24T23:07:59.317 INFO ec_gpu_gen::fft > FFT: 1 working device(s) selected.
2022-11-24T23:07:59.318 INFO ec_gpu_gen::fft > FFT: Device 0: GeForce RTX 3090
2022-11-24T23:07:59.318 INFO bellperson::gpu::locks > GPU FFT kernel instantiated!
2022-11-24T23:08:10.074Z INFO miner miner/miner.go:548 completed mineOne {"tookMilliseconds": 51, "forRound": 2367497, "baseEpoch": 2367495, "baseDeltaSeconds": 40, "nullRounds": 1, "lateStart": false, "beaconEpoch": 2463342, "lookbackEpochs": 900, "networkPowerAtLookback": "21798574221660848128", "minerPowerAtLookback": "11159012229775360", "isEligible": true, "isWinner": false, "error": null}
2022-11-24T23:08:17.474Z ERROR storagemarket_impl impl/provider.go:205 failed to connect index provider host with the full node: failed to call NetProtectAdd on the full node, err: missing permission to invoke 'NetProtectAdd' (need 'admin')
2022-11-24T23:08:27.245 INFO bellperson::gpu::locks > GPU is available for Multiexp!
2022-11-24T23:08:27.245 INFO bellperson::gpu::multiexp > Multiexp: CPU utilization: 0.
2022-11-24T23:08:27.246 INFO ec_gpu_gen::program > Using kernel on CUDA.
2022-11-24T23:08:27.248 INFO ec_gpu_gen::multiexp > Multiexp: 1 working device(s) selected.
2022-11-24T23:08:27.248 INFO ec_gpu_gen::multiexp > Multiexp: Device 0: GeForce RTX 3090 (Chunk-size: 18061702)
2022-11-24T23:08:27.248 INFO bellperson::gpu::locks > GPU Multiexp kernel instantiated!
2022-11-24T23:08:40.246Z INFO miner miner/miner.go:548 completed mineOne {"tookMilliseconds": 17, "forRound": 2367498, "baseEpoch": 2367497, "baseDeltaSeconds": 10, "nullRounds": 0, "lateStart": false, "beaconEpoch": 2463343, "lookbackEpochs": 900, "networkPowerAtLookback": "21798572294495338496", "minerPowerAtLookback": "11159012229775360", "isEligible": true, "isWinner": false, "error": null}
2022-11-24T23:09:10.044Z INFO miner miner/miner.go:590 round winner, will mine new block, for {"height": "2367499"}
2022-11-24T23:09:10.045Z INFO storageminer storage/winning_prover.go:70 Computing WinningPoSt ;[{SealProof:9 SectorNumber:152313 SectorKey: SealedCID:bagboea4b5abcadqi47tmsbyg24t463o4u2nkb5dbhzn24ndpv7mb7ywjosqc4e27}]; [114 34 174 18 207 210 176 171 229 24 20 68 99 184 137 67 14 147 227 93 13 156 156 207 168 200 156 1 88 161 142 198]
2022-11-24T23:09:10.045Z INFO advmgr sealer/manager_post.go:23 GenerateWinningPoSt run at lotus-miner
2022-11-24T23:09:10.054 INFO filecoin_proofs::api::winning_post > generate_winning_post:start
2022-11-24T23:09:10.065 INFO filecoin_proofs::caches > trying parameters memory cache for: WINNING_POST[68719476736]
2022-11-24T23:09:10.065 INFO filecoin_proofs::caches > found params in memory cache for WINNING_POST[68719476736]
2022-11-24T23:09:10.191 INFO storage_proofs_core::compound_proof > vanilla_proofs:start
2022-11-24T23:09:10.488 INFO storage_proofs_core::compound_proof > vanilla_proofs:finish
2022-11-24T23:09:10.493 INFO storage_proofs_core::compound_proof > snark_proof:start
2022-11-24T23:09:10.494 INFO bellperson::groth16::prover > Bellperson 0.22.0 is being used!
2022-11-24T23:09:10.594 INFO bellperson::groth16::prover > synthesis time: 100.372952ms
2022-11-24T23:09:10.594 INFO bellperson::groth16::prover > starting proof timer
2022-11-24T23:09:10.610 INFO bellperson::gpu::locks > GPU is available for FFT!
2022-11-24T23:09:10.610 INFO ec_gpu_gen::program > Using kernel on CUDA.
2022-11-24T23:09:11.044 INFO ec_gpu_gen::fft > FFT: 1 working device(s) selected.
2022-11-24T23:09:11.044 INFO ec_gpu_gen::fft > FFT: Device 0: GeForce RTX 3090
2022-11-24T23:09:11.044 INFO bellperson::gpu::locks > GPU FFT kernel instantiated!
2022-11-24T23:09:17.476Z ERROR storagemarket_impl impl/provider.go:205 failed to connect index provider host with the full node: failed to call NetProtectAdd on the full node, err: missing permission to invoke 'NetProtectAdd' (need 'admin')
2022-11-24T23:10:17.477Z ERROR storagemarket_impl impl/provider.go:205 failed to connect index provider host with the full node: failed to call NetProtectAdd on the full node, err: missing permission to invoke 'NetProtectAdd' (need 'admin')
2022-11-24T23:11:00.420 INFO bellperson::gpu::locks > GPU is available for Multiexp!
2022-11-24T23:11:00.420 INFO bellperson::gpu::multiexp > Multiexp: CPU utilization: 0.
2022-11-24T23:11:00.420 INFO ec_gpu_gen::program > Using kernel on CUDA.
2022-11-24T23:11:00.421 INFO bellperson::groth16::prover > prover time: 185.401883711s
2022-11-24T23:11:01.763 INFO storage_proofs_core::compound_proof > snark_proof:finish
2022-11-24T23:11:01.763 INFO filecoin_proofs::api::window_post > generate_window_post:finish
2022-11-24T23:11:01.764 INFO ec_gpu_gen::multiexp > Multiexp: 1 working device(s) selected.
2022-11-24T23:11:01.764 INFO ec_gpu_gen::multiexp > Multiexp: Device 0: GeForce RTX 3090 (Chunk-size: 18061702)
2022-11-24T23:11:01.764 INFO bellperson::gpu::locks > GPU Multiexp kernel instantiated!
2022-11-24T23:11:02.383 INFO bellperson::groth16::prover > prover time: 111.788384309s
2022-11-24T23:11:02.387 INFO storage_proofs_core::compound_proof > snark_proof:finish
2022-11-24T23:11:02.387 INFO filecoin_proofs::api::winning_post > generate_winning_post:finish
2022-11-24T23:11:02.388Z INFO storageminer storage/winning_prover.go:77 GenerateWinningPoSt took 1m52.342815192s
2022-11-24T23:11:04.035Z INFO wdpost wdpost/wdpost_run.go:732 computing window post {"batch": 0, "elapsed": 301.29480264, "skip": 0, "err": null}
2022-11-24T23:11:04.047 INFO filecoin_proofs::api::window_post > verify_window_post:start
2022-11-24T23:11:04.047 INFO filecoin_proofs::caches > trying parameters memory cache for: WINDOW_POST[68719476736]-verifying-key
2022-11-24T23:11:04.047 INFO filecoin_proofs::caches > found params in memory cache for WINDOW_POST[68719476736]-verifying-key
2022-11-24T23:11:04.081 INFO filecoin_proofs::api::window_post > verify_window_post:finish
2022-11-24T23:11:06.994Z INFO miner miner/miner.go:645 mined new block {"cid": "bafy2bzaceaukxlerk4p4rjtq6x6yfdn764vjmtbkmr2a7wunn3mnaqov2if2a", "height": 2367499, "miner": "f0502198", "parents": ["f0230861","f01886704","f01702940","f01171513","f01852363","f01680940","f089180","f01926802"], "parentTipset": "{bafy2bzaceduwoeqccvx34e6qosnrzjteg4qelshpxnnmdzqcl33axfjyng3qy,bafy2bzaceabyjgwvovms7vdayfuhcqfhyv4ufbvgtnzj66dt7ksdyakzky7ms,bafy2bzacebnwfa24777iglbvuebudz3ockjqvu3adoqmwqd52dtcbgevmeaqo,bafy2bzacea4b5n6olk6khgla6ntji3oacebljbfywakph6jv2buev2ygbterm,bafy2bzacebl76ealt5wv2nzmxrzbb3saoynuvc3brfwuldltbdiyb2kwu42mi,bafy2bzacec3qgh53icxqezayboosxobtoznilaemn47qqqb5hre7corkqyns6,bafy2bzacedyn2wahnuyaaahtv5nmoub2orhwsfob7ovk5gxn6lqfkcynnmby4,bafy2bzaceasooknnkcphdfsuigvbeet3dzrzhmmcx4v6g335liahl3su73ywm}", "took": 116.967937239}
2022-11-24T23:11:06.994Z WARN miner miner/miner.go:647 CAUTION: block production took longer than the block delay. Your computer may not be fast enough to keep up {"tPowercheck ": 0.016822184, "tTicket ": 0.0015335, "tSeed ": 0.00000244, "tProof ": 112.343196209, "tPending ": 4.507330629, "tCreateBlock ": 0.099052277}

@vmx
Copy link

vmx commented Nov 28, 2022

From the log messages it's hard to tell, which lines comes from which process/thread. It could well be that the WinningPoSt one got priority. Why are you sure it didn't?

Are you able to reproduce the issue? Are you compiling the Rust parts from source? I'm asking as if you can, I might be able to provide you a version, where it also logs the thread ID, so that we can distinguish them.

@Elhorses
Copy link
Author

you can run "cargo test test_parallel_prover --features "cuda" -- --nocapture" with v0.21.0 and v0.22.0, and then compare rust DEBUG log, we find v0.21.0 could get "[2022-11-28T13:26:12Z WARN bellperson::gpu::locks] GPU acquired by a high priority process! Freeing up Multiexp kernels..." if happened conflict, but v0.22.0 never get this log. and for my lotus-miner, When the wdpost calculation and winningpost calculation occur at the same time, although the priority of winningpost is true and that of wdpost is false, winningpost still fails to preempt the GPU,and then winningpost computing timeout.

@vmx vmx self-assigned this Nov 28, 2022
@vmx
Copy link

vmx commented Nov 30, 2022

Thanks @Elhorses for providing the command to run. I think I can reproduce it, I'm having a look.

@Elhorses
Copy link
Author

Thanks @Elhorses for providing the command to run. I think I can reproduce it, I'm having a look.

OK, thank you !
I have solved the problem, you can look at https://github.com/Elhorses/bellperson/tree/v0.22.0, commit: , and now my lotus-miner working fine

@vmx
Copy link

vmx commented Nov 30, 2022

Thanks, that'll save me a lot of time!

vmx added a commit that referenced this issue Nov 30, 2022
Due to refactorings, the `PriorityLock::should_break()` logic was
quite confusing and used the wrong way. Make it work correctly
while simplifying the logic.

This commit also removes `PriorityLock` from the public API as it
isn't really needed.

Tweak some values in the parallel prover test, so that aborting a
low priority kernel from running on the GPU happens more frequently.

Fixes #291.
@vmx
Copy link

vmx commented Nov 30, 2022

@Elhorses here's my version of a fix: #293. It's for the master branch, but it should be easily applicable to older bellperson versions as well. The patch I've done for ec-gpu-gen that is referenced from my PR isn't needed for correctness, it just makes sure the output doesn't contain any messages about panics.

@Elhorses
Copy link
Author

Elhorses commented Dec 1, 2022

@Elhorses here's my version of a fix: #293. It's for the master branch, but it should be easily applicable to older bellperson versions as well. The patch I've done for ec-gpu-gen that is referenced from my PR isn't needed for correctness, it just makes sure the output doesn't contain any messages about panics.

Ok, thank for you help, i'll use it

vmx added a commit that referenced this issue Dec 1, 2022
Due to refactorings, the `PriorityLock::should_break()` logic was
quite confusing and used the wrong way. Make it work correctly
while simplifying the logic.

This commit also removes `PriorityLock` from the public API as it
isn't really needed.

Tweak some values in the parallel prover test, so that aborting a
low priority kernel from running on the GPU happens more frequently.

It needs a newer version of `ec-gpu-gen`, else it would cause panics
(which are not fatal, as they happen within a thread. Though, they
still show up in the logs).

Fixes #291.
@Elhorses
Copy link
Author

Elhorses commented Dec 1, 2022

@Elhorses here's my version of a fix: #293. It's for the master branch, but it should be easily applicable to older bellperson versions as well. The patch I've done for ec-gpu-gen that is referenced from my PR isn't needed for correctness, it just makes sure the output doesn't contain any messages about panics.

hello, can we using bellperson on the AMD GPU?

@vmx
Copy link

vmx commented Dec 1, 2022

The OpenCL version should run on AMD GPUs. If it doesn't, it's a bug. Please report if you run into problems.

@Elhorses
Copy link
Author

Elhorses commented Dec 1, 2022

The OpenCL version should run on AMD GPUs. If it doesn't, it's a bug. Please report if you run into problems.

ok, thank for you help!

@vmx vmx closed this as completed in #293 Dec 5, 2022
vmx added a commit that referenced this issue Dec 5, 2022
Due to refactorings, the `PriorityLock::should_break()` logic was
quite confusing and used the wrong way. Make it work correctly
while simplifying the logic.

This commit also removes `PriorityLock` from the public API as it
isn't really needed.

Tweak some values in the parallel prover test, so that aborting a
low priority kernel from running on the GPU happens more frequently.

It needs a newer version of `ec-gpu-gen`, else it would cause panics
(which are not fatal, as they happen within a thread. Though, they
still show up in the logs).

Fixes #291.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants