New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Parallel switching from GPU to CPU #35
Conversation
src/groth16/prover.rs
Outdated
&mut multiexp_kern, | ||
); | ||
// Keep checking between multiexp | ||
#[cfg(feature = "gpu")] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a lot of repetition here, can you extract into either a macro or a function please
two questions
|
Suppose Higher priority B takes over the GPU from A, and then another high priority process C is called, C should take over B and B would then switch to CPU for the remainder of its proof alongside A. This PR assumes that there is only one high priority proof at a time.
Currently the lower priority prover will continue to use the CPU until it finishes its proof. Switching back to GPU is more involved but I can put some thought into that. |
I think it is fine for a first version, but we should have it switch back to gpu ideally if it is free |
src/groth16/prover.rs
Outdated
#[cfg(feature = "gpu")] | ||
macro_rules! check_for_higher_prio { | ||
() => { | ||
match gpu::gpu_is_not_acquired() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI: this match clause could also be written as:
gpu::gpu_is_not_acquired().unwrap_or(false)
Which might make your code simpler.
src/groth16/prover.rs
Outdated
@@ -16,6 +16,22 @@ use crate::multicore::Worker; | |||
use crate::multiexp::{gpu_multiexp_supported, multiexp, DensityTracker, FullDensity}; | |||
use crate::{Circuit, ConstraintSystem, Index, LinearCombination, SynthesisError, Variable}; | |||
|
|||
// We check to see if another higher priority process needs to use | |||
// the GPU for each multiexp | |||
macro_rules! check_for_higher_prio { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this shoould have cfg(not(feature..))
Proofs can be forced to run on the CPU instead of the GPU. This is used to run some proofs with higher priority. The `gpu-cpu-test` tool tests if this actually works. It spawns two threads which run proofs. Those get killed after 5 minutes of running. The overall test runs longer as some input data needs to be generated. By default one thread will always be prioritized to run on the GPU. The other one might be moved to the CPU. When running: cd fil-proofs-tooling RUST_LOG=trace cargo run --release --bin gpu-cpu-test The return values of the tests might look like: 2019-12-11T18:13:19.518 main INFO gpu_cpu_test > Thread HighPrio info: RunInfo { elapsed: 303.366942495s, iterations: 28 } 2019-12-11T18:13:25.981 main INFO gpu_cpu_test > Thread LowPrio info: RunInfo { elapsed: 309.829930518s, iterations: 15 } Clearly the high priority thread got more work done. When running without the "GPU stealing" feature, where one tread demands to run on the GPU: RUST_LOG=trace cargo run --release --bin gpu-cpu-test -- --gpu-stealing false The return values indicate that both threads got the same amount of time on the GPU: 2019-12-11T18:30:16.868 main INFO gpu_cpu_test > Thread HighPrio info: RunInfo { elapsed: 307.388469955s, iterations: 23 } 2019-12-11T18:30:16.868 main INFO gpu_cpu_test > Thread LowPrio info: RunInfo { elapsed: 300.893010419s, iterations: 22 } This PR depends on filecoin-project/bellperson#35.
Proofs can be forced to run on the CPU instead of the GPU. This is used to run some proofs with higher priority. The `gpu-cpu-test` tool tests if this actually works. It spawns two threads which run proofs. Those get killed after 5 minutes of running. The overall test runs longer as some input data needs to be generated. By default one thread will always be prioritized to run on the GPU. The other one might be moved to the CPU. When running: cd fil-proofs-tooling RUST_LOG=trace cargo run --release --bin gpu-cpu-test The return values of the tests might look like: 2019-12-11T18:13:19.518 main INFO gpu_cpu_test > Thread HighPrio info: RunInfo { elapsed: 303.366942495s, iterations: 28 } 2019-12-11T18:13:25.981 main INFO gpu_cpu_test > Thread LowPrio info: RunInfo { elapsed: 309.829930518s, iterations: 15 } Clearly the high priority thread got more work done. When running without the "GPU stealing" feature, where one tread demands to run on the GPU: RUST_LOG=trace cargo run --release --bin gpu-cpu-test -- --gpu-stealing false The return values indicate that both threads got the same amount of time on the GPU: 2019-12-11T18:30:16.868 main INFO gpu_cpu_test > Thread HighPrio info: RunInfo { elapsed: 307.388469955s, iterations: 23 } 2019-12-11T18:30:16.868 main INFO gpu_cpu_test > Thread LowPrio info: RunInfo { elapsed: 300.893010419s, iterations: 22 } This PR depends on filecoin-project/bellperson#35.
@nginnever I went through this code today, looks good to me in general, but some parts are missing I think: |
we need to test both cases |
Proofs can be forced to run on the CPU instead of the GPU. This is used to run some proofs with higher priority. The `gpu-cpu-test` tool tests if this actually works. It spawns two threads which run proofs. Those get killed after 5 minutes of running. The overall test runs longer as some input data needs to be generated. By default one thread will always be prioritized to run on the GPU. The other one might be moved to the CPU. When running: cd fil-proofs-tooling RUST_LOG=trace cargo run --release --bin gpu-cpu-test The return values of the tests might look like: 2019-12-11T18:13:19.518 main INFO gpu_cpu_test > Thread HighPrio info: RunInfo { elapsed: 303.366942495s, iterations: 28 } 2019-12-11T18:13:25.981 main INFO gpu_cpu_test > Thread LowPrio info: RunInfo { elapsed: 309.829930518s, iterations: 15 } Clearly the high priority thread got more work done. When running without the "GPU stealing" feature, where one tread demands to run on the GPU: RUST_LOG=trace cargo run --release --bin gpu-cpu-test -- --gpu-stealing false The return values indicate that both threads got the same amount of time on the GPU: 2019-12-11T18:30:16.868 main INFO gpu_cpu_test > Thread HighPrio info: RunInfo { elapsed: 307.388469955s, iterations: 23 } 2019-12-11T18:30:16.868 main INFO gpu_cpu_test > Thread LowPrio info: RunInfo { elapsed: 300.893010419s, iterations: 22 } This PR depends on filecoin-project/bellperson#35.
Better to drop everything. New updates coming as we discussed in discord.
That's why you drop the acquire lock before starting the proof as the higher prio https://github.com/finalitylabs/bellman/blob/fil-lock/tests/gpu_provers.rs#L154-L156 |
Could you please rebase that one on master? I tried it at https://github.com/vmx/bellman/tree/gpu-lock but somehow it doesn't do what it should do in my tests anymore. I guess you know better rebasing it properly :) |
src/groth16/prover.rs
Outdated
a, | ||
&mut multiexp_kern, | ||
) | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not add this logic to fn multiexp
instead, that way you don't have to duplicate it in here
|
Changes incoming, wrapping the multiexp and fft kernel in a lock structure to leave prover.rs more untouched and begin introducing better control of when the kernel should be canceled and moved back to CPU (i.e. rather than waiting for the entire multiexp round to finish). |
…eleased) and just wait for it
src/domain.rs
Outdated
Err(e) => { | ||
warn!("GPU FFT failed! Falling back to CPU... Error: {}", e); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can write
gpu_fft(k, a, omega, log_n).unwrap_or_else(|err| warn!("..."));
src/domain.rs
Outdated
log_n: u32, | ||
) -> gpu::GPUResult<()> { | ||
if let Some(ref mut k) = kern { | ||
match gpu_mul_by_field(k, a, minv, log_n) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unwrap_or_else
as above
if self.kernel.is_some() { | ||
warn!("GPU acquired by a high priority process! Freeing up kernels..."); | ||
self.kernel = None; // This would drop kernel and free up the GPU | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can be written a bit nicer
if let Some(_kernel) = self.kernel.take() {
warn!("...");
}
use log::info; | ||
use std::fs::File; | ||
|
||
const GPU_LOCK_NAME: &str = "/tmp/bellman.gpu.lock"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should probably use https://doc.rust-lang.org/std/env/fn.temp_dir.html instead of hard coding /tmp
} | ||
} | ||
|
||
const PRIORITY_LOCK_NAME: &str = "/tmp/bellman.priority.lock"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above
return Box::new(pool.compute(move || Ok(p))); | ||
} | ||
Err(e) => { | ||
warn!("GPU Multiexp failed! Falling back to CPU... Error: {}", e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't this return an error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dignifiedquire Depends on what you would expect it to do in case of a GPU failure. Fallback to CPU or return an error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah right
Proofs can be forced to run on the CPU instead of the GPU. This is used to run some proofs with higher priority. The `gpu-cpu-test` tool tests if this actually works. It spawns two threads which run proofs. Those get killed after 5 minutes of running. The overall test runs longer as some input data needs to be generated. By default one thread will always be prioritized to run on the GPU. The other one might be moved to the CPU. When running: cd fil-proofs-tooling RUST_LOG=trace cargo run --release --bin gpu-cpu-test The return values of the tests might look like: 2019-12-11T18:13:19.518 main INFO gpu_cpu_test > Thread HighPrio info: RunInfo { elapsed: 303.366942495s, iterations: 28 } 2019-12-11T18:13:25.981 main INFO gpu_cpu_test > Thread LowPrio info: RunInfo { elapsed: 309.829930518s, iterations: 15 } Clearly the high priority thread got more work done. When running without the "GPU stealing" feature, where one tread demands to run on the GPU: RUST_LOG=trace cargo run --release --bin gpu-cpu-test -- --gpu-stealing false The return values indicate that both threads got the same amount of time on the GPU: 2019-12-11T18:30:16.868 main INFO gpu_cpu_test > Thread HighPrio info: RunInfo { elapsed: 307.388469955s, iterations: 23 } 2019-12-11T18:30:16.868 main INFO gpu_cpu_test > Thread LowPrio info: RunInfo { elapsed: 300.893010419s, iterations: 22 } This PR depends on filecoin-project/bellperson#35.
Proofs can be forced to run on the CPU instead of the GPU. This is used to run some proofs with higher priority. The `gpu-cpu-test` tool tests if this actually works. It spawns two threads which run proofs. Those get killed after 5 minutes of running. The overall test runs longer as some input data needs to be generated. By default one thread will always be prioritized to run on the GPU. The other one might be moved to the CPU. When running: cd fil-proofs-tooling RUST_LOG=trace cargo run --release --bin gpu-cpu-test The return values of the tests might look like: 2019-12-11T18:13:19.518 main INFO gpu_cpu_test > Thread HighPrio info: RunInfo { elapsed: 303.366942495s, iterations: 28 } 2019-12-11T18:13:25.981 main INFO gpu_cpu_test > Thread LowPrio info: RunInfo { elapsed: 309.829930518s, iterations: 15 } Clearly the high priority thread got more work done. When running without the "GPU stealing" feature, where one tread demands to run on the GPU: RUST_LOG=trace cargo run --release --bin gpu-cpu-test -- --gpu-stealing false The return values indicate that both threads got the same amount of time on the GPU: 2019-12-11T18:30:16.868 main INFO gpu_cpu_test > Thread HighPrio info: RunInfo { elapsed: 307.388469955s, iterations: 23 } 2019-12-11T18:30:16.868 main INFO gpu_cpu_test > Thread LowPrio info: RunInfo { elapsed: 300.893010419s, iterations: 22 } This PR depends on filecoin-project/bellperson#35.
@nginnever I currently try to test this PR. I can't get the locking working. I suspect that my code is wrong. Here is the main part, can you please check if I understood it wrong how the API should be used. // This is the higher priority proof, get it on the GPU even if there is one running
// already there
if gpu_stealing {
let mut prio_lock = PriorityLock::new();
info!("Trying to acquire Priority lock");
prio_lock.lock();
// Run the actual proof
election_post::do_generate_post(&priv_replica_infos, &candidates);
}
// No locking
else {
// Run the actual proof
debug!("Do not try to acquire the priority lock");
election_post::do_generate_post(&priv_replica_infos, &candidates);
} The code above is called in a loop. It is executed by two threads. In one thread But what I see is that the proofing just runs sequentially one after another (i.e. proof in thread 1, then proof thread 2, thread 1 again, thread 2 again). It's the same order as when I run both threads with |
@vmx your usage looks correct. Could you try running with For FFT being acquired by higher prio
And if the multiexp is acquired you should see
We do know that the higher prio process will "jump in" and force the lower to switch off the gpu. I am doing some double checking to be sure that the lower prio can then continue on the cpu without waiting for the gpu to finish (it was but a change recently may have broken that). If that is the case then we would see what you are reporting. EDIT: Upon close inspection it looks like everything is working as expected in our tests. |
Continued here: #58 |
This PR extends the file lock to allow for two processes to designate a higher and lower priority. The higher priority task can force the lower priority task into switching from GPU prover to CPU prover between multiexp (the heaviest part of computation) rounds. This PR contains two file structures. A prover lock file and an acquire GPU flag file. This is a fairly large change to the bellman lib so I will give a detailed example.
i.e.
Process A starts a sector sealing proof that blocks the GPU/prover for ~300-600s.
Process B needs to create a smaller PoST proof in a short period of time and would like to use the GPU to do so concurrently.
Before B starts the circuit proof call
bellman::groth16::create_proof(c, ¶ms, r, s)
it will first use a newbellman::gpu
feature to send a signal to A that it would like it to switch to CPU so that B may use the GPU at the same time.B checks if another process is creating a proof...
B creates an acquire file flag and loops for a short period of time until A notices this flag and releases their prover lock. Currently this is done between the 8 multiexp rounds that A is processing, but in later PRs will move into the lower multiexp rounds per chunk size and then into the kernel itself if less time is required for A to move over to CPU.
When A acquires the flag between multiexps it will then use a CPU version of the multiexp from that point on. B is now free to start
create_proof
that will use the GPU prover.