Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Parallel switching from GPU to CPU #35

Closed
wants to merge 56 commits into from

Conversation

nginnever
Copy link

This PR extends the file lock to allow for two processes to designate a higher and lower priority. The higher priority task can force the lower priority task into switching from GPU prover to CPU prover between multiexp (the heaviest part of computation) rounds. This PR contains two file structures. A prover lock file and an acquire GPU flag file. This is a fairly large change to the bellman lib so I will give a detailed example.

i.e.

Process A starts a sector sealing proof that blocks the GPU/prover for ~300-600s.

Process B needs to create a smaller PoST proof in a short period of time and would like to use the GPU to do so concurrently.

Before B starts the circuit proof call bellman::groth16::create_proof(c, &params, r, s) it will first use a new bellman::gpu feature to send a signal to A that it would like it to switch to CPU so that B may use the GPU at the same time.

B checks if another process is creating a proof...

let check = match gpu::gpu_is_available() {
    Ok(n) => n,
    Err(err) => false,
};

B creates an acquire file flag and loops for a short period of time until A notices this flag and releases their prover lock. Currently this is done between the 8 multiexp rounds that A is processing, but in later PRs will move into the lower multiexp rounds per chunk size and then into the kernel itself if less time is required for A to move over to CPU.

if check != true { 
    info!("GPU is NOT Available! Attempting to acuire the GPU...");
    let a_lock = Some(gpu::acquire_gpu().unwrap());

    // We need to drop the acquire lock as soon as the lower prio 
    // process has freed the main lock so that the higher uses GPU
    loop {
        //info!("checking to see if lower prio process has freed GPU");
        let available = match gpu::gpu_is_available() {
            Ok(n) => n,
            Err(err) => false,
        };
        if available {
            info!("GPU free from lower prio process. Dropping acquire gpu file lock from switching process...");
            gpu::drop_acquire_lock(a_lock.unwrap());
            break;
        };
        continue;       
    }
};

When A acquires the flag between multiexps it will then use a CPU version of the multiexp from that point on. B is now free to start create_proof that will use the GPU prover.

&mut multiexp_kern,
);
// Keep checking between multiexp
#[cfg(feature = "gpu")]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a lot of repetition here, can you extract into either a macro or a function please

@dignifiedquire
Copy link

two questions

  • what happens if the high priority item gets called multiple times?
  • what happens if a high priority item finishes?

@nginnever nginnever changed the title Parallel switching from GPU to CPU WIP Parallel switching from GPU to CPU Dec 9, 2019
@nginnever
Copy link
Author

nginnever commented Dec 10, 2019

  • what happens if the high priority item gets called multiple times?

Suppose Higher priority B takes over the GPU from A, and then another high priority process C is called, C should take over B and B would then switch to CPU for the remainder of its proof alongside A. This PR assumes that there is only one high priority proof at a time.

  • what happens if a high priority item finishes?

Currently the lower priority prover will continue to use the CPU until it finishes its proof. Switching back to GPU is more involved but I can put some thought into that.

@dignifiedquire
Copy link

Currently the lower priority prover will continue to use the CPU until it finishes its proof. Switching back to GPU is more involved but I can put some thought into that.

I think it is fine for a first version, but we should have it switch back to gpu ideally if it is free

#[cfg(feature = "gpu")]
macro_rules! check_for_higher_prio {
() => {
match gpu::gpu_is_not_acquired() {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI: this match clause could also be written as:

gpu::gpu_is_not_acquired().unwrap_or(false)

Which might make your code simpler.

@@ -16,6 +16,22 @@ use crate::multicore::Worker;
use crate::multiexp::{gpu_multiexp_supported, multiexp, DensityTracker, FullDensity};
use crate::{Circuit, ConstraintSystem, Index, LinearCombination, SynthesisError, Variable};

// We check to see if another higher priority process needs to use
// the GPU for each multiexp
macro_rules! check_for_higher_prio {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this shoould have cfg(not(feature..))

vmx added a commit to vmx/rust-fil-proofs that referenced this pull request Dec 11, 2019
Proofs can be forced to run on the CPU instead of the GPU. This is used to
run some proofs with higher priority.

The `gpu-cpu-test` tool tests if this actually works. It spawns two threads
which run proofs. Those get killed after 5 minutes of running. The overall
test runs longer as some input data needs to be generated. By default one
thread will always be prioritized to run on the GPU. The other one might be
moved to the CPU.

When running:

    cd fil-proofs-tooling
    RUST_LOG=trace cargo run --release --bin gpu-cpu-test

The return values of the tests might look like:

    2019-12-11T18:13:19.518 main INFO gpu_cpu_test > Thread HighPrio info: RunInfo { elapsed: 303.366942495s, iterations: 28 }
    2019-12-11T18:13:25.981 main INFO gpu_cpu_test > Thread LowPrio info: RunInfo { elapsed: 309.829930518s, iterations: 15 }

Clearly the high priority thread got more work done.

When running without the "GPU stealing" feature, where one tread demands to
run on the GPU:

    RUST_LOG=trace cargo run --release --bin gpu-cpu-test -- --gpu-stealing false

The return values indicate that both threads got the same amount of time on the GPU:

    2019-12-11T18:30:16.868 main INFO gpu_cpu_test > Thread HighPrio info: RunInfo { elapsed: 307.388469955s, iterations: 23 }
    2019-12-11T18:30:16.868 main INFO gpu_cpu_test > Thread LowPrio info: RunInfo { elapsed: 300.893010419s, iterations: 22 }

This PR depends on filecoin-project/bellperson#35.
vmx added a commit to filecoin-project/rust-fil-proofs that referenced this pull request Dec 11, 2019
Proofs can be forced to run on the CPU instead of the GPU. This is used to
run some proofs with higher priority.

The `gpu-cpu-test` tool tests if this actually works. It spawns two threads
which run proofs. Those get killed after 5 minutes of running. The overall
test runs longer as some input data needs to be generated. By default one
thread will always be prioritized to run on the GPU. The other one might be
moved to the CPU.

When running:

    cd fil-proofs-tooling
    RUST_LOG=trace cargo run --release --bin gpu-cpu-test

The return values of the tests might look like:

    2019-12-11T18:13:19.518 main INFO gpu_cpu_test > Thread HighPrio info: RunInfo { elapsed: 303.366942495s, iterations: 28 }
    2019-12-11T18:13:25.981 main INFO gpu_cpu_test > Thread LowPrio info: RunInfo { elapsed: 309.829930518s, iterations: 15 }

Clearly the high priority thread got more work done.

When running without the "GPU stealing" feature, where one tread demands to
run on the GPU:

    RUST_LOG=trace cargo run --release --bin gpu-cpu-test -- --gpu-stealing false

The return values indicate that both threads got the same amount of time on the GPU:

    2019-12-11T18:30:16.868 main INFO gpu_cpu_test > Thread HighPrio info: RunInfo { elapsed: 307.388469955s, iterations: 23 }
    2019-12-11T18:30:16.868 main INFO gpu_cpu_test > Thread LowPrio info: RunInfo { elapsed: 300.893010419s, iterations: 22 }

This PR depends on filecoin-project/bellperson#35.
@keyvank
Copy link

keyvank commented Dec 12, 2019

@nginnever I went through this code today, looks good to me in general, but some parts are missing I think:
1- This needs to be tested when provers are running as independent processes, not threads, right? @dignifiedquire
2- I think you have to drop and free fft_kern and multiexp_kern after receiving the acquire signal or they will not free up the memory for high prio process.
3- Say you are the high prio process and you call the acquire_gpu() function, now check_for_higher_prio!() macro would always return true, even for you that are the one who has acquired the gpu, so it would switch back to cpu along the way and not use gpu at all. Is this handled somewhere else?

@dignifiedquire
Copy link

1- This needs to be tested when provers are running as independent processes, not threads, right? @dignifiedquire

we need to test both cases

vmx added a commit to vmx/rust-fil-proofs that referenced this pull request Dec 12, 2019
Proofs can be forced to run on the CPU instead of the GPU. This is used to
run some proofs with higher priority.

The `gpu-cpu-test` tool tests if this actually works. It spawns two threads
which run proofs. Those get killed after 5 minutes of running. The overall
test runs longer as some input data needs to be generated. By default one
thread will always be prioritized to run on the GPU. The other one might be
moved to the CPU.

When running:

    cd fil-proofs-tooling
    RUST_LOG=trace cargo run --release --bin gpu-cpu-test

The return values of the tests might look like:

    2019-12-11T18:13:19.518 main INFO gpu_cpu_test > Thread HighPrio info: RunInfo { elapsed: 303.366942495s, iterations: 28 }
    2019-12-11T18:13:25.981 main INFO gpu_cpu_test > Thread LowPrio info: RunInfo { elapsed: 309.829930518s, iterations: 15 }

Clearly the high priority thread got more work done.

When running without the "GPU stealing" feature, where one tread demands to
run on the GPU:

    RUST_LOG=trace cargo run --release --bin gpu-cpu-test -- --gpu-stealing false

The return values indicate that both threads got the same amount of time on the GPU:

    2019-12-11T18:30:16.868 main INFO gpu_cpu_test > Thread HighPrio info: RunInfo { elapsed: 307.388469955s, iterations: 23 }
    2019-12-11T18:30:16.868 main INFO gpu_cpu_test > Thread LowPrio info: RunInfo { elapsed: 300.893010419s, iterations: 22 }

This PR depends on filecoin-project/bellperson#35.
@nginnever
Copy link
Author

nginnever commented Dec 12, 2019

2- I think you have to drop and free fft_kern and multiexp_kern after receiving the acquire signal or they will not free up the memory for high prio process.

Better to drop everything. New updates coming as we discussed in discord.

3- Say you are the high prio process and you call the acquire_gpu() function, now check_for_higher_prio!() macro would always return true, even for you that are the one who has acquired the gpu, so it would switch back to cpu along the way and not use gpu at all. Is this handled somewhere else?

That's why you drop the acquire lock before starting the proof as the higher prio https://github.com/finalitylabs/bellman/blob/fil-lock/tests/gpu_provers.rs#L154-L156

@vmx
Copy link

vmx commented Dec 16, 2019

Could you please rebase that one on master? I tried it at https://github.com/vmx/bellman/tree/gpu-lock but somehow it doesn't do what it should do in my tests anymore. I guess you know better rebasing it properly :)

a,
&mut multiexp_kern,
)
};

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not add this logic to fn multiexp instead, that way you don't have to duplicate it in here

@dignifiedquire
Copy link

  • rustfmt is unhappy
  • conflicts with master

@nginnever
Copy link
Author

Changes incoming, wrapping the multiexp and fft kernel in a lock structure to leave prover.rs more untouched and begin introducing better control of when the kernel should be canceled and moved back to CPU (i.e. rather than waiting for the entire multiexp round to finish).

@nginnever nginnever changed the title WIP Parallel switching from GPU to CPU feat: Parallel switching from GPU to CPU Jan 9, 2020
src/domain.rs Outdated
Err(e) => {
warn!("GPU FFT failed! Falling back to CPU... Error: {}", e);
}
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can write

gpu_fft(k, a, omega, log_n).unwrap_or_else(|err| warn!("..."));

src/domain.rs Outdated
log_n: u32,
) -> gpu::GPUResult<()> {
if let Some(ref mut k) = kern {
match gpu_mul_by_field(k, a, minv, log_n) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unwrap_or_else as above

if self.kernel.is_some() {
warn!("GPU acquired by a high priority process! Freeing up kernels...");
self.kernel = None; // This would drop kernel and free up the GPU
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can be written a bit nicer

if let Some(_kernel) = self.kernel.take() {
  warn!("...");
}

use log::info;
use std::fs::File;

const GPU_LOCK_NAME: &str = "/tmp/bellman.gpu.lock";

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably use https://doc.rust-lang.org/std/env/fn.temp_dir.html instead of hard coding /tmp

}
}

const PRIORITY_LOCK_NAME: &str = "/tmp/bellman.priority.lock";

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above

return Box::new(pool.compute(move || Ok(p)));
}
Err(e) => {
warn!("GPU Multiexp failed! Falling back to CPU... Error: {}", e);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this return an error?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dignifiedquire Depends on what you would expect it to do in case of a GPU failure. Fallback to CPU or return an error?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah right

vmx added a commit to vmx/rust-fil-proofs that referenced this pull request Jan 14, 2020
Proofs can be forced to run on the CPU instead of the GPU. This is used to
run some proofs with higher priority.

The `gpu-cpu-test` tool tests if this actually works. It spawns two threads
which run proofs. Those get killed after 5 minutes of running. The overall
test runs longer as some input data needs to be generated. By default one
thread will always be prioritized to run on the GPU. The other one might be
moved to the CPU.

When running:

    cd fil-proofs-tooling
    RUST_LOG=trace cargo run --release --bin gpu-cpu-test

The return values of the tests might look like:

    2019-12-11T18:13:19.518 main INFO gpu_cpu_test > Thread HighPrio info: RunInfo { elapsed: 303.366942495s, iterations: 28 }
    2019-12-11T18:13:25.981 main INFO gpu_cpu_test > Thread LowPrio info: RunInfo { elapsed: 309.829930518s, iterations: 15 }

Clearly the high priority thread got more work done.

When running without the "GPU stealing" feature, where one tread demands to
run on the GPU:

    RUST_LOG=trace cargo run --release --bin gpu-cpu-test -- --gpu-stealing false

The return values indicate that both threads got the same amount of time on the GPU:

    2019-12-11T18:30:16.868 main INFO gpu_cpu_test > Thread HighPrio info: RunInfo { elapsed: 307.388469955s, iterations: 23 }
    2019-12-11T18:30:16.868 main INFO gpu_cpu_test > Thread LowPrio info: RunInfo { elapsed: 300.893010419s, iterations: 22 }

This PR depends on filecoin-project/bellperson#35.
vmx added a commit to vmx/rust-fil-proofs that referenced this pull request Jan 14, 2020
Proofs can be forced to run on the CPU instead of the GPU. This is used to
run some proofs with higher priority.

The `gpu-cpu-test` tool tests if this actually works. It spawns two threads
which run proofs. Those get killed after 5 minutes of running. The overall
test runs longer as some input data needs to be generated. By default one
thread will always be prioritized to run on the GPU. The other one might be
moved to the CPU.

When running:

    cd fil-proofs-tooling
    RUST_LOG=trace cargo run --release --bin gpu-cpu-test

The return values of the tests might look like:

    2019-12-11T18:13:19.518 main INFO gpu_cpu_test > Thread HighPrio info: RunInfo { elapsed: 303.366942495s, iterations: 28 }
    2019-12-11T18:13:25.981 main INFO gpu_cpu_test > Thread LowPrio info: RunInfo { elapsed: 309.829930518s, iterations: 15 }

Clearly the high priority thread got more work done.

When running without the "GPU stealing" feature, where one tread demands to
run on the GPU:

    RUST_LOG=trace cargo run --release --bin gpu-cpu-test -- --gpu-stealing false

The return values indicate that both threads got the same amount of time on the GPU:

    2019-12-11T18:30:16.868 main INFO gpu_cpu_test > Thread HighPrio info: RunInfo { elapsed: 307.388469955s, iterations: 23 }
    2019-12-11T18:30:16.868 main INFO gpu_cpu_test > Thread LowPrio info: RunInfo { elapsed: 300.893010419s, iterations: 22 }

This PR depends on filecoin-project/bellperson#35.
@vmx
Copy link

vmx commented Jan 15, 2020

@nginnever I currently try to test this PR. I can't get the locking working. I suspect that my code is wrong. Here is the main part, can you please check if I understood it wrong how the API should be used.

        // This is the higher priority proof, get it on the GPU even if there is one running
        // already there
        if gpu_stealing {
            let mut prio_lock = PriorityLock::new();
            info!("Trying to acquire Priority lock");
            prio_lock.lock();

            // Run the actual proof
            election_post::do_generate_post(&priv_replica_infos, &candidates);
        }
        // No locking
        else {
            // Run the actual proof
            debug!("Do not try to acquire the priority lock");
            election_post::do_generate_post(&priv_replica_infos, &candidates);
        }

The code above is called in a loop. It is executed by two threads. In one thread gpu_stealing is true in the other one it's false. I would expect that the gpu_stealing == false part runs on the CPU, hence in total there will be less iterations on this thread, than with the thread running with gpu_stealing == true.

But what I see is that the proofing just runs sequentially one after another (i.e. proof in thread 1, then proof thread 2, thread 1 again, thread 2 again). It's the same order as when I run both threads with gpu_stealing == false.

@nginnever
Copy link
Author

nginnever commented Jan 15, 2020

@vmx your usage looks correct. Could you try running with RUST_LOG=info and see if there are any warning logs like this...

For FFT being acquired by higher prio

FFT GPU acquired by some other process! Freeing up kernel...

And if the multiexp is acquired you should see

Multiexp GPU acquired by some other process! Freeing up kernel...

We do know that the higher prio process will "jump in" and force the lower to switch off the gpu. I am doing some double checking to be sure that the lower prio can then continue on the cpu without waiting for the gpu to finish (it was but a change recently may have broken that). If that is the case then we would see what you are reporting.

EDIT: Upon close inspection it looks like everything is working as expected in our tests.

@keyvank
Copy link

keyvank commented Jan 24, 2020

Continued here: #58

@keyvank keyvank closed this Jan 24, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants