feat: Parallel switching from GPU to CPU #35

nginnever · 2019-12-09T02:52:16Z

This PR extends the file lock to allow for two processes to designate a higher and lower priority. The higher priority task can force the lower priority task into switching from GPU prover to CPU prover between multiexp (the heaviest part of computation) rounds. This PR contains two file structures. A prover lock file and an acquire GPU flag file. This is a fairly large change to the bellman lib so I will give a detailed example.

i.e.

Process A starts a sector sealing proof that blocks the GPU/prover for ~300-600s.

Process B needs to create a smaller PoST proof in a short period of time and would like to use the GPU to do so concurrently.

Before B starts the circuit proof call bellman::groth16::create_proof(c, &params, r, s) it will first use a new bellman::gpu feature to send a signal to A that it would like it to switch to CPU so that B may use the GPU at the same time.

B checks if another process is creating a proof...

let check = match gpu::gpu_is_available() {
    Ok(n) => n,
    Err(err) => false,
};

B creates an acquire file flag and loops for a short period of time until A notices this flag and releases their prover lock. Currently this is done between the 8 multiexp rounds that A is processing, but in later PRs will move into the lower multiexp rounds per chunk size and then into the kernel itself if less time is required for A to move over to CPU.

if check != true { 
    info!("GPU is NOT Available! Attempting to acuire the GPU...");
    let a_lock = Some(gpu::acquire_gpu().unwrap());

    // We need to drop the acquire lock as soon as the lower prio 
    // process has freed the main lock so that the higher uses GPU
    loop {
        //info!("checking to see if lower prio process has freed GPU");
        let available = match gpu::gpu_is_available() {
            Ok(n) => n,
            Err(err) => false,
        };
        if available {
            info!("GPU free from lower prio process. Dropping acquire gpu file lock from switching process...");
            gpu::drop_acquire_lock(a_lock.unwrap());
            break;
        };
        continue;       
    }
};

When A acquires the flag between multiexps it will then use a CPU version of the multiexp from that point on. B is now free to start create_proof that will use the GPU prover.

dignifiedquire · 2019-12-09T14:25:34Z

src/groth16/prover.rs

-        &mut multiexp_kern,
-    );
+    // Keep checking between multiexp
+    #[cfg(feature = "gpu")]


There is a lot of repetition here, can you extract into either a macro or a function please

dignifiedquire · 2019-12-09T14:27:00Z

two questions

what happens if the high priority item gets called multiple times?
what happens if a high priority item finishes?

nginnever · 2019-12-10T06:54:22Z

what happens if the high priority item gets called multiple times?

Suppose Higher priority B takes over the GPU from A, and then another high priority process C is called, C should take over B and B would then switch to CPU for the remainder of its proof alongside A. This PR assumes that there is only one high priority proof at a time.

what happens if a high priority item finishes?

Currently the lower priority prover will continue to use the CPU until it finishes its proof. Switching back to GPU is more involved but I can put some thought into that.

dignifiedquire · 2019-12-10T11:30:40Z

Currently the lower priority prover will continue to use the CPU until it finishes its proof. Switching back to GPU is more involved but I can put some thought into that.

I think it is fine for a first version, but we should have it switch back to gpu ideally if it is free

vmx · 2019-12-10T17:52:05Z

src/groth16/prover.rs

+#[cfg(feature = "gpu")]
+macro_rules! check_for_higher_prio {
+  () => {
+      match gpu::gpu_is_not_acquired() {


FYI: this match clause could also be written as:

gpu::gpu_is_not_acquired().unwrap_or(false)

Which might make your code simpler.

dignifiedquire · 2019-12-11T00:05:16Z

src/groth16/prover.rs

@@ -16,6 +16,22 @@ use crate::multicore::Worker;
 use crate::multiexp::{gpu_multiexp_supported, multiexp, DensityTracker, FullDensity};
 use crate::{Circuit, ConstraintSystem, Index, LinearCombination, SynthesisError, Variable};

+// We check to see if another higher priority process needs to use 
+// the GPU for each multiexp
+macro_rules! check_for_higher_prio {


this shoould have cfg(not(feature..))

Proofs can be forced to run on the CPU instead of the GPU. This is used to run some proofs with higher priority. The `gpu-cpu-test` tool tests if this actually works. It spawns two threads which run proofs. Those get killed after 5 minutes of running. The overall test runs longer as some input data needs to be generated. By default one thread will always be prioritized to run on the GPU. The other one might be moved to the CPU. When running: cd fil-proofs-tooling RUST_LOG=trace cargo run --release --bin gpu-cpu-test The return values of the tests might look like: 2019-12-11T18:13:19.518 main INFO gpu_cpu_test > Thread HighPrio info: RunInfo { elapsed: 303.366942495s, iterations: 28 } 2019-12-11T18:13:25.981 main INFO gpu_cpu_test > Thread LowPrio info: RunInfo { elapsed: 309.829930518s, iterations: 15 } Clearly the high priority thread got more work done. When running without the "GPU stealing" feature, where one tread demands to run on the GPU: RUST_LOG=trace cargo run --release --bin gpu-cpu-test -- --gpu-stealing false The return values indicate that both threads got the same amount of time on the GPU: 2019-12-11T18:30:16.868 main INFO gpu_cpu_test > Thread HighPrio info: RunInfo { elapsed: 307.388469955s, iterations: 23 } 2019-12-11T18:30:16.868 main INFO gpu_cpu_test > Thread LowPrio info: RunInfo { elapsed: 300.893010419s, iterations: 22 } This PR depends on filecoin-project/bellperson#35.

keyvank · 2019-12-12T12:25:13Z

@nginnever I went through this code today, looks good to me in general, but some parts are missing I think:
1- This needs to be tested when provers are running as independent processes, not threads, right? @dignifiedquire
2- I think you have to drop and free fft_kern and multiexp_kern after receiving the acquire signal or they will not free up the memory for high prio process.
3- Say you are the high prio process and you call the acquire_gpu() function, now check_for_higher_prio!() macro would always return true, even for you that are the one who has acquired the gpu, so it would switch back to cpu along the way and not use gpu at all. Is this handled somewhere else?

dignifiedquire · 2019-12-12T13:59:06Z

1- This needs to be tested when provers are running as independent processes, not threads, right? @dignifiedquire

we need to test both cases

Proofs can be forced to run on the CPU instead of the GPU. This is used to run some proofs with higher priority. The `gpu-cpu-test` tool tests if this actually works. It spawns two threads which run proofs. Those get killed after 5 minutes of running. The overall test runs longer as some input data needs to be generated. By default one thread will always be prioritized to run on the GPU. The other one might be moved to the CPU. When running: cd fil-proofs-tooling RUST_LOG=trace cargo run --release --bin gpu-cpu-test The return values of the tests might look like: 2019-12-11T18:13:19.518 main INFO gpu_cpu_test > Thread HighPrio info: RunInfo { elapsed: 303.366942495s, iterations: 28 } 2019-12-11T18:13:25.981 main INFO gpu_cpu_test > Thread LowPrio info: RunInfo { elapsed: 309.829930518s, iterations: 15 } Clearly the high priority thread got more work done. When running without the "GPU stealing" feature, where one tread demands to run on the GPU: RUST_LOG=trace cargo run --release --bin gpu-cpu-test -- --gpu-stealing false The return values indicate that both threads got the same amount of time on the GPU: 2019-12-11T18:30:16.868 main INFO gpu_cpu_test > Thread HighPrio info: RunInfo { elapsed: 307.388469955s, iterations: 23 } 2019-12-11T18:30:16.868 main INFO gpu_cpu_test > Thread LowPrio info: RunInfo { elapsed: 300.893010419s, iterations: 22 } This PR depends on filecoin-project/bellperson#35.

nginnever · 2019-12-12T20:28:02Z

2- I think you have to drop and free fft_kern and multiexp_kern after receiving the acquire signal or they will not free up the memory for high prio process.

Better to drop everything. New updates coming as we discussed in discord.

3- Say you are the high prio process and you call the acquire_gpu() function, now check_for_higher_prio!() macro would always return true, even for you that are the one who has acquired the gpu, so it would switch back to cpu along the way and not use gpu at all. Is this handled somewhere else?

That's why you drop the acquire lock before starting the proof as the higher prio https://github.com/finalitylabs/bellman/blob/fil-lock/tests/gpu_provers.rs#L154-L156

vmx · 2019-12-16T16:07:52Z

Could you please rebase that one on master? I tried it at https://github.com/vmx/bellman/tree/gpu-lock but somehow it doesn't do what it should do in my tests anymore. I guess you know better rebasing it properly :)

dignifiedquire · 2019-12-17T12:35:19Z

src/groth16/prover.rs

+            a,
+            &mut multiexp_kern,
+        )
+    };


why not add this logic to fn multiexp instead, that way you don't have to duplicate it in here

dignifiedquire · 2019-12-17T12:36:34Z

rustfmt is unhappy
conflicts with master

nginnever · 2019-12-18T06:45:08Z

Changes incoming, wrapping the multiexp and fft kernel in a lock structure to leave prover.rs more untouched and begin introducing better control of when the kernel should be canceled and moved back to CPU (i.e. rather than waiting for the entire multiexp round to finish).

…eleased) and just wait for it

dignifiedquire · 2020-01-13T21:14:40Z

src/domain.rs

+            Err(e) => {
+                warn!("GPU FFT failed! Falling back to CPU... Error: {}", e);
+            }
+        }


you can write

gpu_fft(k, a, omega, log_n).unwrap_or_else(|err| warn!("..."));

dignifiedquire · 2020-01-13T21:15:25Z

src/domain.rs

+    log_n: u32,
+) -> gpu::GPUResult<()> {
+    if let Some(ref mut k) = kern {
+        match gpu_mul_by_field(k, a, minv, log_n) {


unwrap_or_else as above

dignifiedquire · 2020-01-13T21:17:03Z

src/domain.rs

+                if self.kernel.is_some() {
+                    warn!("GPU acquired by a high priority process! Freeing up kernels...");
+                    self.kernel = None; // This would drop kernel and free up the GPU
+                }


can be written a bit nicer

if let Some(_kernel) = self.kernel.take() { warn!("..."); }

dignifiedquire · 2020-01-13T21:18:25Z

src/gpu/locks.rs

+use log::info;
+use std::fs::File;
+
+const GPU_LOCK_NAME: &str = "/tmp/bellman.gpu.lock";


This should probably use https://doc.rust-lang.org/std/env/fn.temp_dir.html instead of hard coding /tmp

dignifiedquire · 2020-01-13T21:19:18Z

src/gpu/locks.rs

+    }
+}
+
+const PRIORITY_LOCK_NAME: &str = "/tmp/bellman.priority.lock";


same as above

dignifiedquire · 2020-01-13T21:21:09Z

src/multiexp.rs

+                return Box::new(pool.compute(move || Ok(p)));
+            }
+            Err(e) => {
+                warn!("GPU Multiexp failed! Falling back to CPU... Error: {}", e);


shouldn't this return an error?

@dignifiedquire Depends on what you would expect it to do in case of a GPU failure. Fallback to CPU or return an error?

Proofs can be forced to run on the CPU instead of the GPU. This is used to run some proofs with higher priority. The `gpu-cpu-test` tool tests if this actually works. It spawns two threads which run proofs. Those get killed after 5 minutes of running. The overall test runs longer as some input data needs to be generated. By default one thread will always be prioritized to run on the GPU. The other one might be moved to the CPU. When running: cd fil-proofs-tooling RUST_LOG=trace cargo run --release --bin gpu-cpu-test The return values of the tests might look like: 2019-12-11T18:13:19.518 main INFO gpu_cpu_test > Thread HighPrio info: RunInfo { elapsed: 303.366942495s, iterations: 28 } 2019-12-11T18:13:25.981 main INFO gpu_cpu_test > Thread LowPrio info: RunInfo { elapsed: 309.829930518s, iterations: 15 } Clearly the high priority thread got more work done. When running without the "GPU stealing" feature, where one tread demands to run on the GPU: RUST_LOG=trace cargo run --release --bin gpu-cpu-test -- --gpu-stealing false The return values indicate that both threads got the same amount of time on the GPU: 2019-12-11T18:30:16.868 main INFO gpu_cpu_test > Thread HighPrio info: RunInfo { elapsed: 307.388469955s, iterations: 23 } 2019-12-11T18:30:16.868 main INFO gpu_cpu_test > Thread LowPrio info: RunInfo { elapsed: 300.893010419s, iterations: 22 } This PR depends on filecoin-project/bellperson#35.

vmx · 2020-01-15T15:33:54Z

@nginnever I currently try to test this PR. I can't get the locking working. I suspect that my code is wrong. Here is the main part, can you please check if I understood it wrong how the API should be used.

        // This is the higher priority proof, get it on the GPU even if there is one running
        // already there
        if gpu_stealing {
            let mut prio_lock = PriorityLock::new();
            info!("Trying to acquire Priority lock");
            prio_lock.lock();

            // Run the actual proof
            election_post::do_generate_post(&priv_replica_infos, &candidates);
        }
        // No locking
        else {
            // Run the actual proof
            debug!("Do not try to acquire the priority lock");
            election_post::do_generate_post(&priv_replica_infos, &candidates);
        }

The code above is called in a loop. It is executed by two threads. In one thread gpu_stealing is true in the other one it's false. I would expect that the gpu_stealing == false part runs on the CPU, hence in total there will be less iterations on this thread, than with the thread running with gpu_stealing == true.

But what I see is that the proofing just runs sequentially one after another (i.e. proof in thread 1, then proof thread 2, thread 1 again, thread 2 again). It's the same order as when I run both threads with gpu_stealing == false.

nginnever · 2020-01-15T20:44:52Z

@vmx your usage looks correct. Could you try running with RUST_LOG=info and see if there are any warning logs like this...

For FFT being acquired by higher prio

FFT GPU acquired by some other process! Freeing up kernel...

And if the multiexp is acquired you should see

Multiexp GPU acquired by some other process! Freeing up kernel...

We do know that the higher prio process will "jump in" and force the lower to switch off the gpu. I am doing some double checking to be sure that the lower prio can then continue on the cpu without waiting for the gpu to finish (it was but a change recently may have broken that). If that is the case then we would see what you are reporting.

EDIT: Upon close inspection it looks like everything is working as expected in our tests.

This reverts commit 7982fcf.

keyvank · 2020-01-24T21:55:31Z

Continued here: #58

dignifiedquire reviewed Dec 9, 2019

View reviewed changes

nginnever changed the title ~~Parallel switching from GPU to CPU~~ WIP Parallel switching from GPU to CPU Dec 9, 2019

vmx reviewed Dec 10, 2019

View reviewed changes

dignifiedquire reviewed Dec 11, 2019

View reviewed changes

vmx mentioned this pull request Dec 11, 2019

feat: add gpu-cpu mover test tool filecoin-project/rust-fil-proofs#993

Closed

dignifiedquire reviewed Dec 17, 2019

View reviewed changes

nginnever added 3 commits December 17, 2019 21:54

Add gpu notes to readme

a2d753b

Add requirements

35830f6

Format readme

2f73dfb

fmt changes

2920972

nginnever force-pushed the fil-lock branch from c71045c to 83150ad Compare December 18, 2019 18:40

keyvank and others added 7 commits December 29, 2019 22:55

remove correcteness checks

4f11590

move kernel creation logs to get_fft/multiexp_kernel functions

7a7be82

fix readme conflict

c0236f2

add parallel integration test

8103827

ask if gpu is available without blocking process

aa2a9e8

can't move file lock for conditional unlock

e6c4fde

locks aren't dropping properly now

cfa7d91

keyvank and others added 10 commits January 6, 2020 00:09

might be better not to check if gpu is available (when prio lock is r…

0df07d4

…eleased) and just wait for it

start high prio in the middle of the low prio multiexps

cf5de63

remove unnecessary imports

becaa25

fix warnings

e02be48

formatting

92ea9b4

better logging

2eab10e

make fs2 optional and fix feature flags

c07848f

remove env_logger dep

2b1bbd1

lets change pin ff in a separate PR

e64ac88

same format as before

d7ca914

nginnever changed the title ~~WIP Parallel switching from GPU to CPU~~ feat: Parallel switching from GPU to CPU Jan 9, 2020

dignifiedquire reviewed Jan 13, 2020

View reviewed changes

src/gpu/locks.rs

}

}

const PRIORITY_LOCK_NAME: &str = "/tmp/bellman.priority.lock";

Copy link

dignifiedquire Jan 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above

dignifiedquire reviewed Jan 13, 2020

View reviewed changes

nginnever force-pushed the fil-lock branch from b3a91f9 to d7ca914 Compare January 15, 2020 23:57

keyvank added 4 commits January 16, 2020 16:07

Merge branch 'master' into fil-lock

eef15b6

Revert "fall back to cpu in case of fft failures"

6ae4617

This reverts commit 7982fcf.

cargo fmt

8cb0e47

nicer dropping

dad1197

keyvank mentioned this pull request Jan 16, 2020

feat(gpu): priority GPU/CPU switching mechanism #50

Closed

keyvank mentioned this pull request Jan 24, 2020

feat(gpu): priority GPU/CPU switching mechanism #58

Merged

keyvank closed this Jan 24, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Parallel switching from GPU to CPU #35

feat: Parallel switching from GPU to CPU #35

nginnever commented Dec 9, 2019

dignifiedquire Dec 9, 2019 •

edited

dignifiedquire commented Dec 9, 2019

nginnever commented Dec 10, 2019 •

edited

dignifiedquire commented Dec 10, 2019

vmx Dec 10, 2019

dignifiedquire Dec 11, 2019

keyvank commented Dec 12, 2019

dignifiedquire commented Dec 12, 2019

nginnever commented Dec 12, 2019 •

edited

vmx commented Dec 16, 2019

dignifiedquire Dec 17, 2019

dignifiedquire commented Dec 17, 2019

nginnever commented Dec 18, 2019

dignifiedquire Jan 13, 2020

dignifiedquire Jan 13, 2020

dignifiedquire Jan 13, 2020

dignifiedquire Jan 13, 2020

dignifiedquire Jan 13, 2020

dignifiedquire Jan 13, 2020

keyvank Jan 16, 2020

dignifiedquire Jan 16, 2020

vmx commented Jan 15, 2020

nginnever commented Jan 15, 2020 •

edited

keyvank commented Jan 24, 2020

feat: Parallel switching from GPU to CPU #35

feat: Parallel switching from GPU to CPU #35

Conversation

nginnever commented Dec 9, 2019

dignifiedquire Dec 9, 2019 • edited

Choose a reason for hiding this comment

dignifiedquire commented Dec 9, 2019

nginnever commented Dec 10, 2019 • edited

dignifiedquire commented Dec 10, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

keyvank commented Dec 12, 2019

dignifiedquire commented Dec 12, 2019

nginnever commented Dec 12, 2019 • edited

vmx commented Dec 16, 2019

Choose a reason for hiding this comment

dignifiedquire commented Dec 17, 2019

nginnever commented Dec 18, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vmx commented Jan 15, 2020

nginnever commented Jan 15, 2020 • edited

keyvank commented Jan 24, 2020

dignifiedquire Dec 9, 2019 •

edited

nginnever commented Dec 10, 2019 •

edited

nginnever commented Dec 12, 2019 •

edited

nginnever commented Jan 15, 2020 •

edited