# Multi-QPU (nvidia-mqpu)

The `nvidia-mqpu` target is useful for distributing separate quantum circuits to individual GPUs on a single host machine. 

![mqpu](./circuit-mqpu.png)

## Example with `sample` algorithmic primitives

In [4]:
import cudaq

cudaq.set_target("nvidia", option="mqpu")

target = cudaq.get_target()
num_qpus = target.num_qpus()
print("Number of QPUs:", num_qpus)


@cudaq.kernel
def test(i:int):
    num_qubits = 3
    qubits = cudaq.qvector(num_qubits)
    ry(0.1 * i, qubits)
    mz(qubits)


count_futures = []
for i in range(10):
    qpu_id = i % num_qpus
    print("qpu_id", qpu_id)
    result = cudaq.sample_async(test, i, shots_count=10000, qpu_id=qpu_id)
    count_futures.append(result)

for future in count_futures:
    counts = future.get()
    print(counts)

Number of QPUs: 4
qpu_id 0
qpu_id 1
qpu_id 2
qpu_id 3
qpu_id 0
qpu_id 1
qpu_id 2
qpu_id 3
qpu_id 0
qpu_id 1
{ 000:10000 }

{ 000:9928 010:16 100:23 001:33 }

{ 000:9713 010:98 100:80 110:2 001:103 101:3 011:1 }

{ 000:9385 010:195 100:192 110:7 001:206 101:7 011:8 }

{ 000:8796 010:372 111:1 100:389 110:13 001:396 101:20 011:13 }

{ 000:8205 010:573 111:2 100:544 110:44 001:558 101:37 011:37 }

{ 000:7597 010:747 111:8 100:734 110:77 001:702 101:71 011:64 }

{ 000:6903 010:937 111:21 100:873 110:115 001:910 101:125 011:116 }

{ 000:6130 010:1087 111:30 100:1059 110:209 001:1107 101:190 011:188 }

{ 000:5369 010:1191 111:62 100:1238 110:287 001:1269 101:299 011:285 }



## Example with `observe` algorithmic primitives:

In [9]:
import numpy as np

import cudaq
from cudaq import spin

cudaq.set_target("nvidia", option="mqpu")
target = cudaq.get_target()
num_qpus = target.num_qpus()
print("Number of QPUs:", num_qpus)

num_qubits = 10
sample_count = 640

ham = spin.z(0)

parameter_count = num_qubits

@cudaq.kernel
def kernel_rx(theta: list[float]):
    qubits = cudaq.qvector(num_qubits)

    for i in range(num_qubits):
        rx(theta[i], qubits[i])

# Below we run a circuit for 500 different input parameters.
parameters = np.random.default_rng(15).uniform(
    low=0, high=1, size=(sample_count, parameter_count)
)

# Multi-GPU
# We split our parameters into `num_qpus` arrays since we have `num_qpus` GPUs available.
xi = np.split(parameters, num_qpus)

print("We have", parameter_count, "parameters which we would like to execute")

print(
    "We split this into",
    len(xi),
    "batches of",
    ", ".join(str(xi[i].shape[0]) for i in range(num_qpus))
)

print("Shape after splitting", xi[0].shape)


asyncresults = []
for qpu_id in range(num_qpus):
    for i in range(xi[qpu_id].shape[0]):
        result = cudaq.observe_async(kernel_rx, ham, xi[qpu_id][i, :], qpu_id=qpu_id)
        asyncresults.append(result)

print("Energies from multi-GPUs")
for result in asyncresults:
    observe_result = result.get()
    expectation = observe_result.expectation()
#    print(expectation)

Number of QPUs: 4
We have 10 parameters which we would like to execute
We split this into 4 batches of 160, 160, 160, 160
Shape after splitting (160, 10)
Energies from multi-GPUs


## Batch the Spin Hamiltonian terms:

<img src="./ham-batch.png" width="640"/>

In [5]:
import timeit

import cudaq
from cudaq import spin

cudaq.set_target("nvidia", option="mqpu")


qubit_count = 22
term_count = 100000


@cudaq.kernel
def batch_ham():
    qubits = cudaq.qvector(qubit_count)
    h(qubits[0])
    for i in range(1, qubit_count):
        x.ctrl(qubits[0], qubits[i])


# We create a random Hamiltonian
hamiltonian = cudaq.SpinOperator.random(qubit_count, term_count)

# The observe calls allows us to calculate the expectation value of the Hamiltonian with respect to a specified kernel.
start_time = timeit.default_timer()
# Single node, single GPU.
result = cudaq.observe(batch_ham, hamiltonian).expectation()
end_time = timeit.default_timer()
print("Elapsed time (s) for single-GPU: ", end_time - start_time)

# If we have multiple GPUs/ QPUs available, we can parallelize the workflow with the addition of an argument in the observe call.

start_time = timeit.default_timer()
# Single node, multi-GPU.
result = cudaq.observe(
    batch_ham, hamiltonian, execution=cudaq.parallel.thread
).expectation()
end_time = timeit.default_timer()
print("Elapsed time (s) for multi-GPU: ", end_time - start_time)


# Multi-node, multi-GPU. (if included use mpirun -np n filename.py)
# cudaq.mpi.initialize()
# start_time = timeit.default_timer()
# result = cudaq.observe(batch_ham, hamiltonian, execution=cudaq.parallel.mpi).expectation()
# end_time = timeit.default_timer()
# print("Elapsed time (s) for multi-GPU with mpi: ", end_time - start_time)
# cudaq.mpi.finalize()

Elapsed time (s) for single-GPU:  2.703693811025005
Elapsed time (s) for multi-GPU:  1.426821483997628


# Multi-GPU (nvidia-mgpu)

The `nvidia-mgpu` backend is useful for running a large single quantum circuit spread across multiple GPUs.
- A $n$ qubit quantum state has $2^n$ complex amplitudes, each of which require 8 bytes of memory to store. Hence the total memory required to store a n qubit quantum state is $8$ bytes $\times 2^n$. For $n=30$ qubits, this is roughly $8$ GB but for $n=40$, this exponentially increases to $8700$ GB.

#### Example: GHZ

```python
# mpirun -np 4 python <fname> --target nvidia --target-option mgpu

import cudaq

cudaq.mpi.initialize()

@cudaq.kernel
def kernel(qubit_count: int):
    # Allocate our qubits.
    qvector = cudaq.qvector(qubit_count)
    # Place the first qubit in the superposition state.
    h(qvector[0])
    # Loop through the allocated qubits and apply controlled-X,
    # or CNOT, operations between them.
    for qubit in range(qubit_count - 1):
        x.ctrl(qvector[qubit], qvector[qubit + 1])
    # Measure the qubits.
    mz(qvector)

#print("Preparing GHZ state for", qubit_count, "qubits.")
qubit_count = 30
counts = cudaq.sample(kernel, qubit_count)

if cudaq.mpi.rank() == 0:
    print(counts)

cudaq.mpi.finalize()
```

In [7]:
!mpirun -np 4 python ghz.py --target nvidia --target-option mgpu

{ 000000000000000000000000000000:513 111111111111111111111111111111:487 }



## Batch job

Prepare the script file like `script.sh`:
```bash
#!/bin/bash

#$ -l rt_F=1
#$ -l h_rt=0:00:10
#$ -j y
#$ -cwd

source /etc/profile.d/modules.sh
module load singularitypro

singularity exec --nv cuda-quantum_0.8.0.sif python ghz.py --target nvidia --target-option mgpu,fp32
```

Get container and run the script:
```sh
SINGULARITY_TMPDIR=$SGE_LOCALDIR singularity pull docker://nvcr.io/nvidia/quantum/cuda-quantum:0.8.0
qsub -g grpname script.sh
```

<div class="alert alert-block alert-success"> 

## Exercise

What is the maximum number of qubits that can be simulated by one GPU?
What is the maximum number of qubits that can be simulated by one node?
</div>