# Multi-QPU (nvidia-mqpu)

The `nvidia-mqpu` target is useful for distributing separate quantum circuits to individual GPUs on a single host machine. 
![mqpu](./circuit-mqpu.png)

## Example with `sample` algorithmic primitives

In [1]:
import cudaq

cudaq.set_target("nvidia-mqpu")

target = cudaq.get_target()
num_qpus = target.num_qpus()
print("Number of QPUs:", num_qpus)


@cudaq.kernel
def test(i:int):
    num_qubits = 3
    qubits = cudaq.qvector(num_qubits)
    ry(0.1 * i, qubits)
    mz(qubits)


count_futures = []
for i in range(10):
    qpu_id = i % num_qpus
    count_futures.append(cudaq.sample_async(test, i, shots_count=10000, qpu_id=qpu_id))

for counts in count_futures:
    print(counts.get())

Number of QPUs: 5
{ 000:10000 }

{ 000:9925 010:20 100:23 001:32 }

{ 000:9697 010:94 111:1 100:108 110:1 001:97 101:1 011:1 }

{ 000:9354 010:211 100:223 110:8 001:201 011:3 }

{ 000:8807 010:386 111:1 100:396 110:9 001:367 101:21 011:13 }

{ 000:8232 010:538 111:1 100:573 110:37 001:552 101:37 011:30 }

{ 000:7621 010:767 111:7 100:691 110:78 001:703 101:57 011:76 }

{ 000:6852 010:918 111:9 100:945 110:123 001:904 101:120 011:129 }

{ 000:6084 010:1101 111:38 100:1093 110:204 001:1075 101:207 011:198 }

{ 000:5262 010:1262 111:71 100:1230 110:305 001:1265 101:286 011:319 }



## Example with `observe` algorithmic primitives:

In [2]:
import numpy as np

import cudaq
from cudaq import spin

np.random.seed(1)

cudaq.set_target("nvidia-mqpu")
target = cudaq.get_target()
num_qpus = target.num_qpus()
print("Number of QPUs:", num_qpus)

num_qubits = 10
sample_count = 500

ham = spin.z(0)

parameter_count = num_qubits

@cudaq.kernel
def kernel_rx(theta: list[float]):
    qubits = cudaq.qvector(num_qubits)

    for i in range(num_qubits):
        rx(theta[i], qubits[i])

# Below we run a circuit for 500 different input parameters.
parameters = np.random.default_rng(15).uniform(
    low=0, high=1, size=(sample_count, parameter_count)
)

# Multi-GPU

# We split our parameters into `num_qpus` arrays since we have `num_qpus` GPUs available.
xi = np.split(parameters, num_qpus)

print("We have", parameter_count, "parameters which we would like to execute")

print(
    "We split this into",
    len(xi),
    "batches of",
    ", ".join(str(xi[i].shape[0]) for i in range(num_qpus))
)

print("Shape after splitting", xi[0].shape)


asyncresults = []
for i in range(num_qpus):
    for j in range(xi[i].shape[0]):
        qpu_id = i * 4 // len(xi)
        asyncresults.append(
            cudaq.observe_async(kernel_rx, ham, xi[i][j, :], qpu_id=qpu_id)
        )

print("Energies from multi-GPUs")
for result in asyncresults:
    observe_result = result.get()
    expectatoin = observe_result.expectation()
#    print(expectatoin)

Number of QPUs: 5
We have 10 parameters which we would like to execute
We split this into 5 batches of 100, 100, 100, 100, 100
Shape after splitting (100, 10)
Energies from multi-GPUs


## Batch the Spin Hamiltonian terms:

![img](./ham-batch.png)

In [3]:
import timeit

import cudaq
from cudaq import spin

cudaq.set_target("nvidia-mqpu")

# cudaq.mpi.initialize()

qubit_count = 22
term_count = 100000


@cudaq.kernel
def batch_ham():
    qubits = cudaq.qvector(qubit_count)
    h(qubits[0])
    for i in range(1, qubit_count):
        x.ctrl(qubits[0], qubits[i])


# We create a random Hamiltonian
hamiltonian = cudaq.SpinOperator.random(qubit_count, term_count)

# The observe calls allows us to calculate the expectation value of the Hamiltonian with respect to a specified kernel.

start_time = timeit.default_timer()
# Single node, single GPU.
result = cudaq.observe(batch_ham, hamiltonian).expectation()
end_time = timeit.default_timer()
print("Elapsed time (s) for single-GPU: ", end_time - start_time)

# If we have multiple GPUs/ QPUs available, we can parallelize the workflow with the addition of an argument in the observe call.

start_time = timeit.default_timer()
# Single node, multi-GPU.
result = cudaq.observe(
    batch_ham, hamiltonian, execution=cudaq.parallel.thread
).expectation()
end_time = timeit.default_timer()
print("Elapsed time (s) for multi-GPU: ", end_time - start_time)


# Multi-node, multi-GPU. (if included use mpirun -np n filename.py)
# result = cudaq.observe(batch_ham, hamiltonian, execution=cudaq.parallel.mpi).expectation()

# cudaq.mpi.finalize()

Elapsed time (s) for single-GPU:  2.302815204486251
Elapsed time (s) for multi-GPU:  4.726499784737825


# Multi-GPU (nvidia-mgpu)

The `nvidia-mgpu` backend is useful for running a large single quantum circuit spread across multiple GPUs.
- A $n$ qubit quantum state has $2^n$ complex amplitudes, each of which require 8 bytes of memory to store. Hence the total memory required to store a n qubit quantum state is $8$ bytes $\times 2^n$. For $n=30$ qubits, this is roughly $8$ GB but for $n=40$, this exponentially increases to $8700$ GB.

#### Example: GHZ

```python
# mpirun -np 4 python <fname> --target nvidia-mgpu

import cudaq

cudaq.mpi.initialize()

qubit_count = 30

@cudaq.kernel
def kernel(qubit_num: int):
    # Allocate our qubits.
    qvector = cudaq.qvector(qubit_num)
    # Place the first qubit in the superposition state.
    h(qvector[0])
    # Loop through the allocated qubits and apply controlled-X,
    # or CNOT, operations between them.
    for qubit in range(qubit_num - 1):
        x.ctrl(qvector[qubit], qvector[qubit + 1])
    # Measure the qubits.
    mz(qvector)

#print("Preparing GHZ state for", qubit_count, "qubits.")
counts = cudaq.sample(kernel, qubit_count)

if cudaq.mpi.rank() == 0:
    print(counts)

cudaq.mpi.finalize()
```

In [4]:
!mpirun -np 4 python ghz.py --target nvidia-mgpu

{ 111111111111111111111111111111:498 000000000000000000000000000000:502 }

