# Parallel simulations on multiple GPUs

Many quantum algorithms require running a batch of circuits and observables. For example, evaluating a Hamiltonian requires evaluating many terms of the Hamiltonian. 

In this notebook, you will learn how to use parallelization to tackle these challenges. With CUDA-Q and Braket Hybrid Jobs, the simulation of a batch of observables and circuits can be parallelized over multiple GPUs.

We start with necessary imports that are used in the examples below.

In [41]:
import time

import numpy as np

from braket.jobs import hybrid_job
from braket.jobs.config import InstanceConfig

Next, specify the URI of the container image that supports CUDA-Q. If you went through the "0_hello_cudaq_jobs.ipynb" notebook, you can use the same image URI.

In [42]:
image_uri = "<image-uri>"

## Simulation on a single GPU

We start by running a job on a single GPU. This example job evaluates a circuit with terms in the Hamiltonian. The ml.p3.2xlarge instance type used in the example has one GPU.

In [95]:
@hybrid_job(
    device="local:nvidia/nvidia",
    instance_config=InstanceConfig(instanceType="ml.p3.2xlarge"),
    image_uri=image_uri,
)
def single_gpu_job(n_qubits, n_terms, n_shots):
    import cudaq

    # Define backend
    cudaq.set_target("nvidia")
    print("CUDA-Q backend: ", cudaq.get_target())

    # Define circuit and observables
    kernel = cudaq.make_kernel()
    qubits = kernel.qalloc(n_qubits)
    kernel.h(qubits[0])
    for i in range(1, n_qubits):
        kernel.cx(qubits[0], qubits[i])

    hamiltonian = cudaq.SpinOperator.random(n_qubits, n_terms, seed=2024)

    # Run circuit simulation
    t0 = time.time()
    result = cudaq.observe(kernel, hamiltonian, shots_count=n_shots)
    t1 = time.time()
    total_time = t1 - t0
    print(f"result: {result.expectation()} | time: {total_time}")

    return {"expectation": result.expectation(), "total_time": total_time}

Skipping python version validation, make sure versions match between local environment and container.


The evaluation of the Hamiltonian terms are done on a single GPU.

In [None]:
n_qubits = 20
n_terms = 4000
n_shots = 1000

single_job = single_gpu_job(n_qubits, n_terms, n_shots)
print("Job ARN: ", single_job.arn)

In [None]:
result = single_job.result()
print(f"result: {result['expectation']} | time: {result['total_time']}")

## Parallelize the simulation of a batch of observables

Let's tackle the same problem again. But this time, we will run the simulation on multiple GPUs across multiple nodes using the [MPI interface](https://nvidia.github.io/cuda-quantum/latest/using/install/data_center_install.html#mpi). To do so, you add the keyword argument `execution=cudaq.parallel.mpi` to the `cudaq.observe()` call. With this keyword argument, CUDA-Q will distribute the simulation over the GPUs available in a job.

In order for CUDA-Q to distribute the simulation, there are some prerequisites. First, the job needs to run on instances that have many GPUs. To achieve this, you can specify the instance type that has multiple GPUs (e.g., ml.p3.8xlarge). If the number of GPUs on a single instance is not enough, you can extend the parallelization to multiple nodes by specifying `instanceCount` being larger than 1. Then, you need to add a hyperparameter `sagemaker_mpi_enabled=True` to the job which will initialize the job environment to support parallelization with MPI. Next, you need to select a CUDA-Q backend that supports distribution (e.g., `nvidia` backend with the `mqpu` option). Finally, you need to initialize the MPI interface in your CUDA-Q code. The code snippet below provides example of all these steps.

In [97]:
@hybrid_job(
    device="local:nvidia/nvidia-mqpu",
    instance_config=InstanceConfig(instanceType="ml.p3.8xlarge", instanceCount=1),
    image_uri=image_uri,
)
def parallel_observables_gpu_job(
    n_qubits,
    n_terms,
    n_shots,
    sagemaker_mpi_enabled=True,
):
    import cudaq

    # Define target
    cudaq.set_target("nvidia", option="mqpu")
    print("CUDA-Q backend: ", cudaq.get_target())
    print("num_available_gpus: ", cudaq.num_available_gpus())
    cudaq.set_random_seed(2024)

    # Initialize MPI and view the MPI properties
    cudaq.mpi.initialize()
    num_ranks = cudaq.mpi.num_ranks()
    rank = cudaq.mpi.rank()
    print(f"rank={rank} | MPI is initialized? {cudaq.mpi.is_initialized()}")
    print(f"rank={rank}, num_ranks={num_ranks}")

    # Define circuit and observables
    kernel = cudaq.make_kernel()
    qubits = kernel.qalloc(n_qubits)
    kernel.h(qubits[0])
    for i in range(1, n_qubits):
        kernel.cx(qubits[0], qubits[i])

    hamiltonian = cudaq.SpinOperator.random(n_qubits, n_terms, seed=2024)

    # Parallelize circuit simulation
    t0 = time.time()
    result = cudaq.observe(
        kernel, hamiltonian, shots_count=n_shots, execution=cudaq.parallel.mpi
    )
    t1 = time.time()
    total_time = t1 - t0
    print(f"rank={rank} | result: {result.expectation()} | time: {total_time}")

    # End the MPI interface
    cudaq.mpi.finalize()

    if rank == 0:
        return {"expectation": result.expectation(), "total_time": total_time}

Skipping python version validation, make sure versions match between local environment and container.


When the `parallel_job` function is called, it creates a job that distributes the simulation.

In [None]:
parallel_obs_job = parallel_observables_gpu_job(n_qubits=27, n_terms=2000, n_shots=1000)
print("Job ARN: ", parallel_obs_job.arn)

In [None]:
parallel_obs_result = parallel_obs_job.result()
print(
    f"result: {parallel_obs_result['expectation']} | time: {parallel_obs_result['total_time']}"
)

## Parallelize the simulation of a batch of circuits
In this section, we show an example of parallelizing the simulation of a circuit batch over multiple GPUs. First, we import a function `parametric_random_circuit_generator_factory` from "random_circuits.py" to use a circuit generator of random parametric circuits. 

In [27]:
from random_circuits import parametric_random_circuit_generator_factory

In this example, the circuit batch is formed by a single parametric circuit with many different sets of parameters. To assign a parameter set to a particular GPU, you can use the `qpu_id` keyword in the `cudaq.observe_async()` call. For example, to assign a simulation to GPU with rank 5, you set `qpu_id=5`. 

In [28]:
@hybrid_job(
    device="local:nvidia/nvidia-mqpu",
    instance_config=InstanceConfig(instanceType="ml.p3.8xlarge"),
    include_modules="random_circuits",
    image_uri=image_uri,
)
def parallel_batch_gpu_job(
    n_qubits, n_terms, n_shots, n_gates=100, n_circuits=128, sagemaker_mpi_enabled=True
):
    import cudaq

    # Define target
    cudaq.set_target("nvidia", option="mqpu")
    print("CUDA-Q backend: ", cudaq.get_target())

    # Initialize MPI and view the MPI properties
    cudaq.mpi.initialize()
    num_ranks = cudaq.mpi.num_ranks()
    rank = cudaq.mpi.rank()
    print(f"rank={rank} | MPI is initialized? {cudaq.mpi.is_initialized()}")
    print(f"rank={rank}, num_ranks={num_ranks}")

    # Define parametric circuit and observables
    hamiltonian = cudaq.SpinOperator.random(n_qubits, n_terms)
    get_parametric_random_circuit = parametric_random_circuit_generator_factory()
    parametric_circuit, n_params = get_parametric_random_circuit(n_qubits, n_gates)

    # Run parallel execution
    num_gpus = int(num_ranks)
    asyncresults = []
    for i in range(n_circuits):
        qpu_id = i % num_gpus
        params = np.random.uniform(0, np.pi, size=n_params)
        asyncresults.append(
            cudaq.observe_async(
                parametric_circuit,
                hamiltonian,
                params,
                shots_count=n_shots,
                qpu_id=qpu_id,
            )
        )

    t0 = time.time()
    _ = [res.get() for res in asyncresults]
    t1 = time.time()

    # End the MPI interface
    cudaq.mpi.finalize()

    if rank == 0:
        return t1 - t0

Skipping python version validation, make sure versions match between local environment and container.


When the `parallel_multi_gpu_job` function is called, it creates a job that distributes the simulation of the circuit batch.

In [None]:
n_qubits = 15
n_gates = 100
n_terms = 20
n_shots = 500
n_circuits = 128

parallel_batch_job = parallel_batch_gpu_job(
    n_qubits,
    n_terms,
    n_shots,
    n_gates,
    n_circuits,
)
print("Job ARN: ", parallel_batch_job.arn)

Currently, `observe_async()` only supports distribution over GPUs on the same node, so the `qpu_id` needs to be consistent with the number of GPUs of a single instance used in the job. However, if you wish to distribute the circuit batch over multiple nodes, you can manually assign different circuit batches to different nodes with the following MPI logic:
```
ngpu_per_node = ... # number of gpus per node
circuit_batch_0 = ... # circuit batch for node 0
circuit_batch_1 = ... # circuit batch for node 1

if cudaq.mpi.rank()//ngpu_per_node==0:
    for circuit in circuit_batch_0:
        cudaq.observe_async(circuit, hamiltonian, shots_count=n_shots, qpu_id=qpu_id)
if cudaq.mpi.rank()//ngpu_per_node==1:
    for circuit in circuit_batch_1:
        cudaq.observe_async(circuit, hamiltonian, shots_count=n_shots, qpu_id=qpu_id)
    
```

## Summary
This notebook shows you how to parallelize the simulation of a state vector and the simulation of multiple circuits across GPUs. If you have workloads with a large qubit count or a large number of circuits to evaluate, you can use parallelization to speedup the simulation of your workloads.