# Distributed state vector simulations on multiple GPUs (advanced)

In the notebook "3_Multiple_GPU_simulations.ipynb", you learned how to use CUDA-Q and Braket Hybrid Jobs to parallelize the simulation of a batch of observables and circuits over multiple GPUs, where each GPU simulates a single QPU. For workloads with larger qubit counts, however, it may be necessary to distribute a single state vector simulation across multiple GPUs so that multiple GPUs together simulate a single QPU.

In this notebook, you will learn how to use CUDA-Q and Braket Hybrid Jobs to tackle this.

We start with necessary imports that are used in the examples below.

In [1]:
from braket.aws import AwsSession
from braket.jobs import hybrid_job
from braket.jobs.config import InstanceConfig
from braket.jobs.image_uris import Framework, retrieve_image

For this example, we need to use the PyTorch hybrid jobs container provided by Braket, which contains both CUDA-Q and the underlying CUDA support required for distributing our computation across multiple GPUs. Note: this container image is different from the one used in the previous notebooks illustrating more basic CUDA-Q scenarios.

In [2]:
image_uri = retrieve_image(Framework.CUDAQ, AwsSession().region)

## Distributed state vector simulations
Now that we have the container image URI, we are ready to run our workload. The `nvidia` target with the `mgpu` option supports distributing state vector simulations to multiple GPUs. This enables GPU simulations for circuits with higher qubit counts, up to 34 qubits. The example below shows how to submit a job with the `mgpu` option. Note, the `ml.g4dn.12xlarge` has 4 GPUs, the other supported `g4dn` instances only have a single GPU. 

In [3]:
@hybrid_job(
    device="local:nvidia/nvidia-mgpu",
    instance_config=InstanceConfig(instanceType="ml.g4dn.12xlarge", instanceCount=1),
    image_uri=image_uri,
    distribution = "mpi",
)
def distributed_gpu_job(
    n_qubits,
    n_shots,
):
    import os

    import cudaq
    
    # Check environment variables
    print("=== Environment Check ===")
    print(f"OMPI_COMM_WORLD_SIZE: {os.getenv('OMPI_COMM_WORLD_SIZE', 'Not set')}")
    print(f"OMPI_COMM_WORLD_RANK: {os.getenv('OMPI_COMM_WORLD_RANK', 'Not set')}")
    print(f"CUDA_VISIBLE_DEVICES: {os.getenv('CUDA_VISIBLE_DEVICES', 'Not set')}")
    
    cudaq.set_target("nvidia", option="mgpu")
    print("CUDA-Q backend:", cudaq.get_target())
    print("num_available_gpus:", cudaq.num_available_gpus())
    
    # Check MPI status
    print("=== MPI Status ===")
    try:
        print(f"MPI initialized: {cudaq.mpi.is_initialized()}")
        if cudaq.mpi.is_initialized():
            print(f"MPI rank: {cudaq.mpi.rank()}")
            print(f"MPI num_ranks: {cudaq.mpi.num_ranks()}")
        else:
            print("MPI not initialized")
    except Exception as e:
        print(f"MPI check failed: {e}")

    # Define target
    cudaq.set_target("nvidia", option="mgpu")
    print("CUDA-Q backend: ", cudaq.get_target())
    print("num_available_gpus: ", cudaq.num_available_gpus())

    # Initialize MPI and view the MPI properties
    # cudaq.mpi.initialize()
    # rank = cudaq.mpi.rank()

    # Define circuit and observables
    @cudaq.kernel
    def ghz():
        qubits = cudaq.qvector(n_qubits)
        h(qubits[0])
        for q in range(1, n_qubits):
            cx(qubits[0], qubits[q])

    hamiltonian = cudaq.SpinOperator.random(n_qubits, 1)

    # Parallelize circuit simulation
    result = cudaq.observe(ghz, hamiltonian, shots_count=n_shots)

    # End the MPI interface
    # cudaq.mpi.finalize()

    # if rank == 0:
    return {"expectation": result.expectation()}


n_qubits = 25
n_shots = 1000
distributed_job = distributed_gpu_job(n_qubits, n_shots)
print("Job ARN: ", distributed_job.arn)


Job ARN:  arn:aws:braket:us-east-1:641737106670:job/6a72cfcf-8130-40f7-9db6-25da2ecc467e


In [4]:
distributed_job_result = distributed_job.result()
print(distributed_job_result)
print(f"result: {distributed_job_result['expectation']}")

{'expectation': 0.054000000000000034}
result: 0.054000000000000034


## Summary
This notebook shows you how to distribute a single state vector simulation across multiple GPUs so that multiple GPUs together simulate a single QPU. If you have workloads with a qubit count that is too large to simulate on a single GPU, you can use this technique to make these large workloads feasible.