# Distributed state vector simulations on multiple GPUs (advanced)

In the notebook "2_parallel_simulations.ipynb", you learned how to use CUDA-Q and Braket Hybrid Jobs to parallelize the simulation of a batch of observables and circuits over multiple GPUs, where each GPU simulates a single QPU. For workloads with larger qubit counts, however, it may be necessary to distribute a single state vector simulation across multiple GPUs, so that multiple GPUs together simulate a single QPU.

In this notebook, you will learn how to use CUDA-Q and Braket Hybrid Jobs to tackle this.

We start with necessary imports that are used in the examples below.

In [None]:
import time

import numpy as np

from braket.jobs import hybrid_job
from braket.jobs.config import InstanceConfig

Next, we need to create and upload a container which contains both CUDA-Q and the underlying CUDA support required for distributing our computation across multiple GPUs. Note: this container image will be different than the one used in the previous notebooks illustrating more basic CUDA-Q scenarios.

To do this, we need to run the commands in the cell below. (For more information about what these commands are doing, please see the detailed documentation in "0_hello_cudaq_jobs.ipynb". The difference here is that we specify the dockerfile `Dockerfile.mgpu` in order to ensure full support for this advanced scenario.)

In [None]:
!chmod +x container/container_build_and_push.sh
!container/container_build_and_push.sh cudaq-mgpu-job us-west-2 Dockerfile.mgpu

Now we prepare the URI of the container image. Fill the proper value of `aws_account_id`, `region_name` and `container_image_name` in the cell below. For example, with the shell command above, `region_name="us-west-2"` and `container_image_name="cudaq-mgpu-job"`. The cell below prints out the image URI. When you use a container image to run a job, it ensures that your code is run in the same environment every time. 

In [None]:
aws_account_id = "<aws-account-id>"
region_name = "<region-name>"
container_image_name = "<container-image-name>"

image_uri = f"{aws_account_id}.dkr.ecr.{region_name}.amazonaws.com/{container_image_name}:latest"
print(image_uri)

## Distributed state vector simulations
Now that we have the container image URI, we are ready to run our workload. The `nvidia` target with `mgpu` option supports distributing state vector simulations to multiple GPUs. This enables GPU simulations for circuits with higher qubit count, to up to 34 qubits. The example below shows how to submit a job with the `mgpu` option.

In [None]:
@hybrid_job(
    device="local:nvidia/nvidia-mgpu",
    instance_config=InstanceConfig(instanceType="ml.p3.8xlarge", instanceCount=1),
    image_uri=image_uri,
)
def distributed_gpu_job(
    n_qubits,
    n_shots,
    sagemaker_mpi_enabled=True,
):
    import cudaq

    # Define target
    cudaq.set_target("nvidia", option="mgpu")
    print("CUDA-Q backend: ", cudaq.get_target())
    print("num_available_gpus: ", cudaq.num_available_gpus())

    # Initialize MPI and view the MPI properties
    cudaq.mpi.initialize()
    rank = cudaq.mpi.rank()

    # Define circuit and observables
    @cudaq.kernel
    def ghz():
        qubits = cudaq.qvector(n_qubits)
        h(qubits[0])
        for q in range(1, n_qubits):
            cx(qubits[0], qubits[q])

    hamiltonian = cudaq.SpinOperator.random(n_qubits, 1)

    # Parallelize circuit simulation
    result = cudaq.observe(ghz, hamiltonian, shots_count=n_shots)

    # End the MPI interface
    cudaq.mpi.finalize()

    if rank == 0:
        return {"expectation": result.expectation()}


n_qubits = 25
n_shots = 1000
distributed_job = distributed_gpu_job(n_qubits, n_shots)
print("Job ARN: ", distributed_job.arn)

In [None]:
distributed_job_result = distributed_job.result()
print(f"result: {distributed_job_result['expectation']}")

## Summary
This notebook shows you how to distribute a single state vector simulation across multiple GPUs, so that multiple GPUs together simulate a single QPU. If you have workloads with a qubit count that is too large to simulate on a single GPU, you can use this technique to make these large workloads feasible.