# Introduction to CUDA-Q

CUDA-Q is the platform for hybrid quantum-classical computing built to address the challenges facing application developers and domain scientists looking to accelerate their existing applications with quantum computing.
 

CUDA-Q is open and QPU agnostic. We’re partnering with a growing list of Quantum Hardware companies across a broad range of qubit modalities to ensure it provides a unified platform that enables all hybrid quantum-classical systems as well as Quantum Algorithm companies and research institutions to ensure it addresses the needs of developers. 
 

It includes a kernel-based programming model, with both single-source C++ and Python implementations as well as a compiler toolchain for hybrid systems and a standard library of quantum algorithmic primitives. 


CUDA-Q integrates with todays high performance applications and is interoperable with leading parallel programming techniques and software. It allows a domain scientist to quickly and easily move between running all or parts of their applications on the best classical computing resources, the best simulated quantum computing resources, and the best real quantum computing resources.
 

With CUDA-Q, NVIDIA is kicking off another revolution in developer accessibility to disruptive compute technologies, allowing domain scientists to seamlessly leverage quantum acceleration tightly-coupled with the best of GPU Supercomputing. 


![alt text](What-is-CUDA-Q.png)



### GHZ Example - CPU and GPU
Let’s start with a simple example, the GHZ state, where 3 or more qubits are being entangled.
Here you see the Python code using CUDA-Q to implement it. 

In [1]:
import cudaq
@cudaq.kernel
def ghz_state(N: int):
    qubits = cudaq.qvector(N)
    h(qubits[0])
    for i in range(N - 1):
        x.ctrl(qubits[i], qubits[i + 1])
    mz(qubits)


A CUDA-Q kernel is a code or program that can be compiled and run on various devices. The kernel here gets a number of qubits as a parameter and entangles them using the control-X gate  and performs a measurement using mz. 

In [6]:
cudaq.set_target("qpp-cpu")
n = 3
print("Preparing GHZ state for", n, "qubits.")
counts = cudaq.sample(ghz_state,n)
print(counts)

Preparing GHZ state for 3 qubits.
{ 000:483 111:517 }



The kernel is being sampled multiple times (1000 by default).
For n=3, either all qubits are 0 or all are 1.

The previous example runs within seconds on a CPU, but what if we want to run more qubits?
Running on a CPU will result in an out of memory error or hours of runtime, but we can run it on a GPU by setting the target to “nvidia” and can run the algorithm using up to 29 qubits.

In [7]:
cudaq.set_target("nvidia")
n = 29
print("Preparing GHZ state for", n, "qubits.")
counts = cudaq.sample(ghz_state,n)
print(counts)

Preparing GHZ state for 29 qubits.
{ 00000000000000000000000000000:500 11111111111111111111111111111:500 }



### Scaling Circuit Simulation over Multiple GPUs
The exponential scaling of the state vector requires pooling GPU memory to simulate systems a of ~32 or more. 
CUDA-Q ‘nvidia-mgpu’ target.
The example below shows how far we were able to scale a GHZ state prep on A100 GPUs with 40 GB memory. 

![alt text](Qubits-GPUs.jpg)

### Parallelization over Multiple QPUs

![alt text](Hamiltonian.jpg)

A Hamiltonian is a function that describes the total energy of a dynamic system, as a sum of terms. It used in many fields to study how systems evolve over time. While the calculation of Hamiltonian terms can be intensive, the terms could be computed in parallel.


### Hamiltonian example
In this example we create a random Hamiltonian using the SpinOperator that takes number of qubits and number of terms. 


In [14]:
import cudaq

cudaq.set_target("nvidia-mqpu")
qubit_count = 15
term_count = 10000

@cudaq.kernel
def kernel(count: int):
    qubits = cudaq.qvector(count)
    h(qubits[0])
    for i in range(1, count):
        x.ctrl(qubits[0], qubits[i])

hamiltonian = cudaq.SpinOperator.random(qubit_count, term_count)

expectation = cudaq.observe(kernel, hamiltonian, qubit_count, shots_count=1000) # Single node, single GPU.
#expectation = cudaq.observe(kernel,hamiltonian, execution=cudaq.parallel.thread) # Single node, multi-GPU.
#expectation = cudaq.observe(kernel, hamiltonian, execution= cudaq.parallel.mpi) # Multi-node, multi-GPU.
print("Expectation value is: ", expectation.expectation())

Expectation value is:  4.24600000000001


The observe calls allows us to calculate the expectation value of the Hamiltonian, batches the terms, and distributes them over the multiple QPU's/GPUs.


When we have multiple QPUs and execution or queue times might be long, we can use the async version of observe or sample to run the terms asynchronically and fetch the result after all the async computations have concluded. 


In [15]:
asyncResults = []
asyncResults.append(cudaq.observe_async(kernel, hamiltonian, qubit_count, shots_count=1000)) 
print("Expectation value is: ",asyncResults[0].get().expectation())

Expectation value is:  -3.667999999999989


### Multiple QPUs x Multiple GPUs
- A combination of parallelism (multiple QPUs) and scale (multiple GPUs)
- Flexible solution for GPUs optimal utilization.
- ‘remote-mqpu’ target.
- [CUDA-Q Introduces More Capabilities for Quantum Accelerated Supercomputing | NVIDIA Technical Blog](https://developer.nvidia.com/blog/cuda-quantum-introduces-more-capabilities-for-quantum-accelerated-supercomputing/)
![alt text](Multi-GPUs-QPUs.png)


We can combine both approaches.
For example, if we had 4 H100 GPUs, each with 80GB of memory, we could define 2 QPUs that will run the problem in parallel, cutting execution tome in half, while each QPU is scaled over 2 GPUs and can compute a vector size of 160 GB.

### CUDA-Q Backends
Until now, when we talked about multiple GPUs to scale the size of the problem and multi QPUs to run workloads in parallel, we were referring to a state vector.
A state vector is just one of the available simulator backends of CUDA Quantum.

#### Simulators
State vector
- Uses cuQuantum cuStateVec
- Limited by memory – 2^n to represent n qubits. 
- Targets: ‘nvidia’, ‘nvidia-mgpu’, ‘nvidia-mqpu’. ‘remote-mqpu’

Tensor network
- Uses cuQuantum cuTensorNet
- Can simulate 1000s of qubits
- Works well for sparse, low entangled problems
- Can run on multiple GPUs
- Target: ‘tensornet’

Matrix Product State (MPS)
- Approximate tensor network method
- Target: ‘tensornet-mps’

### QPUs
- Quantinuum
- IonQ
- IQM
- Oxford Quantum Circuits (OQC)
- More coming soon




### Variational Quantum Eigensolver (VQE)

VQE is a useful quantum algorithm that is used for quantum chemistry, quantum simulations and optimization problems. 
The goal of VQE is to determine the ground state energy of a physical system, like the H2 molecule in this example.


VQE is an iterative hybrid algorithm – the quantum part computes the expectation value of the hamiltonian, and the classical optimizer adjusts the parameters of the ansatz. 


Note that CUDA-Q provides the uccsd and vqe constructs and a number of optimizers.

In [11]:
pip install openfermionpyscf

1880.75s - pydevd: Sending message related to process being replaced timed-out after 5 seconds


/bin/bash: /home/eshabtai/miniconda3/envs/cudaq-0.7/lib/libtinfo.so.6: no version information available (required by /bin/bash)
Note: you may need to restart the kernel to use updated packages.


In [13]:
import cudaq 

hydrogen_count = 2
bond_distance = 0.7474
geometry = [('H', (0, 0, i * bond_distance)) for i in range(hydrogen_count)]

molecule, data = cudaq.chemistry.create_molecular_hamiltonian(geometry, 'sto-3g', 1, 0)
electron_count = data.n_electrons
qubit_count = 2 * data.n_orbitals

@cudaq.kernel
def kernel(thetas: list[float]):
    qubits = cudaq.qvector(qubit_count)
    
    # Prepare the Hartree Fock State.
    for i in range(electron_count):
        x(qubits[i])

    # UCCSD ansatz
    cudaq.kernels.uccsd(qubits, thetas, electron_count, qubit_count)

parameter_count = cudaq.kernels.uccsd_num_parameters(electron_count, qubit_count)
optimizer = cudaq.optimizers.COBYLA()
energy, parameters = cudaq.vqe(kernel, molecule, optimizer, parameter_count=parameter_count)
print(energy)


-1.1371745102369861


### Noise Modeling - bit flip error example
The examples we looked at so far assumed perfect qubits, but the reality is that today’s NISQ quantum computers are noisy. 


Let’s see how to model noise in CUDA Quantum.
We will use the density-matrix-cpu target for noisy simulations, a GPU accelerated noisy simulator will be available soon.



In [17]:
import cudaq

cudaq.set_target("density-matrix-cpu")
noise = cudaq.NoiseModel()
bit_flip = cudaq.BitFlipChannel(1.0)
noise.add_channel('x', [0], bit_flip)

@cudaq.kernel
def kernel():
    qubit = cudaq.qubit()
    x(qubit)
    mz(qubit)

noisy_result = cudaq.sample(kernel, noise_model=noise)
noisy_result.dump()

# To confirm this, we can run the simulation again without noise.
noiseless_result = cudaq.sample(kernel)
print(noiseless_result)

{ 0:1000 }
{ 1:1000 }



Here we define a noise model with a bit flip channel with a 1.0 probability, meaning a bit flip will happen every time.


We then define a kernel that preforms an X gate on a qubit and them measures it. We then sample the kernel, adding the noise model.
The qubit was initialized to |0> but after the bit flip it’s in the |1> state, therefor an X operation will always be measured as 0.


When we sample the kernel without noise, we will get 1 100% of the times


### Resources

We went over some basic examples of hoe to use CUDA-Q to accelerate quantum programs.

Here are some ways to explore more and learn about CUDA-Q

#### Links
- CUDA-Q Repo for issues and contributions: [NVIDIA/CUDA-Q uantum (github.com)](https://github.com/NVIDIA/cuda-quantum)
- CUDA-Q documentation: [CUDA-Q — NVIDIA CUDA-Q documentation](https://nvidia.github.io/cuda-quantum/latest/index.html)
- Quantum computing technical blogs: [Tag: Quantum Computing | NVIDIA Technical Blog](https://developer.nvidia.com/blog/tag/quantum-computing/)
- CUDA-Q marketing page: [CUDA-Q for Hybrid Quantum-Classical Computing | NVIDIA Developer](https://developer.nvidia.com/cuda-quantum)

#### Documentation refernces
- [Quick Start](https://nvidia.github.io/cuda-quantum/latest/using/quick_start.html)
- [Multi-GPU Workflows](https://nvidia.github.io/cuda-quantum/latest/examples/python/tutorials/multi_gpu_workflows.html)
- [Simulator backends](https://nvidia.github.io/cuda-quantum/latest/using/simulators.html)
- [Hardware backends](https://nvidia.github.io/cuda-quantum/latest/using/hardware.html)
- [Python code examples](https://nvidia.github.io/cuda-quantum/latest/using/python.html)
- [C++ code examples](https://nvidia.github.io/cuda-quantum/latest/using/cpp.html#introduction)
- [Applications](https://nvidia.github.io/cuda-quantum/latest/using/tutorials.html)


