# Introduction to CUDA-Q

## Agenda

### A- Quantum Circuit Basics

A.1- Qubit allocation

A.2- Quantum gates

A.3- Quantum kernel

A.4- Backends & running CUDA-Q programs

A.5- Examples

### B- Quantum algorithmic primitives

B.1- cudaq.sample()

- Mid-circuit measurement & conditional sampling

B.2- cudaq.observe()

- Spin Hamiltonian operator
- Expectation values

B.3- Asynchronous execution

### C- Exercises

### A- Quantum circuit basics

![img](./basic-circuit.png)

### A.1- Qubit allocation

- cudaq.qubit(): a single quantum bit (2-level) in the discrete quantum memory space. 

```qubit=cudaq.qubit()```

- cudaq.qvector(N): a multi quantum bit ($2^N$ level) in the discrete quantum memory

```qubits=cudaq.qvector(N)```

- qubits in qvector are indexed serially:
```qubits[0], qubits[1], ..., qbubits[N-1]```
    
- Is initialized to the |0> computational basis state.

- Owns the quantum memory, therefore it cannot be copied or moved (no-cloning theorem). It can be passed by reference (i.e., references to qubit vectors).

### A.2- Quantum gates


- x: Not gate (Pauli-X gate)

```python
q=cudaq.qubit()
x(q)
```
- h: Hadamard gate

```python
q=cudaq.qvector(2)
h(q[0])
```

- x.ctrl(control,target) or ([control_1, control_2], target): C-NOT gate

```python
q=cudaq.qvector(3)
x.ctrl(q[0],q[1])
```

- rx(angle, qubit): rotation around x-axis
```python
q=cudaq.qubit()
rx(np.pi,q)
```

- adj: adjoint transformation
```python
q=cudaq.qubit()
rx(np.pi,q)
rx.adj(np.pi,q)
```

- mz: measure qubits in the computational basis

```python
q=cudaq.qvector(2)
h(q[0])
x.ctrl(q[0],q[1])
mz(q)
```


To learn more about the quantum operations available in CUDA-Q, visit [this page](https://nvidia.github.io/cuda-quantum/latest/api/default_ops.html)

### A.3- Quantum kernel

- To differentiate between host and quantum device code, the CUDA-Q programming model defines the concept of a quantum kernel.

- All quantum kernels must be annotated to indicate they are to be compiled for, and executed on, a specified quantum coprocessor. 

- Other language bindings may opt to use other language features to enable function annotation or decoration (e.g. a `@cudaq.kernel()` function decorator in Python and `__qpu__` in C++).

- Quantum kernel can take classical data as input

``` python
@cudaq.kernel()
def my_first_entry_point_kernel(x : float):
   ... quantum code ... 

@cudaq.kernel()
def my_second_entry_point_kernel(x : float, params : list[float]):
   ... quantum code ... 

```

- CUDA-Q kernels can serve as input to other quantum kernels and invoked by kernel function body code.


```python
@cudaq.kernel()
def StatePrep(qubits : cudaq.qview):
    ... apply state prep operations on qubits ... 

@cudaq.kernel()
def GenericAlgorithm(statePrep : typing.Callable[[cudaq.qview], None]):
    q = cudaq.qvector(10)
    statePrep(q)
    ...

MyGenericAlgorithm(MyStatePrep)
```

- ```cudaq.qview()```: a non-owning reference to a subset of the discrete quantum memory space. It does not own its elements and can therefore be passed by value or reference. (see [this page](https://nvidia.github.io/cuda-quantum/latest/specification/cudaq/types.html#quantum-containers))

- Lists inside the quantum kernel can be only constructed with specified size

```python
@cudaq.kernel
def kernel(N : int):

   # Not Allowed
   # i = []
   # i.append(1)

   # Allowed
   i = [0 for k in range(5)]
   j = [0 for _ in range(N)]
   i[2] = 3
   f = [1., 2., 3.]
   k = 0
   pi = 3.1415926

```

- To learn more about the CUDA-Q quantum kernel, visit [this page](https://github.com/NVIDIA/cuda-quantum/blob/main/docs/sphinx/specification/cudaq/kernels.rst)

### A.4- Backends & running CUDA-Q programs

Two options:

1. Define the target when running the program:
``` python3 program.py [...] --target <target_name>```

2. Target can be defined in the application code:
```cudaq.set_target('target_name')``` . Then, to run the program, drop the target flag: 
```python3 program.py [...]```

What is target_name?

1. State vector simulators:
    - Open-MP CPU only (Default If an NVIDIA GPU and CUDA runtime libraries are NOT available): ```python3 program.py [...] --target qpp-cpu``` 
    - Single-GPU (Default If an NVIDIA GPU and CUDA runtime libraries are available): ```python3 program.py [...] --target nvidia```
    - Mutli-Node Multi-GPUs: ```mpirun -np 2 python3 program.py [...] --target nvidia --target-option mgpu``` 
2. Tensor network simulator:
    - Single-GPU: ```python3 program.py [...] --target tensornet``` 
    - Multi-GPUs: ```mpirun -np 2 python3 program.py [...] --target tensornet``` 
3. Matrix Product state:
    - Only supports single-GPU simulation: ```python3 program.py [...] --target tensornet-mps``` 
4. NVIDIA Quantum Cloud
    - Run any of the above backends using NVIDIA-provided cloud GPUs (early access only). To learn more, visit [this page](https://www.nvidia.com/en-us/solutions/quantum-computing/cloud/).
    - E.g. `cudaq.set_target('nvqc', backend='tensornet')`
5. Quantum hardware backend (to learn more, visit [this page](https://nvidia.github.io/cuda-quantum/latest/using/backends/hardware.html)):
    - ```cudaq.set_target('QPU_name')```. QPU_name could be `ionq`, `quantinuum`, `iqm`, `oqc`, ...etc.


To learn more about CUDA-Q backends, visit [this page](https://nvidia.github.io/cuda-quantum/latest/using/backends/backends.html)

### A.5- Examples

In [1]:
# Single qubit example

import cudaq

# Set the backend target
cudaq.set_target('nvidia')

# We begin by defining the `Kernel` that we will construct our
# program with.
@cudaq.kernel()
def first_kernel():
    '''
    This is our first CUDA-Q kernel.
    '''
    # Next, we can allocate a single qubit to the kernel via `qubit()`.
    qubit = cudaq.qubit()

    # Now we can begin adding instructions to apply to this qubit!
    # Here we'll just add non-parameterized
    # single qubit gate that is supported by CUDA-Q.
    h(qubit)
    x(qubit)
    y(qubit)
    z(qubit)
    s(qubit)
    t(qubit)
    

    # Next, we add a measurement to the kernel so that we can sample
    # the measurement results on our simulator!
    mz(qubit)

print(cudaq.draw(first_kernel))

     ╭───╮╭───╮╭───╮╭───╮╭───╮╭───╮
q0 : ┤ h ├┤ x ├┤ y ├┤ z ├┤ s ├┤ t ├
     ╰───╯╰───╯╰───╯╰───╯╰───╯╰───╯



In [2]:
# Multi-qubit example

import cudaq

cudaq.set_target('nvidia')

@cudaq.kernel
def second_kernel(N:int):
    qubits=cudaq.qvector(N)

    h(qubits[0])
    
    for i in range(1, N):
        x.ctrl(qubits[0],qubits[i])
        
    z(qubits)

    mz(qubits)

print(cudaq.draw(second_kernel,3))

     ╭───╮          ╭───╮
q0 : ┤ h ├──●────●──┤ z ├
     ╰───╯╭─┴─╮  │  ├───┤
q1 : ─────┤ x ├──┼──┤ z ├
          ╰───╯╭─┴─╮├───┤
q2 : ──────────┤ x ├┤ z ├
               ╰───╯╰───╯



In [3]:
# Multi-control gates example

import cudaq

cudaq.set_target('nvidia')

@cudaq.kernel
def bar(N:int):
    qubits=cudaq.qvector(N)
    
    # front and back: return a direct refernce 
    controls = qubits.front(N - 1)
    target = qubits.back()
    
    x.ctrl(controls, target)


print(cudaq.draw(bar,4))

          
q0 : ──●──
       │  
q1 : ──●──
       │  
q2 : ──●──
     ╭─┴─╮
q3 : ┤ x ├
     ╰───╯



### B- Quantum Algorithmic Primitives

### B.1 cudaq.sample():

Sample the state of a given quantum circuit for a specified number of shots (circuit execution)

This function takes as input a quantum kernel instance followed by the concrete arguments at which the kernel should be invoked

In [4]:
# Sampling Bell state example

import cudaq

cudaq.set_target('nvidia')

@cudaq.kernel
def bell():
    qubits=cudaq.qvector(2)

    h(qubits[0])
    x.ctrl(qubits[0], qubits[1])

    mz(qubits)

print(cudaq.draw(bell))
# Sample the state generated by bell
# shots_count: the number of kernel executions. Default is 1000
counts = cudaq.sample(bell, shots_count=10000) 

# Print to standard out
print(counts)

# Fine-grained access to the bits and counts 
for bits, count in counts.items():
    print('Observed: {}, {}'.format(bits, count))

     ╭───╮     
q0 : ┤ h ├──●──
     ╰───╯╭─┴─╮
q1 : ─────┤ x ├
          ╰───╯

{ 00:4969 11:5031 }

Observed: 00, 4969
Observed: 11, 5031


In [5]:
# Another sampling example

import cudaq

cudaq.set_target('nvidia')

@cudaq.kernel
def sampling_example(N:int, theta:list[float]):
    qubit=cudaq.qvector(N)

    h(qubit)

    for i in range(0,N//2):
        ry(theta[i],qubit[i])
    

    x.ctrl([qubit[0],qubit[1]],qubit[2]) #ccx
    x.ctrl([qubit[0],qubit[1],qubit[2]],qubit[3]) #cccx
    x.ctrl(qubit[0:3],qubit[3]) #cccx using Python slicing syntax

    mz(qubit)

params=[0.15,1.5]

print(cudaq.draw(sampling_example, 4, params))

result=cudaq.sample(sampling_example, 4, params, shots_count=5000)

print('Result: ', result)

print('Most probable bit string: ', result.most_probable())   

     ╭───╮╭──────────╮               
q0 : ┤ h ├┤ ry(0.15) ├──●────●────●──
     ├───┤├─────────┬╯  │    │    │  
q1 : ┤ h ├┤ ry(1.5) ├───●────●────●──
     ├───┤╰─────────╯ ╭─┴─╮  │    │  
q2 : ┤ h ├────────────┤ x ├──●────●──
     ├───┤            ╰───╯╭─┴─╮╭─┴─╮
q3 : ┤ h ├─────────────────┤ x ├┤ x ├
     ╰───╯                 ╰───╯╰───╯

Result:  { 0000:1 1000:3 1101:729 0100:533 1100:749 0110:556 1111:625 1110:720 1001:2 0101:540 0111:542 }

Most probable bit string:  1100


- ###  Mid-circuit measurement & conditional sampling

In [6]:
# Mid-circuit measurment example

import cudaq

cudaq.set_target('nvidia')

@cudaq.kernel
def mid_circuit_m(theta:float):
    qubit=cudaq.qvector(2)
    ancilla=cudaq.qubit()

    x(qubit[0])
    
    ry(theta,ancilla)

    aux=mz(ancilla)
    
    if aux:
        x(ancilla)
    else:
        x(qubit[1])
    
    mz(ancilla)
    mz(qubit)

angle=0.5
result=cudaq.sample(mid_circuit_m, angle)
print(result)

{ 
  __global__ : { 100:84 110:916 }
   aux : { 1:84 0:916 }
}



- Here, we see that we have measured the ancilla qubit to a register named ```aux```

- If any measurements appear in the kernel, then only the measured qubits will appear in the ```__global__``` register, and they will be sorted in qubit allocation order.

- To learn more about cudaq.sample(), visit [this page](https://nvidia.github.io/cuda-quantum/latest/specification/cudaq/algorithmic_primitives.html#cudaq-sample)

### B.2 cudaq.observe()

- A common task in variational algorithms is the computation of the expected value of a given observable with respect to a quantum circuit ⟨H⟩ = ⟨ψ|H|ψ⟩.
- The `cudaq.observe()` function is provided to enable one to quickly compute this expectation value via execution of the quantum circuit.
- The `cudaq.observe()` function takes a kernel, any kernel arguments, and a **spin operator** as inputs.


#### Spin Hamiltonian operator: ####

- CUDA-Q defines convenience functions in `cudaq.spin` namespace that produce the primitive X, Y, and Z Pauli operators on specified qubit indices which can subsequently be used in algebraic expressions to build up more complicated Pauli tensor products and their sums.




- For example, to define the spin hamiltonian $H= 0.5 Z_0 + X_1 + Y_0 + Y_0 Y_1+ X_0 Y_1 Z_2 -2 Z_1 Z_2 - 2 Z_0 Z_1$:


In [7]:
# Spin operator example

from cudaq import spin

hamiltonian = 0.5*spin.z(0) + spin.x(1) + spin.y(0) + spin.y(0) * spin.y(1)+ spin.x(0)*spin.y(1)*spin.z(2)

# add some more terms
for i in range(2):
  hamiltonian += -2.0*spin.z(i)*spin.z(i+1)

print(hamiltonian)

print('Total number of terms in the spin hamiltonian: ',hamiltonian.get_term_count())

[-2+0j] IZZ
[-2+0j] ZZI
[1+0j] XYZ
[0.5+0j] ZII
[1+0j] YII
[1+0j] IXI
[1+0j] YYI

Total number of terms in the spin hamiltonian:  7


#### Expectation Value ####

- The `cudaq.observe()` function returns an `ObserveResult` object. The expectation value can be obtained using the `expectation` method.
- In the example below, the obervable is $H= -5.907 \, I + 2.1433 \, X_0X_1 +2.1433\, Y_0 Y_1 - 0.21829 \, Z_0 +6.125\, Z_1$


In [8]:
# Expectation value example

# The example here shows a simple use case for the `cudaq.observe``
# function in computing expected values of provided spin hamiltonian operators.

import cudaq
from cudaq import spin

cudaq.set_target('nvidia')

qubit_num=2

@cudaq.kernel
def init_state(qubits:cudaq.qview):
    n=qubits.size()
    for i in range(n):
        x(qubits[i])

@cudaq.kernel
def observe_example(theta: float):
    qvector = cudaq.qvector(qubit_num)

    init_state(qvector)
    ry(theta, qvector[1])
    x.ctrl(qvector[1], qvector[0])


spin_operator = -5.907 + 2.1433 * spin.x(0) * spin.x(1) + 2.1433 * spin.y(
    0) * spin.y(1) - .21829 * spin.z(0) + 6.125 * spin.z(1)

# Pre-computed angle that minimizes the energy expectation of the `spin_operator`.
angle = 0.59

energy = cudaq.observe(observe_example, spin_operator, angle).expectation()
print(f"Energy is {energy}")

Energy is -13.562794135947076


### B.3 Asynchronous execution

- Executing quantum circuits on actual hardware can involve long queuing time.
- Also simulation can be computationally intensive.
- Algorithmic primitives `cudaq.sample()` and `cudaq.observe()` have asynchronous versions `cudaq.sample_async()` and `cudaq.observe_async()`.
- Asynchronous primitives return immediatly an asynchronous result object. The actual result can be obtained with its `get()` method.
- If the result is not done, `get()` will wait synchronously.

In [9]:
# Asynchronous sampling example

import cudaq

cudaq.set_target('nvidia')

@cudaq.kernel
def asynchronous_example(N:int, theta:list[float]):
    qubit=cudaq.qvector(N)

    h(qubit)

    for i in range(0,N//2):
        ry(theta[i],qubit[i])
    

    x.ctrl([qubit[0],qubit[1]],qubit[2]) #ccx
    x.ctrl([qubit[0],qubit[1],qubit[2]],qubit[3]) #cccx
    x.ctrl(qubit[0:3],qubit[3]) #cccx using Python slicing syntax

    mz(qubit)

params=[0.15,1.5]

print(cudaq.draw(asynchronous_example, 4, params))

async_result=cudaq.sample_async(asynchronous_example, 4, params, shots_count=5000)
print("Sampling triggered on the GPU...")

# In the mean time, let us do calculations on the CPU
print("Doing some CPU work...")
x = 0
for i in range(10000):
    x+=i
print("CPU work is done!")

print("Waiting for sampling result...")
# Now let's check the result
result = async_result.get()

print('Result: ', result)

print('Most probable bit string: ', result.most_probable())   

     ╭───╮╭──────────╮               
q0 : ┤ h ├┤ ry(0.15) ├──●────●────●──
     ├───┤├─────────┬╯  │    │    │  
q1 : ┤ h ├┤ ry(1.5) ├───●────●────●──
     ├───┤╰─────────╯ ╭─┴─╮  │    │  
q2 : ┤ h ├────────────┤ x ├──●────●──
     ├───┤            ╰───╯╭─┴─╮╭─┴─╮
q3 : ┤ h ├─────────────────┤ x ├┤ x ├
     ╰───╯                 ╰───╯╰───╯

Sampling triggered on the GPU...
Doing some CPU work...
CPU work is done!
Waiting for sampling result...
Result:  { 0000:1 0011:1 0100:548 1100:730 0110:512 1111:733 1110:695 0101:545 1101:711 0111:524 }

Most probable bit string:  1111


### C- Excercises

1. Write a quantum kernel that prepares the [Greenberger–Horne–Zeilinger state (GHZ state)](https://en.wikipedia.org/wiki/Greenberger%E2%80%93Horne%E2%80%93Zeilinger_state) for an arbitrary number $N$ of qubits.\
   $\left| \operatorname{GHZ}\right> = \frac{1}{\sqrt{2}} \left( \left|00\dots0\right> + \left|11\dots1\right> \right)$ \
   Draw the corresponding quantum circuit for 5 qubits.
2. Produce 10000 samples from the above GHZ state, and confirm that bitstrings $00\dots0$ and $11\dots1$ are indeed measured an (almost) equal number of times.
3. Calculate the expectation value of the operators $H_x = X_0 X_1\dots X_N$ and $H_z = Z_0 Z_1\dots Z_N$ for the above GHZ state.
4. Repeat the expectation value calculations using the async observe primitive.

Solutions are available in the `solutions` directory, but you are encourged to try solving these for yourself first!