# Setting Options of the Compilation Backend

We explore the available options that can be specified when compiling a symbolic circuit. See the notebook [learning-a-circuit.ipynb](./learning-a-circuit.ipynb) for more details about symbolic circuit representations and their compilation. Currently, symbolic circuits can only be compiled using a PyTorch 2+ backend, which allows you to specify a few options, such as the semiring that defines how to evaluate sum and products and a couple of flags related to optimizations. Future versions of ```cirkit``` may include compilation backends other than PyTorch, each with their own set of features and compilation options. However, the philosophy of ```cirkit``` is to abstract away the design of circuits and their operators from the underlying implementation and deep learning library dependencies. This will foster opportunities arising from connecting different platforms and compiler tool chains, without affecting the rest of the library.

We start by instantiating a symbolic circuit for image data, as shown in the following code. Note that this is completely disentangled from the compilation step and the compilation options we explore next.

In [1]:
from cirkit.templates import circuit_templates

symbolic_circuit = circuit_templates.image_data(
    (1, 28, 28),                # The shape of the image, i.e., (num_channels, image_height, image_width)
    region_graph='quad-graph',  # Select the structure of the circuit to follow the QuadGraph region graph
    input_layer='categorical',  # Use Categorical distributions for the pixel values (0-255) as input layers
    num_input_units=64,         # Each input layer consists of 64 Categorical input units
    sum_product_layer='tucker', # Use Tucker sum-product layers, i.e., alternate dense sum layers and kronecker product layers
    num_sum_units=64,           # Each dense sum layer consists of 64 sum units
    sum_weight_param=circuit_templates.Parameterization(
        activation='softmax',   # Parameterize the sum weights by using a softmax activation
        initialization='normal' # Initialize the sum weights by sampling from a standard normal distribution
    )
)

## The Pipeline Context object

The most important object we introduce in this notebook is the **pipeline context**, which allows you to specify the compilation backend, as well as compilation options. Since we will use the PyTorch backend, we first set some random seeds and the device to use.

In [2]:
# Set random seeds and the torch device
import random
import numpy as np
import torch

# Set some seeds
random.seed(42)
np.random.seed(42)
torch.manual_seed(42)
torch.cuda.manual_seed(42)

# Set the torch device to use
device = torch.device('cuda')

In the next code snippet, we show how to instantiate a pipeline context using the PyTorch backend.

In [3]:
from cirkit.pipeline import PipelineContext

ctx = PipelineContext(
    backend='torch',  # Use the PyTorch backend with default compilation flags
)

By using this pipeline context, we can compile symbolic circuits as shown in the following code.

In [4]:
circuit = ctx.compile(symbolic_circuit)

An alternative way to compile circuits using a pipeline context is by combining the ``with`` statement and the ``compile`` function, as shown below.

In [5]:
from cirkit.pipeline import compile

with ctx:
    circuit = compile(symbolic_circuit)
    # Many circuits can possibly be compiled using the same pipeline context
    ...

The PyTorch backend allows you to specify three compilation options: (1) a particular **semiring** that specifies how to evaluate sum and product layers, (2) **whether to fold** the circuit computational graph as to better exploit parallel architectures like GPUs or not, and (3) **whether to optimize** the layers and the parameters of each layer by enabling a number of optimization rules. Below, we discuss each of these compilation options.

## (1) Choosing a Semiring

By default, the semiring used is the usual one defined over the reals (called ``sum-product``), i.e., the semiring $(\mathbb{R},+,\times)$, where $\mathbb{R}$ is the field of real numbers, and $+$ and $\times$ are the usual sum and products over reals. Another popular semiring is the _log-sum-exp and sum_ semiring (called ``lse-sum``), which ensures numerical stability by performing computations "in log-space". In fact, the ``lse-sum`` semiring is defined as $(\mathbb{R},\oplus,\otimes)$, where $\oplus$ is the log-sum-exp operation and $\otimes$ is the sum. By specifying ``lse-sum`` as semiring, sums compute log-sum-exp operations, while products compute sums, hence avoiding numerical issues such as underflows. A third available semiring is the ``complex-lse-sum`` semiring, which extends the ``lse-sum`` semiring to the field of complex numbers $(\mathbb{C},\oplus,\otimes)$, by making use of the complex extensions of logarithms and exponentials. This semiring is particularly useful to ensure numerical stability in the case of circuits with negative parameters.

In the following code, we instantiate a pipeline context by specifying the ``lse-sum`` semiring.

In [6]:
ctx = PipelineContext(
    backend='torch',      # Use the PyTorch backend
    # Specify the backend compilation flags next
    # ---- Specify how to evaluate sum and product layers ---- #
    semiring='lse-sum',   # In this case we use the numerically-stable 'lse-sum' semiring (R, +, *), i.e.,
                          # where: + is the log-sum-exp operation, and * is the sum operation.
    # -------------------------------------------------------- #
)

Next, we compile the circuit using this pipeline context.

In [7]:
%%time
circuit = ctx.compile(symbolic_circuit)

CPU times: user 4.54 s, sys: 1.07 s, total: 5.61 s
Wall time: 5.54 s


In [8]:
circuit.to(device);  # Move the compiled circuit parameters to the chosen device

Since we have chosen the ``lse-sum`` semiring, we expect the compiled circuit to output log-probabilities rather than probabilities. We can quickly check this by evaluating the circuit on some input and observing that the outputs are negative (i.e., they are log-likelihoods).

In [9]:
batch = torch.randint(256, size=(1, 1, 784), device=device)
circuit(batch).item()

-4358.77685546875

In the next section of this notebook, we enable a couple of compilation flags that will speed up the feed-forward evaluation of a circuit. However, why would someone disable the optimizations in the first place? The answer is that disabling optimizations is great for debugging purposes. In fact, the PyTorch backend ensures a one-to-one correspondence between the layers in the symbolic circuit representation and the compiled layers, if no optimizations are enabled, thus simplifying debugging operations such as verifying the correctness of inputs and outputs of _each_ layer separately.

Before proceeding to the next section, we benchmark the feed-forward evaluation of the circuit compiled with the default options, as it will serve as a reference when we will enable **folding** and **other optimizations**.

In [10]:
%%timeit
batch = torch.randint(256, size=(128, 1, 784), device=device)
circuit(batch)
if 'cuda' in str(device):
    torch.cuda.synchronize(device)

1.24 s ± 15.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


## (2) Folding your Circuit

Circuits typically have layers that can possibly be evaluated independently. Therefore, we can exploit powerful parallel architectures like GPUs to parallelize the computation of such layers. Enabling folding as compilation option _fuses_ layers of the same type (e.g., Kronecker product layers) that can be evaluated in parallel. By doing so, we obtain a much more efficient computational graph in PyTorch, with a negligible overhead in terms of compilation speed.

To initialize a pipeline context that enables folding, we simply need to specify ``fold=True``.

In [11]:
ctx = PipelineContext(
    backend='torch',      # Use the PyTorch backend
    # Specify the backend compilation flags next
    semiring='lse-sum',   # Use the 'lse-sum' semiring
    # --------- Enable circuit folding ---------- #
    fold=True,
    # ------------------------------------------- #
)

Next, we compile the same symbolic circuit and obtain a folded circuit.

In [12]:
%%time
folded_circuit = ctx.compile(symbolic_circuit)

CPU times: user 4.64 s, sys: 875 ms, total: 5.52 s
Wall time: 5.45 s


In [13]:
folded_circuit.to(device);  # Move the compiled circuit parameters to the chosen device

Note that the compilation procedure took a similar amount of time, when compared to the compilation with the default compilation options shown above. In addition, we compare the number of layers of an "unfolded" circuits with the number of layers of a "folded" circuit.

In [14]:
print(f'Number of layers (fold=False): {len(circuit.layers)}')
print(f'Number of layers (fold=True):  {len(folded_circuit.layers)}')

Number of layers (fold=False): 4163
Number of layers (fold=True):  26


The "folded" circuit has far fewer layers, since many of them have been fused together. For example, we can check that the first layer of the circuit computing Categorical likelihoods consists of many folds, as many as the number of variables modelling MNIST images.

In [15]:
folded_layer = next(folded_circuit.topological_ordering())
print(f'Type of the input folded layer: {folded_layer.__class__.__name__}')
print(f'Number of folded layers within it: {folded_layer.num_folds}')

Type of the input folded layer: TorchCategoricalLayer
Number of folded layers within it: 784


As we see in the next code snippet, enabling folding provided an (approximately) **19.9x speed-up** for feed-forward circuit evaluations.

In [16]:
%%timeit
batch = torch.randint(256, size=(128, 1, 784), device=device)
folded_circuit(batch)
if 'cuda' in str(device):
    torch.cuda.synchronize(device)

58.9 ms ± 20.4 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


## (3) Optimizing the Circuit Layers

Some circuits have layers and parameterizations whose evaluation can be optimized. Enabling optimizations in a pipeline context tells the compiler to try matching a number of optimization patterns defined over the layers of the circuit. If an optimization pattern matches, then the compiler performs a number of operations to optimize the circuit structure.

A simple example of an optimizable circuit structure is the one that alternates Kronecker product layers with Dense sum layers. The symbolic circuit we have built has already this kind of circuit structure, as we have specified the ``tucker`` sum-product layer. We can verify this by observing the types of the layers of the folded circuit have compiled above.

In [17]:
print([layer.__class__.__name__ for layer in folded_circuit.layers])

['TorchCategoricalLayer', 'TorchKroneckerLayer', 'TorchSumLayer', 'TorchKroneckerLayer', 'TorchSumLayer', 'TorchSumLayer', 'TorchKroneckerLayer', 'TorchSumLayer', 'TorchKroneckerLayer', 'TorchSumLayer', 'TorchSumLayer', 'TorchKroneckerLayer', 'TorchSumLayer', 'TorchKroneckerLayer', 'TorchSumLayer', 'TorchSumLayer', 'TorchKroneckerLayer', 'TorchSumLayer', 'TorchKroneckerLayer', 'TorchSumLayer', 'TorchSumLayer', 'TorchKroneckerLayer', 'TorchSumLayer', 'TorchKroneckerLayer', 'TorchSumLayer', 'TorchSumLayer']


In this case, we can fuse Kronecker and Dense layers in a single layer, which we call Tucker layer, that performs the same computations using an efficient ``einsum`` tensorized operation. This optimization is why probabilistic circuit architectures like [EinsumNetworks](https://arxiv.org/abs/2004.06231) are much more efficient. However, there are many other compilation rules that are currently supported by the PyTorch backend.

The next piece of code shows how to enable optimizations in a pipeline context (i.e., specify ``optimize=True``).

In [18]:
ctx = PipelineContext(
    backend='torch',      # Use the PyTorch backend
    # Specify the backend compilation flags next
    semiring='lse-sum',   # Use the 'lse-sum' semiring
    fold=True,            # Enable circuit folding
    # -------- Enable layer optimizations -------- #
    optimize=True,
    # -------------------------------------------- #
)

Next, we compile the same symbolic circuit and obtain an optimized circuit.

In [19]:
%%time
optimized_circuit = ctx.compile(symbolic_circuit)

CPU times: user 5.06 s, sys: 771 ms, total: 5.83 s
Wall time: 5.76 s


In [20]:
optimized_circuit.to(device);  # Move the compiled circuit parameters to the chosen device

Note that the compilation took just a little more time than the time for the folded circuit. Moreover, if we look at the list of layers, we observe that some of them are now Tucker layers, which can be much more efficient.

In [21]:
print([layer.__class__.__name__ for layer in optimized_circuit.layers])

['TorchCategoricalLayer', 'TorchTuckerLayer', 'TorchTuckerLayer', 'TorchSumLayer', 'TorchTuckerLayer', 'TorchTuckerLayer', 'TorchSumLayer', 'TorchTuckerLayer', 'TorchTuckerLayer', 'TorchSumLayer', 'TorchTuckerLayer', 'TorchTuckerLayer', 'TorchSumLayer', 'TorchTuckerLayer', 'TorchTuckerLayer', 'TorchSumLayer']


Finally, we benchmark the optimized circuit compiled in this way.

In [22]:
%%timeit
batch = torch.randint(256, size=(128, 1, 784), device=device)
optimized_circuit(batch)
if 'cuda' in str(device):
    torch.cuda.synchronize(device)

25.7 ms ± 13.1 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


Note that, we achieved an (approximately) **2.3x speed-up**, when compared to the folded circuit compiled above, and an (approximately) **45.7x speed-up**, when compared to the circuit compiled with no folding and no optimizations.