# Setting Options of the Compilation Backend

TODO: what are we going to see here?

TODO: let's start by creating a simple symbolic circuit

In [80]:
from cirkit.templates import circuit_templates

symbolic_circuit = circuit_templates.image_data(
    (1, 28, 28),                # The shape of the image, i.e., (num_channels, image_height, image_width)
    region_graph='quad-graph',  # Select the structure of the circuit to follow the QuadGraph region graph
    input_layer='categorical',  # Use Categorical distributions for the pixel values (0-255) as input layers
    num_input_units=4,          # Each input layer consists of 32 Categorical input units
    sum_product_layer='tucker', # Use Tucker sum-product layers, i.e., alternate dense sum layers and kronecker product layers
    num_sum_units=4,            # Each dense sum layer consists of 32 sum units
    sum_weight_param='softmax'  # Parameterize the weights of dense sum layers with 'softmax'
)

## The Pipeline Context object

TODO: outline the pipeline context object and explain the semiring as well as the compilation flags

In [81]:
# Set random seeds and the torch device
import random
import numpy as np
import torch

# Set some seeds
random.seed(42)
np.random.seed(42)
torch.manual_seed(42)
torch.cuda.manual_seed(42)

# Set the torch device to use
device = torch.device('cpu')

In [82]:
from cirkit.pipeline import PipelineContext

ctx = PipelineContext(
    backend='torch',      # Use the PyTorch backend
    # Specify the backend compilation flags next
    semiring='lse-sum',   # Specify how to evaluate sum and product layers
                          # In this case we use the numerically-stable LogSumExp-Sum semiring (R, +, *),
                          # where: + is the log-sum-exp operation, and * is the sum operation.
    fold=False,           # Disable folding (for now)
    optimize=False,       # Disable layer optimizations (for now)
)

TODO: next, we compile our symbolic circuit

In [83]:
%%time
circuit = ctx.compile(symbolic_circuit)

# Alternative way to compile a circuit using a Pipeline Context:
#
#from cirkit.pipeline import compile
#
#with ctx:
#    circuit = compile(symbolic_circuit)
#

CPU times: user 818 ms, sys: 6.66 ms, total: 824 ms
Wall time: 812 ms


TODO: explain the output values of a circuit compiled with the 'lse-sum' semiring

## Optimizing your Circuit

TODO: outline folding and optimization rules

TODO: let's start by benchmarking the unoptimized circuit we have compiled above

In [84]:
%%timeit
batch = torch.randint(256, size=(256, 1, 784), device=device)
circuit(batch)
if 'cuda' in str(device):
    torch.cuda.synchronize(device)

CPU times: user 2.14 s, sys: 109 ms, total: 2.25 s
Wall time: 608 ms


TODO: Why Would I disable Optimizations? disabling optimizations is great for debugging, i.e., (1) there is a one-to-one correspondence between the layers in the symbolic and compiled circuit in PyTorch, and (2) one can easily retrieve the inputs to each layer and investigate it

In [100]:
print(circuit)

TorchCircuit(
  (0): TorchCategoricalLayer(
    folds: 784  channels: 1  variables: 1  output-units: 4
    input-shape: (784, 1, -1, 1)
    output-shape: (784, -1, 4)
    (probs): TorchParameter(
      shape: (784, 4, 1, 256)
      (0): TorchTensorParameter(output-shape: (784, 4, 1, 256))
      (1): TorchSoftmaxParameter(
        input-shapes: [(784, 4, 1, 256)]
        output-shape: (784, 4, 1, 256)
      )
    )
  )
  (1): TorchTuckerLayer(
    folds: 784  arity: 2  input-units: 4  output-units: 4
    input-shape: (784, 2, -1, 4)
    output-shape: (784, -1, 4)
    (weight): TorchParameter(
      shape: (784, 4, 16)
      (0): TorchTensorParameter(output-shape: (784, 4, 16))
      (1): TorchSoftmaxParameter(
        input-shapes: [(784, 4, 16)]
        output-shape: (784, 4, 16)
      )
    )
  )
  (2): TorchTuckerLayer(
    folds: 392  arity: 2  input-units: 4  output-units: 4
    input-shape: (392, 2, -1, 4)
    output-shape: (392, -1, 4)
    (weight): TorchParameter(
      shape: (

### Folding your Circuit

TODO: we have many layers, but actually many of them can be fused together since they can be evaluated in parallel, this is folding

In [85]:
print(f'Number of layers: {len(circuit.layers)}')

Number of layers: 4163


TODO: let's now enable folding as follows

In [86]:
ctx = PipelineContext(
    backend='torch',      # Use the PyTorch backend
    # Specify the backend compilation flags next
    semiring='lse-sum',   # Use numerically-stable LogSumExp-Sum semiring
    fold=True,            # <---- Enable folding
    optimize=False,       # Disable layer optimizations (for now)
)

In [87]:
%%time
circuit = ctx.compile(symbolic_circuit)

CPU times: user 942 ms, sys: 19.8 ms, total: 961 ms
Wall time: 948 ms


TODO: note that the compilation procedure took more time, let's now check the number of layers

In [88]:
print(f'Number of layers: {len(circuit.layers)}')

Number of layers: 26


TODO: we have far fewer layers. They have not disappeared, but they have been fused together. In fact the first layer has many folds

In [89]:
first_folded_layer = next(circuit.topological_ordering())
print(f'Type of the first layer: {first_folded_layer.__class__.__name__}')
print(f'Number of folded layer within the first layer: {first_folded_layer.num_folds}')

Type of the first layer: TorchCategoricalLayer
Number of folded layer within the first layer: 784


TODO: all categoricals have been been fused in a single one, this drastically improves efficiency.

In [90]:
%%timeit
batch = torch.randint(256, size=(256, 1, 784), device=device)
circuit(batch)
if 'cuda' in str(device):
    torch.cuda.synchronize(device)

CPU times: user 421 ms, sys: 284 ms, total: 705 ms
Wall time: 197 ms


TODO: we have achieved a ~xx.x speedup

### Optimizing the Circuit Layers

TODO: the layers circuit we have built can be optimized. For instance there are kronecker layers that are espensive memory-wise

In [91]:
print([layer.__class__.__name__ for layer in circuit.layers])

['TorchCategoricalLayer', 'TorchKroneckerLayer', 'TorchDenseLayer', 'TorchKroneckerLayer', 'TorchDenseLayer', 'TorchMixingLayer', 'TorchKroneckerLayer', 'TorchDenseLayer', 'TorchKroneckerLayer', 'TorchDenseLayer', 'TorchMixingLayer', 'TorchKroneckerLayer', 'TorchDenseLayer', 'TorchKroneckerLayer', 'TorchDenseLayer', 'TorchMixingLayer', 'TorchKroneckerLayer', 'TorchDenseLayer', 'TorchKroneckerLayer', 'TorchDenseLayer', 'TorchMixingLayer', 'TorchKroneckerLayer', 'TorchDenseLayer', 'TorchKroneckerLayer', 'TorchDenseLayer', 'TorchMixingLayer']


TODO: mention EiNets paper, we can optimize this circuit by fusing kronecker product layers with dense layers. This and other optimizations are performed automatically regardless of the circuit structure, we just need to activate the flag

In [95]:
ctx = PipelineContext(
    backend='torch',      # Use the PyTorch backend
    # Specify the backend compilation flags next
    semiring='lse-sum',   # Specify how to evaluate sum and product layers
                          # In this case we use the numerically-stable LogSumExp-Sum semiring (R, +, *),
                          # where: + is the log-sum-exp operation, and * is the sum operation.
    fold=True,            # Enable folding
    optimize=True,        # <---- Enable layer optimizations
)

In [96]:
%%time
circuit = ctx.compile(symbolic_circuit)

CPU times: user 833 ms, sys: 14.8 ms, total: 848 ms
Wall time: 833 ms


TODO: The compilation took even longer. If we look at the layers we find they are Tucker layers, which are much more efficient

In [97]:
print([layer.__class__.__name__ for layer in circuit.layers])

['TorchCategoricalLayer', 'TorchTuckerLayer', 'TorchTuckerLayer', 'TorchMixingLayer', 'TorchTuckerLayer', 'TorchTuckerLayer', 'TorchMixingLayer', 'TorchTuckerLayer', 'TorchTuckerLayer', 'TorchMixingLayer', 'TorchTuckerLayer', 'TorchTuckerLayer', 'TorchMixingLayer', 'TorchTuckerLayer', 'TorchTuckerLayer', 'TorchMixingLayer']


TODO: Let's now benchmark our circuit

In [99]:
%%timeit
batch = torch.randint(256, size=(256, 1, 784), device=device)
circuit(batch)
if 'cuda' in str(device):
    torch.cuda.synchronize(device)

159 ms ± 8.16 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


TODO: we have achieved ~x.xx speed up