# ACCL Collectives
In a system of more than one ACCL-enabled FPGAs, we can execute MPI-like collectives (scatter, gather, broadcast, reductions, etc). This notebook illustrates how to initialize the ACCL instances and run collectives. Usually, each ACCL instance runs in a separate process on a distinct compute node in a network, but for purposes of demonstration, we utilize multithreading in a single process to create and operate multiple ACCL instances

## Initializing ACCL emulator/simulator instances
We assume that a simulator or emulator session has been started with the appropriate number of ACCL instances (see ACCL documentation). Our application creates ACCL interfaces, each connecting to one ACCL instance in the simulator or emulator.

In [None]:
from pyaccl import accl

WORLD_SIZE = 4
RXBUF_SIZE = 16*1024
RUN_ON_HARDWARE = True
XCLBIN = "axis3x.xclbin"

assert not RUN_ON_HARDWARE or WORLD_SIZE <= 3

accl_instances = []
for i in range(WORLD_SIZE):
    if RUN_ON_HARDWARE:
        accl_instances.append(accl(WORLD_SIZE, i, bufsize=RXBUF_SIZE, xclbin=XCLBIN, cclo_idx=i))
    else:
        accl_instances.append(accl(WORLD_SIZE, i, bufsize=RXBUF_SIZE, sim_mode=True))

## Creating ACCL buffers
With the ACCL interfaces ready, we can allocate buffers in each of the instances' memories. We allocate one source buffer and one result buffer, and paint the source with floating point data.

In [None]:
COUNT = 1000

op0_buffers = []
op1_buffers = []
res_buffers = []
for i in range(WORLD_SIZE):    
    op0_buffers.append(accl_instances[i].allocate((COUNT,)))
    res_buffers.append(accl_instances[i].allocate((COUNT,)))
    op0_buffers[i][:] = [1.0*i for i in range(COUNT)]

## Run an all-reduce collective
We are now ready to execute collectives. Since collectives require communication between the ACCL instances, we must start the collectives in each of the instances in parallel, utilizing threads. Each thread executes an all-reduce sum collective.

In [None]:
import threading
from pyaccl import ACCLReduceFunctions
import numpy as np

def allreduce(n):
    accl_instances[n].allreduce(op0_buffers[n], res_buffers[n], COUNT, ACCLReduceFunctions.SUM)

threads = []
for i in range(WORLD_SIZE):
    threads.append(threading.Thread(target=allreduce, args=(i,)))
    threads[i].start()

## Check results
All-reduce should produce in each of the result buffers the sum of all the input buffers from each of the ACCL instances involved in the collective. We can compare all-reduce outputs with the expected outputs, element by element, to make sure this is the case.

In [None]:
for i in range(WORLD_SIZE):
    threads[i].join()
    assert np.isclose(res_buffers[i], sum(op0_buffers)).all()

## De-Initialize ACCL instances
The `deinit()` function clears all internal data structures in the ACCL instance.

In [None]:
for i in range(WORLD_SIZE):
    accl_instances[i].deinit()