# ACCL Primitives
ACCL primitives are a set of simple operations that an ACCL instance can execute and assemble into larger operations such as collectives. The primitives are:
* Copy - a simple DMA operation from a local source buffer to a local destination buffer
* Combine - applying a binary elementwise operator to two source buffers and placing the result in the destination buffer
* Send - send data from a local buffer to a remote ACCL instance (equivalent to MPI Send)
* Receive - receive data from a remote ACCL instance into a local buffer (equivalent to MPI Recv)

## Initializing ACCL emulator/simulator instances
We are now ready to connect to our ACCL instances and configure them. We assume that a simulator or emulator session has been started with at least two ACCL instances connected via TCP (see ACCL documentation). We associate the instances with rank numbers 0 and 1 respectively.

In [None]:
RUN_ON_HARDWARE = True
XCLBIN = "axis3x.xclbin"

from pyaccl import accl

if RUN_ON_HARDWARE:
    accl0 = accl(2, 0, xclbin=XCLBIN, cclo_idx=0)
    accl1 = accl(2, 1, xclbin=XCLBIN, cclo_idx=1)
else:
    accl0 = accl(2, 0, sim_mode=True)
    accl1 = accl(2, 1, sim_mode=True)

## Copy data
We are now ready to execute primitives. Let's start with a `copy()` operation using one ACCL instance. We allocate buffers in the memory space of rank 0. The default data type is 32-bit float, and we request 10-element buffers. Initially we paint different data to the source and destination buffers. After the copy, we expect both buffers to contain the same data.

In [None]:
src = accl0.allocate((10,))
dst = accl0.allocate((10,))
src[:] = [1.0*i for i in range(10)]
dst[:] = [0.0 for i in range(10)]

accl0.copy(src, dst, 10)

import numpy as np
assert np.isclose(src, dst).all()

# Sum two vectors
ACCL instances can be provided with arithmetic plugins to perform elementwise operations on vectors of data. The simplest and most common of these operations is the elementwise sum, which is the default operator in MPI reduction collectives (reduce, all-reduce, reduce-scatter). We can utilize the arithmetic plugin by calling the `combine()` function of the ACCL interface. We check by comparing with the sum as computed by NumPy.

In [None]:
operand0 = accl0.allocate((10,))
operand1 = accl0.allocate((10,))
result = accl0.allocate((10,))
operand0[:] = [1.0*i for i in range(10)]
operand1[:] = [1.0*i for i in range(10)]
dst[:] = [0.0 for i in range(10)]

from pyaccl import ACCLReduceFunctions

accl0.combine(len(operand0), ACCLReduceFunctions.SUM, operand0, operand1, result)

assert np.isclose(result, operand0+operand1).all()

# Exchange data with remote ACCL instances
The `send()` and `recv()` functions initiate direct data exchange between ACCL instances. Each of these functions take the rank number of the remote ACCL instance as argument, as well as a buffer and an arbitrary integer tag number. Tags prevent confusion between send/receive pairs. Note that the `recv()` function will block until the data has arrived from the remote peer, therefore in a single-threaded environment, sending must always happen before receiving.

In [None]:
src = accl0.allocate((10,))
dst = accl1.allocate((10,))
src[:] = [1.0*i for i in range(10)]
dst[:] = [0.0 for i in range(10)]

accl0.send(src, len(src), 1, tag=0)
accl1.recv(dst, len(dst), 0, tag=0)

assert np.isclose(src, dst).all()

## De-Initialize ACCL instances
The `deinit()` function clears all internal data structures in the ACCL instance.

In [None]:
accl0.deinit()
accl1.deinit()