# ACCL Compression Support
In general, ACCL is datatype-agnostic as most of its function involves data movement without any actual interaction with the data values themselves. However, there are two exceptions to this rule:
* Elementwise operations (e.g. SUM) performed by ACCL instances on buffers during reduction-type collectives 
* Datatype conversions when the source and destination of a transfer are of different data types. In this scenario we call the lower-precision buffer is compressed.

To support these elementwise operations and conversions, ACCL must be configured with a reduction plugin and coversion plugins respectively. Each of these plugins is a free-running Vitis kernel. Reduction plugins take two operand AXI Streams and produce one result AXI Stream, and may implement multiple functions internally, selected by an operation ID provided as side-band to the operands on the TDEST signal of AXI Stream. Conversion plugins take one operand AXI Stream as input and produce a result AXI Stream by applying an arbitrary conversion function specified by a function ID on the operand's TDEST. 

Example reduction and conversion plugins are provided in the ACCL repo. The example reduction plugin supports five data types: FP16/32/64 and INT32/64. The example compression plugin converts between floating-point single-precision (FP32) and half-precision (FP16). Together, these plugins enable six datatype configurations - let's see what they are:

In [None]:
from pyaccl.accl import ACCL_DEFAULT_ARITH_CONFIG

for key in ACCL_DEFAULT_ARITH_CONFIG:
    print(f"Uncompressed dtype: {key[0]}\nCompressed dtype: {key[1]}\n{str(ACCL_DEFAULT_ARITH_CONFIG[key])}")

Five of these configurations are homogeneous, i.e. operate on buffers of identical data types. One is heterogeneous and can operate on combinations of FP32 and FP16 buffers, e.g. source buffers of a primitive can be FP32 and results FP16 or vice-versa, by utilizing the conversion plugin. 

The key points of the ACCL datatype configuration are:
* bytes per element for the compressed and uncompressed datatype. In the case of homogeneous configurations, these are the same datatype.
* ratio of compressed elements to uncompressed elements, i.e. how many uncompressed buffer elements are consumed in the conversion process to produce one compressed element. For elementwise conversion e.g. FP32 to FP16, this ratio is 1. For block floating point formats, this ratio could be higher.
* whether arithmetic should be performed on the compressed data - for higher throughput - or uncompressed data - for higher precision. ACCL determines the order of conversions required to meet this specifications for each primitive and collective. 
* function IDs to be provided to the plugins when performing compression, decompression, and reduction.

Notice that in the ACCL default FP32/FP16 compression configuration, arithmetic is perfomed on the lower-precision FP16 datatype. Let's initialize two ACCL instances and see how the FP16 compression feature works.

In [None]:
RUN_ON_HARDWARE = True
XCLBIN = "axis3x.xclbin"

from pyaccl import accl

if RUN_ON_HARDWARE:
    accl0 = accl(2, 0, xclbin=XCLBIN, cclo_idx=0)
    accl1 = accl(2, 1, xclbin=XCLBIN, cclo_idx=1)
else:
    accl0 = accl(2, 0, sim_mode=True)
    accl1 = accl(2, 1, sim_mode=True)

## Operating on buffers of different data types

Let's do a reduction between a two NumPy FP32 bufferas, with the result stored in a FP16 buffer. First we'll allocate these buffers using the `dtype` optional argument to `allocate()`, paint the buffers with high-precision data, then perform the local reduction. Since in this mixed-precision scenario ACCL  by default performs arithmetic in FP16, the sum-combine is equivalent to the following sequence of operations:
1. convert `op0` and `op1` to FP16
2. perform the sum in FP16
3. store the result in `res`

In [None]:
import numpy as np
from pyaccl import ACCLReduceFunctions

op0 = accl0.allocate((10,), dtype=np.float32)
op0[:] = [np.pi*i for i in range(10)]
op1 = accl0.allocate((10,), dtype=np.float32)
op1[:] = [1.1*i for i in range(10)]
res = accl0.allocate((10,), dtype=np.float16)

accl0.combine(10, ACCLReduceFunctions.SUM, op0, op1, res)

print(op0+op1)
print((op0+op1).astype(np.float16))
print((op0.astype(np.float16)+op1.astype(np.float16)).astype(np.float16))
print(res)
np.sum(np.abs(np.subtract(op0+op1, res)))

Notice how the result is slightly different depending on whether we perform the sum in FP32 and FP16. The ACCL result is slightly different than the NumPy result due to differences in the underlying floating point ALUs on the FPGA and CPU respectively.

## Compressing data over the wire

In addition to local conversions, users can specify FP16 compression for traffic across the backend (typically Ethernet) link between ACCL instances even when all buffers are FP32. This feature reduces network traffic and latency, but as expected, there is a loss of precision of data during transport. Let's compress data for a simple send-receive pair. We need to utilize the `compress_dtype` optional argument for both `send()` and `recv()`. Please note that the compression settings must match for the receive operation to identify the received buffer.

In [None]:
src = accl0.allocate((10,))
dst = accl1.allocate((10,))
src[:] = [1.1111*i for i in range(10)]
dst[:] = [0.0 for i in range(10)]

accl0.send(src, len(src), 1, tag=0, compress_dtype=np.dtype('float16'))
accl1.recv(dst, len(dst), 0, tag=0, compress_dtype=np.dtype('float16'))

print(src)
print(dst)

## De-Initialize ACCL instances
The `deinit()` function clears all internal data structures in the ACCL instance.

In [None]:
accl0.deinit()
accl1.deinit()