# Xilinx Data Compression Library

This notebook introduces some of the higher-level functions of the Xilinx [data compression library](https://github.com/Xilinx/Vitis_Libraries/tree/b658aa5cd262d080048526ce931d4570cb931a36/data_compression) which is part of the set of [Vitis Libraries](https://github.com/Xilinx/Vitis_Libraries). The bitstream delivered for this example is designed to showcase the core parts of the library in particular:

 * LZ4 compression and decompression
 * DEFLATE compression

LZ4 is a block-based compression library optimized for speed and parallel access and DEFLATE is the compression format on which zlib and gzip are built. For this notebook we are going to focus on LZ4 as the compression library provides single accelerators for compression and decompression. DEFLATE is a more complex format and is split into three separate processes and is detailed in the *Zlib Acceleration* notebook alongside this one.

First thing we need to do, as with any PYNQ program, is load the bitstream:

In [1]:
import pynq

ol = pynq.Overlay('compression.xclbin')

Which will allow us to inspect its contents.

In [2]:
ol?

[0;31mType:[0m            Overlay
[0;31mString form:[0m     <pynq.overlay.Overlay object at 0x7fbc78e4fd90>
[0;31mFile:[0m            /scratch/pynq-testing/ogden/conda/lib/python3.7/site-packages/pynq/overlay.py
[0;31mDocstring:[0m      
Default documentation for overlay compression.xclbin. The following
attributes are available on this overlay:

IP Blocks
----------
xilHuffmanKernel_1   : pynq.overlay.DefaultIP
xilHuffmanKernel_2   : pynq.overlay.DefaultIP
xilLz4Compress_1     : pynq.overlay.DefaultIP
xilLz4Compress_2     : pynq.overlay.DefaultIP
xilLz77Compress_1    : pynq.overlay.DefaultIP
xilLz77Compress_2    : pynq.overlay.DefaultIP
xilTreegenKernel_1   : pynq.overlay.DefaultIP
xilTreegenKernel_2   : pynq.overlay.DefaultIP

Hierarchies
-----------
None

Interrupts
----------
None

GPIO Outputs
------------
None

Memories
------------
bank0                : Memory
bank1                : Memory
[0;31mClass docstring:[0m
This class keeps track of a single bitstream's state 

As you can see there are multiple types of kernel with each type having two instances. For this notebook we are only interested in the `xilLz4Compress_1` kernel which we will use to introduce the concepts behind the library.

In [3]:
compress = ol.xilLz4Compress_1
compress.signature

<Signature (in_r: 'ap_uint<512> const *', out_r: 'ap_uint<512>*', compressd_size: 'unsigned int*', in_block_size: 'unsigned int*', block_size_in_kb: 'unsigned int', input_size: 'unsigned int')>

Before constructing the arguments to pass to accelerator we need to understand how the compression library is structured. The compress accelerator is composed of 8 separate internal pipelines with a splitter and combiner at each end:

![compression pipeline](img/lzx_comp.png)

With this logic we can start thinking of the formats for each argument:

 * `in_r` - the input data array consisting of 8 buffers of the block size arranged contiguously
 * `out_r` - the output data array consisting of 8 buffers of the block size arranged contiguously
 * `compressed_size` - an array of size 8 that will contain the sizes of the blocks after compression
 * `in_block_size` - an array of size 8 that contains the uncompressed sizes of the blocks (as the data may not fill the whole block)
 * `block_size_in_kb` - the block size
 * `input_size` - the total size of the input
 
The LZ4 format has 4 possible block sizes from 64 KB to 4 MB. For this example we'll set the block size as 1 MB. Using this as a size we can create all of the buffers. Note that for this accelerator all of the arrays should be in `bank0` as we can see from the `mem` attribute of each entry in the `args` dictionary.

In [4]:
compress.args

{'in_r': XrtArgument(name='in_r', index=1, type='ap_uint<512> const *', mem='bank0'),
 'out_r': XrtArgument(name='out_r', index=2, type='ap_uint<512>*', mem='bank0'),
 'compressd_size': XrtArgument(name='compressd_size', index=3, type='unsigned int*', mem='bank0'),
 'in_block_size': XrtArgument(name='in_block_size', index=4, type='unsigned int*', mem='bank0'),
 'block_size_in_kb': XrtArgument(name='block_size_in_kb', index=5, type='unsigned int', mem=None),
 'input_size': XrtArgument(name='input_size', index=6, type='unsigned int', mem=None)}

In [5]:
BLOCK_SIZE = 1024 * 1024

in_buffers = pynq.allocate((8, BLOCK_SIZE), 'u1', target=ol.bank0)
out_buffers = pynq.allocate((8, BLOCK_SIZE), 'u1', target=ol.bank0)
compressed_size = pynq.allocate((8,), 'u4', target=ol.bank0)
uncompressed_size = pynq.allocate((8,), 'u4', target=ol.bank0)

We need a reasonably sized file to use as a test case. The test data we're using is the compression xclbin itself for the Xilinx u200 board. We can use numpy to reshape the array to make it more convenient. Note that we need to use a *memoryview* of the test data to avoid numpy trying to parse the data as a string.

In [6]:
with open('test_data.bin', 'rb') as f:
    test_data = f.read()

in_buffers.reshape(8*BLOCK_SIZE)[:] = memoryview(test_data)[0:8*BLOCK_SIZE]

The uncompressed size of each block is just 1 MB

In [7]:
uncompressed_size[:] = BLOCK_SIZE

And as a final step we need to sync the input buffers with the accelerator card

In [8]:
uncompressed_size.sync_to_device()
in_buffers.sync_to_device()

Everything is now set up to call the accelerator

In [9]:
compress.call(in_buffers, out_buffers,
             compressed_size, uncompressed_size,
              1024, 8 * BLOCK_SIZE)

To get the data back we first need to sync the buffer containing the sizes of the blocks

In [10]:
compressed_size.sync_from_device()
compressed_size

PynqBuffer([133515, 158756, 479440, 430620, 895308, 754169, 770095,
            684801], dtype=uint32)

And we can now sync the output buffers - note that we only need to sync the part of the buffer that we know is filled

In [11]:
for i in range(8):
    out_buffers[i,0:compressed_size[i]].sync_from_device()

To verify we have the correct result we want to pass the compressed data back through a software implementation. The `lz4` package provides a block-level API we can use for decompressing our results.

In [12]:
import lz4.block

decompressed_data = b''
for i in range(8):
    decompressed_data += lz4.block.decompress(out_buffers[i, 0:compressed_size[i]],
                                              uncompressed_size=1024*1024)
    
decompressed_data == test_data[0:8*1024*1024]

True

To get an idea of the compression ratio we can concatenate the length of the compressed blocks and device by the 8 MB we started with

In [13]:
sum(compressed_size) / sum(uncompressed_size)

0.5133991241455078

### Cleaning up

You might want to *shutdown* this notebook at this point to ensure that all of the resources used are freed.

Copyright (C) 2020 Xilinx, Inc