flanglet edited this page May 14, 2018 · 83 revisions

Overview

This project offers Java code for manipulation and compression of data and images. Kanzi contains the most versatile all-purpose data compressor for Java (and the fastest block compressor in Java).

Other utilities include lossless compression codecs (Huffman, Range, LZ4, Snappy, PAQ, Asymmetric Numeral System, Context Model), color model transforms, resampling, wavelet, DCT, Hadamard transform, bit stream manipulation, Burrows-Wheeler (BWT) and the bijective version BWTS, Move-To-Front transform, run length coding, etc ...

Package hierarchy

kanzi

  • app
  • bitstream
  • entropy
  • function
  • io
  • test
  • transform
  • util
  • hash
  • sort

Description:

  • kanzi: top level including common classes and interfaces
  • app: contains applications (E.G. block compressor)
  • bitstream: utilities to manipulate a stream of data at the bit level
  • entropy: implementation of several common entropy codecs (process bits)
  • function: implementation of common functions (input and output sizes differ but process only bytes): RLT, ZRLT, LZ4, Snappy
  • io: implementation of InputStream, OutputStream with block codec
  • test: contains many classes to test the utility classes
  • transform: implementation of common functions (input and output sizes are identical) such as Wavelet, Discrete Cosine, Walsh-Hadamard, Burrows-Wheeler
  • util: utility classes, suffix array algorithms, ...
  • util/hash: implementations of various hash functions ( MurMurHash, xxHash32, xxHash64, SipHash_2_4)
  • util/sort: implementation of the most common sorting algorithms: QuickSort, Radix, MergeSort, BucketSort, etc...

There are no static dependencies to other jar files but jna.jar can be provided in case video filters are implemented via JNI calls.

jdeps kanzi.jar (filtered for external dependencies only)

    kanzi (kanzi.jar)
      -> java.lang
      -> java.nio
   kanzi.app (kanzi.jar)
      -> java.io
      -> java.lang
      -> java.util
      -> java.util.concurrent
   kanzi.bitstream (kanzi.jar)
      -> java.io
      -> java.lang
   kanzi.entropy (kanzi.jar)
      -> java.io
      -> java.lang
   kanzi.function (kanzi.jar)
      -> java.io
      -> java.lang
      -> java.util
   kanzi.io (kanzi.jar)
      -> java.io
      -> java.lang
      -> java.util
      -> java.util.concurrent
      -> java.util.concurrent.atomic
      -> java.util.concurrent.locks
   kanzi.test (kanzi.jar)
      -> java.awt
      -> java.awt.event
      -> java.awt.image
      -> java.io
      -> java.lang
      -> java.nio
      -> java.util
      -> java.util.concurrent
      -> java.util.zip
      -> javax.imageio
      -> javax.swing
   kanzi.transform (kanzi.jar)
      -> java.lang
   kanzi.util (kanzi.jar)
      -> java.lang
      -> java.lang.management
      -> java.nio.charset
      -> java.util
   kanzi.util.hash (kanzi.jar)
      -> java.lang
   kanzi.util.sort (kanzi.jar)
      -> java.lang
      -> java.util.concurrent

Java 7 or higher is required.

Block compressor

Usage:

java jar kanzi.jar --help

Kanzi 1.4 (C) 2018,  Frederic Langlet

   -h, --help
        display this message

   -v, --verbose=<level>
        0=silent, 1=default, 2=display details, 3=display configuration,
        4=display block size and timings, 5=display extra information
        Verbosity is reduced to 1 when files are processed concurrently
        Verbosity is silently reduced to 0 when the output is 'stdout'
        (EG: The source is a directory and the number of jobs > 1).

   -f, --force
        overwrite the output file if it already exists

   -i, --input=<inputName>
        mandatory name of the input file or directory or 'stdin'
        When the source is a directory, all files in it will be processed.
        Provide \. at the end of the directory name to avoid recursion
        (EG: myDir\. => no recursion)

   -o, --output=<outputName>
        optional name of the output file or directory (defaults to
        <inputName.knz>) or 'none' or 'stdout'. 'stdout' is not valid
        when the number of jobs is greater than 1.

   -b, --block=<size>
        size of blocks, multiple of 16 (default 1 MB, max 1 GB, min 1 KB).

   -l, --level=<compression>
        set the compression level [0..6]
        Providing this option forces entropy and transform.
        0=None&None (store), 1=TEXT+LZ4&HUFFMAN, 2=TEXT+ROLZ
        3=BWT+RANK+ZRLT&ANS0, 4=BWT+RANK+ZRLT&FPAQ, 5=BWT&CM
        6=X86+RLT+TEXT&TPAQ

   -e, --entropy=<codec>
        entropy codec [None|Huffman|ANS0|ANS1|Range|PAQ|FPAQ|TPAQ|CM]
        (default is ANS0)

   -t, --transform=<codec>
        transform [None|BWT|BWTS|SNAPPY|LZ4|ROLZ|RLT|ZRLT|MTFT|RANK|TEXT|X86]
        EG: BWT+RANK or BWTS+MTFT (default is BWT+RANK+ZRLT)

   -x, --checksum
        enable block checksum

   -s, --skip
        copy blocks with high entropy instead of compressing them.

   -j, --jobs=<jobs>
        maximum number of jobs the program may start concurrently
        (default is 1, maximum is 32).


EG. Kanzi -c -i foo.txt -o none -b 4m -l 4 -v 3

EG. Kanzi -c -i foo.txt -f -t BWT+MTFT+ZRLT -b 4m -e FPAQ -v 3 -j 4

EG. Kanzi --compress --input=foo.txt --output=foo.knz --force
          --transform=BWT+MTFT+ZRLT --block=4m --entropy=FPAQ --verbose=3 --jobs
=4

EG. Kanzi -d -i foo.knz -f -v 2 -j 2

EG. Kanzi --decompress --input=foo.knz --force --verbose=2 --jobs=2

The Block compressor cuts the input file into chunks of 1 MB (or the size provided on the command line with the 'block' option up to 512MB). Optionally, a checksum for the chunk of data can be computed and stored in the output.

As a first step, it applies a transform (default is BWT+MTFT+ZRLT) to turn the block into a smaller number of bytes (byte transform). The supported transforms include Snappy, LZ4, BWT and BWTS. If BWT(S) is chosen, a post processing transform can be provided (EG. BWTS+RANK or BWT+MTFT). Usually, the BWT(s) and the selected GST are followed by a ZRLT (to remove the runs of 0). Up to 4 transforms (input size = output size) or functions (input size != output size) can be provided at first stage. EG: BWT+RANK+ZRLT.

As a second step, entropy coding is performed (to turn the block into a smaller number of bits).

Each step can be bypassed based on command line options.

The decompressor extracts all necessary information from the header of the bitstream (input file) such as entropy type, transform type, block size, checksum enabled/disabled, etc... before applying appropriate entropy decoder followed by the inverse transform for each block. Optionally, a checksum is computed and checked against the one stored in the bitstream (based on original data).

The 2 step process allows either very fast compression/decompression (Snappy/LZ4+no entropy or Snappy/LZ4+Huffman) or high compression ratio (BWT(S) + CM or PAQ or TPAQ + block size > 1MB).

Entropy codecs

Several entropy codecs have been implemented (sorted by increasing compression):

  • Huffman: The codec is a fast canonical Huffman implementation. Both encoder and decoder use tables to compute code lengths and values instead of a tree (for speed purpose).
  • Range: A fast implementation that uses pre-computed block statistics.
  • ANS: Based on Range Asymmetric Numeral Systems by Jarek Duda (specifically an implementation by Fabian Giesen). Works in a similar fashion to the Range encoder but uses only 1 state (instead of bottom and range) and does encoding in reverse byte order.
  • FPAQ: A binary arithmetic codec based on FPAQ1 by Matt Mahoney. Uses a simple, adaptive order 0 predictor based on frequencies. Fast and compact code.
  • CM: A binary arithmetic codec based on BCM by Ilya Muravyov. Uses context mixing of counters to generate a prediction. Efficient and decently fast.
  • PAQ: A binary arithmetic codec based on an old PAQ version by Matt Mahoney. The number of post processing steps (Adaptive Probability Maps) has been reduced for speed purpose. Slow.
  • TPAQ: A binary arithmetic codec based on Tangelo 2.4 (itself derived from FPAQ8). Uses context mixing of predictions produced by one layer neural networks. The initial code has been heavily tuned to improve compression ratio and speed. Slowest but usually best compression ratio.

See more details about the block transform

https://github.com/flanglet/kanzi/wiki/BWT-Block-Codec

Performance of the Snappy and LZ4 codecs in Kanzi

https://github.com/flanglet/kanzi/wiki/Performance-of-Snappy-and-LZ4-Codecs

Stream and block header formats

Stream Header (12 bytes)
   stream type: 32 bits, value "KANZ"
   stream format version: 7 bits
   checksum present: 1 bit (boolean)
   entropy codec code: 5 bits
   transform codecs: 16 bits (4x4)
   block size: (divided by 8) 26 bits, valid size range is [1024, 512*1024*1024[
   reserved for future extension: 9 bits

Block Header (1 to 8 bytes) 
   mode: 8 bits  
   if mode& 0x80 != 0
      small block, block size = mode & 0x0F
   else 
      regular block, len(block size) = ((mode & 3) + 1) * 8
      block size = len(block size) bits from bitstream 
      if block size=0, last empty block
      (mode>>2) & 0x0F = skip flags for (up to 4) inverse transforms
   if checksum 
      checksum = 32 bits from bitstream 
   block data = block size * 8 bits from bitstream
   

Compression examples and results

See here: https://github.com/flanglet/kanzi/wiki/Compression-examples

Video filter examples

See here: https://github.com/flanglet/kanzi/wiki/Video-Examples

You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.
Press h to open a hovercard with more details.