`fst::compress_fst` can't seem to be read back in in another language #223

xiaodaigh · 2019-11-11T23:44:23Z

I wonder if there are any additional bytes or any settings I need to be aware of about the fst::compress_fst function.

For example, in R I write the compressed bytes to disk

a = writeBin(1:10, raw(), size = 4)
ca = fst::compress_fst(a, compression = 50)
writeBin(ca, "d:/data/plsdel.io")

and I try to read it back using Julia

using CodecZstd
ca = open("d:/data/plsdel.io", "r") do io
     read(io)
end

transcode(ZstdDecompressor, ca)

gives error.

I wonder if the compression in fst are interopable with elsewhere in general?

The text was updated successfully, but these errors were encountered:

xiaodaigh · 2019-11-11T23:52:09Z

In contrast the {lz4} package seems to work https://github.com/bwlewis/lz4

library(lz4)
a = writeBin(1:10, raw(), size = 4)
ca = lzCompress(a)
writeBin(ca, "d:/data/plsdel.io")

works with

using CodecLz4

a = open("/mnt/d/data/plsdel.io", "r") do io
       read(io)
end
transcode(LZ4Decompressor, a)
reinterpret(Int32, a) # 1, 2, ..., 10

MarcusKlik · 2019-11-12T10:06:25Z

Hi @xiaodaigh,

you're right, the raw vector result from fst::compress_fst() is not a default LZ4 (or ZSTD) compression result!

The reason is that fst::compress_fst() divides the input in blocks to allow for parallel compression of each block (increasing performance). So the result is basically a few default LZ4 compression results pasted together with a small amount of metadata at the start.

Would it be interesting to have an option to force a compatible result (e.g. compatibility_mode = TRUE) ?

xiaodaigh · 2019-11-12T10:12:39Z

Or is it possible to know the cuts points and the length of meta data, so the reader can be multithreaded in the read too? Compat mode would be nice too.

MarcusKlik · 2019-11-12T10:31:29Z

Hi @xiaodaigh, yes certainly!

the block sizes are calculated in this code and the metadata format is specified here.

For a reader application to read the compressed raw vector, the metadata should be interpreted first. For example, the actual compressor used can be found in the metadata.

So, it's not to difficult to write code to read a stream compressed with compress_fst(). The question is off course, if it would be interesting enough to the user, as there are alternative formats that allow for a large number of compressors (think zip or 7z 😸)

xiaodaigh · 2019-11-12T10:55:07Z

If I follow those formats, can I theoretically write a fst reader/writer in pure Julia?

MarcusKlik · 2019-11-13T09:25:35Z

Hi @xiaodaigh, yes, that would be relatively simple if you have the LZ4 and ZSTD compressors available.

I've created a function in pure R to decompress a file with compressed data. By following the same strategy in Julia, you can create a (de-)compressor for these type of files!

library(dplyr)
library(bit64)

# helper to convert an unsinged int to a bit64
uint_to_long <- function(unigned_int) {
  high_bit <- bitwShiftR(unigned_int, 31)
  low_bits <- bitwAnd(unigned_int, 2147483647)
  
  bit64::as.integer64(1073741824L) * 2 * high_bit + bit64::as.integer64(low_bits)
}

# helper to convert two integers to a bit64
ints_to_long <- function(low_int, high_int) {
  bit64::as.integer64(1073741824L) * 4 * uint_to_long(high_int) +
    uint_to_long(low_int)
}

decompress_cfst <- function(path) {

  # open binary connection
  cfst_file <- file(path, "rb")

  # first few header bytes
  params <- readBin(cfst_file, "integer", 4)
  
  block_size <- uint_to_long(params[2])
  version <- uint_to_long(params[3])
  compression_algorithm <- bitwAnd(params[4], 2147483647)
  
  # read some more header
  params <- readBin(cfst_file, "integer", 4)
  
  vec_length <- ints_to_long(params[1], params[2])

  # calculate the number of blocks
  nr_of_blocks <- as.integer(1 + (vec_length - 1) %/% block_size)

  # read block offsets
  params <- readBin(cfst_file, "integer", 2 + nr_of_blocks * 2)

  block_offsets <- integer64(nr_of_blocks + 1)
  
  for (block_nr in 1:(nr_of_blocks + 1)) {
    block_offsets[block_nr] <- ints_to_long(params[block_nr * 2 - 1], params[block_nr * 2])
  }

  # read chunks
  res <- lapply(1:nr_of_blocks, function(block_nr) {
    compressed_chunk <- readBin(cfst_file, "raw", as.numeric(block_offsets[block_nr + 1] - block_offsets[block_nr]))
    
    if (compression_algorithm == 1) {
      res <- decompressed_chunk <- qs::lz4_decompress_raw(compressed_chunk)
    } else {
      res <- decompressed_chunk <- qs::zstd_decompress_raw(compressed_chunk)
    }
    
    res
  })
  
  close(cfst_file)
  
  unlist(res)
}


# some compressible sample data
x <- as.raw(sample(10:20, 1e6, replace = TRUE))

path <- "1.cfst"

# compress and write to file
fst::compress_fst(x, "ZSTD", 100, hash = TRUE) %>%
  writeBin(path)

y <- decompress_cfst(path)

# in equals out
testthat::expect_equal(x, y)

# compression ratio
as.numeric(object.size(x)) / file.size(path)
#> [1] 2.244948

So this is a (slow 😸) pure R method to deserialize a binary file with the output of compress_fst(). For completeness, the header information is stored in the following format:

Size	Type	Description
4	unsigned int	header hash
4	unsigned int	blockSize
4	unsigned int	version
4	unsigned int	COMPRESSION_ALGORITHM and upper bit isHashed
8	unsigned long long	vecLength
8	unsigned long long	block hash result
8 * (nrOfBlocks + 1)	unsigned long long	block offset
x	unsigned char	compressed data

If you want to try and write a Julia reader / writer, please let me know if I can help!

thanks

MarcusKlik · 2019-11-13T09:30:40Z

Note that I use the qs::_decompress_raw methods to show that we are really decompressing LZ4 or ZSTD compatible memory blocks.

Also, I've omitted the hash checks for simplicity. These checks can be added to verify the data and avoid problems with decompression of corrupted memory blocks...

xiaodaigh · 2019-11-13T09:59:36Z

Nice. This should cover double and int so what's left is factors and strings. A native Julia reader would be awesome.

xiaodaigh · 2019-11-13T12:12:13Z

Only ZSTD is working but not LZ4. Interesting. Here is the Julia code

uncompress_cfst(path) = begin
	cfst_file = open(path, "r")

	# first few header bytes
	headerhash = read(cfst_file, UInt32)
	block_size = Int(read(cfst_file, UInt32))
	version = read(cfst_file, UInt32) |> Int
	compression_algorithm = read(cfst_file, UInt32) & 2147483647

	vec_length = read(cfst_file, UInt64)

	vec_length |> Int

	# calculate the number of blocks
	nr_of_blocks = Int(1 + div(vec_length - 1, block_size))

	block_hash = read(cfst_file, UInt64)

	block_offsets = Vector{UInt64}(undef, 1 + nr_of_blocks)
	# read block offsets
	read!(cfst_file, block_offsets)

	# the first number is the hash
	block_offsets_int = Int.(block_offsets)

	lo = block_offsets_int[1]
	hi = block_offsets_int[2]

	compressed_chunk = read(cfst_file, hi-lo)

	if compression_algorithm == 1 
		fnl_res =  transcode(LZ4Decompressor, compressed_chunk)
	else
		fnl_res =  transcode(ZstdDecompressor, compressed_chunk)
	end


	for (lo, hi) in zip(@view(block_offsets_int[2:end-1]), @view(block_offsets_int[3:end]))		
		compressed_chunk = read(cfst_file, hi-lo)
		if compression_algorithm == 1 
			fnl_res = vcat(fnl_res, transcode(LZ4Decompressor, compressed_chunk))
		else
			fnl_res = vcat(fnl_res, transcode(ZstdDecompressor, compressed_chunk))
		end
	end

	close(cfst_file)
	fnl_res
end

using CodecLz4
using CodecZstd


path = "c:/data/1.cfst"

include("c:/scratch/fst.jl")
res = uncompress_cfst(path)

reinterpret(Int32, res)

xiaodaigh · 2019-11-13T12:36:33Z

Hey @MarcusKlik, so if you describe how factors and strings column are compressed and how the fst's format is like in a document then theoretically I can write a 100%-Julia reader of fst. That would be really nice. Are you able to give me some pointers as to the format of fst files? Or you want me to the C++ code you wrote? The issue with C++ is that I don't know any of it.

Then if we have a PYthon reader too then I can use fst everywhere.

MarcusKlik · 2019-11-16T22:11:08Z

Hi @xiaodaigh,

thanks for posting the code, interesting to see the equivalent code in Julia!

The format that compress_fst() uses to compress a raw factor is different from the fst format, apologies for not making that clear!

The compress_fst() format was designed as the simplest format that could accommodate multithreaded compression using various (possibly single-threaded) compressors. So the header is used to store the block-sizes, the selected compressor and some hashing information, the rest of the stream consists of compressed data blocks.

The fst format is a bit more complicated so that it can accommodate random access, more data types, dynamic selection of compressors and many other features, so will be harder to implement natively. More so because it uses custom bit-shifters (C++) and other algorithms to enhance compression speed or provide efficient access (you need to run the C++ code for it to work).

It's still a great idea to create fst bindings for Julia. But the best way would probably be to use the fstlib library directly by calling the C++ code from Julia (same for Python).

thanks and great to see that you published your disk.frame package to CRAN :-)

xiaodaigh mentioned this issue Nov 11, 2019

Submit to CRAN? bwlewis/lz4#5

Open

MarcusKlik self-assigned this Nov 16, 2019

MarcusKlik added the feature request label Nov 16, 2019

xiaodaigh closed this as completed Nov 30, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`fst::compress_fst` can't seem to be read back in in another language #223

`fst::compress_fst` can't seem to be read back in in another language #223

xiaodaigh commented Nov 11, 2019

xiaodaigh commented Nov 11, 2019

MarcusKlik commented Nov 12, 2019

xiaodaigh commented Nov 12, 2019

MarcusKlik commented Nov 12, 2019

xiaodaigh commented Nov 12, 2019

MarcusKlik commented Nov 13, 2019 •

edited

Loading

MarcusKlik commented Nov 13, 2019 •

edited

Loading

xiaodaigh commented Nov 13, 2019

xiaodaigh commented Nov 13, 2019 •

edited

Loading

xiaodaigh commented Nov 13, 2019

MarcusKlik commented Nov 16, 2019

fst::compress_fst can't seem to be read back in in another language #223

fst::compress_fst can't seem to be read back in in another language #223

Comments

xiaodaigh commented Nov 11, 2019

xiaodaigh commented Nov 11, 2019

MarcusKlik commented Nov 12, 2019

xiaodaigh commented Nov 12, 2019

MarcusKlik commented Nov 12, 2019

xiaodaigh commented Nov 12, 2019

MarcusKlik commented Nov 13, 2019 • edited Loading

MarcusKlik commented Nov 13, 2019 • edited Loading

xiaodaigh commented Nov 13, 2019

xiaodaigh commented Nov 13, 2019 • edited Loading

xiaodaigh commented Nov 13, 2019

MarcusKlik commented Nov 16, 2019

`fst::compress_fst` can't seem to be read back in in another language #223

`fst::compress_fst` can't seem to be read back in in another language #223

MarcusKlik commented Nov 13, 2019 •

edited

Loading

MarcusKlik commented Nov 13, 2019 •

edited

Loading

xiaodaigh commented Nov 13, 2019 •

edited

Loading