Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fst::compress_fst can't seem to be read back in in another language #223

Closed
xiaodaigh opened this issue Nov 11, 2019 · 11 comments
Closed

fst::compress_fst can't seem to be read back in in another language #223

xiaodaigh opened this issue Nov 11, 2019 · 11 comments
Assignees

Comments

@xiaodaigh
Copy link
Contributor

I wonder if there are any additional bytes or any settings I need to be aware of about the fst::compress_fst function.

For example, in R I write the compressed bytes to disk

a = writeBin(1:10, raw(), size = 4)
ca = fst::compress_fst(a, compression = 50)
writeBin(ca, "d:/data/plsdel.io")

and I try to read it back using Julia

using CodecZstd
ca = open("d:/data/plsdel.io", "r") do io
     read(io)
end

transcode(ZstdDecompressor, ca)

gives error.

I wonder if the compression in fst are interopable with elsewhere in general?

@xiaodaigh
Copy link
Contributor Author

In contrast the {lz4} package seems to work https://github.com/bwlewis/lz4

library(lz4)
a = writeBin(1:10, raw(), size = 4)
ca = lzCompress(a)
writeBin(ca, "d:/data/plsdel.io")

works with

using CodecLz4

a = open("/mnt/d/data/plsdel.io", "r") do io
       read(io)
end
transcode(LZ4Decompressor, a)
reinterpret(Int32, a) # 1, 2, ..., 10

@MarcusKlik
Copy link
Collaborator

Hi @xiaodaigh,

you're right, the raw vector result from fst::compress_fst() is not a default LZ4 (or ZSTD) compression result!

The reason is that fst::compress_fst() divides the input in blocks to allow for parallel compression of each block (increasing performance). So the result is basically a few default LZ4 compression results pasted together with a small amount of metadata at the start.

Would it be interesting to have an option to force a compatible result (e.g. compatibility_mode = TRUE) ?

@xiaodaigh
Copy link
Contributor Author

Or is it possible to know the cuts points and the length of meta data, so the reader can be multithreaded in the read too? Compat mode would be nice too.

@MarcusKlik
Copy link
Collaborator

Hi @xiaodaigh, yes certainly!

the block sizes are calculated in this code and the metadata format is specified here.

For a reader application to read the compressed raw vector, the metadata should be interpreted first. For example, the actual compressor used can be found in the metadata.

So, it's not to difficult to write code to read a stream compressed with compress_fst(). The question is off course, if it would be interesting enough to the user, as there are alternative formats that allow for a large number of compressors (think zip or 7z 😸)

@xiaodaigh
Copy link
Contributor Author

If I follow those formats, can I theoretically write a fst reader/writer in pure Julia?

@MarcusKlik
Copy link
Collaborator

MarcusKlik commented Nov 13, 2019

Hi @xiaodaigh, yes, that would be relatively simple if you have the LZ4 and ZSTD compressors available.

I've created a function in pure R to decompress a file with compressed data. By following the same strategy in Julia, you can create a (de-)compressor for these type of files!

library(dplyr)
library(bit64)

# helper to convert an unsinged int to a bit64
uint_to_long <- function(unigned_int) {
  high_bit <- bitwShiftR(unigned_int, 31)
  low_bits <- bitwAnd(unigned_int, 2147483647)
  
  bit64::as.integer64(1073741824L) * 2 * high_bit + bit64::as.integer64(low_bits)
}

# helper to convert two integers to a bit64
ints_to_long <- function(low_int, high_int) {
  bit64::as.integer64(1073741824L) * 4 * uint_to_long(high_int) +
    uint_to_long(low_int)
}

decompress_cfst <- function(path) {

  # open binary connection
  cfst_file <- file(path, "rb")

  # first few header bytes
  params <- readBin(cfst_file, "integer", 4)
  
  block_size <- uint_to_long(params[2])
  version <- uint_to_long(params[3])
  compression_algorithm <- bitwAnd(params[4], 2147483647)
  
  # read some more header
  params <- readBin(cfst_file, "integer", 4)
  
  vec_length <- ints_to_long(params[1], params[2])

  # calculate the number of blocks
  nr_of_blocks <- as.integer(1 + (vec_length - 1) %/% block_size)

  # read block offsets
  params <- readBin(cfst_file, "integer", 2 + nr_of_blocks * 2)

  block_offsets <- integer64(nr_of_blocks + 1)
  
  for (block_nr in 1:(nr_of_blocks + 1)) {
    block_offsets[block_nr] <- ints_to_long(params[block_nr * 2 - 1], params[block_nr * 2])
  }

  # read chunks
  res <- lapply(1:nr_of_blocks, function(block_nr) {
    compressed_chunk <- readBin(cfst_file, "raw", as.numeric(block_offsets[block_nr + 1] - block_offsets[block_nr]))
    
    if (compression_algorithm == 1) {
      res <- decompressed_chunk <- qs::lz4_decompress_raw(compressed_chunk)
    } else {
      res <- decompressed_chunk <- qs::zstd_decompress_raw(compressed_chunk)
    }
    
    res
  })
  
  close(cfst_file)
  
  unlist(res)
}


# some compressible sample data
x <- as.raw(sample(10:20, 1e6, replace = TRUE))

path <- "1.cfst"

# compress and write to file
fst::compress_fst(x, "ZSTD", 100, hash = TRUE) %>%
  writeBin(path)

y <- decompress_cfst(path)

# in equals out
testthat::expect_equal(x, y)

# compression ratio
as.numeric(object.size(x)) / file.size(path)
#> [1] 2.244948

So this is a (slow 😸) pure R method to deserialize a binary file with the output of compress_fst(). For completeness, the header information is stored in the following format:

Size Type Description
4 unsigned int header hash
4 unsigned int blockSize
4 unsigned int version
4 unsigned int COMPRESSION_ALGORITHM and upper bit isHashed
8 unsigned long long vecLength
8 unsigned long long block hash result
8 * (nrOfBlocks + 1) unsigned long long block offset
x unsigned char compressed data

If you want to try and write a Julia reader / writer, please let me know if I can help!

thanks

@MarcusKlik
Copy link
Collaborator

MarcusKlik commented Nov 13, 2019

Note that I use the qs::_decompress_raw methods to show that we are really decompressing LZ4 or ZSTD compatible memory blocks.

Also, I've omitted the hash checks for simplicity. These checks can be added to verify the data and avoid problems with decompression of corrupted memory blocks...

@xiaodaigh
Copy link
Contributor Author

Nice. This should cover double and int so what's left is factors and strings. A native Julia reader would be awesome.

@xiaodaigh
Copy link
Contributor Author

xiaodaigh commented Nov 13, 2019

Only ZSTD is working but not LZ4. Interesting. Here is the Julia code

uncompress_cfst(path) = begin
	cfst_file = open(path, "r")

	# first few header bytes
	headerhash = read(cfst_file, UInt32)
	block_size = Int(read(cfst_file, UInt32))
	version = read(cfst_file, UInt32) |> Int
	compression_algorithm = read(cfst_file, UInt32) & 2147483647

	vec_length = read(cfst_file, UInt64)

	vec_length |> Int

	# calculate the number of blocks
	nr_of_blocks = Int(1 + div(vec_length - 1, block_size))

	block_hash = read(cfst_file, UInt64)

	block_offsets = Vector{UInt64}(undef, 1 + nr_of_blocks)
	# read block offsets
	read!(cfst_file, block_offsets)

	# the first number is the hash
	block_offsets_int = Int.(block_offsets)

	lo = block_offsets_int[1]
	hi = block_offsets_int[2]

	compressed_chunk = read(cfst_file, hi-lo)

	if compression_algorithm == 1 
		fnl_res =  transcode(LZ4Decompressor, compressed_chunk)
	else
		fnl_res =  transcode(ZstdDecompressor, compressed_chunk)
	end


	for (lo, hi) in zip(@view(block_offsets_int[2:end-1]), @view(block_offsets_int[3:end]))		
		compressed_chunk = read(cfst_file, hi-lo)
		if compression_algorithm == 1 
			fnl_res = vcat(fnl_res, transcode(LZ4Decompressor, compressed_chunk))
		else
			fnl_res = vcat(fnl_res, transcode(ZstdDecompressor, compressed_chunk))
		end
	end

	close(cfst_file)
	fnl_res
end

using CodecLz4
using CodecZstd


path = "c:/data/1.cfst"

include("c:/scratch/fst.jl")
res = uncompress_cfst(path)

reinterpret(Int32, res)

@xiaodaigh
Copy link
Contributor Author

Hey @MarcusKlik, so if you describe how factors and strings column are compressed and how the fst's format is like in a document then theoretically I can write a 100%-Julia reader of fst. That would be really nice. Are you able to give me some pointers as to the format of fst files? Or you want me to the C++ code you wrote? The issue with C++ is that I don't know any of it.

Then if we have a PYthon reader too then I can use fst everywhere.

@MarcusKlik
Copy link
Collaborator

Hi @xiaodaigh,

thanks for posting the code, interesting to see the equivalent code in Julia!

The format that compress_fst() uses to compress a raw factor is different from the fst format, apologies for not making that clear!

The compress_fst() format was designed as the simplest format that could accommodate multithreaded compression using various (possibly single-threaded) compressors. So the header is used to store the block-sizes, the selected compressor and some hashing information, the rest of the stream consists of compressed data blocks.

The fst format is a bit more complicated so that it can accommodate random access, more data types, dynamic selection of compressors and many other features, so will be harder to implement natively. More so because it uses custom bit-shifters (C++) and other algorithms to enhance compression speed or provide efficient access (you need to run the C++ code for it to work).

It's still a great idea to create fst bindings for Julia. But the best way would probably be to use the fstlib library directly by calling the C++ code from Julia (same for Python).

thanks and great to see that you published your disk.frame package to CRAN :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants