-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fst::compress_fst
can't seem to be read back in in another language
#223
Comments
In contrast the library(lz4)
a = writeBin(1:10, raw(), size = 4)
ca = lzCompress(a)
writeBin(ca, "d:/data/plsdel.io") works with using CodecLz4
a = open("/mnt/d/data/plsdel.io", "r") do io
read(io)
end
transcode(LZ4Decompressor, a)
reinterpret(Int32, a) # 1, 2, ..., 10 |
Hi @xiaodaigh, you're right, the raw vector result from The reason is that Would it be interesting to have an option to force a compatible result (e.g. compatibility_mode = TRUE) ? |
Or is it possible to know the cuts points and the length of meta data, so the reader can be multithreaded in the read too? Compat mode would be nice too. |
Hi @xiaodaigh, yes certainly! the block sizes are calculated in this code and the metadata format is specified here. For a reader application to read the compressed raw vector, the metadata should be interpreted first. For example, the actual compressor used can be found in the metadata. So, it's not to difficult to write code to read a stream compressed with |
If I follow those formats, can I theoretically write a fst reader/writer in pure Julia? |
Hi @xiaodaigh, yes, that would be relatively simple if you have the LZ4 and ZSTD compressors available. I've created a function in pure library(dplyr)
library(bit64)
# helper to convert an unsinged int to a bit64
uint_to_long <- function(unigned_int) {
high_bit <- bitwShiftR(unigned_int, 31)
low_bits <- bitwAnd(unigned_int, 2147483647)
bit64::as.integer64(1073741824L) * 2 * high_bit + bit64::as.integer64(low_bits)
}
# helper to convert two integers to a bit64
ints_to_long <- function(low_int, high_int) {
bit64::as.integer64(1073741824L) * 4 * uint_to_long(high_int) +
uint_to_long(low_int)
}
decompress_cfst <- function(path) {
# open binary connection
cfst_file <- file(path, "rb")
# first few header bytes
params <- readBin(cfst_file, "integer", 4)
block_size <- uint_to_long(params[2])
version <- uint_to_long(params[3])
compression_algorithm <- bitwAnd(params[4], 2147483647)
# read some more header
params <- readBin(cfst_file, "integer", 4)
vec_length <- ints_to_long(params[1], params[2])
# calculate the number of blocks
nr_of_blocks <- as.integer(1 + (vec_length - 1) %/% block_size)
# read block offsets
params <- readBin(cfst_file, "integer", 2 + nr_of_blocks * 2)
block_offsets <- integer64(nr_of_blocks + 1)
for (block_nr in 1:(nr_of_blocks + 1)) {
block_offsets[block_nr] <- ints_to_long(params[block_nr * 2 - 1], params[block_nr * 2])
}
# read chunks
res <- lapply(1:nr_of_blocks, function(block_nr) {
compressed_chunk <- readBin(cfst_file, "raw", as.numeric(block_offsets[block_nr + 1] - block_offsets[block_nr]))
if (compression_algorithm == 1) {
res <- decompressed_chunk <- qs::lz4_decompress_raw(compressed_chunk)
} else {
res <- decompressed_chunk <- qs::zstd_decompress_raw(compressed_chunk)
}
res
})
close(cfst_file)
unlist(res)
}
# some compressible sample data
x <- as.raw(sample(10:20, 1e6, replace = TRUE))
path <- "1.cfst"
# compress and write to file
fst::compress_fst(x, "ZSTD", 100, hash = TRUE) %>%
writeBin(path)
y <- decompress_cfst(path)
# in equals out
testthat::expect_equal(x, y)
# compression ratio
as.numeric(object.size(x)) / file.size(path)
#> [1] 2.244948 So this is a (slow 😸) pure
If you want to try and write a Julia reader / writer, please let me know if I can help! thanks |
Note that I use the Also, I've omitted the hash checks for simplicity. These checks can be added to verify the data and avoid problems with decompression of corrupted memory blocks... |
Nice. This should cover double and int so what's left is factors and strings. A native Julia reader would be awesome. |
Only ZSTD is working but not LZ4. Interesting. Here is the Julia code uncompress_cfst(path) = begin
cfst_file = open(path, "r")
# first few header bytes
headerhash = read(cfst_file, UInt32)
block_size = Int(read(cfst_file, UInt32))
version = read(cfst_file, UInt32) |> Int
compression_algorithm = read(cfst_file, UInt32) & 2147483647
vec_length = read(cfst_file, UInt64)
vec_length |> Int
# calculate the number of blocks
nr_of_blocks = Int(1 + div(vec_length - 1, block_size))
block_hash = read(cfst_file, UInt64)
block_offsets = Vector{UInt64}(undef, 1 + nr_of_blocks)
# read block offsets
read!(cfst_file, block_offsets)
# the first number is the hash
block_offsets_int = Int.(block_offsets)
lo = block_offsets_int[1]
hi = block_offsets_int[2]
compressed_chunk = read(cfst_file, hi-lo)
if compression_algorithm == 1
fnl_res = transcode(LZ4Decompressor, compressed_chunk)
else
fnl_res = transcode(ZstdDecompressor, compressed_chunk)
end
for (lo, hi) in zip(@view(block_offsets_int[2:end-1]), @view(block_offsets_int[3:end]))
compressed_chunk = read(cfst_file, hi-lo)
if compression_algorithm == 1
fnl_res = vcat(fnl_res, transcode(LZ4Decompressor, compressed_chunk))
else
fnl_res = vcat(fnl_res, transcode(ZstdDecompressor, compressed_chunk))
end
end
close(cfst_file)
fnl_res
end
using CodecLz4
using CodecZstd
path = "c:/data/1.cfst"
include("c:/scratch/fst.jl")
res = uncompress_cfst(path)
reinterpret(Int32, res) |
Hey @MarcusKlik, so if you describe how factors and strings column are compressed and how the fst's format is like in a document then theoretically I can write a 100%-Julia reader of fst. That would be really nice. Are you able to give me some pointers as to the format of fst files? Or you want me to the C++ code you wrote? The issue with C++ is that I don't know any of it. Then if we have a PYthon reader too then I can use fst everywhere. |
Hi @xiaodaigh, thanks for posting the code, interesting to see the equivalent code in The format that The The fst format is a bit more complicated so that it can accommodate random access, more data types, dynamic selection of compressors and many other features, so will be harder to implement natively. More so because it uses custom bit-shifters (C++) and other algorithms to enhance compression speed or provide efficient access (you need to run the C++ code for it to work). It's still a great idea to create thanks and great to see that you published your |
I wonder if there are any additional bytes or any settings I need to be aware of about the
fst::compress_fst
function.For example, in R I write the compressed bytes to disk
and I try to read it back using Julia
gives error.
I wonder if the compression in fst are interopable with elsewhere in general?
The text was updated successfully, but these errors were encountered: