Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is zstd splitabble in hadoop/spark/etc? #395

Open
trixpan opened this issue Sep 28, 2016 · 57 comments
Open

Is zstd splitabble in hadoop/spark/etc? #395

trixpan opened this issue Sep 28, 2016 · 57 comments

Comments

@trixpan
Copy link
Contributor

trixpan commented Sep 28, 2016

I know the 4mc container adds splitability to zstd but is the standard by itself able to do so without the help of 3rd party implementations?

Note: I am aware zstd still not officially implemented in Hadoop so this is more like a theoretical question.

@Cyan4973
Copy link
Contributor

We have plan to re-use and extend the skip format based on skippable frames implemented in pzstd for multi-threaded decompression. 4mc's @carlomedas is aware of this context. We may spend some time in the future to properly define it and make it part of the official format.

@advancedxy
Copy link

Hi, @trixpan, I talked to the 4mc's author, It would be possible to write metadata in zstd skippable frames and make the whole file splittable by hadoop(need a special InputFormat).

But the idea would be same with 4mc in the essential.

@trixpan
Copy link
Contributor Author

trixpan commented Feb 10, 2017

@advancedxy I saw 4mc previously.

I think people's concern about 4mc is the limited community (although I must confess compression algorithms don't strike me as pieces of code with hundreds of contributors)

@ghost
Copy link

ghost commented Feb 27, 2017

Is the format in which pzstd uses splittable frames official? If it is then I can update the hadoop project to utilize anything compressed with zstd-cli then copied into hadoop. Have you guys decided or finalized the format of pzstd files? Thank you

@Cyan4973
Copy link
Contributor

We don't have yet a splittable format.

This is an active topic, we are working on it, and expect to make progresses throughout March for this objective.
Btw, it's probably the best time to be involved in this process, as we will work on this specification with several projects interested in a splittable format. So don't hesitate if you want to be part of it.

The final format will likely be different from current pzstd one.
Not that there is any problem with pzstd format, it's just that we can do better.
pzstd format will likely be discontinued, though it's not a concern for existing compressed data : pzstd format has been designed to be compatible with zstd decoder, and it is perfectly decodable by any zstd compliant decoder. That will remain true.
We merely are going to use different ways to generate splittable and multi-threaded data.

@ghost
Copy link

ghost commented Feb 27, 2017

@Cyan4973 thanks for the update, I think having zstd splittable in hadoop would be very attractive to a lot of users. I would like to be involved if possible discussing the proposed splittable format. Thank you.

@Cyan4973
Copy link
Contributor

cc @iburinoc

@carlomedas
Copy link

I definitely agree with this approach: 4mc is nice but we need this as mainstream in zstd.
A splittable zstd, as matter of fact, is just leveraging the already fully scalable compression format to make sure it can be processed/emitted in blocks with the related indexes.

@advancedxy
Copy link

Agreed. Please involve me with discussion and implementation if possible.

@sean-purcell
Copy link
Contributor

Here's an overview of our current thoughts on how to implement this, any feedback is appreciated!

Splittable/Seekable Format

A splittable and/or seekable Zstandard format could be useful in a variety
of a scenarios by allowing decompression/processing to be parallelized
or by allowing small subsections to be decompressed without having
to decompress the entire preceding file.

The proposed general format of a Zstandard frame would consist of a
sequence of Zstandard compressed frames, with a skippable frame containing
a "jump table". This jump table would contain the information allowing
decompressors to seek directly to the frame it wants to decompress,
rather than having to process the entire file leading up to it.
This design is based on the format used in @carlomedas's 4mc project,
which currently adds splittability to LZ4 and Zstd for use in Hadoop.

Given this outline, there's a few choices left to be made:

Jump Table Position

Where in the file should the jump table be placed?
There are two obvious choices here:

  • Beginning of the file:
    • Pros:
      • Decompressors don't need to seek anywhere to find the jump table,
        they can read it immediately
    • Cons:
      • We need to know how many chunks we're going to write before we start
        anything
      • We need to be able to seek back afterwards and write the jump table
        • Very difficult to use with streaming
      • Can't append new data
  • End of file:
    • Pros:
      • Can write the jump table after processing all data, so it works with streaming
      • Can append more data easily, just overwrite the jump table (or leave it as a skipped frame),
        and write a new one at the new end
    • Cons:
      • More complicated for the decompressor, they must seek to the end of the file
        to determine where to go
  • Multiple jump tables:
    This style could be similar to the beggining of file one, except with an
    additional field that indicates the jump offset to find the next jump table
    This works with appending data, however it still requires either seeking back to write,
    or compressing chunks in memory and writing them out later on, so it's still
    challenging to use in streaming mode, and it's very complex for the decompressor.

The end of file approach seems best as I'm not aware of any scenarios
where the seek to end of file requirement, for reads, is an issue,
but if there are such scenarios then this topic is open.

Jump Table Format

The format of the Jump Table would be a Zstandard skippable frame,
with some header (or footer depending on the position of the jump table)
containing some information such as the number of chunks, and some flags,
and then an array containing the data in the jump table itself

Potentially useful data to have in the table is, for each frame:

  • compressed size
  • uncompressed size
  • a checksum for the frame

Compressed Size

There are two main options for storing the compressed size:
do we store an offset, or do we store the size

  • Offset:
    • Pros:
      • For seekability, we don't need to process every preceding frame in the table
      • This method could also be used to avoid have entries in the jump table for skippable frames,
        although then a method like ZSTD_findCompressedSize would be needed to get the actual
        compressed size of a frame
    • Cons:
      • To find the actual compressed size, we also need to read the next entry in the table
      • Size constraints: If we store the offset, for a large file (> 4GB) we need more than 4 bytes for each offset
        for each entry (likely 8 bytes in this case). Do we then always use 8 bytes? Or perhaps make that a flag?
  • Size:
    • Pros:
      • We can limit frame size to 4 GB while still having files larger than that, and so we only need 4 bytes per field
        • Although we could still leave this open as an option, if larger chunks are desired
    • Cons:
      • Finding the offset to seek to requires processing all previous entries in the jump table

Thoughts: we could leave this as an option to be chosen by a flag,
although I personally prefer the offset option as it provides good behaviour in both cases,
and the size issue only comes into play when dealing with files that are already large.

Uncompressed Size

Here we have the same offset/size problem as with compressed size, so that section more or less applies to both.
One additional point to add is that for seeking to the middle of a file,
the offset method allows for binary search to be used,
while the size method requires a linear scan.
It's unlikely that the jump table would be large enough for this to be an issue however.

Another point with uncompressed size is that it's likely that in many cases,
each frame has the same uncompressed size.
If this is a case that would occur, then it could be worth adding a flag/field
that allows the jump table to set a single uncompressed size for all frames,
allowing a decompressor to immediately determine which frames it needs
to decompress based on the offset and frame size.

I think the optional solution for uncompressed size would be to resolve the
size/offset issue with the same choice as for the compressed size,
for consistency, and then allow for a flag in the header to indicate that
all frames have the same uncompressed size instead of storing it with each entry.

Checksum

It might be desirable to have a checksum accompany each frame so we can
verify frames.
Zstandard already has a per-frame checksum over uncompressed data,
however the layout of this format thus far would allow for it
to trivially be extended to allow frames compressed with other algorithms,
e.g. gz or xz.

So the questions to be answered here are:

Do we include a checksum?
I think this should be a configurable option in the format.

Which algorithm should be used?
For consistency with Zstandard, the least significant 4 bytes of XXH64 makes sense.

Should it be over the compressed or uncompressed data?
Uncompressed data: allows us to verify both the compressed data and the decompression process.
Compressed data: allows us to check the data before we attempt decompression

I think uncompressed data makes more sense, especially if we're mixing various
compression algorithms, and since if we have to read the compressed data anyway,
we might as well decompress it while we're at it.
This is definitely worth discussing more, however.

Dictionary Compression

One way to recoup the compression-ratio losses of splitting the input
is to use dictionaries.
While currently training dictionaries on data is quite slow,
and so might not be suited to doing inline,
a possible future improvement on this is to train a dictionary
on the frames and using them with each one,
and then including this dictionary somewhere in the compressed file.

Some obvious impacts this usage could have are:

  • If we have a special frame containing the dictionary at the start of the file,
    the jump table can't make the assumption that the first frame starts at offset 0,
    which complicates the size method.
  • If dictionaries aren't necessarily stored at the beginning of the file but simply
    stored inline, we need to be able to determine whether a frame needs a certain
    dictionary for decoding, and if so, where to find it.
    A possible solution here is to include some sort of dictionary field in the jump table,
    e.g. the offset at which to find the dictionary for decompression (these could then be
    cached and used as DDict's for efficient decompression of multiple frames).

We should make sure to put some thought into these and other possible related situations
to ensure that the format doesn't hamper future expansion.

@carlomedas
Copy link

I definitely like this.

Here's my 2 comments/preferences:

  • Jump Table Positions - End of File - One more Pro's: Hadoop/Spark/other-big-data compression codecs will be able to work with this flawlessly: HDFS output streams only work in append mode.

  • Jump table format / Compressed Size / By Size: I would even remove from the 'Cons' the limit up to 4 GB of each frame, as it's very huge one: when you want to have the file splittable even 64MB is a too bit frame size IMHO.

@advancedxy
Copy link

My 2 cents:

  • Jump Table Positions: End of File, works well with append only files. And really like the idea of skip the jump table and add a new one.
  • Jump Table Format:
    • Compressed Size/Uncompressed Size: Size. The meta(footer or header) should be negligible to process. Prefers no flags for simplicity
    • Checksum: XXH64 of uncompressed data.

@sean-purcell
Copy link
Contributor

I'm not sure no flags is the best option, it leaves very little room to expand.

Potential flags I can think of currently:

  • Ability to have a constant block decompressed size. This could be useful in seekable files as they wouldn't need variable size blocks, and could immediately determine the block number they need. Admittedly this doesn't work very well with skippable frames, unless the offset method is used.
  • Ability to turn off checksum. This isn't as useful with large files, but with smaller files it could be preferable. Also, since the Zstd frame format already has a checksum of the uncompressed data, we'd probably want to be able to turn one of them off. If you can turn off the one in the splittable header, then a regular (non-splittable aware) decompressor can still verify the checksums.
  • Some sort of dict ID specification/jump table: a possible use case would be to encode a few dictionaries inline at the (or elsewhere), and then have the compressor decide on the fly which to use for each chunk. We'd need some way for the decompressor to find these, possibly by having the option to add a dictionary jump table?

Since the implementation will be in the library itself, we can handle some complexity there, and we want to make sure we have the flexibility to handle all use-cases that could benefit from this.

@advancedxy
Copy link

Since the implementation will be in the library itself, we can handle some complexity there, and we want to make sure we have the flexibility to handle all use-cases that could benefit from this.

On second thought, It's indeed better to use flags to handle various use cases.

@Senemu
Copy link

Senemu commented Mar 30, 2017

I'm not aware of any scenarios
where the seek to end of file requirement, for reads, is an issue,
but if there are such scenarios then this topic is open.

One scenario where seeking is not possible is when reading from a pipe. Granted, then, also, no frame can be skipped. But a performance improvement can be achieved by not processing frames one is not interested in.

As there are distinct advantages and limitations to having index frames prepended or appended, it might be useful to provide both.

It is already possible to make a distributed index by prepending by using pzstd skippable frames and Zstandard frame headersʼ Frame_Content_Size.

For appending, I think that a similar simple format would be best.
Other extensions such as checksums and dictionaries would then be better implemented in separate frames.

@sean-purcell
Copy link
Contributor

sean-purcell commented Apr 14, 2017

I've created a first round implementation.

The format description is here, comments/feedback is appreciated.

As for the API, the planned compression API looks like this:

/*===== Seekable compressor management =====*/
ZSTD_seekable_CStream* ZSTD_seekable_createCStream(void);
size_t ZSTD_seekable_freeCStream(ZSTD_seekable_CStream* zcs);

/*===== Seekable compression functions =====*/
size_t ZSTD_seekable_initCStream(ZSTD_seekable_CStream* zcs, int compressionLevel, int checksumFlag, unsigned maxFrameSize);
size_t ZSTD_seekable_compressStream(ZSTD_seekable_CStream* zcs, ZSTD_outBuffer* output, ZSTD_inBuffer* input);
size_t ZSTD_seekable_endFrame(ZSTD_seekable_CStream* zcs, ZSTD_outBuffer* output);
size_t ZSTD_seekable_endStream(ZSTD_seekable_CStream* zcs, ZSTD_outBuffer* output);

which looks similar to the regular zstd streaming API.

On the decompression side it's a bit more complex, since a decompressor needs to be able to seek.

There's a few cases I can think of here that we can provide for an API:

  • File is entirely in memory, similar to the ZSTD_decompress API:
    ZSTD_seekable_decompressRange(void* dst, size_t dstSize, void* src, size_t srcSize, size_t offset, size_t length);
    
  • File is on disk, in which we could provide a FILE* oriented API
    ZSTD_seekable_init(ZSTD_seekable* zds, FILE* src);
    ZSTD_seekable_decompress(ZSTD_seekable* zds, void* dst, size_t dstSize, unsigned long long offset); /*< There's a bit of a naming clash here with the in-memory API above */
    
    This API could also be extended to allow more advanced usage when the user can't provide a FILE* object, for example with a callback-based API like this:
    typedef void(*ZSTD_seekable_read)(void* opaque, void* buffer, size_t n);
    typedef void(*ZSTD_seekable_seek)(void* opaque, unsigned long long offset); /* Careful with files > 4 GiB and fseek() */
    typedef struct {
      void* opaque;
      ZSTD_seekable_read read;
      ZSTD_seekable_seek seek;
    } ZSTD_seekable_customFile;
    ZSTD_seekable_init_advanced(ZSTD_seekable* zds, ZSTD_seekable_customFile src);
    
    One concern here is that this model may be hard for users that are creating zstd bindings to other languages. Would this API be usable or are C callbacks too hard to interact with from other languages? Feedback here appreciated. An alternative way to provide functionality similar to this API would be to have a special return code from ZSTD_decompressStream that indicates the client needs to seek to a different position, like what's currently in the proposal.

Other relevant notes here:

  • It could be useful to allow clients direct access to the contents of the seek table, such as chunk offsets etc, so that they can break up and distribute the frames directly.
    ZSTD_seekable_getFrameCompressedOffset(ZSTD_seekable* zds, unsigned chunkIdx);
    ZSTD_seekable_getFrameDecompressedOffset(ZSTD_seekable* zds, unsigned chunkIdx);
    ZSTD_seekable_convertPosToChunk(ZSTD_seekable* zds, unsigned long long decompressedPosition);
    
  • For range decompression operations, is it easier for the client to provide the input in (offset, length) form, or in (rangeStart, rangeEnd) form?

@trixpan
Copy link
Contributor Author

trixpan commented Sep 7, 2017

folks would you mind if I ask you to clarify what is current status of this?

Reason I ask is because I got quite confused trying to use git commits to understand what is going on... 😃

eg. despite multi-threading now being part of the main tool, pzstd still included as part of the contrib source directory and while pzstd refer to Skippable format, the RFC related to this ticket suggest the Seekable format.

I am confused. 😃

@Cyan4973
Copy link
Contributor

The proposed format is implemented in this demo :
https://github.com/iburinoc/zstd/tree/splittable/contrib/seekable_format
So far, it's the only code which applies it.

pzstd is an unrelated project focused on multithreading for faster processing speed.
pzstd has been interesting to experiment multiple appended frames, which is one mechanism used in the specification. But that's all they have in common, which means it's very little.

@adamkennedy
Copy link

@iburinoc Direct lookup of the seek table is going to be critical to the Hadoop InputFormat case, correct? Since some byte range spanning one or more chunks is going to be distributed as an InputSplit to different hosts.

@sean-purcell
Copy link
Contributor

@adamkennedy The seek table itself won't be split up though, and it's not hard to traverse the seek table to convert it into a direct lookup to determine what data is where in the source data.

@scherepanov
Copy link

scherepanov commented Nov 1, 2017

I solved similar problem - break large zstd/lz4/lzma files for parallel processing, by extending zstd/lz4 framing format. I am breaking input data into large chunks (recomending 1MB), with each rec reside in full inside chunk (record not broken between chunks). Each chunk is compressed independently (lz4 frame, zstd frame, or lzma (no frame), prepended by transport skippable frame and sequentially written. Transport frame contains size of compressed, size of uncompressed data, type of compression (I support lz4, zstd, lzma, and it can be a mix of compressions in file). And custom data tags. Transport frame protected by CRC.
I can do random binary search in resulting format by file offset from beginning, or by custom tags (as soon as tags monotonically increasing, for example timestamps). Works fine on sub-1TB files.
Breaking of file done by breaking into even parts (one part per processing thread), and then searching next transport frame. To make sure I found correct next transport frame, I am checking couple more that follows.
Have multithreaded utility that compress/decompress/change compression, c++ library, java bindings etc.
Works like a charm.

@chrchang
Copy link

It would be great if the seekable format plays well with the tabix index format (https://samtools.github.io/hts-specs/tabix.pdf ) widely used in bioinformatics; this would speed up a bunch of workflows by up to ~2-3x over BGZF. (I'd be happy to write some tabix indexing and seeking code, if the basic library functions are all ready.)

@Cyan4973
Copy link
Contributor

@chrchang : Would you mind testing the experimental splittable format API for tabix use case ?

The goal would be to know if this API is suitable enough for such usage, and if not or partially, how to make it evolve to better fit the need.

@scherepanov
Copy link

Experimental splittable API was not enough for me. I implemented my own extended version of splittable API. Here are reasons and some details:

  1. My data have many formats, but all of them are record-oriented. Record must be processed as single unit. Record size is much smaller than compression block (my records sizes are 40bytes - 2KB, blocks are 128-256-512KB or 1MB). Many records fits into single compression block.
  2. Splitting record between two compression block makes processing much harder that it needs to be. I am requiring that record in full is inside compression block. That greatly simplify everything.
  3. Pure searching by arbitrary offset makes little sense when you have records - no guarantee that some record will not be partial. I need to get data from record boundaries only. If data starts in the middle of record, it is invalid.
  4. I do provide search by offset, but data provided from next record begin. That guarantee that data is valid.
  5. I support record attribute, 8-bytes integer that have to be monotonically increasing. In my case it is time. I support search on attribute. Again, I return data from record begin, always valid.

Experimental splittable API is of no use for me. I am searching for transport frames, read block, and inside block do data scan. That guarantee that I am always have valid record boundaries. Splittable API returns data from rather arbitrary offset - you have to guess record boundaries.

As additional feature, my API transparently support zstd, lz4 and lzma compressions. And uncompressed data. File can have several compression formats inside. Each compression block have it's own compression format, and API hide compression type (or absence of compression) from client.
API support search in compressed formats by offset and by attribute, and in uncompressed format by attribute (and by offset, to provide data from record boundaries).

I implemented API with a lot of multithreading, and I am not hiding parallel nature of compressor/decompressor. That allow for more direct integration with multithreaded clients. I communicate with clients with blocks of data, corresponding to compression blocks in file.
Splittable API produces single data stream, to parallelize it you have to do some efforts.
I do not want to say that splittable API is bad. I just telling that it did not work for me, and reasons why.

That was interesting to hear about tabbix format. I think I implemented a small subset of tabbix format.
Full support probably would be time consuming.

@chrchang
Copy link

chrchang commented Dec 15, 2017

Okay, after taking a look at the current seekable API, it appears to be adequate for a basic single-threaded tabix implementation, but it's missing parallel-compression and decompression-stream support which would really make seekable .zst files a joy to work with.

It looks like parallel compression wouldn't take much work to add: just create a version of ZSTD_seekable_compressStream that calls ZSTD_compress_generic?

Possible decompression stream interface (which I've previously written for BGZF files): during decompressor initialization, you pass an additional uint64_t* containing a sequence of uint64 [decompressed start pos, decompressed end pos) intervals (not necessarily in increasing-position order, and not necessarily disjoint), as well as the number of such intervals. (So if there are N intervals, the uint64_t* points to a length-2N array.) A compile-time constant can be defined to denote the decompressed file size, so N=1 with a [0, DECOMPRESSED_FILE_SIZE) interval would request a normal decompression stream. Ideally, initialization succeeds on a non-seekable zstd file when the intervals are disjoint and in increasing order. A nice property of this interface is that it supports parallel decompression, though that can be left out of the initial implementation.

@P-E-Meunier
Copy link

P-E-Meunier commented Jul 7, 2019

Trying to lay an egg here.

I'm writing Rust bindings to this seekable API, in order to use it as the new patch format for Pijul (https://pijul.org).

I can read from files fine (using ZSTD_seekable_initFile), but could not figure out how to read from buffers (ZSTD_seekable_initBuff). The error I get is An I/O error occurred when reading/seeking. However, if I write that same buffer to a file and read it, everything seems to work.

@Cyan4973
Copy link
Contributor

@P-E-Meunier , could you describe a test case ? It could be a bug, or an incorrect API usage, suggesting an incomplete documentation.

@P-E-Meunier
Copy link

P-E-Meunier commented Jul 20, 2019

My binding is on https://nest.pijul.com/pmeunier/zstd-seekable. The test is:

#[test]
fn test() {
    let mut cstream = SeekableCStream::new(10, 256).unwrap();
    let input = b"Lorem ipsum dolor sit amet, consectetur adipiscing elit. Ut diam ante, sollicitudin a dolor et, volutpat elementum nulla. Etiam nec efficitur nibh, quis rutrum risus. Maecenas quis lorem malesuada, aliquet mi vel, viverra nunc. Donec et nulla sed velit sagittis varius. Suspendisse vestibulum, neque lobortis ornare vestibulum, orci turpis vulputate nisi, ut sodales neque purus eget magna. Nunc condimentum, diam eu consequat venenatis, est nisl semper lorem, et lobortis velit justo sed nulla. Nunc sit amet tempor nunc, vel posuere ipsum. Cras erat tortor, pulvinar ac pretium eu, auctor ac nibh. Duis iaculis porta magna, eu lobortis elit. Duis vitae massa eros. Nulla non magna accumsan, egestas quam sit amet, laoreet lectus.";
    let mut input_pos = 0;
    let mut output = vec![0; input.len()];
    let mut output_pos = 0;
    while input_pos < input.len() {
        let (a, b) = cstream.compress(&mut output[output_pos..], &input[input_pos..]).unwrap();
        output_pos += a;
        input_pos += b;
    }
    while let Ok(n) = cstream.end_stream(&mut output[output_pos..]) {
        if n == 0 {
            break
        }
        output_pos += n;
    }
    output.truncate(output_pos);
    {
        use std::io::Write;
        let mut file = std::fs::File::create("test").unwrap();
        file.write_all(&output).unwrap();
    }
    println!("input len = {:?}, pos = {:?}", input.len(), output_pos);

    let mut decomp = Vec::new();
    let mut s = {
        Seekable::init_buf(&output).unwrap()
    };
    for frame in 0..s.get_num_frames() {
        let size = s.get_frame_decompressed_size(frame);
        println!("{:?} {:?}", size, s.get_frame_decompressed_offset(frame));
        let n = decomp.len();
        decomp.extend(std::iter::repeat(0).take(size));
        s.decompress_frame(&mut decomp[n..], frame);
    }
    println!("{:?}", std::str::from_utf8(&decomp).unwrap())
}

If I replace Seekable::init_buf(&output).unwrap() with Seekable::init_file("test").unwrap(), everything works.

@P-E-Meunier
Copy link

P-E-Meunier commented Jul 20, 2019

Also, here is the code for the two functions, calling the actual library:

    pub fn init_buf(input: &'a [u8]) -> Result<Self, Error> {
        unsafe {
            let p = ZSTD_seekable_create();
            if p.is_null() {
                return Err(Error::Null)
            }
            let result = ZSTD_seekable_initBuff(p, input.as_ptr() as *const c_void, input.len());
            if ZSTD_isError(result) != 0 {
                return Err(Error::ZSTD(result))
            }
            Ok(Seekable {
                p,
                f: std::ptr::null_mut(),
                marker: std::marker::PhantomData,
            })
         }
    }
    /// Initialise a decompressor with a file. This method opens the file, and dropping the resulting `Seekable` closes the file.
    pub fn init_file(name: &str) -> Result<Self, Error> {
        unsafe {
            let name = CString::new(name).unwrap();
            let f: *mut libc::FILE = fopen(name.as_ptr(), "rb\0".as_ptr() as *const i8);
            assert!(!f.is_null());
            let p = ZSTD_seekable_create();
            if p.is_null() {
                return Err(Error::Null)
            }
            let result = ZSTD_seekable_initFile(p, f as *mut _IO_FILE);
            if ZSTD_isError(result) != 0 {
                return Err(Error::ZSTD(result))
            }
            Ok(Seekable {
                p,
                f,
                marker: std::marker::PhantomData,
            })
        }
    }

@sean-purcell
Copy link
Contributor

Looks like it was my fault, that's a little embarrassing. Fixed in #1695.

@rwmjones
Copy link

rwmjones commented Sep 27, 2019

We have a had a request to implement a zstd decompression filter in nbdkit. We already have a filter which supports xz. If implemented it would work like this:

(1) The compressed disk image is prepared by one user and uploaded to a website:

zstd [some flag to tell it to split into seekable frames] disk.img -o disk.img.zst
cp disk.img.zst /var/www/html/

(2) Another user can serve this file uncompressed to their local network using:

nbdkit --filter=cow --filter=zstd curl https://example.com/disk.img.zst
qemu -hda nbd:localhost:10809

(The equivalent already works for xz; note the COW filter makes the image writable but we don't require writing to the underlying zstd file). For xz we can seek to a block boundary and only uncompress a single block which makes it relatively efficient as long as a smallish block size was chosen by the person who originally compressed the file.

As I understand it from the preceding comments, there is no support yet for the seekable format in the zstd tool (the -B option was mentioned but reading the code it seems this is unrelated). There is an experimental directory but the tool there is not widely packaged, and the API exists but is also not packaged in distros or stable. Is my understanding correct?

@mlin
Copy link

mlin commented May 7, 2020

Hello all, I've just used the contrib seekable code to put together a SQLite extension enabling it to read from a database file compressed in the proposed format. It was very easy to hook up and seems to work well.

Since this application entails frame sizes of just a few KB, I'm next interested in following through on the allusion to adding a dictionary inline. I can come up with some proposal, but just wondering if @iburinoc @Cyan4973 or anyone else have any specific ideas in mind about how that should work

@saurik
Copy link

saurik commented Dec 17, 2020

Seeing as xz is a compression container format that specifically is designed to provide seek-ability to underlying compression algorithms (which it abstracts over, similar to a zip file), instead of adding this support to zstd--which then would have its own framing mechanism and have some weird detection on its input and tradeoff in its output--why not just add support to xz for zstd by defining a zstd filter (that's the xz terminology: note that earlier in this thread a mention is used to "filter" near "xz" in a way that is unrelated)?

@rwmjones
Copy link

"Filter" in that comment referred to nbdkit filters.

@saurik
Copy link

saurik commented Dec 18, 2020

"Filter" in that comment referred to nbdkit filters.

@rwmjones Right. I 100% understood that, which is why I felt a need to make explicit that I was using a different, "unrelated" definition of "filter", lest someone get confused. (Ironically, I see someone still got confused, but I honestly am not sure how I could have been even more explicit in order to ensure no one got confused.)

@kaey
Copy link

kaey commented Dec 18, 2020

As a related work there is RAC(random access compression) format https://github.com/google/wuffs/blob/master/doc/spec/rac-spec.md

@nigeltao
Copy link

nigeltao commented Mar 8, 2021

As a related work there is RAC(random access compression) format

I am the author of the RAC format. Happy to discuss that here if you want.

It is not "battle-proven" yet. I started the experiment in 2019, but like many other people, I found 2020 disruptive in terms of getting work done.

A comparison with XZ is briefly discussed in the RAC related work doc.

For actual numbers (not just a spec), here's one highlight from the "Extended Example" of the ractool docs:

$ time unzstd --stdout                 enwik8.zst        > /dev/null
real    0m0.203s
$ time ractool -decode                 zstd.withdict.rac > /dev/null
real    0m0.037s

@amerlyq
Copy link

amerlyq commented Mar 14, 2021

why not just add support to xz for zstd by defining a zstd filter

After looking at xz corruption-prone spec analyzis, I'm not sure if it's good idea to even start extending xz.
Xz format inadequate for long-term archiving: https://www.nongnu.org/lzip/xz_inadequate.html

@saurik , in ycombinator discussion of 2016 you was seemingly in liberate-me-from-xz camp (:D), had something changed since then?
Xz format inadequate for long-term archiving | Hacker News: https://news.ycombinator.com/item?id=12768425

@saurik
Copy link

saurik commented Mar 15, 2021

@amerlyq So, the idea that "camps" exist in software development is part of why it is so hard to have reasonable discussions :(. The fact that xz is often used for things it is ineffective at (I personally believe it should probably never have been supported in dpkg and it was a devastating mistake for them to have removed lzma support) or that people like to praise it in incorrect contexts due to misunderstandings about why it even exists (such as claiming it is a compression algorithm instead of a file format, as many did on that HN thread) does not imply that it is without any value or that it is never a choice to consider... it certainly doesn't mean that someone who has argued against the format (such as myself) must switch "camps" in order to suggest its use :(.

I brought up xz because it seems like a fast path to get seekability of zstd without zstd having to define yet another new file format, as the only reason xz even exists is to be a seekable compression container that has extensible algorithm choice... aka, exactly what is being looked for here today. You do not have to like xz to appreciate that it has become pretty widely used, and you don't have to believe it is an optimal file format under every possible criteria to use it... you don't even need to make it the only format people use: it just already happens to exist. I can simultaneously argue against xz in one context and for it in another. Someone here mentioned xz as a thing they were supporting for something else, and so there is value I think in noting that it is already an extensible format and so arguably already is the solution... if only there were a zstd filter.

But like, seriously: this is an issue that has been open since 2016 that frankly seems unlikely to me to every come to consensus on a new format soon (and like, that's fine: no rush from me! I am not complaining, I am just being practical). It seems like it would take a lot less time to come to consensus on what xz filter number to assign to zstd--assuming the xz people would be amenable to that, which I admit they probably wouldn't be :/--and then the people who are waiting on this feature could have pretty widespread and inter-compatible support for streaming zstd within a few days. The best part: doing that wouldn't in any way prevent zstd from coming up with their own file format for this that was infinitely better when confronted with random bit flips or massively parallel CPUs or high-cost random access storage.

So like, do what you want: I don't even remember why I cared at this point ;P I imagine I first subscribed to this issue in mid-2019 as that was when I was working on a project that used zstd... but I ended up working on something else instead ;P... but please just don't assume the world is always divided into "camps" and that something must have "changed" in order to cause someone who argues against something in one context to argue for it in another. Some of us are honestly just trying to accomplish tasks ;P.

@amerlyq
Copy link

amerlyq commented Mar 15, 2021

this is an issue that has been open since 2016 ... and then the people who are waiting on this feature could have pretty widespread and inter-compatible support for streaming zstd within a few days

Yep, both hands up! Having .xz is better than having nothing.

In the meanwhile in parallel universe... to preview (partial extract) .zstd archives in FUSE the 15th standard was hacked :D

@Artoria2e5
Copy link
Contributor

A common use case not mentioned here is indexing of compressed tar archives. The lzip thing tangentially mentioned earlier has a tarlz program for that. A “pixz” variant of xz also indexes tars. tar integration would be a good way both to test the feature and give end-users a reason to use it.

@vasi
Copy link

vasi commented Nov 21, 2021

I'm the author of pixz. Just chiming in to say that the way to break the chicken-and-egg problem for xz was for someone to ship seekable-xz to a lot of users, which is what pixz + its distro packagers did. Once it became accessible, the approach became tested and well-liked, and then upstream was more easily convinced.

So someone ought to arrange to actually ship seekable zstd to end-users, so they can try it out. Either:

  • Ship third-party tools like t2sz. You'd probably want to at least support non-tar files, and then get this packaged in major distros like Debian/Arch.
  • Or, ship the seekable_format work in contrib. It'd need a frontend that's a lot more usable than the seekable_compression binary, and then to be packaged in distros.

@dralley
Copy link

dralley commented Apr 11, 2024

@Cyan4973 When you say it needs to be battle-tested, production-ready, etc, what does that look like? Is there some kind of threshold you have in mind or is it a "I'll know it when I see it" kind of thing?

I guess this question also applies to pzstd. At what point do we say "ok, seems to work well, time to add it to the zstd frontend"?

I bring this up because it was mentioned in some of the discussions around XZ, that this is one of the few features which zstd cannot currently provide, which handicap it as a replacement for XZ.

@Cyan4973
Copy link
Contributor

pzstd, the C++ code base, is planned for deprecation.

If you rather mean "the format", then I guess the seekable_compression format would form a better basis, because it's a superset.

When you say it needs to be battle-tested, production-ready, etc, what does that look like?

Yeah, that's a difficult one.
The easiest threshold would be that there is a sizable application within our company that depends on it for production.
Since that's not the case (yet), we have to rely on third party reports.
It's true that I'm aware of a few applications that are using it at a reasonable scale.
So, in some ways, it's getting there.

The real issue now is that, in order to make this format "official", we would have to guarantee its maintenance moving forward. And that's a tall order. Unfortunately, I feel we are a bit stretched at the moment to carry that weight.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests