Multiple calls to Write() has unexpected overheads? #22

jimsmart · 2018-01-15T02:25:32Z

If I make multiple calls to Write, the resulting compressed data stream length is much greater than buffering the data and making a single call to Write.

— Is this expected behaviour? The reason I ask is that if chained with other writers, one cannot make any guarantees as to how they may parcel up their data.

Code (error handling omitted):

func main() {

	b := &bytes.Buffer{}
	for i := 0; i < 500; i++ {
		b.Write([]byte("Hello World! "))
	}
	data1 := b.Bytes()
	fmt.Println("data len", len(data1))

	// Compress 1
	buffer1 := &bytes.Buffer{}
	w1 := zstd.NewWriterLevel(buffer1, CompressionLevel)
	w1.Write(data1)
	w1.Close()

	fmt.Println("buffer1 len", buffer1.Len())

	// Compress 2
	buffer2 := &bytes.Buffer{}
	w2 := zstd.NewWriterLevel(buffer2, CompressionLevel)
	for i := 0; i < 500; i++ {
		w2.Write([]byte("Hello World! "))
	}
	w2.Close()

	fmt.Println("buffer2 len", buffer2.Len())
}

Output:

data len 6500
buffer1 len 33
buffer2 len 5014

Regards

jimsmart · 2018-01-15T03:11:27Z

Possibly related?
facebook/zstd#206

jimsmart · 2018-01-15T03:37:41Z

Ok, the best thing to do if one cannot ensure good-sized calls to Write, is to use a bufio.Writer like this:-

func main() {

	b := &bytes.Buffer{}
	for i := 0; i < 500; i++ {
		b.Write([]byte("Hello World! "))
	}
	data1 := b.Bytes()
	fmt.Println("data len", len(data1))

	// Compress 1
	buffer1 := &bytes.Buffer{}
	w1 := zstd.NewWriterLevel(buffer1, CompressionLevel)
	w1.Write(data1)
	w1.Close()

	fmt.Println("Buffer1 len", buffer1.Len())

	// Compress 2
	buffer2 := &bytes.Buffer{}
	w2 := zstd.NewWriterLevel(buffer2, CompressionLevel)
	bw := bufio.NewWriter(w2) // default buffer size = 4k
	// bw := bufio.NewWriterSize(w2, 8192) // buffer size = 8k
	for i := 0; i < 500; i++ {
		bw.Write([]byte("Hello World! "))
	}
	bw.Flush()
	w2.Close()

	fmt.Println("Buffer2 len", buffer2.Len())
}

Output:

data len 6500
Buffer1 len 33
Buffer2 len 44

It's not so elegant, but ¯\(ツ)/¯

— Hope that helps someone!

valyala · 2018-03-31T19:47:36Z

@jimsmart , try gozstd.Writer. It uses another underlying zstd API, which should have lower overhead.

dmoklaf · 2018-12-03T10:21:37Z

The zstd bug referenced above (facebook/zstd#206) has been closed. Is this issue still ongoing? If yes, the need to wrap the writer with a buffer shall be documented, this is a pretty subtle usage advice.

Viq111 · 2018-12-04T14:43:04Z

Hi @rgeronimi,

I checked again previous result and indeed you'd currently have the same results.
zstd (this lib) & gozstd (from @valyala above) use 2 slightly different C zstd API with slightly different directions.

(this) zstd library uses ZSTD_compressContinue which basically use buffer-less zstd streaming compression, meaning we have complete control over memory at the expense of needing to do buffers Go-side if you want to optimize for compression size on small inputs.

gozstd uses ZSTD_compressStream which abstract that buffer logic into the C code (at the cost of having less control over memory consumption C land)

Hope this helps!

valyala · 2018-12-05T00:59:27Z

The following limitations for ZSTD_compressContinue look scary:

ZSTD_compressContinue() presumes prior input is still accessible and unmodified (up to maximum distance size, see WindowLog). It remembers all previous contiguous blocks, plus one separated memory segment (which can itself consists of multiple contiguous blocks)

ZSTD_compressContinue() detects that prior input has been overwritten when src buffer overlaps. In which case, it will "discard" the relevant memory section from its history.

As I understand, they mean two things:

zstd could use garbage as a dictionary from the previous buffers used in the ZSTD_compressContinue call if address of these buffers doesn't match the address of the current buffer. This also may lead to segmentation fault if the underlying memory of the previous buffer has been unmapped from the process address space.
zstd may have bad compression rate, since it discards dictionary data from the previously compressed block if the buffer passed to the func is re-used.

cc'ing @Cyan4973 for further clarification.

Cyan4973 · 2018-12-05T05:33:29Z

@Viq111 explanations are correct.

ZSTD_compressContinue() is a fairly low level function, designed for systems which need absolute control over memory allocation. It requires a fairly good control over buffer content and lifetime. To be fair, it's more targeted at embedded environments than managed languages, but I'm not a qualified expert to tell if this is a good fit or not for go.

When in doubt, prefer using ZSTD_compressStream(). It's safer to use, and abstract all the machinery, at the cost of also managing its own internal buffers.

dmoklaf · 2018-12-05T08:49:27Z

ZSTD_compressContinue() is a fairly low level function, designed for systems which need absolute control over memory allocation. It requires a fairly good control over buffer content and lifetime. To be fair, it's more targeted at embedded environments than managed languages, but I'm not a qualified expert to tell if this is a good fit or not for go.

This depends on what the Go wrapper code does. I just checked it and it transmits directly the user-provided buffer as a C pointer. If I understand what @valyala wrote, this could be a critical bug as the zstd C code is expecting this buffer to remain accessible after the function returns. If true, this would have the potential for data corruptions, process crashes, and hard-to-reproduce cases.

When in doubt, prefer using ZSTD_compressStream(). It's safer to use, and abstract all the machinery, at the cost of also managing its own internal buffers.

Viq111 · 2018-12-05T15:02:28Z

Reading back at the code, we started implementing the go wrapper at zstd v0.5 which only had the ZBUFF_decompressContinue methods indeed: https://github.com/facebook/zstd/blob/201433a7f713af056cc7ea32624eddefb55e10c8/lib/zstd_buffered.h#L79

It may actually be also the issue for #39

If anyone could put up a PR for migrating to ZSTD_compressStream, we are accepting all contributions!

Otherwise I can also look into it as it seems it could bit a couple of people using the streaming interface

dmoklaf · 2018-12-20T09:38:43Z

We don't have the skillset to zoom into that soon. For storage tasks (e.g., blob storage in DB or compressed custom backup) this bug is a showstopper unfortunately.

Until DataDog#22 is fixed, this library is using zstd in a way that can cause data corruption, as confirmed by zstd maintainer himself. I think this is critical enough that should be mentioned at the top of README and the Go community should be alerted until the bug is fixed

See #22 for more details

jimsmart changed the title ~~Multiple calls to Write() not compressing as continuous stream?~~ Multiple calls to Write() has great overheads? Jan 15, 2018

jimsmart changed the title ~~Multiple calls to Write() has great overheads?~~ Multiple calls to Write() has unexpected overheads? Jan 15, 2018

valyala mentioned this issue Apr 2, 2018

Add a link to an alternative binding for Go facebook/zstd#1093

Merged

Viq111 added the question label Dec 4, 2018

Viq111 added bug and removed question labels Dec 5, 2018

Viq111 added the blocker Something that needs to be fixed before next release label Dec 28, 2018

Viq111 mentioned this issue Dec 28, 2018

Bump version #48

Closed

valyala mentioned this issue Jun 2, 2019

Switch from DataDog/zstd to valyala/gozstd in tests and benchmarks klauspost/compress#109

Closed

rasky mentioned this issue Nov 20, 2019

Add a banner to mention the data corruption bug #75

Closed

bschofield mentioned this issue Nov 21, 2019

Build fails on Alpine due to bundled .a files valyala/gozstd#20

Open

arl mentioned this issue Nov 30, 2019

Change zstd compression algorithm dgraph-io/badger#1069

Closed

Viq111 added a commit that referenced this issue Dec 3, 2019

[readme] Add banner for streaming writer interface issue

065bfcb

See #22 for more details

Viq111 mentioned this issue Dec 3, 2019

[readme] Add banner for streaming writer interface issue #77

Merged

Viq111 mentioned this issue Jan 31, 2020

[zstd_stream] Now use ZSTD_compressStream2 C API #79

Merged

Viq111 closed this as completed in ca147a0 Mar 19, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multiple calls to Write() has unexpected overheads? #22

Multiple calls to Write() has unexpected overheads? #22

jimsmart commented Jan 15, 2018 •

edited

Loading

jimsmart commented Jan 15, 2018

jimsmart commented Jan 15, 2018 •

edited

Loading

valyala commented Mar 31, 2018

dmoklaf commented Dec 3, 2018

Viq111 commented Dec 4, 2018 •

edited

Loading

valyala commented Dec 5, 2018

Cyan4973 commented Dec 5, 2018

dmoklaf commented Dec 5, 2018 •

edited

Loading

Viq111 commented Dec 5, 2018

dmoklaf commented Dec 20, 2018

Multiple calls to Write() has unexpected overheads? #22

Multiple calls to Write() has unexpected overheads? #22

Comments

jimsmart commented Jan 15, 2018 • edited Loading

jimsmart commented Jan 15, 2018

jimsmart commented Jan 15, 2018 • edited Loading

valyala commented Mar 31, 2018

dmoklaf commented Dec 3, 2018

Viq111 commented Dec 4, 2018 • edited Loading

valyala commented Dec 5, 2018

Cyan4973 commented Dec 5, 2018

dmoklaf commented Dec 5, 2018 • edited Loading

Viq111 commented Dec 5, 2018

dmoklaf commented Dec 20, 2018

jimsmart commented Jan 15, 2018 •

edited

Loading

jimsmart commented Jan 15, 2018 •

edited

Loading

Viq111 commented Dec 4, 2018 •

edited

Loading

dmoklaf commented Dec 5, 2018 •

edited

Loading