LUCENE-8739: Add ZSTD support. #174

jpountz · 2021-06-07T12:53:12Z

This is a new stored fields format that leverages ZSTD via JNA.

JNA doesn't have a NOTICE file on their repository, so I copied the bottom of the README that seems to cover what the NOTICE file usually does.

This is a new stored fields format that leverages ZSTD via JNA.

rmuir · 2021-06-07T22:48:28Z

What is the improvement? Especially interested in performance versus a "good" zlib: e.g. your comment here: https://issues.apache.org/jira/browse/LUCENE-8739?focusedCommentId=17322015&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17322015

I'm just curious because it definitely adds complexity by adding a third-party dependency/native code to lucene-codecs. JNA is complicated (e.g. .so hell with various "system versions" of it, doesn't work out of box with non-executable /tmp, doesn't support all architectures, etc)

If the improvement can be done in an easier way (e.g. instructions to luser on how to swap in better zlib), that seems like a better way to go?

jpountz · 2021-06-08T17:40:34Z

Right, zstd definitely outperforms zlib on large files, but when I tested with Lucene stored fields, the difference wasn't very significant, which makes me wonder that most of the benefit of zstd comes from its very large window. One could increase the block size for stored fields to try to leverage this, but it would slow down document retrieval.

I wasn't aware of these limitations with JNA. I guess I'll close the PR, it should already make it easier for folks who are curious about zstd to test it with Lucene.

rmuir · 2021-06-08T18:46:35Z

@jpountz another alternative would be to build against the new JDK foreign api? If we need to call out to some native code, maybe we should look at the best available mechanism. From what I looked at before, I feel like it would have less hassles than JNA or JNI and give safer native code, but never tried it out.

uschindler · 2021-06-08T20:33:07Z

With foreign API you can directly wrap arrays to compress and wrap as MemorySegment and pass to foreign function (native MethodHandle). Jextract parses header file and creates MethodHandlers out of it.
It is also safer than JNI, because user has to opt in with a JVM parameter and you can't crush your jvm without explicitly allowing it. By default this codec won't work.

So yes, that's the way to go in future...

See also MmapDirectory v2 #177 for JDK 17.

believezzd · 2024-02-22T09:13:13Z

@jpountz

Description

There are two place code that i can't understand in this commit, could you please give me help.

First one

To ZstdStoredFieldsFormat#ZstdDecompressor.decompress(124)
why it has to skip blocks. As I know, in the compress, the compressed sub_block appends to output one by one. it it just decompress from offset to offset+length. Just As DeflateDecompressor.decompress

Second one

To ZstdStoredFieldsFormat#ZstdDecompressor.decompress(145)
why it reset bytes.offset and bytes.length to other value in the end. As I can see, in the end, bytes.offset should be zero and bytes.length should be originalLength. Same as DeflateDecompressor.decompress in the end.

Thanks.

LUCENE-8739: Add ZSTD support.

958add3

This is a new stored fields format that leverages ZSTD via JNA.

jpountz closed this Jun 8, 2021

asfimport mentioned this pull request Apr 19, 2022

ZSTD Compressor support in Lucene [LUCENE-8739] #9784

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LUCENE-8739: Add ZSTD support. #174

LUCENE-8739: Add ZSTD support. #174

jpountz commented Jun 7, 2021

rmuir commented Jun 7, 2021

jpountz commented Jun 8, 2021

rmuir commented Jun 8, 2021

uschindler commented Jun 8, 2021

believezzd commented Feb 22, 2024

LUCENE-8739: Add ZSTD support. #174

LUCENE-8739: Add ZSTD support. #174

Conversation

jpountz commented Jun 7, 2021

rmuir commented Jun 7, 2021

jpountz commented Jun 8, 2021

rmuir commented Jun 8, 2021

uschindler commented Jun 8, 2021

believezzd commented Feb 22, 2024

Description

First one

Second one