Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LUCENE-8739: Add ZSTD support. #174

Closed
wants to merge 1 commit into from
Closed

LUCENE-8739: Add ZSTD support. #174

wants to merge 1 commit into from

Conversation

jpountz
Copy link
Contributor

@jpountz jpountz commented Jun 7, 2021

This is a new stored fields format that leverages ZSTD via JNA.

JNA doesn't have a NOTICE file on their repository, so I copied the bottom of the README that seems to cover what the NOTICE file usually does.

This is a new stored fields format that leverages ZSTD via JNA.
@rmuir
Copy link
Member

rmuir commented Jun 7, 2021

What is the improvement? Especially interested in performance versus a "good" zlib: e.g. your comment here: https://issues.apache.org/jira/browse/LUCENE-8739?focusedCommentId=17322015&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17322015

I'm just curious because it definitely adds complexity by adding a third-party dependency/native code to lucene-codecs. JNA is complicated (e.g. .so hell with various "system versions" of it, doesn't work out of box with non-executable /tmp, doesn't support all architectures, etc)

If the improvement can be done in an easier way (e.g. instructions to luser on how to swap in better zlib), that seems like a better way to go?

@jpountz
Copy link
Contributor Author

jpountz commented Jun 8, 2021

Right, zstd definitely outperforms zlib on large files, but when I tested with Lucene stored fields, the difference wasn't very significant, which makes me wonder that most of the benefit of zstd comes from its very large window. One could increase the block size for stored fields to try to leverage this, but it would slow down document retrieval.

I wasn't aware of these limitations with JNA. I guess I'll close the PR, it should already make it easier for folks who are curious about zstd to test it with Lucene.

@jpountz jpountz closed this Jun 8, 2021
@rmuir
Copy link
Member

rmuir commented Jun 8, 2021

@jpountz another alternative would be to build against the new JDK foreign api? If we need to call out to some native code, maybe we should look at the best available mechanism. From what I looked at before, I feel like it would have less hassles than JNA or JNI and give safer native code, but never tried it out.

@uschindler
Copy link
Contributor

With foreign API you can directly wrap arrays to compress and wrap as MemorySegment and pass to foreign function (native MethodHandle). Jextract parses header file and creates MethodHandlers out of it.
It is also safer than JNI, because user has to opt in with a JVM parameter and you can't crush your jvm without explicitly allowing it. By default this codec won't work.

So yes, that's the way to go in future...

See also MmapDirectory v2 #177 for JDK 17.

@believezzd
Copy link

@jpountz

Description

  • There are two place code that i can't understand in this commit, could you please give me help.

First one

  • To ZstdStoredFieldsFormat#ZstdDecompressor.decompress(124)
  • why it has to skip blocks. As I know, in the compress, the compressed sub_block appends to output one by one. it it just decompress from offset to offset+length. Just As DeflateDecompressor.decompress

Second one

  • To ZstdStoredFieldsFormat#ZstdDecompressor.decompress(145)
  • why it reset bytes.offset and bytes.length to other value in the end. As I can see, in the end, bytes.offset should be zero and bytes.length should be originalLength. Same as DeflateDecompressor.decompress in the end.

Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants