Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add zstd to native access #105715

Merged
merged 53 commits into from
Mar 13, 2024
Merged

Add zstd to native access #105715

merged 53 commits into from
Mar 13, 2024

Conversation

rjernst
Copy link
Member

@rjernst rjernst commented Feb 22, 2024

This commit makes zstd compression available to Elasticsearch. The library is pulled in through maven in jar files for each platform, then bundled in a new platform directory under lib. Access to the zstd compression/decompression is through NativeAccess.

This commit makes zstd compression available to Elasticsearch. The
library is pulled in through maven in jar files for each platform, then
bundled in a new platform directory under lib. Access to the zstd
compression/decompression is through NativeAccess.
@rjernst rjernst added the :Core/Infra/Core Core issues without another label label Feb 22, 2024
@rjernst rjernst mentioned this pull request Feb 22, 2024
Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't comment on the build/distribution-related bits that I'm not too familiar with but this looks fantastic!

@Override
public Zstd getZstd() {
logger.warn("cannot compress with zstd because native access is not available");
return null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When ZSTD support becomes a requirement in the future, I imagine we'd throw here instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, ideally there would be no no-op implementation, we would throw at load time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason we aren't just doing that now?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's contentious, so I want to save that for an isolated discussion/PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds like leniency, which I presumed we had all agreed is abhorrent 😉

@Override
public long compress(ByteBuffer dst, int dstLen, ByteBuffer src, int srcLen, int compressionLevel) {
try (Arena arena = Arena.ofConfined()) {
var nativeDst = arena.allocate(dstLen);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sometimes, dst may be over-allocated. I wonder if we should allocate nativeDst with size min(dstLen, compressBound(srcLen))`. Hopefully the overhead of the additional native call would be negligible.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we should do that in Zstd.compress so it applies to both impls?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, maybe I'm misunderstanding your suggestion, but it felt to me like this should happen next to where we're allocating the native memory, so it should be in the JDK/JNA impls?

@Override
public long decompress(ByteBuffer dst, int dstLen, ByteBuffer src, int srcLen) {
try (Arena arena = Arena.ofConfined()) {
var nativeDst = arena.allocate(dstLen);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likewise here for decompression, maybe we should take advantage of the ZSTD_getFrameContentSize API to avoid allocating more native memory than necessary.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't getFrameContentsSize only find the size of the first frame, yet we could decompress multiple frames?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The compress method that is exposed right now is guaranteed to only create a single frame: https://github.com/facebook/zstd/blob/dev/lib/zstd.h#L150 so that should always be the case, but you have a point that we may want to decompress buffers compressed by 3rd parties in the future.

We could check if the source length is equal to ZSTD_findFrameCompressedSize to check if there is a single frame. But at this point this is becoming complicated enough that it's probably worth deferring to a follow-up PR.

assert srcLen != 0;
try (Memory nativeDst = new Memory(dstLen);
Memory nativeSrc = new Memory(srcLen)) {
nativeSrc.write(0, src.array(), src.position(), srcLen);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You already have asserts that the buffer is not direct, but since you're using array() here, I guess that the buffer needs to not be read-only either. I wonder how much work that would be to make it work transparently across direct, heap and read-only byte buffers. (could be a follow-up)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I can make it work, it just needs the correct ByteBuffer calls to be backing agnostic.

elasticsearchmachine pushed a commit that referenced this pull request Feb 22, 2024
This commit upgrades jna to 5.12.1, which supports better control over
releasing native memory.

relates #105715
rjernst added a commit to rjernst/elasticsearch that referenced this pull request Feb 28, 2024
The build handles platform specific code which may be for arm or x86.
Yet there are multiple ways to describe 64bit x86, and the build
converts between the two in several places. This commit consolidates on
the x64 nomenclature in most places, except where necessary (eg ML still
uses x86_64).

relates elastic#105715
The build handles platform specific code which may be for arm or x86.
Yet there are multiple ways to describe 64bit x86, and the build
converts between the two in several places. This commit consolidates on
the x64 nomenclature in most places, except where necessary (eg ML still
uses x86_64).

relates elastic#105715
elasticsearchmachine pushed a commit that referenced this pull request Feb 28, 2024
The build handles platform specific code which may be for arm or x86.
Yet there are multiple ways to describe 64bit x86, and the build
converts between the two in several places. This commit consolidates on
the x64 nomenclature in most places, except where necessary (eg ML still
uses x86_64).

relates #105715
rjernst added a commit to rjernst/elasticsearch that referenced this pull request Feb 28, 2024
…105842)

The build handles platform specific code which may be for arm or x86.
Yet there are multiple ways to describe 64bit x86, and the build
converts between the two in several places. This commit consolidates on
the x64 nomenclature in most places, except where necessary (eg ML still
uses x86_64).

relates elastic#105715
elasticsearchmachine pushed a commit that referenced this pull request Feb 28, 2024
…#105846)

The build handles platform specific code which may be for arm or x86.
Yet there are multiple ways to describe 64bit x86, and the build
converts between the two in several places. This commit consolidates on
the x64 nomenclature in most places, except where necessary (eg ML still
uses x86_64).

relates #105715
@mark-vieira
Copy link
Contributor

mark-vieira commented Mar 8, 2024

@rjernst I'm wondering if we need some packaging tests here. As it is, there's nothing that's actually exercising zstd in a packaged distribution. Presumably, we'll eventually add functionality that relies on zstd that will be executed via REST tests and maybe that'll be good enough?

fang-xing-esql pushed a commit to fang-xing-esql/Elasticsearch that referenced this pull request Mar 8, 2024
…105842)

The build handles platform specific code which may be for arm or x86.
Yet there are multiple ways to describe 64bit x86, and the build
converts between the two in several places. This commit consolidates on
the x64 nomenclature in most places, except where necessary (eg ML still
uses x86_64).

relates elastic#105715
@rjernst
Copy link
Member Author

rjernst commented Mar 8, 2024

Presumably, we'll eventually add functionality that relies on zstd that will be executed via REST tests and maybe that'll be good enough?

Yes, that's what I've been expecting.

private final ByteBuffer bufferView;

JdkCloseableByteBuffer(int len) {
this.arena = Arena.ofShared();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not too familiar with arenas, what would be the downsides of using a shared arena if e.g. a confined arena would have worked as well?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A confined arena restricts usage to the current thread, so the buffer could not be used in a different thread.

public int compress(ByteBuffer dst, ByteBuffer src, int level) {
Objects.requireNonNull(dst, "Null destination buffer");
Objects.requireNonNull(src, "Null source buffer");
long ret = zstdLib.compress(dst, src, level);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we check that both buffers are direct non-read-only buffers and fail otherwise?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I have some of that checked in the impls, but here would be better.

@rjernst rjernst added the auto-merge Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Mar 12, 2024
@rjernst rjernst removed the auto-merge Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Mar 12, 2024
@rjernst rjernst merged commit 405b88b into elastic:main Mar 13, 2024
14 checks passed
@rjernst rjernst deleted the native/zstd branch March 13, 2024 16:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Core/Infra/Core Core issues without another label >refactoring Team:Core/Infra Meta label for core/infra team v8.14.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants