ARROW-8674: [JS] Implement IPC RecordBatch body buffer compression #13076

kylebarron · 2022-05-06T00:33:21Z

Haven't quite gotten this to work yet, attempting with a sample IPC file from PyArrow, but putting up for visibility.

github-actions · 2022-05-06T00:33:46Z

https://issues.apache.org/jira/browse/ARROW-8674

github-actions · 2022-05-06T00:33:47Z

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

isaacbrodsky · 2022-05-09T22:53:08Z

js/src/ipc/reader.ts

+                // https://github.com/apache/arrow/blob/1fc251f18d5b48f0c9fe8af8168237e7e6d05a45/format/Message.fbs#L59-L65
+                body = codec.decode(body);
+            } else {
+                throw new Error('Record batch is compressed but codec not found');


Should this error include the codec ID?

isaacbrodsky · 2022-05-09T22:55:06Z

js/src/ipc/reader.ts

+            if (codec?.decode) {
+                // TODO: does this need to be offset by 8 bytes? Since the uncompressed length is
+                // written in the first 8 bytes:
+                // https://github.com/apache/arrow/blob/1fc251f18d5b48f0c9fe8af8168237e7e6d05a45/format/Message.fbs#L59-L65


Confusing that here the uint8array is the "body" but it might correspond to the "buffer" in that spec. I'm not sure off hand which is actually here?

isaacbrodsky · 2022-05-09T22:55:48Z

js/package.json

@@ -99,6 +99,7 @@
    "jest": "27.5.1",
    "jest-silent-reporter": "0.5.0",
    "lerna": "4.0.0",
+    "lz4js": "0.2.0",


This is intended to only be registered in tests, and users need to declare the dependency and register in their own applications if they want?

We should include code snippets and instructions for how to do that for each codec needed to support the full Arrow spec. The error message could even link to the docs on exactly how to enable a given compression codec.

kitsunde · 2022-11-21T15:48:02Z

js/src/ipc/reader.ts

+            if (codec?.decode) {
+                // TODO: does this need to be offset by 8 bytes? Since the uncompressed length is
+                // written in the first 8 bytes:
+                // https://github.com/apache/arrow/blob/1fc251f18d5b48f0c9fe8af8168237e7e6d05a45/format/Message.fbs#L59-L65


I started digging into this although being unfamiliar to arrow and zstd both it has been a bit of a struggle.

The first 8 bytes are not part of the compressed section. You can see it corrected for in other places like here: https://github.com/apache/arrow/pull/9137/files#diff-38c3da9d6c7d521df83f59ee7d8cfced00951e8dcd38f39c7c36d26b6feee3d6R75

If you encode something with ZSTD you'll see it starts with [40, 181, 47, 253, 32] which comes after those 8 bytes. I'e just encoding:

const testVal = compress(Buffer.from(new Uint8Array([1, 0, 0, 0, 0, 0, 0, 0]))); Uint8Array(17) [ 40, 181, 47, 253, 32, 8, 65, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0 ]

This within the body here is:

const compressedMessage = new Uint8Array([ 8, 0, 0, 0, 0, 0, 0, 0, 40, 181, 47, 253, 32, 8, 65, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ]);

Now there's another issue where which is there appears to be padding at the end and between the fields which I'm very uncertain on how to detect besides brute forcing in reverse between the LZ4 header and the end.

console.log(decompress(compressedMessage.slice(8, 17 + 8))); Uint8Array([ 40, 181, 47, 253, 32, 8, 65, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0 ]);

Thanks for doing some digging! This PR isn't currently at the top of my list of things to work on, but would love for this support to be added to arrow JS!

metalmatze · 2023-09-06T10:57:25Z

Hey @kylebarron, super nice work on this PR so far!
We are very interested in getting this support into Arrow JS.
Do you plan on continuing this in the next weeks?

If you're busy with other things, can you briefly elaborate on what's left to do in your opinion such that we may consider moving it forward from our side.

Thanks!

kylebarron · 2023-09-06T13:31:49Z

Do you plan on continuing this in the next weeks?

No I have no imminent plans to work on this, but I'd love to see support for this added.

what's left to do in your opinion

I never got this actually working at runtime; you probably need to consult the IPC spec closer to see if I'm decompressing the wrong byte range, as suggested in the comment above.

I don't have any specific suggestions on this because I haven't looked at it lately.
The codec API probably needs to have some consensus on what that should look like.

kylebarron added 8 commits May 5, 2022 17:49

Define compression registry

f18c1d3

Try to implement decompression

fc7a082

Typo

e8d9aef

Lint

08f1b7b

Export compressionType and compressionRegistry

3438858

Add lz4js as dev dependency

2017793

Check against undefined

cbb3d2f

Indent properly

483a614

github-actions bot added the Component: JavaScript label May 6, 2022

isaacbrodsky reviewed May 9, 2022

View reviewed changes

kitsunde reviewed Nov 21, 2022

View reviewed changes

metalmatze mentioned this pull request Jul 18, 2023

Revert "compress the flamegraph record with lz4 (#3459)" parca-dev/parca#3466

Merged

kylebarron mentioned this pull request Oct 3, 2023

Data compression over the wire developmentseed/lonboard#37

Closed

asfimport mentioned this pull request Dec 1, 2022

[JS] Implement IPC RecordBatch body buffer compression from ARROW-300 #24833

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-8674: [JS] Implement IPC RecordBatch body buffer compression #13076

ARROW-8674: [JS] Implement IPC RecordBatch body buffer compression #13076

kylebarron commented May 6, 2022

github-actions bot commented May 6, 2022

github-actions bot commented May 6, 2022

isaacbrodsky May 9, 2022

isaacbrodsky May 9, 2022

isaacbrodsky May 9, 2022

kitsunde Nov 21, 2022

kylebarron Dec 1, 2022

metalmatze commented Sep 6, 2023

kylebarron commented Sep 6, 2023

ARROW-8674: [JS] Implement IPC RecordBatch body buffer compression #13076

Are you sure you want to change the base?

ARROW-8674: [JS] Implement IPC RecordBatch body buffer compression #13076

Conversation

kylebarron commented May 6, 2022

github-actions bot commented May 6, 2022

github-actions bot commented May 6, 2022

isaacbrodsky May 9, 2022

Choose a reason for hiding this comment

isaacbrodsky May 9, 2022

Choose a reason for hiding this comment

isaacbrodsky May 9, 2022

Choose a reason for hiding this comment

kitsunde Nov 21, 2022

Choose a reason for hiding this comment

kylebarron Dec 1, 2022

Choose a reason for hiding this comment

metalmatze commented Sep 6, 2023

kylebarron commented Sep 6, 2023