-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-8674: [JS] Implement IPC RecordBatch body buffer compression #13076
base: main
Are you sure you want to change the base?
Conversation
|
// https://github.com/apache/arrow/blob/1fc251f18d5b48f0c9fe8af8168237e7e6d05a45/format/Message.fbs#L59-L65 | ||
body = codec.decode(body); | ||
} else { | ||
throw new Error('Record batch is compressed but codec not found'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this error include the codec ID?
if (codec?.decode) { | ||
// TODO: does this need to be offset by 8 bytes? Since the uncompressed length is | ||
// written in the first 8 bytes: | ||
// https://github.com/apache/arrow/blob/1fc251f18d5b48f0c9fe8af8168237e7e6d05a45/format/Message.fbs#L59-L65 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Confusing that here the uint8array is the "body" but it might correspond to the "buffer" in that spec. I'm not sure off hand which is actually here?
@@ -99,6 +99,7 @@ | |||
"jest": "27.5.1", | |||
"jest-silent-reporter": "0.5.0", | |||
"lerna": "4.0.0", | |||
"lz4js": "0.2.0", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is intended to only be registered in tests, and users need to declare the dependency and register in their own applications if they want?
We should include code snippets and instructions for how to do that for each codec needed to support the full Arrow spec. The error message could even link to the docs on exactly how to enable a given compression codec.
if (codec?.decode) { | ||
// TODO: does this need to be offset by 8 bytes? Since the uncompressed length is | ||
// written in the first 8 bytes: | ||
// https://github.com/apache/arrow/blob/1fc251f18d5b48f0c9fe8af8168237e7e6d05a45/format/Message.fbs#L59-L65 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I started digging into this although being unfamiliar to arrow and zstd both it has been a bit of a struggle.
The first 8 bytes are not part of the compressed section. You can see it corrected for in other places like here: https://github.com/apache/arrow/pull/9137/files#diff-38c3da9d6c7d521df83f59ee7d8cfced00951e8dcd38f39c7c36d26b6feee3d6R75
If you encode something with ZSTD you'll see it starts with [40, 181, 47, 253, 32]
which comes after those 8 bytes. I'e just encoding:
const testVal = compress(Buffer.from(new Uint8Array([1, 0, 0, 0, 0, 0, 0, 0])));
Uint8Array(17) [
40, 181, 47, 253, 32, 8, 65,
0, 0, 1, 0, 0, 0, 0,
0, 0, 0
]
This within the body here is:
const compressedMessage = new Uint8Array([
8, 0, 0, 0,
0, 0, 0, 0,
40, 181, 47,
253, 32, 8, 65,
0, 0, 1, 0,
0, 0, 0, 0,
0, 0, 0, 0,
0, 0, 0, 0,
0
]);
Now there's another issue where which is there appears to be padding at the end and between the fields which I'm very uncertain on how to detect besides brute forcing in reverse between the LZ4 header and the end.
console.log(decompress(compressedMessage.slice(8, 17 + 8)));
Uint8Array([
40, 181, 47, 253,
32, 8, 65, 0,
0, 1, 0, 0,
0, 0, 0, 0,
0
]);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for doing some digging! This PR isn't currently at the top of my list of things to work on, but would love for this support to be added to arrow JS!
Hey @kylebarron, super nice work on this PR so far! If you're busy with other things, can you briefly elaborate on what's left to do in your opinion such that we may consider moving it forward from our side. Thanks! |
No I have no imminent plans to work on this, but I'd love to see support for this added.
|
Haven't quite gotten this to work yet, attempting with a sample IPC file from PyArrow, but putting up for visibility.