GH-15060: [JS] Add LargeUtf8 type #35780

domoritz · 2023-05-26T08:59:58Z

This pull request adds support for the LargeUTF8 type in Arrow. Now we can create, decode, and encode these vectors. However, while the offset vectors support 64 bit integers, note that the value buffers are limited to a length of 32 bits meaning that LargeUTF8 vectors cannot yet be larger than UTF8 vectors. We will see how we can address this limitation in a follow up pull request. The issue is that JS typed arrays can be at most 2**31-1 elements long (implementation defined). This pull request also fixes a bug in a rounding method which prevented us from supporting large vectors so it's already a big step forward.

Fixes #15060.

Closes: [JS] Missing support for LargeUtf8 datatype #15060

github-actions · 2023-05-26T09:00:23Z

Closes: [JS] Missing support for LargeUtf8 datatype #15060

github-actions · 2023-05-26T09:00:25Z

⚠️ GitHub issue #15060 has been automatically assigned in GitHub to PR creator.

pitrou · 2023-06-01T18:53:29Z

Why not add LargeBinary at the same time?

domoritz · 2023-06-01T18:56:10Z

Why not add LargeBinary at the same time?

I will look at it when I get this working. I just need some help wrapping it up.

domoritz · 2023-06-11T17:27:44Z

js/src/builder.ts

@@ -289,7 +290,7 @@ export abstract class Builder<T extends DataType = any, TNull = any> {
            valueOffsets = _offsets?.flush(length);
        } else if (valueOffsets = _offsets?.flush(length)) { // Variable-width primitives (Binary, Utf8), and Lists
            // Binary, Utf8
-            data = _values?.flush(_offsets.last());
+            data = _values?.flush(Number(_offsets.last()));


I think this will fail for very large offsets that are larger than number allows but I think that's inevitable.

Inevitable because JS supports indexing with numbers only.

domoritz · 2023-06-11T18:34:53Z

js/src/enum.ts

@@ -201,6 +201,7 @@ export enum Type {
    SparseUnion = -24,
    IntervalDayTime = -25,
    IntervalYearMonth = -26,
+    LargeUtf8 = -27,


Is this right?

Note: Only enum values 0-17 (NONE through Map) are written to an Arrow IPC payload.

Seems like LargeUtf8 would be given a positive number then

I think it should be 20, according to the flatbuffers file that you get when you build the package (also the comment about 0-17) is probably outdated.

Fixed it but we should probably do #39048.

js/src/type.ts

pitrou · 2023-12-06T17:11:14Z

Maybe you're right that this is effectively the same as what pyarrow is doing. It's hard to tell if those batches are views on the existing buffers

What do "those batches" denote here? I might be able to answer if you phrase the question more precisely :-)

domoritz · 2023-12-07T20:33:24Z

It looks like what happens is that when you put differently chunked vectors into a table, the record batch boundaries are created at the chunk gaps from all vectors. So if you have two vectors with chunk lengths of 2,3 and 3,2, you will get record batches of length 2,1,2. I would hope that the data buffers are views on the existing buffers (I think that's the case since we use subarray on the buffers). The code for doing that in js is in

arrow/js/src/util/recordbatch.ts

Line 30 in 081b402

    
           function uniformlyDistributeChunksAcrossRecordBatches<T extends TypeMap = any>(schema: Schema<T>, cols: Data<T[keyof T]>[][]): [Schema<T>, RecordBatch<T>[]] {

.

The only issue I can see with this approach is that in the case of variable width vectors (like utf8)---where we have a data buffer where we might reference the same position multiple times from the offset buffer---we would need to copy the data buffer in serialized record batches.

pitrou · 2023-12-07T22:55:47Z

It looks like what happens is that when you put differently chunked vectors into a table, the record batch boundaries are created at the chunk gaps from all vectors.

Yes, indeed.

I would hope that the data buffers are views on the existing buffers.

In C++ they are.

pitrou · 2023-12-07T22:59:59Z

The only issue I can see with this approach is that in the case of variable width vectors (like utf8)---where we have a data buffer where we might reference the same position multiple times from the offset buffer

This is impossible given the layout of the binary types.
It is possible for the binary view types, though.

domoritz · 2023-12-08T15:47:53Z

@kylebarron @bmschmidt @FrNecas This pull request is ready for a review. I will look into actually supporting large value buffers in a follow up pull request. For now, people can chunk vectors as we discussed and at least we have support for the new type. @trxcllnt is currently out so it would be great if you could review this pull request so I can merge it. I will work on LargeBinary in a follow up pull request but it will be easier to have this merged first.

Test scripts I played with to debug this code (just to have it saved somewhere).

```ts import { constants } from 'node:buffer'; import { vectorFromArray, LargeUtf8, Utf8, makeBuilder, Table, tableToIPC, RecordBatchJSONWriter, tableFromIPC, Vector, Data, Int32, makeTable } from './src/Arrow.dom.ts';

console.log(buffer.constants.MAX_LENGTH: ${constants.MAX_LENGTH.toLocaleString()})
console.log(buffer.constants.MAX_STRING_LENGTH: ${constants.MAX_STRING_LENGTH.toLocaleString()})
console.log(Number.MAX_SAFE_INTEGER: ${Number.MAX_SAFE_INTEGER.toLocaleString()})
console.log(2**32: ${(2 ** 32).toLocaleString()})

// const roundLengthUpToNearest64Bytes_1 = (len: number, BPE: number) => ((((len * BPE) + 63) & ~63) || 64) / BPE;
// function roundLengthUpToNearest64Bytes_2(len: number, BPE: number) {
// const bytesMinus1 = len * BPE - 1;
// return ((bytesMinus1 - bytesMinus1 % 64 + 64) || 64) / BPE;
// }

// const roundLengthUpToNearest64Bytes = (len: number, BPE: number) => (round64(len * BPE) || 64) / BPE;
// const roundLengthUpToNearest64Bytes = (len: number, BPE: number) => len + 64 - 1 - (len + 63) % 64;

// const round64_1 = (num: number) => Math.ceil(num / 64) * 64
// const round64_2 = (num: number) => ((num) + 63) & ~63
// const round64_3 = (num: number) => num + 63 - (num + 63) % 64;
// const round64_4 = (num: number) => num - 1 - (num - 1) % 64 + 64;
// const round64_5 = (num: number) => ((num + 63) >> 6) << 6;

// {
// for (const f of [round64_1, round64_2, round64_3, round64_4, round64_5]) {
// console.log(f)
// console.log(f(-1))
// console.log(f(0))
// console.log(f(1))
// console.log(f(2))
// console.log(f(63))
// console.log(f(64))
// console.log(f(65))
// console.log(f(2 ** 16))
// console.log(f(2 ** 32))
// console.log(f(2 ** 42))
// console.log(f(Number.MAX_SAFE_INTEGER / 100))
// console.log(f(Number.MAX_SAFE_INTEGER))
// console.log(f(Number.MAX_SAFE_INTEGER * 10))
// }
// }

// {
// const compare = (v, b) => roundLengthUpToNearest64Bytes_1(v, b) === roundLengthUpToNearest64Bytes_2(v, b);

// for (const b of [1, 2]) {
// console.log(b: ${b});
// console.log(compare(-1, b));
// console.log(compare(0, b));
// console.log(compare(1, b));
// console.log(compare(2, b));
// console.log(compare(63, b));
// console.log(compare(64, b));
// console.log(compare(65, b));
// console.log(compare(128, b));
// console.log(compare(672396, b));
// console.log(compare(2 ** 20, b));
// // console.log(compare(Number.MAX_SAFE_INTEGER / 100, b));
// }
// }

// {
// for (const f of [roundLengthUpToNearest64Bytes_1, roundLengthUpToNearest64Bytes_2]) {
// console.log(f)

// console.log("1 byte");
// console.log(f(0, 1));
// console.log(f(1, 1));
// console.log(f(2, 1));
// console.log(f(63, 1));
// console.log(f(64, 1));
// console.log(f(65, 1));
// console.log(f(2 ** 20, 1));
// console.log(f(Number.MAX_SAFE_INTEGER / 100, 1));

// console.log("2 bytes");
// console.log(f(0, 2));
// console.log(f(1, 2));
// console.log(f(2, 2));
// console.log(f(63, 2));
// console.log(f(64, 2));
// console.log(f(65, 2));
// console.log(f(2 ** 20, 2));
// console.log(f(Number.MAX_SAFE_INTEGER / 100, 2));
// }
// }

// {
// const utf8Vector = vectorFromArray(["foo", "bar", "baz"], new Utf8);
// const largeUtf8Vector = vectorFromArray(["foo", "bar", "baz"], new LargeUtf8);

// console.log(largeUtf8Vector);

// console.log(largeUtf8Vector.toArray());
// console.log(utf8Vector.toArray());

// const table = new Table({ utf8Vector, largeUtf8Vector });
// const writer = RecordBatchJSONWriter.writeAll(table);
// const string = await writer.toString();

// // JSON serialization
// // console.log(string);

// const table2 = tableFromIPC(JSON.parse(string))
// console.log(table2.toString())

// const table3 = tableFromIPC(tableToIPC(table))
// console.log(table3.toString())
// }

// {
// // try putting a lot of strings in a map (works in node but not bun)
// const N = 1e4;

// const map = new Map();
// for (let i = 0; i < N; i++) {
// const longString = "a".repeat(constants.MAX_STRING_LENGTH);
// map.set(i, longString);
// }

// console.log(Total characters in map: ${(map.size * constants.MAX_STRING_LENGTH).toLocaleString()});
// }

// {
// const builder = makeBuilder({ type: new LargeUtf8 });

// const string = "hello world";

// builder.append(string);

// console.log(builder.finish().toVector());
// }

{
// table with two columns with different chunk lengths

const b1 = makeBuilder({ type: new Int32 });
const b2 = makeBuilder({ type: new Int32 });

const d1 = [] as Data[];
const d2 = [] as Data[];

b1.append(1);
b1.append(2);
d1.push(b1.flush());
b1.append(3);
b2.append(4);
d1.push(b1.flush());

b2.append(5);
d2.push(b2.flush());
b2.append(6);
b2.append(7);
b2.append(8);
d2.push(b2.flush());

const v1 = new Vector(d1);
const v2 = new Vector(d2);

const table = new Table({ v1, v2 });

console.log(table.batches.map((x) => x.toArray()));

// console.log(table.toArray());
// console.log(table.data);

}

// {
// const longString = "a".repeat(constants.MAX_STRING_LENGTH);

// const builder = makeBuilder({ type: new LargeUtf8 });

// const data = [] as Data[];
// // const vectors = [] as Vector[];

// builder.append(longString);
// builder.append(longString);
// builder.append(longString);
// // vectors.push(builder.toVector());
// data.push(builder.flush());

// console.log(new Vector(data));
// }

{
// make the longest possible string (longer will crash with RangeError: Invalid string length)
const longString = "a".repeat(constants.MAX_STRING_LENGTH);

const builder = makeBuilder({ type: new LargeUtf8 });

// length of vector
const N = 8;

console.log(`Building vector with total string length (potential need for offsets): ${(constants.MAX_STRING_LENGTH * N).toLocaleString()} or <= 2^${Math.ceil(Math.log2(constants.MAX_STRING_LENGTH * N))}`);

for (let i = 0; i < N; i++) {
    builder.append(longString);
}
// add string to force another offset
builder.append("");
const vector = builder.finish().toVector();

console.log(vector.length);
console.log(vector.byteLength.toLocaleString());

// const table = new Table({ strings: vector });
// console.log(table);

// const ipc = tableToIPC(table);
// console.log(ipc);


// console.log(vector.toArray());

}

// {
// const builder = makeBuilder({ type: new LargeUtf8 });

// const string = "a".repeat(1e5);
// const N = 1e7;

// for (let i = 0; i < N; i++) {
// builder.append(string);
// }
// const vector = builder.toVector();

// console.log(vector);
// }

</details>

kylebarron

lgtm!

kylebarron · 2023-12-16T02:01:26Z

js/src/enum.ts

@@ -175,6 +174,7 @@ export enum Type {
    FixedSizeList = 16, /** Fixed-size list. Each value occupies the same number of bytes */
    Map = 17, /** Map of named logical types */
    Duration = 18, /** Measure of elapsed time in either seconds, milliseconds, microseconds or nanoseconds. */
+    LargeUtf8 = 20, /** Large variable-length string as List<Char> */


Did you skip 19 on purpose?

We haven't implemented it so I'm skipping it for now.

kylebarron · 2023-12-16T02:03:13Z

js/src/util/buffer.ts

    // If we have a non-zero offset, create a new offsets array with the values
    // shifted by the start offset, such that the new start offset is 0
-    if (offset !== 0) {
+    if (offset != 0) {


is this necessary to compare the bigint to a number?

Yes. This works for both number and big int.

conbench-apache-arrow · 2023-12-16T19:14:45Z

After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit 49fde23.

There was 1 benchmark result indicating a performance regression:

Commit Run on ursa-i9-9960x at 2023-12-16 16:50:47Z
- tpch (R) with engine=arrow, format=native, language=R, memory_map=False, query_id=TPCH-15, scale_factor=1

The full Conbench report has more details. It also includes information about 5 possible false positives for unstable benchmarks that are known to sometimes produce them.

geclos · 2023-12-18T08:38:43Z

Thanks for your work guys!

This pull request adds support for the LargeUTF8 type in Arrow. Now we can create, decode, and encode these vectors. However, while the offset vectors support 64 bit integers, note that the value buffers are limited to a length of 32 bits meaning that LargeUTF8 vectors cannot yet be larger than UTF8 vectors. We will see how we can address this limitation in a follow up pull request. The issue is that JS typed arrays can be at most 2**31-1 elements long (implementation defined). This pull request also fixes a bug in a rounding method which prevented us from supporting large vectors so it's already a big step forward. Fixes apache#15060. * Closes: apache#15060 --------- Co-authored-by: Kyle Barron <kylebarron2@gmail.com>

github-actions bot added the Component: JavaScript label May 26, 2023

domoritz force-pushed the dom/large-utf8 branch from 179e858 to ff43b65 Compare May 26, 2023 09:00

github-actions bot added the awaiting committer review Awaiting committer review label May 26, 2023

domoritz force-pushed the dom/large-utf8 branch from ff43b65 to 1521b44 Compare May 26, 2023 09:00

domoritz force-pushed the dom/large-utf8 branch from 41405ad to fdf9ed6 Compare June 11, 2023 17:23

domoritz commented Jun 11, 2023

View reviewed changes

github-actions bot added awaiting changes Awaiting changes awaiting change review Awaiting change review and removed awaiting committer review Awaiting committer review awaiting changes Awaiting changes labels Jun 11, 2023

domoritz commented Jun 11, 2023

View reviewed changes

github-actions bot added awaiting changes Awaiting changes awaiting change review Awaiting change review and removed awaiting change review Awaiting change review awaiting changes Awaiting changes labels Jun 11, 2023

domoritz marked this pull request as ready for review June 12, 2023 16:53

domoritz requested a review from trxcllnt as a code owner June 12, 2023 16:53

github-actions bot added awaiting changes Awaiting changes awaiting change review Awaiting change review and removed awaiting change review Awaiting change review awaiting changes Awaiting changes labels Jun 20, 2023

domoritz commented Jun 23, 2023

View reviewed changes

js/src/type.ts Outdated Show resolved Hide resolved

github-actions bot added awaiting changes Awaiting changes awaiting change review Awaiting change review and removed awaiting change review Awaiting change review awaiting changes Awaiting changes labels Jun 23, 2023

style: format

d63ef0c

domoritz requested review from FrNecas and kylebarron December 8, 2023 15:45

update status

806800b

github-actions bot added the Component: Documentation label Dec 8, 2023

domoritz mentioned this pull request Dec 12, 2023

Max array length constant doesn't seem to match actual limits in browsers stdlib-js/stdlib#1178

Closed

kylebarron approved these changes Dec 16, 2023

View reviewed changes

github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Dec 16, 2023

domoritz added 2 commits December 16, 2023 10:32

Simpify vector assembler

1da7ef6

Separate code paths for utf and large utf

6117916

github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Dec 16, 2023

domoritz merged commit 49fde23 into apache:main Dec 16, 2023
10 checks passed

domoritz deleted the dom/large-utf8 branch December 16, 2023 16:10

lukasmasuch mentioned this pull request Jan 4, 2024

[Coming soon] Support Polars dataframes streamlit/streamlit#5088

Closed

domoritz mentioned this pull request Jan 5, 2024

[JS] support for LargeUtf8 type #31889

Closed

jrbourbeau mentioned this pull request Mar 1, 2024

Support large string (large utf8) of arrow type streamlit/streamlit#6334

Closed

1 task

This was referenced Apr 13, 2024

Issue with LargeUtf8 data type adamviola/parquet-explorer#2

Closed

Bump 'apache-arrow' from v9.0.0 to v15.0.2 duckdb/duckdb-node#74

Open

drewbitt mentioned this pull request Apr 29, 2024

Slow and seemingly unreasonably large memory usage pola-rs/nodejs-polars#69

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-15060: [JS] Add LargeUtf8 type #35780

GH-15060: [JS] Add LargeUtf8 type #35780

domoritz commented May 26, 2023 •

edited by github-actions bot

Loading

github-actions bot commented May 26, 2023

github-actions bot commented May 26, 2023

pitrou commented Jun 1, 2023

domoritz commented Jun 1, 2023

domoritz Jun 11, 2023

domoritz Jun 20, 2023

domoritz Jun 11, 2023

kylebarron Aug 17, 2023

FrNecas Aug 22, 2023

domoritz Dec 2, 2023

pitrou commented Dec 6, 2023

domoritz commented Dec 7, 2023 •

edited

Loading

pitrou commented Dec 7, 2023

pitrou commented Dec 7, 2023

domoritz commented Dec 8, 2023

kylebarron left a comment

kylebarron Dec 16, 2023

domoritz Dec 16, 2023

kylebarron Dec 16, 2023

domoritz Dec 16, 2023

conbench-apache-arrow bot commented Dec 16, 2023

geclos commented Dec 18, 2023

GH-15060: [JS] Add LargeUtf8 type #35780

GH-15060: [JS] Add LargeUtf8 type #35780

Conversation

domoritz commented May 26, 2023 • edited by github-actions bot Loading

github-actions bot commented May 26, 2023

github-actions bot commented May 26, 2023

pitrou commented Jun 1, 2023

domoritz commented Jun 1, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pitrou commented Dec 6, 2023

domoritz commented Dec 7, 2023 • edited Loading

pitrou commented Dec 7, 2023

pitrou commented Dec 7, 2023

domoritz commented Dec 8, 2023

kylebarron left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

conbench-apache-arrow bot commented Dec 16, 2023

geclos commented Dec 18, 2023

domoritz commented May 26, 2023 •

edited by github-actions bot

Loading

domoritz commented Dec 7, 2023 •

edited

Loading