Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-15060: [JS] Add LargeUtf8 type #35780

Merged
merged 38 commits into from
Dec 16, 2023
Merged

Conversation

domoritz
Copy link
Member

@domoritz domoritz commented May 26, 2023

This pull request adds support for the LargeUTF8 type in Arrow. Now we can create, decode, and encode these vectors. However, while the offset vectors support 64 bit integers, note that the value buffers are limited to a length of 32 bits meaning that LargeUTF8 vectors cannot yet be larger than UTF8 vectors. We will see how we can address this limitation in a follow up pull request. The issue is that JS typed arrays can be at most 2**31-1 elements long (implementation defined). This pull request also fixes a bug in a rounding method which prevented us from supporting large vectors so it's already a big step forward.

Fixes #15060.

@github-actions
Copy link

@github-actions
Copy link

⚠️ GitHub issue #15060 has been automatically assigned in GitHub to PR creator.

@pitrou
Copy link
Member

pitrou commented Jun 1, 2023

Why not add LargeBinary at the same time?

@domoritz
Copy link
Member Author

domoritz commented Jun 1, 2023

Why not add LargeBinary at the same time?

I will look at it when I get this working. I just need some help wrapping it up.

@@ -289,7 +290,7 @@ export abstract class Builder<T extends DataType = any, TNull = any> {
valueOffsets = _offsets?.flush(length);
} else if (valueOffsets = _offsets?.flush(length)) { // Variable-width primitives (Binary, Utf8), and Lists
// Binary, Utf8
data = _values?.flush(_offsets.last());
data = _values?.flush(Number(_offsets.last()));
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this will fail for very large offsets that are larger than number allows but I think that's inevitable.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inevitable because JS supports indexing with numbers only.

@github-actions github-actions bot added awaiting changes Awaiting changes awaiting change review Awaiting change review and removed awaiting committer review Awaiting committer review awaiting changes Awaiting changes labels Jun 11, 2023
js/src/enum.ts Outdated
@@ -201,6 +201,7 @@ export enum Type {
SparseUnion = -24,
IntervalDayTime = -25,
IntervalYearMonth = -26,
LargeUtf8 = -27,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Note: Only enum values 0-17 (NONE through Map) are written to an Arrow IPC payload.

Seems like LargeUtf8 would be given a positive number then

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be 20, according to the flatbuffers file that you get when you build the package (also the comment about 0-17) is probably outdated.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed it but we should probably do #39048.

@github-actions github-actions bot added awaiting changes Awaiting changes awaiting change review Awaiting change review and removed awaiting change review Awaiting change review awaiting changes Awaiting changes labels Jun 11, 2023
@domoritz domoritz marked this pull request as ready for review June 12, 2023 16:53
@domoritz domoritz requested a review from trxcllnt as a code owner June 12, 2023 16:53
@github-actions github-actions bot added awaiting changes Awaiting changes awaiting change review Awaiting change review and removed awaiting change review Awaiting change review awaiting changes Awaiting changes labels Jun 20, 2023
js/src/type.ts Outdated Show resolved Hide resolved
@github-actions github-actions bot added awaiting changes Awaiting changes awaiting change review Awaiting change review and removed awaiting change review Awaiting change review awaiting changes Awaiting changes labels Jun 23, 2023
@pitrou
Copy link
Member

pitrou commented Dec 6, 2023

Maybe you're right that this is effectively the same as what pyarrow is doing. It's hard to tell if those batches are views on the existing buffers

What do "those batches" denote here? I might be able to answer if you phrase the question more precisely :-)

@domoritz
Copy link
Member Author

domoritz commented Dec 7, 2023

It looks like what happens is that when you put differently chunked vectors into a table, the record batch boundaries are created at the chunk gaps from all vectors. So if you have two vectors with chunk lengths of 2,3 and 3,2, you will get record batches of length 2,1,2. I would hope that the data buffers are views on the existing buffers (I think that's the case since we use subarray on the buffers). The code for doing that in js is in

function uniformlyDistributeChunksAcrossRecordBatches<T extends TypeMap = any>(schema: Schema<T>, cols: Data<T[keyof T]>[][]): [Schema<T>, RecordBatch<T>[]] {
.

The only issue I can see with this approach is that in the case of variable width vectors (like utf8)---where we have a data buffer where we might reference the same position multiple times from the offset buffer---we would need to copy the data buffer in serialized record batches.

@pitrou
Copy link
Member

pitrou commented Dec 7, 2023

It looks like what happens is that when you put differently chunked vectors into a table, the record batch boundaries are created at the chunk gaps from all vectors.

Yes, indeed.

I would hope that the data buffers are views on the existing buffers.

In C++ they are.

@pitrou
Copy link
Member

pitrou commented Dec 7, 2023

The only issue I can see with this approach is that in the case of variable width vectors (like utf8)---where we have a data buffer where we might reference the same position multiple times from the offset buffer

This is impossible given the layout of the binary types.
It is possible for the binary view types, though.

@domoritz
Copy link
Member Author

domoritz commented Dec 8, 2023

@kylebarron @bmschmidt @FrNecas This pull request is ready for a review. I will look into actually supporting large value buffers in a follow up pull request. For now, people can chunk vectors as we discussed and at least we have support for the new type. @trxcllnt is currently out so it would be great if you could review this pull request so I can merge it. I will work on LargeBinary in a follow up pull request but it will be easier to have this merged first.

Test scripts I played with to debug this code (just to have it saved somewhere). ```ts import { constants } from 'node:buffer'; import { vectorFromArray, LargeUtf8, Utf8, makeBuilder, Table, tableToIPC, RecordBatchJSONWriter, tableFromIPC, Vector, Data, Int32, makeTable } from './src/Arrow.dom.ts';

console.log(buffer.constants.MAX_LENGTH: ${constants.MAX_LENGTH.toLocaleString()})
console.log(buffer.constants.MAX_STRING_LENGTH: ${constants.MAX_STRING_LENGTH.toLocaleString()})
console.log(Number.MAX_SAFE_INTEGER: ${Number.MAX_SAFE_INTEGER.toLocaleString()})
console.log(2**32: ${(2 ** 32).toLocaleString()})

// const roundLengthUpToNearest64Bytes_1 = (len: number, BPE: number) => ((((len * BPE) + 63) & ~63) || 64) / BPE;
// function roundLengthUpToNearest64Bytes_2(len: number, BPE: number) {
// const bytesMinus1 = len * BPE - 1;
// return ((bytesMinus1 - bytesMinus1 % 64 + 64) || 64) / BPE;
// }

// const roundLengthUpToNearest64Bytes = (len: number, BPE: number) => (round64(len * BPE) || 64) / BPE;
// const roundLengthUpToNearest64Bytes = (len: number, BPE: number) => len + 64 - 1 - (len + 63) % 64;

// const round64_1 = (num: number) => Math.ceil(num / 64) * 64
// const round64_2 = (num: number) => ((num) + 63) & ~63
// const round64_3 = (num: number) => num + 63 - (num + 63) % 64;
// const round64_4 = (num: number) => num - 1 - (num - 1) % 64 + 64;
// const round64_5 = (num: number) => ((num + 63) >> 6) << 6;

// {
// for (const f of [round64_1, round64_2, round64_3, round64_4, round64_5]) {
// console.log(f)
// console.log(f(-1))
// console.log(f(0))
// console.log(f(1))
// console.log(f(2))
// console.log(f(63))
// console.log(f(64))
// console.log(f(65))
// console.log(f(2 ** 16))
// console.log(f(2 ** 32))
// console.log(f(2 ** 42))
// console.log(f(Number.MAX_SAFE_INTEGER / 100))
// console.log(f(Number.MAX_SAFE_INTEGER))
// console.log(f(Number.MAX_SAFE_INTEGER * 10))
// }
// }

// {
// const compare = (v, b) => roundLengthUpToNearest64Bytes_1(v, b) === roundLengthUpToNearest64Bytes_2(v, b);

// for (const b of [1, 2]) {
// console.log(b: ${b});
// console.log(compare(-1, b));
// console.log(compare(0, b));
// console.log(compare(1, b));
// console.log(compare(2, b));
// console.log(compare(63, b));
// console.log(compare(64, b));
// console.log(compare(65, b));
// console.log(compare(128, b));
// console.log(compare(672396, b));
// console.log(compare(2 ** 20, b));
// // console.log(compare(Number.MAX_SAFE_INTEGER / 100, b));
// }
// }

// {
// for (const f of [roundLengthUpToNearest64Bytes_1, roundLengthUpToNearest64Bytes_2]) {
// console.log(f)

// console.log("1 byte");
// console.log(f(0, 1));
// console.log(f(1, 1));
// console.log(f(2, 1));
// console.log(f(63, 1));
// console.log(f(64, 1));
// console.log(f(65, 1));
// console.log(f(2 ** 20, 1));
// console.log(f(Number.MAX_SAFE_INTEGER / 100, 1));

// console.log("2 bytes");
// console.log(f(0, 2));
// console.log(f(1, 2));
// console.log(f(2, 2));
// console.log(f(63, 2));
// console.log(f(64, 2));
// console.log(f(65, 2));
// console.log(f(2 ** 20, 2));
// console.log(f(Number.MAX_SAFE_INTEGER / 100, 2));
// }
// }

// {
// const utf8Vector = vectorFromArray(["foo", "bar", "baz"], new Utf8);
// const largeUtf8Vector = vectorFromArray(["foo", "bar", "baz"], new LargeUtf8);

// console.log(largeUtf8Vector);

// console.log(largeUtf8Vector.toArray());
// console.log(utf8Vector.toArray());

// const table = new Table({ utf8Vector, largeUtf8Vector });
// const writer = RecordBatchJSONWriter.writeAll(table);
// const string = await writer.toString();

// // JSON serialization
// // console.log(string);

// const table2 = tableFromIPC(JSON.parse(string))
// console.log(table2.toString())

// const table3 = tableFromIPC(tableToIPC(table))
// console.log(table3.toString())
// }

// {
// // try putting a lot of strings in a map (works in node but not bun)
// const N = 1e4;

// const map = new Map();
// for (let i = 0; i < N; i++) {
// const longString = "a".repeat(constants.MAX_STRING_LENGTH);
// map.set(i, longString);
// }

// console.log(Total characters in map: ${(map.size * constants.MAX_STRING_LENGTH).toLocaleString()});
// }

// {
// const builder = makeBuilder({ type: new LargeUtf8 });

// const string = "hello world";

// builder.append(string);

// console.log(builder.finish().toVector());
// }

{
// table with two columns with different chunk lengths

const b1 = makeBuilder({ type: new Int32 });
const b2 = makeBuilder({ type: new Int32 });

const d1 = [] as Data[];
const d2 = [] as Data[];

b1.append(1);
b1.append(2);
d1.push(b1.flush());
b1.append(3);
b2.append(4);
d1.push(b1.flush());

b2.append(5);
d2.push(b2.flush());
b2.append(6);
b2.append(7);
b2.append(8);
d2.push(b2.flush());

const v1 = new Vector(d1);
const v2 = new Vector(d2);

const table = new Table({ v1, v2 });

console.log(table.batches.map((x) => x.toArray()));

// console.log(table.toArray());
// console.log(table.data);

}

// {
// const longString = "a".repeat(constants.MAX_STRING_LENGTH);

// const builder = makeBuilder({ type: new LargeUtf8 });

// const data = [] as Data[];
// // const vectors = [] as Vector[];

// builder.append(longString);
// builder.append(longString);
// builder.append(longString);
// // vectors.push(builder.toVector());
// data.push(builder.flush());

// builder.append(longString);
// builder.append(longString);
// builder.append(longString);
// // vectors.push(builder.toVector());
// data.push(builder.flush());

// console.log(new Vector(data));
// }

{
// make the longest possible string (longer will crash with RangeError: Invalid string length)
const longString = "a".repeat(constants.MAX_STRING_LENGTH);

const builder = makeBuilder({ type: new LargeUtf8 });

// length of vector
const N = 8;

console.log(`Building vector with total string length (potential need for offsets): ${(constants.MAX_STRING_LENGTH * N).toLocaleString()} or <= 2^${Math.ceil(Math.log2(constants.MAX_STRING_LENGTH * N))}`);

for (let i = 0; i < N; i++) {
    builder.append(longString);
}
// add string to force another offset
builder.append("");
const vector = builder.finish().toVector();

console.log(vector.length);
console.log(vector.byteLength.toLocaleString());

// const table = new Table({ strings: vector });
// console.log(table);

// const ipc = tableToIPC(table);
// console.log(ipc);


// console.log(vector.toArray());

}

// {
// const builder = makeBuilder({ type: new LargeUtf8 });

// const string = "a".repeat(1e5);
// const N = 1e7;

// for (let i = 0; i < N; i++) {
// builder.append(string);
// }
// const vector = builder.toVector();

// console.log(vector);
// }

</details>

Copy link
Contributor

@kylebarron kylebarron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

@@ -175,6 +174,7 @@ export enum Type {
FixedSizeList = 16, /** Fixed-size list. Each value occupies the same number of bytes */
Map = 17, /** Map of named logical types */
Duration = 18, /** Measure of elapsed time in either seconds, milliseconds, microseconds or nanoseconds. */
LargeUtf8 = 20, /** Large variable-length string as List<Char> */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you skip 19 on purpose?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We haven't implemented it so I'm skipping it for now.

// If we have a non-zero offset, create a new offsets array with the values
// shifted by the start offset, such that the new start offset is 0
if (offset !== 0) {
if (offset != 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this necessary to compare the bigint to a number?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. This works for both number and big int.

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Dec 16, 2023
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Dec 16, 2023
@domoritz domoritz merged commit 49fde23 into apache:main Dec 16, 2023
10 checks passed
@domoritz domoritz deleted the dom/large-utf8 branch December 16, 2023 16:10
Copy link

After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit 49fde23.

There was 1 benchmark result indicating a performance regression:

The full Conbench report has more details. It also includes information about 5 possible false positives for unstable benchmarks that are known to sometimes produce them.

@geclos
Copy link

geclos commented Dec 18, 2023

Thanks for your work guys!

clayburn pushed a commit to clayburn/arrow that referenced this pull request Jan 23, 2024
This pull request adds support for the LargeUTF8 type in Arrow. Now we
can create, decode, and encode these vectors. However, while the offset
vectors support 64 bit integers, note that the value buffers are limited
to a length of 32 bits meaning that LargeUTF8 vectors cannot yet be
larger than UTF8 vectors. We will see how we can address this limitation
in a follow up pull request. The issue is that JS typed arrays can be at
most 2**31-1 elements long (implementation defined). This pull request
also fixes a bug in a rounding method which prevented us from supporting
large vectors so it's already a big step forward.

Fixes apache#15060.
* Closes: apache#15060

---------

Co-authored-by: Kyle Barron <kylebarron2@gmail.com>
dgreiss pushed a commit to dgreiss/arrow that referenced this pull request Feb 19, 2024
This pull request adds support for the LargeUTF8 type in Arrow. Now we
can create, decode, and encode these vectors. However, while the offset
vectors support 64 bit integers, note that the value buffers are limited
to a length of 32 bits meaning that LargeUTF8 vectors cannot yet be
larger than UTF8 vectors. We will see how we can address this limitation
in a follow up pull request. The issue is that JS typed arrays can be at
most 2**31-1 elements long (implementation defined). This pull request
also fixes a bug in a rounding method which prevented us from supporting
large vectors so it's already a big step forward.

Fixes apache#15060.
* Closes: apache#15060

---------

Co-authored-by: Kyle Barron <kylebarron2@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[JS] Missing support for LargeUtf8 datatype
6 participants