Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-12549: [JS] Table and RecordBatch should not extend Vector, make JS lib smaller #10371

Closed
wants to merge 172 commits into from

Conversation

trxcllnt
Copy link
Contributor

@trxcllnt trxcllnt commented May 21, 2021

This pull request addresses a number of issues that requires a more substantial refactor.

The main goals are:

  1. Eliminate cruft by dropping support for outdated browsers/environments.
  2. Reduce total surface area by eliminating unnecessary Vector, Chunked, and Column classes.
  3. Reduce the amount of the library pulled in when Table, RecordBatch, or Vector classes are imported.

In this pull request, we have eliminated type specific Vector classes. There is now only one vector that has a data instance and we use type-specific visitors. Record batches don't inherit from vectors anymore. Neither do Tables. Columns are gone. To create vectors and tables, we now have separate methods that can be easily tree shaken.

We also added tests for the bundles, fixed some issues with bundling in webpack, updated dependencies (including typescript and flatbuffers). We also added memoization to dictionary vectors to reduce the overhead of decoding UTF-8 to strings.

A quick overview of Arrow with the new API: https://observablehq.com/d/9480eccb30a21010.

Also addresses:

Performance comparison:

Master:

Prepare Data: 502.401ms
Running "Parse" suite...
dataset: tracks, function: Table.from 15,578 ops/s ±0.67%, 0.064 ms, 94 samples
dataset: tracks, function: readBatches 15,853 ops/s ±0.59%, 0.063 ms, 97 samples
dataset: tracks, function: serialize 969 ops/s ±1.8%, 1 ms, 93 samples
Running "Get values by index" suite...
dataset: tracks, column: lat, length: 1,000,000, type: Float32 78 ops/s ±0.090%, 13 ms, 82 samples
dataset: tracks, column: lng, length: 1,000,000, type: Float32 79 ops/s ±0.090%, 13 ms, 70 samples
dataset: tracks, column: origin, length: 1,000,000, type: Dictionary<Int8, Utf8> 1.59 ops/s ±25%, 563 ms, 9 samples
dataset: tracks, column: destination, length: 1,000,000, type: Dictionary<Int8, Utf8> 1.74 ops/s ±3.2%, 576 ms, 9 samples
Running "Iterate vectors" suite...
dataset: tracks, column: lat, length: 1,000,000, type: Float32 85 ops/s ±0.14%, 12 ms, 74 samples
dataset: tracks, column: lng, length: 1,000,000, type: Float32 85 ops/s ±0.11%, 12 ms, 75 samples
dataset: tracks, column: origin, length: 1,000,000, type: Dictionary<Int8, Utf8> 1.51 ops/s ±3.1%, 657 ms, 8 samples
dataset: tracks, column: destination, length: 1,000,000, type: Dictionary<Int8, Utf8> 1.49 ops/s ±4.0%, 666 ms, 8 samples
Running "Slice toArray vectors" suite...
dataset: tracks, column: lat, length: 1,000,000, type: Float32 2,588 ops/s ±3.0%, 0.4 ms, 74 samples
dataset: tracks, column: lng, length: 1,000,000, type: Float32 2,345 ops/s ±1.7%, 0.43 ms, 73 samples
dataset: tracks, column: origin, length: 1,000,000, type: Dictionary<Int8, Utf8> 1.29 ops/s ±5.3%, 760 ms, 8 samples
dataset: tracks, column: destination, length: 1,000,000, type: Dictionary<Int8, Utf8> 1.28 ops/s ±4.1%, 784 ms, 8 samples
Running "Slice vectors" suite...
dataset: tracks, column: lat, length: 1,000,000, type: Float32 4,212,193 ops/s ±0.23%, 0 ms, 100 samples
dataset: tracks, column: lng, length: 1,000,000, type: Float32 4,400,234 ops/s ±0.80%, 0 ms, 92 samples
dataset: tracks, column: origin, length: 1,000,000, type: Dictionary<Int8, Utf8> 4,764,651 ops/s ±0.13%, 0 ms, 101 samples
dataset: tracks, column: destination, length: 1,000,000, type: Dictionary<Int8, Utf8> 4,763,581 ops/s ±0.050%, 0 ms, 98 samples
Running "DataFrame Iterate" suite...
dataset: tracks, length: 1,000,000 23.1 ops/s ±2.1%, 43 ms, 43 samples
Running "DataFrame Count By" suite...
dataset: tracks, column: origin, length: 1,000,000, type: Dictionary<Int8, Utf8> 535 ops/s ±0.050%, 1.9 ms, 99 samples
dataset: tracks, column: destination, length: 1,000,000, type: Dictionary<Int8, Utf8> 535 ops/s ±0.040%, 1.9 ms, 96 samples
Running "DataFrame Filter-Scan Count" suite...
dataset: tracks, column: lat, length: 1,000,000, type: Float32, test: gt, value: 0 57 ops/s ±0.090%, 18 ms, 75 samples
dataset: tracks, column: lng, length: 1,000,000, type: Float32, test: gt, value: 0 57 ops/s ±0.050%, 18 ms, 74 samples
dataset: tracks, column: origin, length: 1,000,000, type: Dictionary<Int8, Utf8>, test: eq, value: Seattle 99 ops/s ±0.060%, 10 ms, 86 samples
Running "DataFrame Filter-Iterate" suite...
dataset: tracks, column: lat, length: 1,000,000, type: Float32, test: gt, value: 0 37 ops/s ±0.12%, 27 ms, 66 samples
dataset: tracks, column: lng, length: 1,000,000, type: Float32, test: gt, value: 0 37 ops/s ±0.14%, 27 ms, 66 samples
dataset: tracks, column: origin, length: 1,000,000, type: Dictionary<Int8, Utf8>, test: eq, value: Seattle 70 ops/s ±0.45%, 14 ms, 73 samples
Running "DataFrame Direct Count" suite...
dataset: tracks, column: lat, length: 1,000,000, type: Float32, test: gt, value: 0 160 ops/s ±0.040%, 6.3 ms, 83 samples
dataset: tracks, column: lng, length: 1,000,000, type: Float32, test: gt, value: 0 162 ops/s ±0.12%, 6.1 ms, 85 samples
dataset: tracks, column: origin, length: 1,000,000, type: Dictionary<Int8, Utf8>, test: eq, value: Seattle 1.51 ops/s ±5.6%, 664 ms, 8 samples

This branch:

Running "vectorFromArray" suite...
from: numbers                  106 ops/s ±1.1%,   9.3 ms, 79 samples
from: booleans                 101 ops/s ±1.4%,   9.8 ms, 76 samples
from: dictionary               105 ops/s ±4.1%,     9 ms, 78 samples
Running "Iterate Vector" suite...
from: uint8Array               896 ops/s ±0.21%,  1.1 ms, 94 samples
from: uint16Array              896 ops/s ±0.82%,  1.1 ms, 94 samples
from: uint32Array              884 ops/s ±0.39%,  1.1 ms, 95 samples
from: uint64Array              285 ops/s ±0.19%,  3.5 ms, 92 samples
from: int8Array                882 ops/s ±0.65%,  1.1 ms, 95 samples
from: int16Array               899 ops/s ±0.37%,  1.1 ms, 95 samples
from: int32Array               887 ops/s ±0.46%,  1.1 ms, 92 samples
from: int64Array               280 ops/s ±0.60%,  3.5 ms, 91 samples
from: float32Array             805 ops/s ±0.86%,  1.2 ms, 90 samples
from: float64Array             814 ops/s ±0.44%,  1.2 ms, 92 samples
from: numbers                  812 ops/s ±0.39%,  1.2 ms, 91 samples
from: booleans                 284 ops/s ±0.14%,  3.5 ms, 92 samples
from: dictionary               298 ops/s ±0.44%,  3.3 ms, 91 samples
from: string                  16.2 ops/s ±3.9%,    59 ms, 45 samples
Running "Spread Vector" suite...
from: uint8Array               360 ops/s ±1.2%,   2.7 ms, 93 samples
from: uint16Array              374 ops/s ±0.55%,  2.6 ms, 92 samples
from: uint32Array              372 ops/s ±1.1%,   2.6 ms, 91 samples
from: uint64Array              164 ops/s ±0.66%,    6 ms, 78 samples
from: int8Array                372 ops/s ±0.64%,  2.7 ms, 96 samples
from: int16Array               380 ops/s ±0.42%,  2.6 ms, 94 samples
from: int32Array               375 ops/s ±0.87%,  2.6 ms, 92 samples
from: int64Array               164 ops/s ±0.64%,  6.1 ms, 86 samples
from: float32Array             327 ops/s ±0.62%,    3 ms, 85 samples
from: float64Array             318 ops/s ±1.1%,   3.1 ms, 91 samples
from: numbers                  326 ops/s ±0.74%,    3 ms, 89 samples
from: booleans                 178 ops/s ±0.92%,  5.6 ms, 84 samples
from: dictionary               189 ops/s ±0.51%,  5.2 ms, 89 samples
from: string                  14.8 ops/s ±3.7%,    65 ms, 41 samples
Running "toArray Vector" suite...
from: uint8Array        28,488,216 ops/s ±0.22%,    0 ms, 101 samples
from: uint16Array       28,777,482 ops/s ±0.41%,    0 ms, 98 samples
from: uint32Array       28,387,333 ops/s ±0.25%,    0 ms, 97 samples
from: uint64Array       23,412,763 ops/s ±0.68%,    0 ms, 97 samples
from: int8Array         21,497,600 ops/s ±0.22%,    0 ms, 94 samples
from: int16Array        21,990,137 ops/s ±0.16%,    0 ms, 101 samples
from: int32Array        21,809,196 ops/s ±0.68%,    0 ms, 96 samples
from: int64Array        20,084,822 ops/s ±0.68%,    0 ms, 93 samples
from: float32Array      18,452,580 ops/s ±0.83%,    0 ms, 96 samples
from: float64Array      18,527,057 ops/s ±0.54%,    0 ms, 92 samples
from: numbers           18,555,045 ops/s ±0.52%,    0 ms, 99 samples
from: booleans                 178 ops/s ±0.43%,  5.6 ms, 84 samples
from: dictionary               189 ops/s ±0.61%,  5.3 ms, 89 samples
from: string                  15.8 ops/s ±0.76%,   63 ms, 43 samples
Running "get Vector" suite...
from: uint8Array               441 ops/s ±1.1%,   2.2 ms, 95 samples
from: uint16Array              441 ops/s ±0.48%,  2.2 ms, 95 samples
from: uint32Array              443 ops/s ±0.23%,  2.2 ms, 96 samples
from: uint64Array              414 ops/s ±0.68%,  2.4 ms, 93 samples
from: int8Array                439 ops/s ±0.30%,  2.3 ms, 95 samples
from: int16Array               447 ops/s ±0.35%,  2.2 ms, 96 samples
from: int32Array               439 ops/s ±0.48%,  2.3 ms, 94 samples
from: int64Array               415 ops/s ±0.17%,  2.4 ms, 97 samples
from: float32Array             472 ops/s ±0.49%,  2.1 ms, 94 samples
from: float64Array             471 ops/s ±0.26%,  2.1 ms, 97 samples
from: numbers                  473 ops/s ±0.22%,  2.1 ms, 98 samples
from: booleans                 429 ops/s ±0.25%,  2.3 ms, 97 samples
from: dictionary               464 ops/s ±0.23%,  2.1 ms, 96 samples
from: string                  17.8 ops/s ±1.3%,    56 ms, 48 samples
Running "Parse" suite...
dataset: tracks, function: read recordBatches
       12,047 ops/s ±0.77%, 0.082 ms, 100 samples
dataset: tracks, function: write recordBatches
        1,028 ops/s ±0.72%, 0.96 ms, 96 samples
Running "Get values by index" suite...
dataset: tracks, column: lat, length: 1,000,000, type: Float32
           46 ops/s ±0.12%,   22 ms, 61 samples
dataset: tracks, column: lng, length: 1,000,000, type: Float32
           46 ops/s ±0.15%,   22 ms, 61 samples
dataset: tracks, column: origin, length: 1,000,000, type: Dictionary<Int8, Utf8>
         25.3 ops/s ±0.37%,   39 ms, 46 samples
dataset: tracks, column: destination, length: 1,000,000, type: Dictionary<Int8, Utf8>
         25.1 ops/s ±0.76%,   39 ms, 46 samples
Running "Iterate vectors" suite...
dataset: tracks, column: lat, length: 1,000,000, type: Float32
           84 ops/s ±0.20%,   12 ms, 73 samples
dataset: tracks, column: lng, length: 1,000,000, type: Float32
           82 ops/s ±0.65%,   12 ms, 72 samples
dataset: tracks, column: origin, length: 1,000,000, type: Dictionary<Int8, Utf8>
           30 ops/s ±0.94%,   33 ms, 54 samples
dataset: tracks, column: destination, length: 1,000,000, type: Dictionary<Int8, Utf8>
           30 ops/s ±0.41%,   33 ms, 54 samples
Running "Slice toArray vectors" suite...
dataset: tracks, column: lat, length: 1,000,000, type: Float32
        2,911 ops/s ±3.3%,  0.33 ms, 86 samples
dataset: tracks, column: lng, length: 1,000,000, type: Float32
        2,765 ops/s ±3.2%,  0.35 ms, 77 samples
dataset: tracks, column: origin, length: 1,000,000, type: Dictionary<Int8, Utf8>
           18 ops/s ±1.2%,    55 ms, 49 samples
dataset: tracks, column: destination, length: 1,000,000, type: Dictionary<Int8, Utf8>
         18.2 ops/s ±0.73%,   54 ms, 50 samples
Running "Slice vectors" suite...
dataset: tracks, column: lat, length: 1,000,000, type: Float32
    4,338,570 ops/s ±0.52%,    0 ms, 94 samples
dataset: tracks, column: lng, length: 1,000,000, type: Float32
    4,341,418 ops/s ±0.41%,    0 ms, 97 samples
dataset: tracks, column: origin, length: 1,000,000, type: Dictionary<Int8, Utf8>
    3,656,243 ops/s ±0.45%,    0 ms, 101 samples
dataset: tracks, column: destination, length: 1,000,000, type: Dictionary<Int8, Utf8>
    3,598,448 ops/s ±1.0%,     0 ms, 97 samples
Running "Spread vectors" suite...
dataset: tracks, column: lat, length: 1,000,000, type: Float32
           16 ops/s ±4.3%,    59 ms, 44 samples
dataset: tracks, column: lng, length: 1,000,000, type: Float32
         16.1 ops/s ±4.2%,    60 ms, 45 samples
dataset: tracks, column: origin, length: 1,000,000, type: Dictionary<Int8, Utf8>
         17.8 ops/s ±1.5%,    55 ms, 49 samples
dataset: tracks, column: destination, length: 1,000,000, type: Dictionary<Int8, Utf8>
         17.6 ops/s ±1.7%,    55 ms, 48 samples
Running "Table" suite...
Iterate, dataset: tracks, numRows: 1,000,000
           27 ops/s ±0.28%,   37 ms, 49 samples
Spread, dataset: tracks, numRows: 1,000,000
         8.73 ops/s ±3.7%,   111 ms, 25 samples
toArray, dataset: tracks, numRows: 1,000,000
         8.15 ops/s ±4.9%,   115 ms, 26 samples
get, dataset: tracks, numRows: 1,000,000
         17.2 ops/s ±0.31%,   58 ms, 47 samples
Running "Table Direct Count" suite...
dataset: tracks, column: lat, numRows: 1,000,000, type: Float32, test: gt, value: 0
           74 ops/s ±0.16%,   14 ms, 77 samples
dataset: tracks, column: lng, numRows: 1,000,000, type: Float32, test: gt, value: 0
           74 ops/s ±0.20%,   14 ms, 77 samples
dataset: tracks, column: origin, numRows: 1,000,000, type: Dictionary<Int8, Utf8>, test: eq, value: Seattle
          80 ops/s ±0.060%,   12 ms, 71 samples

[Symbol.iterator]() {
return new MapRowIterator(this[kKeys], this[kVals]);
}
public toArray() { return Object.values(this.toJSON()); }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it be more efficient not to go through toJSON here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely would, yeah 😄

js/src/schema.ts Outdated Show resolved Hide resolved
js/src/table.ts Outdated Show resolved Hide resolved
js/src/table.ts Outdated
return this.data.reduce((ary, data) =>
ary.concat(toArrayVisitor.visit(data)),
new Array<Struct<T>['TValue']>()
);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this perform compared to return [...this]?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dunno, can't run benchmarks yet 😛

js/src/table.ts Outdated

/** @ignore */
export interface Table<T extends { [key: string]: DataType } = any> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to support table.map(...)?

js/src/util/chunk.ts Show resolved Hide resolved
js/src/util/recordbatch.ts Outdated Show resolved Hide resolved
js/src/vector.ts Show resolved Hide resolved
@github-actions
Copy link

@domoritz
Copy link
Member

This fixes https://issues.apache.org/jira/browse/ARROW-10794 as well, right?

@trxcllnt
Copy link
Contributor Author

@domoritz yep, looks like it

js/package.json Outdated Show resolved Hide resolved
js/package.json Outdated
@@ -96,10 +96,9 @@
"ts-jest": "27.0.0",
"ts-node": "10.0.0",
"typedoc": "0.20.36",
"typescript": "4.0.2",
"typescript": "4.3.3",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will need to figure out a solution to this before proceeding: microsoft/TypeScript-DOM-lib-generator#890 (comment)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@domoritz domoritz marked this pull request as ready for review January 11, 2022 23:39
@domoritz
Copy link
Member

@ursabot please benchmark lang=JavaScript

@ursabot
Copy link

ursabot commented Jan 16, 2022

Benchmark runs are scheduled for baseline = 7029f90 and contender = 6619579. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Only ['Python'] langs are supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Failed] ursa-i9-9960x
[Skipped ⚠️ Only ['C++', 'Java'] langs are supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python. Runs only benchmarks with cloud = True
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

@domoritz domoritz closed this in 20b66c2 Jan 16, 2022
@ursabot
Copy link

ursabot commented Jan 16, 2022

Benchmark runs are scheduled for baseline = 7029f90 and contender = 20b66c2. 20b66c2 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️0.0% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.39% ⬆️0.3%] ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python. Runs only benchmarks with cloud = True
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants