Skip to content

Commit

Permalink
ARROW-2116: [JS] implement IPC writers
Browse files Browse the repository at this point in the history
https://issues.apache.org/jira/browse/ARROW-2116
https://issues.apache.org/jira/browse/ARROW-2115

This PR represents a first pass at implementing the IPC writers for binary stream and file formats in JS.

I've also added scripts to do the `json-to-arrow`, `file-to-stream`, and `stream-to-file` steps of the integration tests. These scripts rely on a new feature in Node 10 (the next LTS version), so please update. My attempts to use a library to remain backwards-compatible with Node 9 were unsuccessful.

I've only done the APIs to serialize a preexisting Table to stream or file formats so far. We will want to refactor this soon to support end-to-end streaming.

Edit: Figured out why the integration tests weren't passing, fixed now 🥇

Author: ptaylor <paul.e.taylor@me.com>
Author: Paul Taylor <paul.e.taylor@me.com>
Author: lsb <leebutterman@gmail.com>

Closes #2035 from trxcllnt/js-buffer-writer and squashes the following commits:

261a864 <ptaylor> Merge branch 'master' into js-buffer-writer
917c2fc <ptaylor> test the ES5/UMD bundle in the integration tests
7a346dc <ptaylor> add a handy script for printing the alignment of buffers in a table
4594fe3 <ptaylor> align to 8-byte boundaries only
1a9864c <ptaylor> read message bodyLength from flatbuffer object
e34afaa <ptaylor> export the RecordBatchSerializer
b765b12 <ptaylor> speed up integration_test.py by only testing the JS source, not every compilation target
4ed6554 <ptaylor> Merge branch 'master' of https://github.com/apache/arrow into js-buffer-writer
f497f7a <ptaylor> measure maxColumnWidths across all recordBatches when printing a table
14e6b38 <ptaylor> cleanup: remove dead code
df43bc5 <ptaylor> make arrow2csv support streaming files from stdin, add rowsToString() method to RecordBatch
7924e67 <ptaylor> rename readNodeStream -> readStream, fromNodeStream -> fromReadableStream, add support for reading File format
efc7225 <ptaylor> fix perf tests
a06180b <ptaylor> don't run JS integration tests in src-only mode when --debug=true
ed85572 <ptaylor> fix instanceof ArrayBuffer in jest/node 10
2df1a4a <ptaylor> update google-closure-compiler, remove gcc-specific workarounds in the build
a6a7ab9 <ptaylor> put test tables into hoisted functions so it's easier to set breakpoints
a79334d <ptaylor> fix typo again after rebase
081fefc <ptaylor> remove bin from ts package.json
ccaf489 <ptaylor> remove stream-to-iterator
c0b88c2 <ptaylor> always write flatbuffer vectors
0be6de3 <ptaylor> use node v10.1.0 in travis
d4b8637 <ptaylor> add license headers
b52af25 <ptaylor> cleanup
3187732 <ptaylor> set bitmap alignment to 8 bytes if < 64 values
af9f4a8 <ptaylor> run integration tests in node 10.1
de81ac1 <ptaylor> Update JSTester to be an Arrow producer now too
832cc30 <ptaylor> add more js integration scripts for creating/converting arrow formats
263d06d <ptaylor> clean up js integration script
78cba38 <ptaylor> arrow2csv: support reading arrow streams from stdin
e75da13 <ptaylor> add support for reading streaming format via node streams
4e80851 <ptaylor> write correct recordBatch length
73a2fa9 <ptaylor> fix stream -> file, file -> stream, add tests
304e75d <ptaylor> fix magic string alignment in file reader, add file reader tests
402187e <ptaylor> add apache license headers
db02c1c <ptaylor> Add an integration test for binary writer
a242da8 <ptaylor> Add `Table.prototype.serialize` method to make ArrayBuffers from Tables
da0f457 <ptaylor> first pass at a working binary writer, only arrow stream format tested so far
508f4f8 <ptaylor> add getChildAt(n) methods to List and FixedSizeList Vectors to be more consistent with the other nested Vectors, make it easier to do the writer
a9d773d <ptaylor> move ValidityView into its own module, like ChunkedView is
85eb7ee <ptaylor> fix erroneous footer length check in reader
4333e54 <ptaylor> FileBlock constructor should accept Long | number, have public number fields
7fff99e <ptaylor> move IPC magic into its own module
d98e178 <ptaylor> add option to run gulp cmds with `-t src` to run jest against the `src` folder direct
aaec76b <ptaylor> fix @std/esm options for node10
18b9dd2 <lsb> Fix a typo
efb840f <Paul Taylor> fix typo
ae1f481 <Paul Taylor> align to 64-byte boundaries
c8ba1fe <Paul Taylor> don't write an empty buffer for NullVectors
43c671f <Paul Taylor>  add Binary writer
6522cb0 <Paul Taylor> fix Data generics for FixedSizeList
ef1acc7 <Paul Taylor> read union buffers in the correct order
dc92b83 <Paul Taylor> fix typo
  • Loading branch information
trxcllnt authored and Brian Hulette committed May 23, 2018
1 parent 1d9d893 commit fc7a382
Show file tree
Hide file tree
Showing 52 changed files with 2,121 additions and 4,101 deletions.
2 changes: 2 additions & 0 deletions ci/travis_env_common.sh
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,8 @@
# specific language governing permissions and limitations
# under the License.

# hide nodejs experimental-feature warnings
export NODE_NO_WARNINGS=1
export MINICONDA=$HOME/miniconda
export PATH="$MINICONDA/bin:$PATH"
export CONDA_PKGS_DIRS=$HOME/.conda_packages
Expand Down
33 changes: 25 additions & 8 deletions integration/integration_test.py
Original file line number Diff line number Diff line change
Expand Up @@ -1092,35 +1092,52 @@ def file_to_stream(self, file_path, stream_path):
os.system(cmd)

class JSTester(Tester):
PRODUCER = False
PRODUCER = True
CONSUMER = True

INTEGRATION_EXE = os.path.join(ARROW_HOME, 'js/bin/integration.js')
EXE_PATH = os.path.join(ARROW_HOME, 'js/bin')
VALIDATE = os.path.join(EXE_PATH, 'integration.js')
JSON_TO_ARROW = os.path.join(EXE_PATH, 'json-to-arrow.js')
STREAM_TO_FILE = os.path.join(EXE_PATH, 'stream-to-file.js')
FILE_TO_STREAM = os.path.join(EXE_PATH, 'file-to-stream.js')

name = 'JS'

def _run(self, arrow_path=None, json_path=None, command='VALIDATE'):
cmd = [self.INTEGRATION_EXE]
def _run(self, exe_cmd, arrow_path=None, json_path=None, command='VALIDATE'):
cmd = [exe_cmd]

if arrow_path is not None:
cmd.extend(['-a', arrow_path])

if json_path is not None:
cmd.extend(['-j', json_path])

cmd.extend(['--mode', command])
cmd.extend(['--mode', command, '-t', 'es5', '-m', 'umd'])

if self.debug:
print(' '.join(cmd))

run_cmd(cmd)

def validate(self, json_path, arrow_path):
return self._run(arrow_path, json_path, 'VALIDATE')
return self._run(self.VALIDATE, arrow_path, json_path, 'VALIDATE')

def json_to_file(self, json_path, arrow_path):
cmd = ['node', self.JSON_TO_ARROW, '-a', arrow_path, '-j', json_path]
cmd = ' '.join(cmd)
if self.debug:
print(cmd)
os.system(cmd)

def stream_to_file(self, stream_path, file_path):
# Just copy stream to file, we can read the stream directly
cmd = ['cp', stream_path, file_path]
cmd = ['cat', stream_path, '|', 'node', self.STREAM_TO_FILE, '>', file_path]
cmd = ' '.join(cmd)
if self.debug:
print(cmd)
os.system(cmd)

def file_to_stream(self, file_path, stream_path):
cmd = ['cat', file_path, '|', 'node', self.FILE_TO_STREAM, '>', stream_path]
cmd = ' '.join(cmd)
if self.debug:
print(cmd)
Expand Down
196 changes: 2 additions & 194 deletions js/DEVELOP.md
Original file line number Diff line number Diff line change
Expand Up @@ -64,13 +64,11 @@ This argument configuration also applies to `clean` and `test` scripts.

* `npm run deploy`

Uses [learna](https://github.com/lerna/lerna) to publish each build target to npm with [conventional](https://conventionalcommits.org/) [changelogs](https://github.com/conventional-changelog/conventional-changelog/tree/master/packages/conventional-changelog-cli).
Uses [lerna](https://github.com/lerna/lerna) to publish each build target to npm with [conventional](https://conventionalcommits.org/) [changelogs](https://github.com/conventional-changelog/conventional-changelog/tree/master/packages/conventional-changelog-cli).

# Updating the Arrow format flatbuffers generated code

Once generated, the flatbuffers format code needs to be adjusted for our TS and JS build environments.

## TypeScript
Once generated, the flatbuffers format code needs to be adjusted for our build scripts.

1. Generate the flatbuffers TypeScript source from the Arrow project root directory:
```sh
Expand Down Expand Up @@ -101,193 +99,3 @@ Once generated, the flatbuffers format code needs to be adjusted for our TS and
```
1. Add `/* tslint:disable:class-name */` to the top of `Schema.ts`
1. Execute `npm run lint` to fix all the linting errors

## JavaScript (for Google Closure Compiler builds)

1. Generate the flatbuffers JS source from the Arrow project root directory
```sh
cd $ARROW_HOME

flatc --js --no-js-exports -o ./js/src/format ./format/*.fbs

cd ./js/src/format

# Delete Tensor_generated.js (skip this when we support Tensors)
rm Tensor_generated.js

# append an ES6 export to Schema_generated.js
echo "$(cat Schema_generated.js)
export { org };
" > Schema_generated.js

# import Schema's "org" namespace and
# append an ES6 export to File_generated.js
echo "import { org } from './Schema';
$(cat File_generated.js)
export { org };
" > File_generated.js

# import Schema's "org" namespace and
# append an ES6 export to Message_generated.js
echo "import { org } from './Schema';
$(cat Message_generated.js)
export { org };
" > Message_generated.js
```
1. Fixup the generated JS enums with the reverse value-to-key mappings to match TypeScript
`Message_generated.js`
```js
// Replace this
org.apache.arrow.flatbuf.MessageHeader = {
NONE: 0,
Schema: 1,
DictionaryBatch: 2,
RecordBatch: 3,
Tensor: 4
};
// With this
org.apache.arrow.flatbuf.MessageHeader = {
NONE: 0, 0: 'NONE',
Schema: 1, 1: 'Schema',
DictionaryBatch: 2, 2: 'DictionaryBatch',
RecordBatch: 3, 3: 'RecordBatch',
Tensor: 4, 4: 'Tensor'
};
```
`Schema_generated.js`
```js
/**
* @enum
*/
org.apache.arrow.flatbuf.MetadataVersion = {
/**
* 0.1.0
*/
V1: 0, 0: 'V1',

/**
* 0.2.0
*/
V2: 1, 1: 'V2',

/**
* 0.3.0 -> 0.7.1
*/
V3: 2, 2: 'V3',

/**
* >= 0.8.0
*/
V4: 3, 3: 'V4'
};

/**
* @enum
*/
org.apache.arrow.flatbuf.UnionMode = {
Sparse: 0, 0: 'Sparse',
Dense: 1, 1: 'Dense',
};

/**
* @enum
*/
org.apache.arrow.flatbuf.Precision = {
HALF: 0, 0: 'HALF',
SINGLE: 1, 1: 'SINGLE',
DOUBLE: 2, 2: 'DOUBLE',
};

/**
* @enum
*/
org.apache.arrow.flatbuf.DateUnit = {
DAY: 0, 0: 'DAY',
MILLISECOND: 1, 1: 'MILLISECOND',
};

/**
* @enum
*/
org.apache.arrow.flatbuf.TimeUnit = {
SECOND: 0, 0: 'SECOND',
MILLISECOND: 1, 1: 'MILLISECOND',
MICROSECOND: 2, 2: 'MICROSECOND',
NANOSECOND: 3, 3: 'NANOSECOND',
};

/**
* @enum
*/
org.apache.arrow.flatbuf.IntervalUnit = {
YEAR_MONTH: 0, 0: 'YEAR_MONTH',
DAY_TIME: 1, 1: 'DAY_TIME',
};

/**
* ----------------------------------------------------------------------
* Top-level Type value, enabling extensible type-specific metadata. We can
* add new logical types to Type without breaking backwards compatibility
*
* @enum
*/
org.apache.arrow.flatbuf.Type = {
NONE: 0, 0: 'NONE',
Null: 1, 1: 'Null',
Int: 2, 2: 'Int',
FloatingPoint: 3, 3: 'FloatingPoint',
Binary: 4, 4: 'Binary',
Utf8: 5, 5: 'Utf8',
Bool: 6, 6: 'Bool',
Decimal: 7, 7: 'Decimal',
Date: 8, 8: 'Date',
Time: 9, 9: 'Time',
Timestamp: 10, 10: 'Timestamp',
Interval: 11, 11: 'Interval',
List: 12, 12: 'List',
Struct_: 13, 13: 'Struct_',
Union: 14, 14: 'Union',
FixedSizeBinary: 15, 15: 'FixedSizeBinary',
FixedSizeList: 16, 16: 'FixedSizeList',
Map: 17, 17: 'Map'
};

/**
* ----------------------------------------------------------------------
* The possible types of a vector
*
* @enum
*/
org.apache.arrow.flatbuf.VectorType = {
/**
* used in List type, Dense Union and variable length primitive types (String, Binary)
*/
OFFSET: 0, 0: 'OFFSET',

/**
* actual data, either wixed width primitive types in slots or variable width delimited by an OFFSET vector
*/
DATA: 1, 1: 'DATA',

/**
* Bit vector indicating if each value is null
*/
VALIDITY: 2, 2: 'VALIDITY',

/**
* Type vector used in Union type
*/
TYPE: 3, 3: 'TYPE'
};

/**
* ----------------------------------------------------------------------
* Endianness of the platform producing the data
*
* @enum
*/
org.apache.arrow.flatbuf.Endianness = {
Little: 0, 0: 'Little',
Big: 1, 1: 'Big',
};
```
37 changes: 37 additions & 0 deletions js/bin/file-to-stream.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
#! /usr/bin/env node

// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.

const fs = require('fs');
const path = require('path');

const encoding = 'binary';
const ext = process.env.ARROW_JS_DEBUG === 'src' ? '.ts' : '';
const { util: { PipeIterator } } = require(`../index${ext}`);
const { Table, serializeStream, fromReadableStream } = require(`../index${ext}`);

(async () => {
// Todo (ptaylor): implement `serializeStreamAsync` that accepts an
// AsyncIterable<Buffer>, rather than aggregating into a Table first
const in_ = process.argv.length < 3
? process.stdin : fs.createReadStream(path.resolve(process.argv[2]));
const out = process.argv.length < 4
? process.stdout : fs.createWriteStream(path.resolve(process.argv[3]));
new PipeIterator(serializeStream(await Table.fromAsync(fromReadableStream(in_))), encoding).pipe(out);

})().catch((e) => { console.error(e); process.exit(1); });
Loading

0 comments on commit fc7a382

Please sign in to comment.