Skip to content

Commit

Permalink
ARROW-5563: [Format] Update integration test JSON format documentation
Browse files Browse the repository at this point in the history
This fills in details about all types (AFAIK) and adds a couple of examples.

Closes #6377 from nealrichardson/integration-test-docs and squashes the following commits:

6708223 <Neal Richardson> Address comments from Wes
aba596a <Neal Richardson> Apply suggestions from code review
f3514cc <Neal Richardson> Fill in rest of types, esp. nested
2f28e6f <Neal Richardson> Some progress
9152a91 <Neal Richardson> Start moving/adding to integration test docs

Authored-by: Neal Richardson <neal.p.richardson@gmail.com>
Signed-off-by: Wes McKinney <wesm+git@apache.org>
  • Loading branch information
nealrichardson authored and wesm committed Mar 6, 2020
1 parent 24cfd5f commit 81176c2
Show file tree
Hide file tree
Showing 3 changed files with 258 additions and 122 deletions.
303 changes: 253 additions & 50 deletions docs/source/format/Integration.rst
Expand Up @@ -20,8 +20,95 @@
Integration Testing
===================

Our strategy for integration testing between Arrow implementations is:

* Test datasets are specified in a custom human-readable, JSON-based format
designed exclusively for Arrow's integration tests
* Each implementation provides a testing executable capable of converting
between the JSON and the binary Arrow file representation
* The test executable is also capable of validating the contents of a binary
file against a corresponding JSON file

Running integration tests
-------------------------

The integration test data generator and runner uses ``archery``, a Python script
that requires Python 3.6 or higher. You can create a standalone Python
distribution and environment for running the tests by using
`miniconda <https://conda.io/miniconda.html>`_. On Linux this is:

.. code-block:: shell
MINICONDA_URL=https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
wget -O miniconda.sh $MINICONDA_URL
bash miniconda.sh -b -p miniconda
export PATH=`pwd`/miniconda/bin:$PATH
conda create -n arrow-integration python=3.6 nomkl numpy six
conda activate arrow-integration
If you are on macOS, instead use the URL:

.. code-block:: shell
MINICONDA_URL=https://repo.continuum.io/miniconda/Miniconda3-latest-MacOSX-x86_64.sh
Once you have Python, you can install archery

.. code-block:: shell
pip install -e dev/archery
The integration tests are run using the ``archery integration`` command.

.. code-block:: shell
archery integration --help
In order to run integration tests, you'll first need to build each component
you want to include. See the respective developer docs for C++, Java, etc.
for instructions on building those.

Some languages may require additional build options to enable integration
testing. For C++, for example, you need to add ``-DARROW_BUILD_INTEGRATION=ON``
to your cmake command.

Depending on which components you have built, you can enable and add them to
the archery test run. For example, if you only have the C++ project built, run:

.. code-block:: shell
archery integration --with-cpp=1
For Java, it may look like:

.. code-block:: shell
VERSION=0.11.0-SNAPSHOT
export ARROW_JAVA_INTEGRATION_JAR=$JAVA_DIR/tools/target/arrow-tools-$VERSION-jar-with-dependencies.jar
archery integration --with-cpp=1 --with-java=1
To run all tests, including Flight integration tests, do:

.. code-block:: shell
archery integration --with-all --run-flight
Note that we run these tests in continuous integration, and the CI job uses
docker-compose. You may also run the docker-compose job locally, or at least
refer to it if you have questions about how to build other languages or enable
certain tests.

See :ref:`integration` for more information about the project's
``docker-compose`` configuration.

JSON test data format
---------------------

A JSON representation of Arrow columnar data is provided for
cross-language integration testing purposes.
This representation is `not canonical <https://lists.apache.org/thread.html/6947fb7666a0f9cc27d9677d2dad0fb5990f9063b7cf3d80af5e270f%40%3Cdev.arrow.apache.org%3E>`_
but it provides a human-readable way of verifying language implementations.

See `here <https://github.com/apache/arrow/tree/master/integration/data>`_
for some examples of this JSON data.

.. can we check in more examples, e.g. from the generated_*.json test files?
The high level structure of a JSON integration test files is as follows:

Expand All @@ -33,6 +120,9 @@ The high level structure of a JSON integration test files is as follows:
"dictionaries": [ /*DictionaryBatch*/ ],
}

All files contain ``schema`` and ``batches``, while ``dictionaries`` is only
present if there are dictionary type fields in the schema.

**Schema** ::

{
Expand All @@ -45,70 +135,34 @@ The high level structure of a JSON integration test files is as follows:

{
"name" : "name_of_the_field",
"nullable" : false,
"nullable" : /* boolean */,
"type" : /* Type */,
"children" : [ /* Field */ ],
}

**RecordBatch**::
If the Field corresponds to a dictionary type, the "type" attribute
corresponds to the dictionary values, and the Field includes an additional
"dictionary" member, the "id" of which maps onto a column in the
``DictionaryBatch`` : ::

{
"count": /*length of batch*/,
"columns": [ /* FieldData */ ]
"dictionary": {
"id": /* integer */,
"indexType": /* Type */,
"isOrdered": /* boolean */
}

**FieldData**::

{
"name": "field_name",
"count" "field_length",
"BUFFER_TYPE": /* BufferData */
...
"BUFFER_TYPE": /* BufferData */
"children": [ /* FieldData */ ]
}

Here ``BUFFER_TYPE`` is one of ``VALIDITY``, ``OFFSET`` (for
variable-length types), ``TYPE`` (for unions), or ``DATA``.

``BufferData`` is encoded based on the type of buffer:

* ``VALIDITY``: a JSON array of 1 (valid) and 0 (null)
* ``OFFSET``: a JSON array of integers for 32-bit offsets or
string-formatted integers for 64-bit offsets
* ``TYPE``: a JSON array of integers
* ``DATA``: a JSON array of encoded values

The value encoding for ``DATA`` is different depending on the logical
type:

* For boolean type: an array of 1 (true) and 0 (false)
* For integer-based types (including timestamps): an array of integers
* For 64-bit integers: an array of integers formatted as JSON strings
to avoid loss of precision
* For floating point types: as is
* For Binary types, a hex-string is produced to encode a variable- or
fixed-size binary value
For primitive types, "children" is an empty array.

**Type**: ::

{
"name" : "null|struct|list|largelist|fixedsizelist|union|int|floatingpoint|utf8|largeutf8|binary|largebinary|fixedsizebinary|bool|decimal|date|time|timestamp|interval|duration|map"
// fields as defined in the Flatbuffer depending on the type name
}

Union: ::

{
"name" : "union",
"mode" : "Sparse|Dense",
"typeIds" : [ /* integer */ ]
A ``Type`` will have other fields as defined in `Schema.fbs <https://github.com/apache/arrow/tree/master/format/Schema.fbs>`_
depending on its name.
}

The ``typeIds`` field in the Union are the codes used to denote each type, which
may be different from the index of the child array. This is so that the union
type ids do not have to be enumerated from 0.

Int: ::

{
Expand Down Expand Up @@ -144,11 +198,13 @@ Timestamp: ::
{
"name" : "timestamp",
"unit" : "$TIME_UNIT"
"timezone": "$timezone" [optional]
"timezone": "$timezone"
}

``$TIME_UNIT`` is one of ``"SECOND|MILLISECOND|MICROSECOND|NANOSECOND"``

"timezone" is an optional string.

Duration: ::

{
Expand All @@ -175,5 +231,152 @@ Interval: ::

{
"name" : "interval",
"unit" : "YEAR_MONTH"
"unit" : "YEAR_MONTH|DAY_TIME"
}

Union: ::

{
"name" : "union",
"mode" : "Sparse|Dense",
"typeIds" : [ /* integer */ ]
}

The ``typeIds`` field in ``Union`` are the codes used to denote which member of
the union is active in each array slot. Note that in general these discriminants are not identical
to the index of the corresponding child array.

List: ::

{
"name": "list"
}

The type that the list is a "list of" will be included in the ``Field``'s
"children" member, as a single ``Field`` there. For example, for a list of
``int32``, ::

{
"name": "list_nullable",
"type": {
"name": "list"
},
"nullable": true,
"children": [
{
"name": "item",
"type": {
"name": "int",
"isSigned": true,
"bitWidth": 32
},
"nullable": true,
"children": []
}
]
}

FixedSizeList: ::

{
"name": "fixedsizelist",
"listSize": /* integer */
}

This type likewise comes with a length-1 "children" array.

Struct: ::

{
"name": "struct"
}

The ``Field``'s "children" contains an array of ``Fields`` with meaningful
names and types.

Map: ::

{
"name": "map",
"keysSorted": /* boolean */
}

The ``Field``'s "children" contains a single ``struct`` field, which itself
contains 2 children, named "key" and "value".

Null: ::

{
"name": "null"
}

**RecordBatch**::

{
"count": /* integer number of rows */,
"columns": [ /* FieldData */ ]
}

**DictionaryBatch**::

{
"id": /* integer */,
"data": [ /* RecordBatch */ ]
}

**FieldData**::

{
"name": "field_name",
"count" "field_length",
"$BUFFER_TYPE": /* BufferData */
...
"$BUFFER_TYPE": /* BufferData */
"children": [ /* FieldData */ ]
}

The "name" member of a ``Field`` in the ``Schema`` corresponds to the "name"
of a ``FieldData`` contained in the "columns" of a ``RecordBatch``.
For nested types (list, struct, etc.), ``Field``'s "children" each have a
"name" that corresponds to the "name" of a ``FieldData`` inside the
"children" of that ``FieldData``.
For ``FieldData`` inside of a ``DictionaryBatch``, the "name" field does not
correspond to anything.

Here ``$BUFFER_TYPE`` is one of ``VALIDITY``, ``OFFSET`` (for
variable-length types, such as strings and lists), ``TYPE`` (for unions),
or ``DATA``.

``BufferData`` is encoded based on the type of buffer:

* ``VALIDITY``: a JSON array of 1 (valid) and 0 (null). Data for non-nullable
``Field`` still has a ``VALIDITY`` array, even though all values are 1.
* ``OFFSET``: a JSON array of integers for 32-bit offsets or
string-formatted integers for 64-bit offsets
* ``TYPE``: a JSON array of integers
* ``DATA``: a JSON array of encoded values

The value encoding for ``DATA`` is different depending on the logical
type:

* For boolean type: an array of 1 (true) and 0 (false)
* For integer-based types (including timestamps): an array of integers
* For 64-bit integers: an array of integers formatted as JSON strings
to avoid loss of precision
* For floating point types: as is. Values are limited to 3 decimal places to
avoid loss of precision
* For Binary types, a hex-string is produced to encode a variable- or
fixed-size binary value

For "list" type, ``BufferData`` has ``VALIDITY`` and ``OFFSET``, and the
rest of the data is inside "children". These child ``FieldData`` contain all
of the same attributes as non-child data, so in the example of a list of
``int32``, the child data has ``VALIDITY`` and ``DATA``.
For "fixedsizelist", there is no ``OFFSET`` member because the offsets are
implied by the field's "listSize".
Note that the "count" for these child data may not match the parent "count".
For example, if a ``RecordBatch`` has 7 rows and contains a ``FixedSizeList``
of ``listSize`` 4, then the data inside the "children" of that ``FieldData``
will have count 28.

For "null" type, ``BufferData`` does not contain any buffers.
15 changes: 4 additions & 11 deletions docs/source/index.rst
Expand Up @@ -34,16 +34,6 @@ such topics as:

.. _toc.columnar:

.. Deprecated documents for Google searches
.. toctree::
:hidden:

format/Guidelines
format/Layout
format/IPC
format/Metadata

.. toctree::
:maxdepth: 2
:caption: Arrow Specifications and Protocols
Expand All @@ -63,7 +53,10 @@ such topics as:

cpp/index
python/index
java/index
`Java <https://arrow.apache.org/docs/java/>`_
`C GLib <https://arrow.apache.org/docs/c_glib/>`_
`JavaScript <https://arrow.apache.org/docs/js/>`_
`R <https://arrow.apache.org/docs/r/>`_

.. _toc.development:

Expand Down

0 comments on commit 81176c2

Please sign in to comment.