ARROW-5563: [Format] Update integration test JSON format documentation

This fills in details about all types (AFAIK) and adds a couple of examples. Closes #6377 from nealrichardson/integration-test-docs and squashes the following commits: 6708223 <Neal Richardson> Address comments from Wes aba596a <Neal Richardson> Apply suggestions from code review f3514cc <Neal Richardson> Fill in rest of types, esp. nested 2f28e6f <Neal Richardson> Some progress 9152a91 <Neal Richardson> Start moving/adding to integration test docs Authored-by: Neal Richardson <neal.p.richardson@gmail.com> Signed-off-by: Wes McKinney <wesm+git@apache.org>
apache · Mar 6, 2020 · 81176c2 · 81176c2
1 parent 24cfd5f
commit 81176c2
Show file tree

Hide file tree

Showing 3 changed files with 258 additions and 122 deletions.
diff --git a/docs/source/format/Integration.rst b/docs/source/format/Integration.rst
@@ -20,8 +20,95 @@
 Integration Testing
 ===================
 
+Our strategy for integration testing between Arrow implementations is:
+
+* Test datasets are specified in a custom human-readable, JSON-based format
+  designed exclusively for Arrow's integration tests
+* Each implementation provides a testing executable capable of converting
+  between the JSON and the binary Arrow file representation
+* The test executable is also capable of validating the contents of a binary
+  file against a corresponding JSON file
+
+Running integration tests
+-------------------------
+
+The integration test data generator and runner uses ``archery``, a Python script
+that requires Python 3.6 or higher. You can create a standalone Python
+distribution and environment for running the tests by using
+`miniconda <https://conda.io/miniconda.html>`_. On Linux this is:
+
+.. code-block:: shell
+   MINICONDA_URL=https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
+   wget -O miniconda.sh $MINICONDA_URL
+   bash miniconda.sh -b -p miniconda
+   export PATH=`pwd`/miniconda/bin:$PATH
+
+   conda create -n arrow-integration python=3.6 nomkl numpy six
+   conda activate arrow-integration
+
+
+If you are on macOS, instead use the URL:
+
+.. code-block:: shell
+   MINICONDA_URL=https://repo.continuum.io/miniconda/Miniconda3-latest-MacOSX-x86_64.sh
+
+Once you have Python, you can install archery
+
+.. code-block:: shell
+   pip install -e dev/archery
+
+The integration tests are run using the ``archery integration`` command.
+
+.. code-block:: shell
+   archery integration --help
+
+In order to run integration tests, you'll first need to build each component
+you want to include. See the respective developer docs for C++, Java, etc.
+for instructions on building those.
+
+Some languages may require additional build options to enable integration
+testing. For C++, for example, you need to add ``-DARROW_BUILD_INTEGRATION=ON``
+to your cmake command.
+
+Depending on which components you have built, you can enable and add them to
+the archery test run. For example, if you only have the C++ project built, run:
+
+.. code-block:: shell
+   archery integration --with-cpp=1
+
+
+For Java, it may look like:
+
+.. code-block:: shell
+   VERSION=0.11.0-SNAPSHOT
+   export ARROW_JAVA_INTEGRATION_JAR=$JAVA_DIR/tools/target/arrow-tools-$VERSION-jar-with-dependencies.jar
+   archery integration --with-cpp=1 --with-java=1
+
+To run all tests, including Flight integration tests, do:
+
+.. code-block:: shell
+   archery integration --with-all --run-flight
+
+Note that we run these tests in continuous integration, and the CI job uses
+docker-compose. You may also run the docker-compose job locally, or at least
+refer to it if you have questions about how to build other languages or enable
+certain tests.
+
+See :ref:`integration` for more information about the project's
+``docker-compose`` configuration.
+
+JSON test data format
+---------------------
+
 A JSON representation of Arrow columnar data is provided for
 cross-language integration testing purposes.
+This representation is `not canonical <https://lists.apache.org/thread.html/6947fb7666a0f9cc27d9677d2dad0fb5990f9063b7cf3d80af5e270f%40%3Cdev.arrow.apache.org%3E>`_
+but it provides a human-readable way of verifying language implementations.
+
+See `here <https://github.com/apache/arrow/tree/master/integration/data>`_
+for some examples of this JSON data.
+
+.. can we check in more examples, e.g. from the generated_*.json test files?
 
 The high level structure of a JSON integration test files is as follows:
 
@@ -33,6 +120,9 @@ The high level structure of a JSON integration test files is as follows:
       "dictionaries": [ /*DictionaryBatch*/ ],
     }
 
+All files contain ``schema`` and ``batches``, while ``dictionaries`` is only
+present if there are dictionary type fields in the schema.
+
 **Schema** ::
 
     {
@@ -45,70 +135,34 @@ The high level structure of a JSON integration test files is as follows:
 
     {
       "name" : "name_of_the_field",
-      "nullable" : false,
+      "nullable" : /* boolean */,
       "type" : /* Type */,
       "children" : [ /* Field */ ],
     }
 
-**RecordBatch**::
+If the Field corresponds to a dictionary type, the "type" attribute
+corresponds to the dictionary values, and the Field includes an additional
+"dictionary" member, the "id" of which maps onto a column in the
+``DictionaryBatch`` : ::
 
-    {
-      "count": /*length of batch*/,
-      "columns": [ /* FieldData */ ]
+    "dictionary": {
+      "id": /* integer */,
+      "indexType": /* Type */,
+      "isOrdered": /* boolean */
     }
 
-**FieldData**::
-
-    {
-      "name": "field_name",
-      "count" "field_length",
-      "BUFFER_TYPE": /* BufferData */
-      ...
-      "BUFFER_TYPE": /* BufferData */
-      "children": [ /* FieldData */ ]
-    }
-
-Here ``BUFFER_TYPE`` is one of ``VALIDITY``, ``OFFSET`` (for
-variable-length types), ``TYPE`` (for unions), or ``DATA``.
-
-``BufferData`` is encoded based on the type of buffer:
-
-* ``VALIDITY``: a JSON array of 1 (valid) and 0 (null)
-* ``OFFSET``: a JSON array of integers for 32-bit offsets or
-  string-formatted integers for 64-bit offsets
-* ``TYPE``: a JSON array of integers
-* ``DATA``: a JSON array of encoded values
-
-The value encoding for ``DATA`` is different depending on the logical
-type:
-
-* For boolean type: an array of 1 (true) and 0 (false)
-* For integer-based types (including timestamps): an array of integers
-* For 64-bit integers: an array of integers formatted as JSON strings
-  to avoid loss of precision
-* For floating point types: as is
-* For Binary types, a hex-string is produced to encode a variable- or
-  fixed-size binary value
+For primitive types, "children" is an empty array.
 
 **Type**: ::
 
     {
       "name" : "null|struct|list|largelist|fixedsizelist|union|int|floatingpoint|utf8|largeutf8|binary|largebinary|fixedsizebinary|bool|decimal|date|time|timestamp|interval|duration|map"
-      // fields as defined in the Flatbuffer depending on the type name
     }
 
-Union: ::
-
-    {
-      "name" : "union",
-      "mode" : "Sparse|Dense",
-      "typeIds" : [ /* integer */ ]
+A ``Type`` will have other fields as defined in `Schema.fbs <https://github.com/apache/arrow/tree/master/format/Schema.fbs>`_
+depending on its name.
     }
 
-The ``typeIds`` field in the Union are the codes used to denote each type, which
-may be different from the index of the child array. This is so that the union
-type ids do not have to be enumerated from 0.
-
 Int: ::
 
     {
@@ -144,11 +198,13 @@ Timestamp: ::
     {
       "name" : "timestamp",
       "unit" : "$TIME_UNIT"
-      "timezone": "$timezone" [optional]
+      "timezone": "$timezone"
     }
 
 ``$TIME_UNIT`` is one of ``"SECOND|MILLISECOND|MICROSECOND|NANOSECOND"``
 
+"timezone" is an optional string.
+
 Duration: ::
 
     {
@@ -175,5 +231,152 @@ Interval: ::
 
     {
       "name" : "interval",
-      "unit" : "YEAR_MONTH"
+      "unit" : "YEAR_MONTH|DAY_TIME"
     }
+
+Union: ::
+
+    {
+      "name" : "union",
+      "mode" : "Sparse|Dense",
+      "typeIds" : [ /* integer */ ]
+    }
+
+The ``typeIds`` field in ``Union`` are the codes used to denote which member of
+the union is active in each array slot. Note that in general these discriminants are not identical
+to the index of the corresponding child array.
+
+List: ::
+
+    {
+      "name": "list"
+    }
+
+The type that the list is a "list of" will be included in the ``Field``'s
+"children" member, as a single ``Field`` there. For example, for a list of
+``int32``, ::
+
+    {
+      "name": "list_nullable",
+      "type": {
+        "name": "list"
+      },
+      "nullable": true,
+      "children": [
+        {
+          "name": "item",
+          "type": {
+            "name": "int",
+            "isSigned": true,
+            "bitWidth": 32
+          },
+          "nullable": true,
+          "children": []
+        }
+      ]
+    }
+
+FixedSizeList: ::
+
+    {
+      "name": "fixedsizelist",
+      "listSize": /* integer */
+    }
+
+This type likewise comes with a length-1 "children" array.
+
+Struct: ::
+
+    {
+      "name": "struct"
+    }
+
+The ``Field``'s "children" contains an array of ``Fields`` with meaningful
+names and types.
+
+Map: ::
+
+    {
+      "name": "map",
+      "keysSorted": /* boolean */
+    }
+
+The ``Field``'s "children" contains a single ``struct`` field, which itself
+contains 2 children, named "key" and "value".
+
+Null: ::
+
+    {
+      "name": "null"
+    }
+
+**RecordBatch**::
+
+    {
+      "count": /* integer number of rows */,
+      "columns": [ /* FieldData */ ]
+    }
+
+**DictionaryBatch**::
+
+    {
+      "id": /* integer */,
+      "data": [ /* RecordBatch */ ]
+    }
+
+**FieldData**::
+
+    {
+      "name": "field_name",
+      "count" "field_length",
+      "$BUFFER_TYPE": /* BufferData */
+      ...
+      "$BUFFER_TYPE": /* BufferData */
+      "children": [ /* FieldData */ ]
+    }
+
+The "name" member of a ``Field`` in the ``Schema`` corresponds to the "name"
+of a ``FieldData`` contained in the "columns" of a ``RecordBatch``.
+For nested types (list, struct, etc.), ``Field``'s "children" each have a
+"name" that corresponds to the "name" of a ``FieldData`` inside the
+"children" of that ``FieldData``.
+For ``FieldData`` inside of a ``DictionaryBatch``, the "name" field does not
+correspond to anything.
+
+Here ``$BUFFER_TYPE`` is one of ``VALIDITY``, ``OFFSET`` (for
+variable-length types, such as strings and lists), ``TYPE`` (for unions),
+or ``DATA``.
+
+``BufferData`` is encoded based on the type of buffer:
+
+* ``VALIDITY``: a JSON array of 1 (valid) and 0 (null). Data for  non-nullable
+  ``Field`` still has a ``VALIDITY`` array, even though all values are 1.
+* ``OFFSET``: a JSON array of integers for 32-bit offsets or
+  string-formatted integers for 64-bit offsets
+* ``TYPE``: a JSON array of integers
+* ``DATA``: a JSON array of encoded values
+
+The value encoding for ``DATA`` is different depending on the logical
+type:
+
+* For boolean type: an array of 1 (true) and 0 (false)
+* For integer-based types (including timestamps): an array of integers
+* For 64-bit integers: an array of integers formatted as JSON strings
+  to avoid loss of precision
+* For floating point types: as is. Values are limited to 3 decimal places to
+  avoid loss of precision
+* For Binary types, a hex-string is produced to encode a variable- or
+  fixed-size binary value
+
+For "list" type, ``BufferData`` has ``VALIDITY`` and ``OFFSET``, and the
+rest of the data is inside "children". These child ``FieldData`` contain all
+of the same attributes as non-child data, so in the example of a list of
+``int32``, the child data has ``VALIDITY`` and ``DATA``.
+For "fixedsizelist", there is no ``OFFSET`` member because the offsets are
+implied by the field's "listSize".
+Note that the "count" for these child data may not match the parent "count".
+For example, if a ``RecordBatch`` has 7 rows and contains a ``FixedSizeList``
+of ``listSize`` 4, then the data inside the "children" of that ``FieldData``
+will have count 28.
+
+For "null" type, ``BufferData`` does not contain any buffers.
diff --git a/docs/source/index.rst b/docs/source/index.rst
@@ -34,16 +34,6 @@ such topics as:
 
 .. _toc.columnar:
 
-.. Deprecated documents for Google searches
-
-.. toctree::
-   :hidden:
-
-   format/Guidelines
-   format/Layout
-   format/IPC
-   format/Metadata
-
 .. toctree::
    :maxdepth: 2
    :caption: Arrow Specifications and Protocols
@@ -63,7 +53,10 @@ such topics as:
 
    cpp/index
    python/index
-   java/index
+   `Java <https://arrow.apache.org/docs/java/>`_
+   `C GLib <https://arrow.apache.org/docs/c_glib/>`_
+   `JavaScript <https://arrow.apache.org/docs/js/>`_
+   `R <https://arrow.apache.org/docs/r/>`_
 
 .. _toc.development: