apache · AlenkaF · May 16, 2024 · May 9, 2024 · May 9, 2024 · May 10, 2024
@@ -136,5 +136,11 @@ repos:
     rev: v0.9.1
     hooks:
       - id: sphinx-lint
-        files: ^docs/
-        args: ['--disable', 'all', '--enable', 'trailing-whitespace,missing-final-newline', 'docs']
+        files: ^docs/source
+        exclude: ^docs/source/python/generated
+        args: [
+          '--enable',
+          'all',
+          '--disable',
+          'dangling-hyphen,line-too-long',
+        ]
diff --git a/docs/source/conf.py b/docs/source/conf.py
@@ -535,7 +535,7 @@
 #
 # latex_appendices = []
 
-# It false, will not define \strong, \code, 	itleref, \crossref ... but only
+# It false, will not define \strong, \code, \titleref, \crossref ... but only
 # \sphinxstrong, ..., \sphinxtitleref, ... To help avoid clash with user added
 # packages.
 #

diff --git a/docs/source/cpp/acero/developer_guide.rst b/docs/source/cpp/acero/developer_guide.rst
@@ -327,8 +327,8 @@ An engine could choose to create a thread task for every execution of a node.  H
 this leads to problems with cache locality.  For example, let's assume we have a basic plan consisting of three
 exec nodes, scan, project, and then filter (this is a very common use case).  Now let's assume there are 100 batches.
 In a task-per-operator model we would have tasks like "Scan Batch 5", "Project Batch 5", and "Filter Batch 5".  Each
-of those tasks is potentially going to access the same data.  For example, maybe the `project` and `filter` nodes need
-to read the same column.  A column which is intially created in a decode phase of the `scan` node.  To maximize cache
+of those tasks is potentially going to access the same data.  For example, maybe the ``project`` and ``filter`` nodes need
+to read the same column.  A column which is intially created in a decode phase of the ``scan`` node.  To maximize cache
 utilization we would need to carefully schedule our tasks to ensure that all three of those tasks are run consecutively
 and assigned to the same CPU core.
 
@@ -412,7 +412,7 @@ Ordered Execution
 =================
 
 Some nodes either establish an ordering to their outgoing batches or they need to be able to process batches in order.
-Acero handles ordering using the `batch_index` property on an ExecBatch.  If a node has a deterministic output order
+Acero handles ordering using the ``batch_index`` property on an ExecBatch.  If a node has a deterministic output order
 then it should apply a batch index on batches that it emits.  For example, the OrderByNode applies a new ordering to
 batches (regardless of the incoming ordering).  The scan node is able to attach an implicit ordering to batches which
 reflects the order of the rows in the files being scanned.
@@ -461,8 +461,8 @@ Acero's tracing is currently half-implemented and there are major gaps in profil
 effort at tracing with open telemetry and most of the necessary pieces are in place.  The main thing currently lacking is
 some kind of effective visualization of the tracing results.
 
-In order to use the tracing that is present today you will need to build with Arrow with `ARROW_WITH_OPENTELEMETRY=ON`.
-Then you will need to set the environment variable `ARROW_TRACING_BACKEND=otlp_http`.  This will configure open telemetry
+In order to use the tracing that is present today you will need to build with Arrow with ``ARROW_WITH_OPENTELEMETRY=ON``.
+Then you will need to set the environment variable ``ARROW_TRACING_BACKEND=otlp_http``.  This will configure open telemetry
 to export trace results (as OTLP) to the HTTP endpoint http://localhost:4318/v1/traces.  You will need to configure an
 open telemetry collector to collect results on that endpoint and you will need to configure a trace viewer of some kind
 such as Jaeger: https://www.jaegertracing.io/docs/1.21/opentelemetry/

diff --git a/docs/source/cpp/acero/overview.rst b/docs/source/cpp/acero/overview.rst
@@ -209,16 +209,16 @@ must have the same length.  There are a few key differences from ExecBatch:
 
    Both the record batch and the exec batch have strong ownership of the arrays & buffers
 
-* An `ExecBatch` does not have a schema.  This is because an `ExecBatch` is assumed to be
+* An ``ExecBatch`` does not have a schema.  This is because an ``ExecBatch`` is assumed to be
   part of a stream of batches and the stream is assumed to have a consistent schema.  So
-  the schema for an `ExecBatch` is typically stored in the ExecNode.
-* Columns in an `ExecBatch` are either an `Array` or a `Scalar`.  When a column is a `Scalar`
-  this means that the column has a single value for every row in the batch.  An `ExecBatch`
+  the schema for an ``ExecBatch`` is typically stored in the ExecNode.
+* Columns in an ``ExecBatch`` are either an ``Array`` or a ``Scalar``.  When a column is a ``Scalar``
+  this means that the column has a single value for every row in the batch.  An ``ExecBatch``
   also has a length property which describes how many rows are in a batch.  So another way to
-  view a `Scalar` is a constant array with `length` elements.
-* An `ExecBatch` contains additional information used by the exec plan.  For example, an
-  `index` can be used to describe a batch's position in an ordered stream.  We expect
-  that `ExecBatch` will also evolve to contain additional fields such as a selection vector.
+  view a ``Scalar`` is a constant array with ``length`` elements.
+* An ``ExecBatch`` contains additional information used by the exec plan.  For example, an
+  ``index`` can be used to describe a batch's position in an ordered stream.  We expect
+  that ``ExecBatch`` will also evolve to contain additional fields such as a selection vector.
 
 .. figure:: scalar_vs_array.svg
 
@@ -231,8 +231,8 @@ only zero copy if there are no scalars in the exec batch.
 
 .. note::
    Both Acero and the compute module have "lightweight" versions of batches and arrays.
-   In the compute module these are called `BatchSpan`, `ArraySpan`, and `BufferSpan`.  In
-   Acero the concept is called `KeyColumnArray`.  These types were developed concurrently
+   In the compute module these are called ``BatchSpan``, ``ArraySpan``, and ``BufferSpan``.  In
+   Acero the concept is called ``KeyColumnArray``.  These types were developed concurrently
    and serve the same purpose.  They aim to provide an array container that can be completely
    stack allocated (provided the data type is non-nested) in order to avoid heap allocation
    overhead.  Ideally these two concepts will be merged someday.
@@ -247,9 +247,9 @@ execution of the nodes.  Both ExecPlan and ExecNode are tied to the lifecycle of
 They have state and are not expected to be restartable.
 
 .. warning::
-   The structures within Acero, including `ExecBatch`, are still experimental.  The `ExecBatch`
-   class should not be used outside of Acero.  Instead, an `ExecBatch` should be converted to
-   a more standard structure such as a `RecordBatch`.
+   The structures within Acero, including ``ExecBatch``, are still experimental.  The ``ExecBatch``
+   class should not be used outside of Acero.  Instead, an ``ExecBatch`` should be converted to
+   a more standard structure such as a ``RecordBatch``.
 
    Similarly, an ExecPlan is an internal concept.  Users creating plans should be using Declaration
    objects.  APIs for consuming and executing plans should abstract away the details of the underlying

diff --git a/docs/source/cpp/acero/user_guide.rst b/docs/source/cpp/acero/user_guide.rst
@@ -455,8 +455,8 @@ can be selected from :ref:`this list of aggregation functions
           will be added which should alleviate this constraint.
 
 The aggregation can provide results as a group or scalar. For instances,
-an operation like `hash_count` provides the counts per each unique record
-as a grouped result while an operation like `sum` provides a single record.
+an operation like ``hash_count`` provides the counts per each unique record
+as a grouped result while an operation like ``sum`` provides a single record.
 
 Scalar Aggregation example:
 
@@ -490,7 +490,7 @@ caller will repeatedly call this function until the generator function is exhaus
 will accumulate in memory.  An execution plan should only have one
 "terminal" node (one sink node).  An :class:`ExecPlan` can terminate early due to cancellation or
 an error, before the output is fully consumed. However, the plan can be safely destroyed independently
-of the sink, which will hold the unconsumed batches by `exec_plan->finished()`.
+of the sink, which will hold the unconsumed batches by ``exec_plan->finished()``.
 
 As a part of the Source Example, the Sink operation is also included;
 
@@ -515,7 +515,7 @@ The consuming function may be called before a previous invocation has completed.
 function does not run quickly enough then many concurrent executions could pile up, blocking the
 CPU thread pool.  The execution plan will not be marked finished until all consuming function callbacks
 have been completed.
-Once all batches have been delivered the execution plan will wait for the `finish` future to complete
+Once all batches have been delivered the execution plan will wait for the ``finish`` future to complete
 before marking the execution plan finished.  This allows for workflows where the consumption function
 converts batches into async tasks (this is currently done internally for the dataset write node).
 

diff --git a/docs/source/cpp/build_system.rst b/docs/source/cpp/build_system.rst
@@ -167,7 +167,7 @@ file into an executable linked with the Arrow C++ shared library:
 .. code-block:: makefile
 
    my_example: my_example.cc
-   	$(CXX) -o $@ $(CXXFLAGS) $< $$(pkg-config --cflags --libs arrow)
+       $(CXX) -o $@ $(CXXFLAGS) $< $$(pkg-config --cflags --libs arrow)
 
 Many build systems support pkg-config. For example:
 

diff --git a/docs/source/cpp/compute.rst b/docs/source/cpp/compute.rst
@@ -514,8 +514,8 @@ Mixed time resolution temporal inputs will be cast to finest input resolution.
   +------------+---------------------------------------------+
 
   It's compatible with Redshift's decimal promotion rules. All decimal digits
-  are preserved for `add`, `subtract` and `multiply` operations. The result
-  precision of `divide` is at least the sum of precisions of both operands with
+  are preserved for ``add``, ``subtract`` and ``multiply`` operations. The result
+  precision of ``divide`` is at least the sum of precisions of both operands with
   enough scale kept. Error is returned if the result precision is beyond the
   decimal value range.
 
@@ -1029,7 +1029,7 @@ These functions trim off characters on both sides (trim), or the left (ltrim) or
 +--------------------------+------------+-------------------------+---------------------+----------------------------------------+---------+
 
 * \(1) Only characters specified in :member:`TrimOptions::characters` will be
-  trimmed off. Both the input string and the `characters` argument are
+  trimmed off. Both the input string and the ``characters`` argument are
   interpreted as ASCII characters.
 
 * \(2) Only trim off ASCII whitespace characters (``'\t'``, ``'\n'``, ``'\v'``,
@@ -1570,7 +1570,7 @@ is the same, even though the UTC years would be different.
 Timezone handling
 ~~~~~~~~~~~~~~~~~
 
-`assume_timezone` function is meant to be used when an external system produces
+``assume_timezone`` function is meant to be used when an external system produces
 "timezone-naive" timestamps which need to be converted to "timezone-aware"
 timestamps (see for example the `definition
 <https://docs.python.org/3/library/datetime.html#aware-and-naive-objects>`__
@@ -1581,11 +1581,11 @@ Input timestamps are assumed to be relative to the timezone given in
 UTC-relative timestamps with the timezone metadata set to the above value.
 An error is returned if the timestamps already have the timezone metadata set.
 
-`local_timestamp` function converts UTC-relative timestamps to local "timezone-naive"
+``local_timestamp`` function converts UTC-relative timestamps to local "timezone-naive"
 timestamps. The timezone is taken from the timezone metadata of the input
-timestamps. This function is the inverse of `assume_timezone`. Please note:
+timestamps. This function is the inverse of ``assume_timezone``. Please note:
 **all temporal functions already operate on timestamps as if they were in local
-time of the metadata provided timezone**. Using `local_timestamp` is only meant to be
+time of the metadata provided timezone**. Using ``local_timestamp`` is only meant to be
 used when an external system expects local timestamps.
 
 +-----------------+-------+-------------+---------------+---------------------------------+-------+
@@ -1649,8 +1649,8 @@ overflow is detected.
 
 * \(1) CumulativeOptions has two optional parameters. The first parameter
   :member:`CumulativeOptions::start` is a starting value for the running
-  accumulation. It has a default value of 0 for `sum`, 1 for `prod`, min of
-  input type for `max`, and max of input type for `min`. Specified values of
+  accumulation. It has a default value of 0 for ``sum``, 1 for ``prod``, min of
+  input type for ``max``, and max of input type for ``min``. Specified values of
   ``start`` must be castable to the input type. The second parameter
   :member:`CumulativeOptions::skip_nulls` is a boolean. When set to
   false (the default), the first encountered null is propagated. When set to

diff --git a/docs/source/developers/cpp/building.rst b/docs/source/developers/cpp/building.rst
@@ -312,7 +312,7 @@ depends on ``python`` being available).
 
 On some Linux distributions, running the test suite might require setting an
 explicit locale. If you see any locale-related errors, try setting the
-environment variable (which requires the `locales` package or equivalent):
+environment variable (which requires the ``locales`` package or equivalent):
 
 .. code-block::
 

diff --git a/docs/source/developers/documentation.rst b/docs/source/developers/documentation.rst
@@ -259,7 +259,7 @@ Build the docs in the target directory:
    sphinx-build ./source/developers ./source/developers/_build -c ./source -D master_doc=temp_index
 
 This builds everything in the target directory to a folder inside of it
-called ``_build`` using the config file in the `source` directory.
+called ``_build`` using the config file in the ``source`` directory.
 
 Once you have verified the HTML documents, you can remove temporary index file:
 

diff --git a/docs/source/developers/guide/step_by_step/arrow_codebase.rst b/docs/source/developers/guide/step_by_step/arrow_codebase.rst
@@ -99,8 +99,8 @@ can be called from a function in another language.  After a function is defined
 C++ we must create the binding manually to use it in that implementation.
 
 .. note::
-	There is much you can learn by checking **Pull Requests**
-	and **unit tests** for similar issues.
+  There is much you can learn by checking **Pull Requests**
+  and **unit tests** for similar issues.
 
 .. tab-set::
 

diff --git a/docs/source/developers/guide/step_by_step/set_up.rst b/docs/source/developers/guide/step_by_step/set_up.rst
@@ -118,10 +118,10 @@ Should give you a result similar to this:
 
 .. code:: console
 
-   origin	https://github.com/<your username>/arrow.git (fetch)
-   origin	https://github.com/<your username>/arrow.git (push)
-   upstream	https://github.com/apache/arrow (fetch)
-   upstream	https://github.com/apache/arrow (push)
+   origin    https://github.com/<your username>/arrow.git (fetch)
+   origin    https://github.com/<your username>/arrow.git (push)
+   upstream  https://github.com/apache/arrow (fetch)
+   upstream  https://github.com/apache/arrow (push)
 
 If you did everything correctly, you should now have a copy of the code
 in the ``arrow`` directory and two remotes that refer to your own GitHub

diff --git a/docs/source/developers/java/development.rst b/docs/source/developers/java/development.rst
@@ -118,7 +118,7 @@ This checks the code style of all source code under the current directory or fro
 
     $ mvn checkstyle:check
 
-Maven `pom.xml` style is enforced with Spotless using `Apache Maven pom.xml guidelines`_
+Maven ``pom.xml`` style is enforced with Spotless using `Apache Maven pom.xml guidelines`_
 You can also just check the style without building the project.
 This checks the style of all pom.xml files under the current directory or from within an individual module.
 

diff --git a/docs/source/developers/release.rst b/docs/source/developers/release.rst
@@ -106,7 +106,7 @@ If there is consensus and there is a Release Manager willing to take the effort
 the release a patch release can be created.
 
 Committers can tag issues that should be included on the next patch release using the
-`backport-candidate` label. Is the responsability of the author or the committer to add the
+``backport-candidate`` label. Is the responsability of the author or the committer to add the
 label to the issue to help the Release Manager identify the issues that should be backported.
 
 If a specific issue is identified as the reason to create a patch release the Release Manager
@@ -117,7 +117,7 @@ Be sure to go through on the following checklist:
 #. Create milestone
 #. Create maintenance branch
 #. Include issue that was requested as requiring new patch release
-#. Add new milestone to issues with `backport-candidate` label
+#. Add new milestone to issues with ``backport-candidate`` label
 #. cherry-pick issues into maintenance branch
 
 Creating a Release Candidate

diff --git a/docs/source/format/CanonicalExtensions.rst b/docs/source/format/CanonicalExtensions.rst
@@ -77,7 +77,7 @@ Official List
 Fixed shape tensor
 ==================
 
-* Extension name: `arrow.fixed_shape_tensor`.
+* Extension name: ``arrow.fixed_shape_tensor``.
 
 * The storage type of the extension: ``FixedSizeList`` where:
 
@@ -153,7 +153,7 @@ Fixed shape tensor
 Variable shape tensor
 =====================
 
-* Extension name: `arrow.variable_shape_tensor`.
+* Extension name: ``arrow.variable_shape_tensor``.
 
 * The storage type of the extension is: ``StructArray`` where struct
   is composed of **data** and **shape** fields describing a single

diff --git a/docs/source/format/Columnar.rst b/docs/source/format/Columnar.rst
@@ -312,7 +312,7 @@ Each value in this layout consists of 0 or more bytes. While primitive
 arrays have a single values buffer, variable-size binary have an
 **offsets** buffer and **data** buffer.
 
-The offsets buffer contains `length + 1` signed integers (either
+The offsets buffer contains ``length + 1`` signed integers (either
 32-bit or 64-bit, depending on the logical type), which encode the
 start position of each slot in the data buffer. The length of the
 value in each slot is computed using the difference between the offset
@@ -374,7 +374,7 @@ locations are indicated using a **views** buffer, which may point to one
 of potentially several **data** buffers or may contain the characters
 inline.
 
-The views buffer contains `length` view structures with the following layout:
+The views buffer contains ``length`` view structures with the following layout:
 
 ::
 
@@ -394,7 +394,7 @@ should be interpreted.
 
 In the short string case the string's bytes are inlined — stored inside the
 view itself, in the twelve bytes which follow the length. Any remaining bytes
-after the string itself are padded with `0`.
+after the string itself are padded with ``0``.
 
 In the long string case, a buffer index indicates which data buffer
 stores the data bytes and an offset indicates where in that buffer the

diff --git a/docs/source/format/FlightSql.rst b/docs/source/format/FlightSql.rst
@@ -196,7 +196,7 @@ in the ``app_metadata`` field of the Flight RPC ``PutResult`` returned.
 
     When used with DoPut: load the stream of Arrow record batches into
     the specified target table and return the number of rows ingested
-    via a `DoPutUpdateResult` message.
+    via a ``DoPutUpdateResult`` message.
 
 Flight Server Session Management
 --------------------------------

diff --git a/docs/source/format/Integration.rst b/docs/source/format/Integration.rst
@@ -501,7 +501,7 @@ integration testing actually tests.
 
 There are two types of integration test cases: the ones populated on the fly
 by the data generator in the Archery utility, and *gold* files that exist
-in the `arrow-testing <https://github.com/apache/arrow-testing/tree/master/data/arrow-ipc-stream/integration>`
+in the `arrow-testing <https://github.com/apache/arrow-testing/tree/master/data/arrow-ipc-stream/integration>`_
 repository.
 
 Data Generator Tests

diff --git a/docs/source/java/algorithm.rst b/docs/source/java/algorithm.rst
@@ -82,7 +82,7 @@ for fixed width and variable width vectors, respectively. Both algorithms run in
 
 3. **Index sorter**: this sorter does not actually sort the vector. Instead, it returns an integer
 vector, which correspond to indices of vector elements in sorted order. With the index vector, one can
-easily construct a sorted vector. In addition, some other tasks can be easily achieved, like finding the ``k``th
+easily construct a sorted vector. In addition, some other tasks can be easily achieved, like finding the ``k`` th
 smallest value in the vector. Index sorting is supported by ``org.apache.arrow.algorithm.sort.IndexSorter``,
 which runs in ``O(nlog(n))`` time. It is applicable to vectors of any type.
 

diff --git a/docs/source/java/flight_sql_jdbc_driver.rst b/docs/source/java/flight_sql_jdbc_driver.rst
@@ -162,7 +162,7 @@ the Flight SQL service as gRPC headers. For example, the following URI ::
 
 This will connect without authentication or encryption, to a Flight
 SQL service running on ``localhost`` on port 12345. Each request will
-also include a `database=mydb` gRPC header.
+also include a ``database=mydb`` gRPC header.
 
 Connection parameters may also be supplied using the Properties object
 when using the JDBC Driver Manager to connect. When supplying using