[SPARK-57268][SQL] Add Apache Arrow as a native cache format for in-memory Dataset caching #56334

acruise · 2026-06-09T17:00:59Z

Presumably when this returns false for any reason, we fallback to the default cache driver, that should be made clear in the docs if it isn't already

e.g. the doc says supports all Spark SQL data types but this implementation would seem to falsify that claim ;)

Well, I guess the claim may be true, and the fallthrough at the end might be defensive... In which case maybe we'd want to log a surprisingly unsupported type :)

Good catch -- "supports all Spark SQL data types" did overstate it, and the type list was also incomplete (it omitted Time, intervals, Geometry/Geography, Variant, Null, and UDTs, which are all supported). Fixed the doc to list the actually-supported set and to describe the unsupported-type behavior.

One clarification on the mechanism, re: your fallback question: isSupportedByArrow here only gates supportsColumnarInput, which the cache framework uses to choose the columnar-vs-row input path into this same serializer -- it isn't a fallback to the default cache driver (there's no such per-type fallback). A truly unsupported type isn't silently dropped either: toArrowSchema throws UNSUPPORTED_DATATYPE when the cache is materialized. The docs now state this explicitly.

uros-b · 2026-06-24T14:36:26Z

cc @MaxGekk to take a look ^^

sunchao · 2026-06-27T19:33:32Z

[P2] Please reconcile this capability whitelist with current master before merging. master now maps and reads/writes TimestampNTZNanosType and TimestampLTZNanosType through Arrow, but this predicate still rejects them. The serializer then takes the row path, falls through to ObjectColumnStats (whose ColumnType has no nanos-timestamp case), and fails materialization with UNSUPPORTED_DATATYPE; row output also lacks a typed reader or fallback. The synthetic merge is conflict-free but still has this behavioral incompatibility.

Good catch that master now maps these through Arrow. I looked at the cache paths, though, and the physical value for TimestampNTZNanosType/TimestampLTZNanosType is a TimestampNanosVal, not a plain Long, so the cache's stats collector and fast columnar reader can't treat them as long-backed without a dedicated, precision-aware path -- that's net-new support rather than a predicate tweak. Importantly there's no parity regression: the default cache serializer doesn't support these types either. I verified on current master that df.cache() of a TIMESTAMP_NTZ(9) column throws not support type: TimestampNTZNanosType(9) with the default serializer (its ColumnBuilder has no case for them). The Arrow serializer now rejects them with a clear checkSupportedSchema error at materialization. I'd rather add real support as a focused follow-up than land a half-correct fast path here -- let me know if you'd prefer it gated differently in the meantime.

Follow-up filed: SPARK-57735 / #56842 adds nanosecond-timestamp support to the default in-memory cache (DefaultCachedBatchSerializer), which is the prerequisite -- once that lands, the Arrow cache can route these types through the same statistics machinery rather than rejecting them here.

[P2] The prerequisite cited here has now landed in this exact tree: DefaultCachedBatchSerializer has nanosecond timestamp builders in ColumnBuilder.scala, matching ColumnType cases and TimestampNanosColumnStats. This predicate still rejects both TimestampNTZNanosType and TimestampLTZNanosType, so the Arrow cache is now less capable than the default cache even though the guide says it covers every default-supported type. Since #56842 was the stated gate, please reconcile this before merging by completing Arrow-cache support or narrowing the compatibility claim.

[ 🤖 posted by Codex on behalf of sunchao using the code-review-for-me skill 🤖 ]

sunchao · 2026-06-27T19:33:32Z

[P2] Please do not advertise CalendarIntervalType as fully supported unless the cache representation is lossless over Spark's full value domain. Arrow's IntervalMonthDayNanoWriter multiplies CalendarInterval.microseconds by 1000 with Math.multiplyExact, but Spark permits the full Long range. I reproduced this with new CalendarInterval(0, 0, Long.MaxValue / 1000L + 1L): the default cache returns COUNT=1, while the Arrow cache aborts materialization with ArithmeticException: long overflow. Please use a lossless representation or reject and document this type rather than claiming parity with the default serializer.

Fixed with a clear diagnostic. Caching a CalendarInterval whose microseconds exceed +/-(Long.MaxValue / 1000) now throws an explanatory error (naming the type and the nanosecond-conversion limit) instead of an opaque ArithmeticException: long overflow. The check is installed only when the schema actually contains a CalendarInterval column, so there is no per-row cost for other schemas. Arrow's IntervalMonthDayNano is nanosecond-based and cannot losslessly hold the full Long microsecond domain, so I documented the value-range limit rather than changing the shared Arrow writer; the default serializer's lack of this restriction is noted in the docs.

[P2] The diagnostic fix still covers only top-level intervals. hasCalendarInterval checks attr.dataType == CalendarIntervalType, while isSupportedByArrow accepts CalendarInterval recursively inside arrays, structs, maps, and UDTs, and their recursive Arrow writers still reach Math.multiplyExact(microseconds, 1000L). For example, an array<interval> containing Long.MaxValue / 1000 + 1 still escapes withIntervalOverflowTranslation and fails with raw ArithmeticException: long overflow, contrary to the guide and this reply. Please recurse through complex types and UserDefinedType.sqlType, with nested overflow coverage.

[ 🤖 posted by Codex on behalf of sunchao using the code-review-for-me skill 🤖 ]

sunchao · 2026-06-25T00:02:38Z

[P2] This capability check accepts any UDT whose sqlType is supported, but the cache statistics path does not unwrap UDTs. A valid UDT backed by VariantType passes here, then createColumnStats constructs ObjectColumnStats; ColumnType(udt.sqlType) has no VariantType case and throws UNSUPPORTED_DATATYPE during materialization. The default cache serializer does unwrap the same UDT and supports it, so this also contradicts the documented claim that Arrow covers every default-supported type. Please unwrap UDTs when selecting statistics collectors, or narrow this predicate and the documentation.

Fixed. createColumnStats now unwraps UDTs (case udt: UserDefinedType[_] => createColumnStats(udt.sqlType)), so a Variant- or Geometry-backed UDT gets the right collector instead of falling through to ObjectColumnStats and throwing UNSUPPORTED_DATATYPE. This keeps the capability check and the statistics path in agreement.

-Original file line number
+Diff line change
@@ Expand Up / @@ -620,6 +620,8 @@ private[serializer] object KryoSerializer { @@
           "org.apache.spark.sql.columnar.CachedBatchSerializer",
           "org.apache.spark.sql.columnar.SimpleMetricsCachedBatchSerializer",
           "org.apache.spark.sql.execution.columnar.DefaultCachedBatchSerializer",
+          "org.apache.spark.sql.execution.columnar.ArrowCachedBatch",
+          "org.apache.spark.sql.execution.columnar.ArrowCachedBatchSerializer",
           "org.apache.spark.ml.attribute.Attribute",
           "org.apache.spark.ml.attribute.AttributeGroup",
@@ Expand Down @@

-Original file line number
+Diff line change
@@ Expand Up / @@ -80,6 +80,8 @@ @@
       subitems:
         - text: Caching Data
           url: sql-performance-tuning.html#caching-data
+        - text: Arrow Cache Format
+          url: sql-arrow-cache-format.html
         - text: Tuning Partitions
           url: sql-performance-tuning.html#tuning-partitions
         - text: Leveraging Statistics
@@ Expand Down @@

-Original file line number
+Diff line change
@@ Expand Up @@
     To list relations cached with an explicit name, use `spark.catalog.listCachedTables()`. Entries cached only via `Dataset.cache()` without a name are not included.
+    Spark supports two cache formats:
+    - **Default cache format**: The standard in-memory columnar cache (used by default).
+    - **Arrow cache format**: An Apache Arrow-based cache that can improve read performance for columnar workloads and enables Arrow ecosystem interoperability. See [Arrow Cache Format documentation](sql-arrow-cache-format.html) for details and configuration.
     Configuration of in-memory caching can be done via `spark.conf.set` or by running
     `SET key=value` commands using SQL.
@@ Expand Down @@

-Original file line number
+Diff line change
@@ Expand Up / @@ -38,6 +38,50 @@ private[sql] object ArrowUtils { @@
       // todo: support more types.
+      /**
+       * Check if a Spark DataType is supported by Arrow. This recursively checks complex types
+       * (Array, Struct, Map).
+       *
+       * Note: This checks compatibility with toArrowField(), not toArrowType(). Types like
+       * GeometryType, GeographyType, and VariantType are not supported by toArrowType() (which only
+       * handles primitive Arrow types), but ARE supported by toArrowField() which converts them to
+       * Arrow Struct representations with metadata. Since Arrow cache uses toArrowField() via
+       * toArrowSchema() to create the schema, these types are supported.
+       */
+      def isSupportedByArrow(dt: DataType): Boolean = {
+        dt match {
+          // Primitive types
+          case BooleanType | ByteType | ShortType | IntegerType | LongType | FloatType | DoubleType |
+              _: StringType | BinaryType | NullType =>
+            true
+          // Decimal
+          case _: DecimalType => true
+          // Temporal types
+          case DateType | TimestampType | TimestampNTZType | _: TimeType => true
+          // Interval types
+          case _: YearMonthIntervalType | _: DayTimeIntervalType | CalendarIntervalType => true
+          // Complex types - recursively check element types
+          case ArrayType(elementType, _) => isSupportedByArrow(elementType)
+          case StructType(fields) => fields.forall(f => isSupportedByArrow(f.dataType))
+          case MapType(keyType, valueType, _) =>
+            isSupportedByArrow(keyType) && isSupportedByArrow(valueType)
+          // Special types
+          // Note: These are not in toArrowType(), but are handled by toArrowField()
+          case udt: UserDefinedType[_] => isSupportedByArrow(udt.sqlType)
+          case _: GeometryType => true // Converted to Struct with srid + wkb fields
+          case _: GeographyType => true // Converted to Struct with srid + wkb fields
+          case _: VariantType => true // Converted to Struct with value + metadata fields
+          // Unsupported types
+          case _ => false
+        }
+      }
       /** Maps data type from Spark to Arrow. NOTE: timeZoneId required for TimestampTypes */
       def toArrowType(dt: DataType, timeZoneId: String, largeVarTypes: Boolean = false): ArrowType =
         TypeApiOps(dt)
@@ Expand Down @@

-Original file line number
+Diff line change
@@ Expand Up / @@ -4767,6 +4767,18 @@ object SQLConf { @@
           .booleanConf
           .createWithDefault(true)
+      val ARROW_CACHE_PREFETCH_ENABLED =
+        buildConf("spark.sql.execution.arrow.cache.prefetch.enabled")
+          .doc("When true, Arrow cache read path prefetches and decompresses the next batch " +
+            "in a background thread while the current batch is being consumed. This can " +
+            "significantly improve read performance for compressed Arrow caches (e.g., ZSTD) " +
+            "by overlapping decompression with consumption. Increases memory usage by up to " +
+            "one additional batch worth of Arrow vectors.")
+          .version("4.3.0")
+          .withBindingPolicy(ConfigBindingPolicy.SESSION)
+          .booleanConf
+          .createWithDefault(false)
       val ARROW_TRANSFORM_WITH_STATE_IN_PYSPARK_MAX_STATE_RECORDS_PER_BATCH =
         buildConf("spark.sql.execution.arrow.transformWithStateInPySpark.maxStateRecordsPerBatch")
           .doc("When using TransformWithState in PySpark (both Python Row and Pandas), limit " +
@@ Expand Down Expand Up @@
       def arrowPySparkUDFColumnarInputEnabled: Boolean =
         getConf(ARROW_PYSPARK_UDF_COLUMNAR_INPUT_ENABLED)
+      def arrowCachePrefetchEnabled: Boolean = getConf(ARROW_CACHE_PREFETCH_ENABLED)
       def arrowTransformWithStateInPySparkMaxStateRecordsPerBatch: Int =
         getConf(ARROW_TRANSFORM_WITH_STATE_IN_PYSPARK_MAX_STATE_RECORDS_PER_BATCH)
@@ Expand Down @@

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-57268][SQL] Add Apache Arrow as a native cache format for in-memory Dataset caching #56334

Uh oh!

Diff view

Diff view

There are no files selected for viewing

Uh oh!

acruise Jun 9, 2026

Uh oh!

acruise Jun 9, 2026

Uh oh!

acruise Jun 9, 2026

Uh oh!

viirya Jun 24, 2026

Uh oh!

uros-b Jun 24, 2026

Uh oh!

sunchao Jun 27, 2026

Uh oh!

viirya Jun 28, 2026

Uh oh!

viirya Jun 28, 2026

Uh oh!

sunchao Jul 1, 2026

Uh oh!

sunchao Jun 27, 2026

Uh oh!

viirya Jun 28, 2026

Uh oh!

sunchao Jul 1, 2026

Uh oh!

sunchao Jun 25, 2026

Uh oh!

viirya Jun 25, 2026

Uh oh!

Uh oh!

-Original file line number
+Diff line change
@@ Expand Up / @@ -171,7 +171,10 @@ object StaticSQLConf { @@
           "org.apache.spark.sql.columnar.CachedBatchSerializer. It will be used to " +
           "translate SQL data into a format that can more efficiently be cached. The underlying " +
           "API is subject to change so use with caution. Multiple classes cannot be specified. " +
-          "The class must have a no-arg constructor.")
+          "The class must have a no-arg constructor. Available implementations include: " +
+          "org.apache.spark.sql.execution.columnar.DefaultCachedBatchSerializer (default) and " +
+          "org.apache.spark.sql.execution.columnar.ArrowCachedBatchSerializer (Arrow format with " +
+          "zero-copy columnar reads and better Arrow ecosystem interoperability).")
         .version("3.1.0")
         .stringConf
         .createWithDefault("org.apache.spark.sql.execution.columnar.DefaultCachedBatchSerializer")
@@ Expand Down @@

-Original file line number
+Diff line change
@@ -0,0 +1,85 @@
+    ================================================================================================
+    Arrow Cache vs Default Cache
+    ================================================================================================
+    ================================================================================================
+    Cache primitive types
+    ================================================================================================
+    OpenJDK 64-Bit Server VM 21.0.11+10-LTS on Linux 6.17.0-1018-azure
+    AMD EPYC 7763 64-Core Processor
+    Cache 5M rows with primitives:               Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
+    ---------------------------------------------------------------------------------------------------------------------------
+    Default cache - write + read                          1854           1922          97          2.7         370.8       1.0X
+    Default cache - write + read (uncompressed)           1159           1165           8          4.3         231.8       1.6X
+    Arrow cache - write + read                            1300           1315          21          3.8         260.0       1.4X
+    Arrow cache - write + read (zstd level -1)            1808           1811           4          2.8         361.6       1.0X
+    Arrow cache - write + read (zstd level 1)             1814           1830          23          2.8         362.8       1.0X
+    Arrow cache - write + read (zstd level 3)             1902           1929          39          2.6         380.4       1.0X
+    ================================================================================================
+    Cache then filter
+    ================================================================================================
+    OpenJDK 64-Bit Server VM 21.0.11+10-LTS on Linux 6.17.0-1018-azure
+    AMD EPYC 7763 64-Core Processor
+    Cache 5M rows, then filter:               Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
+    ------------------------------------------------------------------------------------------------------------------------
+    Default cache - filter                             1662           1683          29          3.0         332.5       1.0X
+    Default cache - filter (uncompressed)              1312           1312           0          3.8         262.4       1.3X
+    Arrow cache - filter                               1447           1462          21          3.5         289.4       1.1X
+    Arrow cache - filter (zstd level -1)               1729           1757          40          2.9         345.8       1.0X
+    Arrow cache - filter (zstd level 1)                1787           1799          17          2.8         357.3       0.9X
+    Arrow cache - filter (zstd level 3)                1951           1955           5          2.6         390.3       0.9X
+    ================================================================================================
+    Cache columnar input (Parquet)
+    ================================================================================================
+    OpenJDK 64-Bit Server VM 21.0.11+10-LTS on Linux 6.17.0-1018-azure
+    AMD EPYC 7763 64-Core Processor
+    Cache 2M rows from Parquet:                    Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
+    -----------------------------------------------------------------------------------------------------------------------------
+    Default cache - columnar input                          1545           1619         104          1.3         772.7       1.0X
+    Default cache - columnar input (uncompressed)           1313           1336          33          1.5         656.4       1.2X
+    Arrow cache - columnar input                            1353           1378          35          1.5         676.7       1.1X
+    Arrow cache - columnar input (zstd level -1)            1535           1573          54          1.3         767.6       1.0X
+    Arrow cache - columnar input (zstd level 1)             1619           1622           5          1.2         809.6       1.0X
+    Arrow cache - columnar input (zstd level 3)             1708           1709           2          1.2         853.8       0.9X
+    ================================================================================================
+    Re-cache Arrow cached data (zero-copy test)
+    ================================================================================================
+    OpenJDK 64-Bit Server VM 21.0.11+10-LTS on Linux 6.17.0-1018-azure
+    AMD EPYC 7763 64-Core Processor
+    Re-cache 2M rows (zero-copy):                     Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
+    --------------------------------------------------------------------------------------------------------------------------------
+    Default cache - cache a cached DF                           411            428          20          4.9         205.7       1.0X
+    Default cache - cache a cached DF (uncompressed)            191            210          26         10.5          95.7       2.2X
+    Arrow cache - cache a cached DF (zero-copy)                 137            156          24         14.6          68.4       3.0X
+    Arrow cache - cache a cached DF (zstd level -1)             327            343          18          6.1         163.3       1.3X
+    Arrow cache - cache a cached DF (zstd level 1)              338            341           3          5.9         168.8       1.2X
+    Arrow cache - cache a cached DF (zstd level 3)              352            357           3          5.7         176.2       1.2X
+    ================================================================================================
+    Cache with column pruning (select 1 of 20 columns)
+    ================================================================================================
+    OpenJDK 64-Bit Server VM 21.0.11+10-LTS on Linux 6.17.0-1018-azure
+    AMD EPYC 7763 64-Core Processor
+    Cache 5M rows, select 1 column:                Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
+    -----------------------------------------------------------------------------------------------------------------------------
+    Default cache - select 1 of 20 columns                 10855          11142         406          0.5        2171.0       1.0X
+    Default cache - select 1 of 20 (uncompressed)           4135           4149          20          1.2         827.0       2.6X
+    Arrow cache - select 1 of 20                            5179           5280         144          1.0        1035.8       2.1X
+    Arrow cache - select 1 of 20 (zstd level -1)            9258           9283          35          0.5        1851.7       1.2X
+    Arrow cache - select 1 of 20 (zstd level 1)             9437           9603         234          0.5        1887.4       1.2X
+    Arrow cache - select 1 of 20 (zstd level 3)             9778           9794          23          0.5        1955.5       1.1X

Uh oh!

[SPARK-57268][SQL] Add Apache Arrow as a native cache format for in-memory Dataset caching #56334

Are you sure you want to change the base?

Uh oh!

[SPARK-57268][SQL] Add Apache Arrow as a native cache format for in-memory Dataset caching #56334

Uh oh!

Uh oh!

Diff view

Diff view

There are no files selected for viewing

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!