Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
55 commits
Select commit Hold shift + click to select a range
8606c70
Add Apache Arrow as native cache format for Spark in-memory Dataset c…
viirya Jan 17, 2026
6826e7f
Implement min/max statistics collection for Arrow cache Phase 2
viirya Jan 17, 2026
97de2e0
Implement zero-copy optimization for ArrowColumnVector input
viirya Jan 17, 2026
03459e4
Add comprehensive performance benchmarks for Arrow cache format
viirya Jan 17, 2026
7800501
Update Arrow cache benchmark with working implementation
viirya Jan 17, 2026
8e1d1bd
Add comprehensive Phase 3 documentation for Arrow cache format
viirya Jan 17, 2026
4504494
Fix zero-copy benchmark and correct documentation about Parquet/ORC
viirya Jan 18, 2026
7a98664
Add Arrow cache benchmark with performance results
viirya Jan 18, 2026
d4046e0
Update Arrow cache documentation with accurate benchmark results
viirya Jan 18, 2026
3498767
Implement proper type checking for Arrow cache columnar input support
viirya Jan 18, 2026
da11431
Add clarifying comments to ArrowUtils.isSupportedByArrow method
viirya Jan 18, 2026
11b1444
Remove unsupported claims about when default cache performs better
viirya Jan 18, 2026
107d66f
Add test to verify Arrow cache serializer is actually used
viirya Jan 29, 2026
ab59047
Optimize Arrow cache performance with three major improvements
viirya Jan 30, 2026
54a8ca8
[SPARK-XXXXX] Comment out LZ4 compression benchmarks in ArrowCacheBen…
viirya Jan 31, 2026
7748eba
[SPARK-XXXXX] Add ZSTD level -1 (fastest) compression benchmarks
viirya Jan 31, 2026
fc17e99
[SPARK-XXXXX] Update ArrowCacheBenchmark results with ZSTD level -1
viirya Jan 31, 2026
d16320c
[ARROW-CACHE] Add column pruning benchmark (select 1 of 20 columns)
viirya Feb 6, 2026
0a23d0c
[ARROW-CACHE] Deduplicate methods in ArrowCachedBatchSerializer
viirya Feb 10, 2026
7b20580
[ARROW-CACHE] Add comprehensive tests for complex types from columnar…
viirya Feb 11, 2026
22779f1
[SPARK-XXXXX] Make ObjectColumnStats handle columnar complex types
viirya Feb 11, 2026
8cf5b7b
[ARROW-CACHE] Optimize Arrow cache columnar-to-row read path
viirya Apr 7, 2026
2ee60d7
[ARROW-CACHE] Add TPC-DS cache benchmark for Arrow vs Default comparison
viirya Apr 7, 2026
99f7e08
[ARROW-CACHE] Add columnar read and column pruning benchmarks to TPCD…
viirya Apr 7, 2026
417ff6c
[ARROW-CACHE] Inline statistics collection in columnar-to-Arrow write…
viirya Apr 8, 2026
7e5d12f
[ARROW-CACHE] Fix collated strings, Kryo registration, NaN handling, …
viirya Apr 8, 2026
4407eb1
[ARROW-CACHE] Add background prefetch for Arrow cache read path
viirya Apr 15, 2026
0a6f911
[ARROW-CACHE] Port minor improvements from merged PR review
viirya Apr 18, 2026
f730d08
[ARROW-CACHE] Link Arrow cache docs into SQL menu and soften benchmar…
viirya Jun 4, 2026
645d0c6
[ARROW-CACHE] Remove Arrow cache tuning guide doc
viirya Jun 4, 2026
e32290e
[ARROW-CACHE] Remove Arrow cache migration guide doc
viirya Jun 4, 2026
a3c4688
[ARROW-CACHE] Complete the min/max statistics type list in docs
viirya Jun 4, 2026
e93a179
[ARROW-CACHE] Remove TPCDSCacheBenchmark
viirya Jun 4, 2026
2d66fef
Benchmark results for org.apache.spark.sql.execution.benchmark.ArrowC…
viirya Jun 4, 2026
b657624
Benchmark results for org.apache.spark.sql.execution.benchmark.ArrowC…
viirya Jun 4, 2026
fa28424
Benchmark results for org.apache.spark.sql.execution.benchmark.ArrowC…
viirya Jun 4, 2026
23f88e4
[ARROW-CACHE] Fix CI failures: binding policy, test isolation, scalafmt
viirya Jun 5, 2026
4b4700d
[ARROW-CACHE] Support top-level Geometry/Geography columns in the cache
viirya Jun 7, 2026
ded601e
[ARROW-CACHE] Fix nonpositive maxRecordsPerBatch, empty-batch row rea…
viirya Jun 7, 2026
3960b27
[ARROW-CACHE] Fix use-after-free in columnar cache read prefetch
viirya Jun 7, 2026
432fa9d
[ARROW-CACHE] Wrap an over-length line in the geospatial roundtrip test
viirya Jun 7, 2026
ae35da8
[ARROW-CACHE] Cover all isSupportedByArrow branches in the cache type…
viirya Jun 8, 2026
6d333de
[ARROW-CACHE] Do not override sizeInBytes with compressed Arrow IPC size
viirya Jun 8, 2026
298e09a
[ARROW-CACHE] Honor the configured zstd compression level
viirya Jun 11, 2026
4f3c3a9
[ARROW-CACHE] Add a cache roundtrip test for duplicated column names
viirya Jun 11, 2026
5390882
Benchmark results for org.apache.spark.sql.execution.benchmark.ArrowC…
viirya Jun 11, 2026
5838182
Benchmark results for org.apache.spark.sql.execution.benchmark.ArrowC…
viirya Jun 11, 2026
ece94cd
Benchmark results for org.apache.spark.sql.execution.benchmark.ArrowC…
viirya Jun 11, 2026
19323dd
[ARROW-CACHE] Correct the supported-types doc and document unsupporte…
viirya Jun 24, 2026
8f75393
[ARROW-CACHE] Address review findings: resource safety, stats correct…
viirya Jun 25, 2026
9ae0b7e
[ARROW-CACHE] Address second review round: capability gate, resource …
viirya Jun 28, 2026
b632172
[ARROW-CACHE] Assert geometry/geography sizeInBytes in the e2e roundt…
viirya Jun 28, 2026
7717b05
Benchmark results for org.apache.spark.sql.execution.benchmark.ArrowC…
viirya Jun 28, 2026
f4db462
Benchmark results for org.apache.spark.sql.execution.benchmark.ArrowC…
viirya Jun 28, 2026
7c63918
Benchmark results for org.apache.spark.sql.execution.benchmark.ArrowC…
viirya Jun 28, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -620,6 +620,8 @@ private[serializer] object KryoSerializer {
"org.apache.spark.sql.columnar.CachedBatchSerializer",
"org.apache.spark.sql.columnar.SimpleMetricsCachedBatchSerializer",
"org.apache.spark.sql.execution.columnar.DefaultCachedBatchSerializer",
"org.apache.spark.sql.execution.columnar.ArrowCachedBatch",
"org.apache.spark.sql.execution.columnar.ArrowCachedBatchSerializer",

"org.apache.spark.ml.attribute.Attribute",
"org.apache.spark.ml.attribute.AttributeGroup",
Expand Down
2 changes: 2 additions & 0 deletions docs/_data/menu-sql.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -80,6 +80,8 @@
subitems:
- text: Caching Data
url: sql-performance-tuning.html#caching-data
- text: Arrow Cache Format
url: sql-arrow-cache-format.html
- text: Tuning Partitions
url: sql-performance-tuning.html#tuning-partitions
- text: Leveraging Statistics
Expand Down
378 changes: 378 additions & 0 deletions docs/sql-arrow-cache-format.md

Large diffs are not rendered by default.

4 changes: 4 additions & 0 deletions docs/sql-performance-tuning.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,10 @@ memory usage and GC pressure. You can call `spark.catalog.uncacheTable("tableNam

To list relations cached with an explicit name, use `spark.catalog.listCachedTables()`. Entries cached only via `Dataset.cache()` without a name are not included.

Spark supports two cache formats:
- **Default cache format**: The standard in-memory columnar cache (used by default).
- **Arrow cache format**: An Apache Arrow-based cache that can improve read performance for columnar workloads and enables Arrow ecosystem interoperability. See [Arrow Cache Format documentation](sql-arrow-cache-format.html) for details and configuration.

Configuration of in-memory caching can be done via `spark.conf.set` or by running
`SET key=value` commands using SQL.

Expand Down
44 changes: 44 additions & 0 deletions sql/api/src/main/scala/org/apache/spark/sql/util/ArrowUtils.scala
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,50 @@ private[sql] object ArrowUtils {

// todo: support more types.

/**
* Check if a Spark DataType is supported by Arrow. This recursively checks complex types
* (Array, Struct, Map).
*
* Note: This checks compatibility with toArrowField(), not toArrowType(). Types like
* GeometryType, GeographyType, and VariantType are not supported by toArrowType() (which only
* handles primitive Arrow types), but ARE supported by toArrowField() which converts them to
* Arrow Struct representations with metadata. Since Arrow cache uses toArrowField() via
* toArrowSchema() to create the schema, these types are supported.
*/
def isSupportedByArrow(dt: DataType): Boolean = {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Presumably when this returns false for any reason, we fallback to the default cache driver, that should be made clear in the docs if it isn't already

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

e.g. the doc says supports all Spark SQL data types but this implementation would seem to falsify that claim ;)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I guess the claim may be true, and the fallthrough at the end might be defensive... In which case maybe we'd want to log a surprisingly unsupported type :)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch -- "supports all Spark SQL data types" did overstate it, and the type list was also incomplete (it omitted Time, intervals, Geometry/Geography, Variant, Null, and UDTs, which are all supported). Fixed the doc to list the actually-supported set and to describe the unsupported-type behavior.

One clarification on the mechanism, re: your fallback question: isSupportedByArrow here only gates supportsColumnarInput, which the cache framework uses to choose the columnar-vs-row input path into this same serializer -- it isn't a fallback to the default cache driver (there's no such per-type fallback). A truly unsupported type isn't silently dropped either: toArrowSchema throws UNSUPPORTED_DATATYPE when the cache is materialized. The docs now state this explicitly.

dt match {
// Primitive types
case BooleanType | ByteType | ShortType | IntegerType | LongType | FloatType | DoubleType |
_: StringType | BinaryType | NullType =>
true

// Decimal
case _: DecimalType => true

// Temporal types
case DateType | TimestampType | TimestampNTZType | _: TimeType => true

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @MaxGekk to take a look ^^

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P2] Please reconcile this capability whitelist with current master before merging. master now maps and reads/writes TimestampNTZNanosType and TimestampLTZNanosType through Arrow, but this predicate still rejects them. The serializer then takes the row path, falls through to ObjectColumnStats (whose ColumnType has no nanos-timestamp case), and fails materialization with UNSUPPORTED_DATATYPE; row output also lacks a typed reader or fallback. The synthetic merge is conflict-free but still has this behavioral incompatibility.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch that master now maps these through Arrow. I looked at the cache paths, though, and the physical value for TimestampNTZNanosType/TimestampLTZNanosType is a TimestampNanosVal, not a plain Long, so the cache's stats collector and fast columnar reader can't treat them as long-backed without a dedicated, precision-aware path -- that's net-new support rather than a predicate tweak. Importantly there's no parity regression: the default cache serializer doesn't support these types either. I verified on current master that df.cache() of a TIMESTAMP_NTZ(9) column throws not support type: TimestampNTZNanosType(9) with the default serializer (its ColumnBuilder has no case for them). The Arrow serializer now rejects them with a clear checkSupportedSchema error at materialization. I'd rather add real support as a focused follow-up than land a half-correct fast path here -- let me know if you'd prefer it gated differently in the meantime.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow-up filed: SPARK-57735 / #56842 adds nanosecond-timestamp support to the default in-memory cache (DefaultCachedBatchSerializer), which is the prerequisite -- once that lands, the Arrow cache can route these types through the same statistics machinery rather than rejecting them here.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P2] The prerequisite cited here has now landed in this exact tree: DefaultCachedBatchSerializer has nanosecond timestamp builders in ColumnBuilder.scala, matching ColumnType cases and TimestampNanosColumnStats. This predicate still rejects both TimestampNTZNanosType and TimestampLTZNanosType, so the Arrow cache is now less capable than the default cache even though the guide says it covers every default-supported type. Since #56842 was the stated gate, please reconcile this before merging by completing Arrow-cache support or narrowing the compatibility claim.

[ 🤖 posted by Codex on behalf of sunchao using the code-review-for-me skill 🤖 ]


// Interval types
case _: YearMonthIntervalType | _: DayTimeIntervalType | CalendarIntervalType => true

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P2] Please do not advertise CalendarIntervalType as fully supported unless the cache representation is lossless over Spark's full value domain. Arrow's IntervalMonthDayNanoWriter multiplies CalendarInterval.microseconds by 1000 with Math.multiplyExact, but Spark permits the full Long range. I reproduced this with new CalendarInterval(0, 0, Long.MaxValue / 1000L + 1L): the default cache returns COUNT=1, while the Arrow cache aborts materialization with ArithmeticException: long overflow. Please use a lossless representation or reject and document this type rather than claiming parity with the default serializer.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed with a clear diagnostic. Caching a CalendarInterval whose microseconds exceed +/-(Long.MaxValue / 1000) now throws an explanatory error (naming the type and the nanosecond-conversion limit) instead of an opaque ArithmeticException: long overflow. The check is installed only when the schema actually contains a CalendarInterval column, so there is no per-row cost for other schemas. Arrow's IntervalMonthDayNano is nanosecond-based and cannot losslessly hold the full Long microsecond domain, so I documented the value-range limit rather than changing the shared Arrow writer; the default serializer's lack of this restriction is noted in the docs.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P2] The diagnostic fix still covers only top-level intervals. hasCalendarInterval checks attr.dataType == CalendarIntervalType, while isSupportedByArrow accepts CalendarInterval recursively inside arrays, structs, maps, and UDTs, and their recursive Arrow writers still reach Math.multiplyExact(microseconds, 1000L). For example, an array<interval> containing Long.MaxValue / 1000 + 1 still escapes withIntervalOverflowTranslation and fails with raw ArithmeticException: long overflow, contrary to the guide and this reply. Please recurse through complex types and UserDefinedType.sqlType, with nested overflow coverage.

[ 🤖 posted by Codex on behalf of sunchao using the code-review-for-me skill 🤖 ]


// Complex types - recursively check element types
case ArrayType(elementType, _) => isSupportedByArrow(elementType)
case StructType(fields) => fields.forall(f => isSupportedByArrow(f.dataType))
case MapType(keyType, valueType, _) =>
isSupportedByArrow(keyType) && isSupportedByArrow(valueType)

// Special types
// Note: These are not in toArrowType(), but are handled by toArrowField()
case udt: UserDefinedType[_] => isSupportedByArrow(udt.sqlType)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P2] This capability check accepts any UDT whose sqlType is supported, but the cache statistics path does not unwrap UDTs. A valid UDT backed by VariantType passes here, then createColumnStats constructs ObjectColumnStats; ColumnType(udt.sqlType) has no VariantType case and throws UNSUPPORTED_DATATYPE during materialization. The default cache serializer does unwrap the same UDT and supports it, so this also contradicts the documented claim that Arrow covers every default-supported type. Please unwrap UDTs when selecting statistics collectors, or narrow this predicate and the documentation.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. createColumnStats now unwraps UDTs (case udt: UserDefinedType[_] => createColumnStats(udt.sqlType)), so a Variant- or Geometry-backed UDT gets the right collector instead of falling through to ObjectColumnStats and throwing UNSUPPORTED_DATATYPE. This keeps the capability check and the statistics path in agreement.

case _: GeometryType => true // Converted to Struct with srid + wkb fields
case _: GeographyType => true // Converted to Struct with srid + wkb fields
case _: VariantType => true // Converted to Struct with value + metadata fields

// Unsupported types
case _ => false
}
}

/** Maps data type from Spark to Arrow. NOTE: timeZoneId required for TimestampTypes */
def toArrowType(dt: DataType, timeZoneId: String, largeVarTypes: Boolean = false): ArrowType =
TypeApiOps(dt)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -4767,6 +4767,18 @@ object SQLConf {
.booleanConf
.createWithDefault(true)

val ARROW_CACHE_PREFETCH_ENABLED =
buildConf("spark.sql.execution.arrow.cache.prefetch.enabled")
.doc("When true, Arrow cache read path prefetches and decompresses the next batch " +
"in a background thread while the current batch is being consumed. This can " +
"significantly improve read performance for compressed Arrow caches (e.g., ZSTD) " +
"by overlapping decompression with consumption. Increases memory usage by up to " +
"one additional batch worth of Arrow vectors.")
.version("4.3.0")
.withBindingPolicy(ConfigBindingPolicy.SESSION)
.booleanConf
.createWithDefault(false)

val ARROW_TRANSFORM_WITH_STATE_IN_PYSPARK_MAX_STATE_RECORDS_PER_BATCH =
buildConf("spark.sql.execution.arrow.transformWithStateInPySpark.maxStateRecordsPerBatch")
.doc("When using TransformWithState in PySpark (both Python Row and Pandas), limit " +
Expand Down Expand Up @@ -8521,6 +8533,8 @@ class SQLConf extends Serializable with Logging with SqlApiConf {
def arrowPySparkUDFColumnarInputEnabled: Boolean =
getConf(ARROW_PYSPARK_UDF_COLUMNAR_INPUT_ENABLED)

def arrowCachePrefetchEnabled: Boolean = getConf(ARROW_CACHE_PREFETCH_ENABLED)

def arrowTransformWithStateInPySparkMaxStateRecordsPerBatch: Int =
getConf(ARROW_TRANSFORM_WITH_STATE_IN_PYSPARK_MAX_STATE_RECORDS_PER_BATCH)

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -171,7 +171,10 @@ object StaticSQLConf {
"org.apache.spark.sql.columnar.CachedBatchSerializer. It will be used to " +
"translate SQL data into a format that can more efficiently be cached. The underlying " +
"API is subject to change so use with caution. Multiple classes cannot be specified. " +
"The class must have a no-arg constructor.")
"The class must have a no-arg constructor. Available implementations include: " +
"org.apache.spark.sql.execution.columnar.DefaultCachedBatchSerializer (default) and " +
"org.apache.spark.sql.execution.columnar.ArrowCachedBatchSerializer (Arrow format with " +
"zero-copy columnar reads and better Arrow ecosystem interoperability).")
.version("3.1.0")
.stringConf
.createWithDefault("org.apache.spark.sql.execution.columnar.DefaultCachedBatchSerializer")
Expand Down
85 changes: 85 additions & 0 deletions sql/core/benchmarks/ArrowCacheBenchmark-jdk21-results.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
================================================================================================
Arrow Cache vs Default Cache
================================================================================================

================================================================================================
Cache primitive types
================================================================================================

OpenJDK 64-Bit Server VM 21.0.11+10-LTS on Linux 6.17.0-1018-azure
AMD EPYC 7763 64-Core Processor
Cache 5M rows with primitives: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
---------------------------------------------------------------------------------------------------------------------------
Default cache - write + read 1854 1922 97 2.7 370.8 1.0X
Default cache - write + read (uncompressed) 1159 1165 8 4.3 231.8 1.6X
Arrow cache - write + read 1300 1315 21 3.8 260.0 1.4X
Arrow cache - write + read (zstd level -1) 1808 1811 4 2.8 361.6 1.0X
Arrow cache - write + read (zstd level 1) 1814 1830 23 2.8 362.8 1.0X
Arrow cache - write + read (zstd level 3) 1902 1929 39 2.6 380.4 1.0X


================================================================================================
Cache then filter
================================================================================================

OpenJDK 64-Bit Server VM 21.0.11+10-LTS on Linux 6.17.0-1018-azure
AMD EPYC 7763 64-Core Processor
Cache 5M rows, then filter: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
Default cache - filter 1662 1683 29 3.0 332.5 1.0X
Default cache - filter (uncompressed) 1312 1312 0 3.8 262.4 1.3X
Arrow cache - filter 1447 1462 21 3.5 289.4 1.1X
Arrow cache - filter (zstd level -1) 1729 1757 40 2.9 345.8 1.0X
Arrow cache - filter (zstd level 1) 1787 1799 17 2.8 357.3 0.9X
Arrow cache - filter (zstd level 3) 1951 1955 5 2.6 390.3 0.9X


================================================================================================
Cache columnar input (Parquet)
================================================================================================

OpenJDK 64-Bit Server VM 21.0.11+10-LTS on Linux 6.17.0-1018-azure
AMD EPYC 7763 64-Core Processor
Cache 2M rows from Parquet: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
-----------------------------------------------------------------------------------------------------------------------------
Default cache - columnar input 1545 1619 104 1.3 772.7 1.0X
Default cache - columnar input (uncompressed) 1313 1336 33 1.5 656.4 1.2X
Arrow cache - columnar input 1353 1378 35 1.5 676.7 1.1X
Arrow cache - columnar input (zstd level -1) 1535 1573 54 1.3 767.6 1.0X
Arrow cache - columnar input (zstd level 1) 1619 1622 5 1.2 809.6 1.0X
Arrow cache - columnar input (zstd level 3) 1708 1709 2 1.2 853.8 0.9X


================================================================================================
Re-cache Arrow cached data (zero-copy test)
================================================================================================

OpenJDK 64-Bit Server VM 21.0.11+10-LTS on Linux 6.17.0-1018-azure
AMD EPYC 7763 64-Core Processor
Re-cache 2M rows (zero-copy): Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
--------------------------------------------------------------------------------------------------------------------------------
Default cache - cache a cached DF 411 428 20 4.9 205.7 1.0X
Default cache - cache a cached DF (uncompressed) 191 210 26 10.5 95.7 2.2X
Arrow cache - cache a cached DF (zero-copy) 137 156 24 14.6 68.4 3.0X
Arrow cache - cache a cached DF (zstd level -1) 327 343 18 6.1 163.3 1.3X
Arrow cache - cache a cached DF (zstd level 1) 338 341 3 5.9 168.8 1.2X
Arrow cache - cache a cached DF (zstd level 3) 352 357 3 5.7 176.2 1.2X


================================================================================================
Cache with column pruning (select 1 of 20 columns)
================================================================================================

OpenJDK 64-Bit Server VM 21.0.11+10-LTS on Linux 6.17.0-1018-azure
AMD EPYC 7763 64-Core Processor
Cache 5M rows, select 1 column: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
-----------------------------------------------------------------------------------------------------------------------------
Default cache - select 1 of 20 columns 10855 11142 406 0.5 2171.0 1.0X
Default cache - select 1 of 20 (uncompressed) 4135 4149 20 1.2 827.0 2.6X
Arrow cache - select 1 of 20 5179 5280 144 1.0 1035.8 2.1X
Arrow cache - select 1 of 20 (zstd level -1) 9258 9283 35 0.5 1851.7 1.2X
Arrow cache - select 1 of 20 (zstd level 1) 9437 9603 234 0.5 1887.4 1.2X
Arrow cache - select 1 of 20 (zstd level 3) 9778 9794 23 0.5 1955.5 1.1X



85 changes: 85 additions & 0 deletions sql/core/benchmarks/ArrowCacheBenchmark-jdk25-results.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
================================================================================================
Arrow Cache vs Default Cache
================================================================================================

================================================================================================
Cache primitive types
================================================================================================

OpenJDK 64-Bit Server VM 25.0.3+9-LTS on Linux 6.17.0-1018-azure
AMD EPYC 7763 64-Core Processor
Cache 5M rows with primitives: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
---------------------------------------------------------------------------------------------------------------------------
Default cache - write + read 1686 1723 53 3.0 337.2 1.0X
Default cache - write + read (uncompressed) 1045 1065 27 4.8 209.1 1.6X
Arrow cache - write + read 1268 1305 53 3.9 253.6 1.3X
Arrow cache - write + read (zstd level -1) 1724 1725 1 2.9 344.8 1.0X
Arrow cache - write + read (zstd level 1) 1770 1794 34 2.8 354.0 1.0X
Arrow cache - write + read (zstd level 3) 1857 1893 50 2.7 371.4 0.9X


================================================================================================
Cache then filter
================================================================================================

OpenJDK 64-Bit Server VM 25.0.3+9-LTS on Linux 6.17.0-1018-azure
AMD EPYC 7763 64-Core Processor
Cache 5M rows, then filter: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
Default cache - filter 1426 1432 8 3.5 285.3 1.0X
Default cache - filter (uncompressed) 1252 1274 31 4.0 250.4 1.1X
Arrow cache - filter 1289 1295 8 3.9 257.8 1.1X
Arrow cache - filter (zstd level -1) 1712 1716 7 2.9 342.4 0.8X
Arrow cache - filter (zstd level 1) 1747 1759 16 2.9 349.5 0.8X
Arrow cache - filter (zstd level 3) 1812 1848 50 2.8 362.4 0.8X


================================================================================================
Cache columnar input (Parquet)
================================================================================================

OpenJDK 64-Bit Server VM 25.0.3+9-LTS on Linux 6.17.0-1018-azure
AMD EPYC 7763 64-Core Processor
Cache 2M rows from Parquet: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
-----------------------------------------------------------------------------------------------------------------------------
Default cache - columnar input 1461 1486 35 1.4 730.6 1.0X
Default cache - columnar input (uncompressed) 1219 1227 12 1.6 609.3 1.2X
Arrow cache - columnar input 1253 1273 27 1.6 626.7 1.2X
Arrow cache - columnar input (zstd level -1) 1448 1460 17 1.4 723.8 1.0X
Arrow cache - columnar input (zstd level 1) 1504 1504 0 1.3 752.0 1.0X
Arrow cache - columnar input (zstd level 3) 1578 1587 13 1.3 788.9 0.9X


================================================================================================
Re-cache Arrow cached data (zero-copy test)
================================================================================================

OpenJDK 64-Bit Server VM 25.0.3+9-LTS on Linux 6.17.0-1018-azure
AMD EPYC 7763 64-Core Processor
Re-cache 2M rows (zero-copy): Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
--------------------------------------------------------------------------------------------------------------------------------
Default cache - cache a cached DF 386 409 28 5.2 193.1 1.0X
Default cache - cache a cached DF (uncompressed) 194 217 26 10.3 96.8 2.0X
Arrow cache - cache a cached DF (zero-copy) 132 144 10 15.2 65.9 2.9X
Arrow cache - cache a cached DF (zstd level -1) 321 324 7 6.2 160.3 1.2X
Arrow cache - cache a cached DF (zstd level 1) 333 341 7 6.0 166.7 1.2X
Arrow cache - cache a cached DF (zstd level 3) 350 356 12 5.7 174.8 1.1X


================================================================================================
Cache with column pruning (select 1 of 20 columns)
================================================================================================

OpenJDK 64-Bit Server VM 25.0.3+9-LTS on Linux 6.17.0-1018-azure
AMD EPYC 7763 64-Core Processor
Cache 5M rows, select 1 column: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
-----------------------------------------------------------------------------------------------------------------------------
Default cache - select 1 of 20 columns 9310 9426 164 0.5 1862.0 1.0X
Default cache - select 1 of 20 (uncompressed) 3929 3994 92 1.3 785.7 2.4X
Arrow cache - select 1 of 20 5150 5225 106 1.0 1030.0 1.8X
Arrow cache - select 1 of 20 (zstd level -1) 9265 9376 156 0.5 1853.1 1.0X
Arrow cache - select 1 of 20 (zstd level 1) 9296 9351 78 0.5 1859.3 1.0X
Arrow cache - select 1 of 20 (zstd level 3) 9970 9982 18 0.5 1994.0 0.9X



Loading