Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
22 commits
Select commit Hold shift + click to select a range
80fc5fe
[core] v4: Add TrackedFileAdapters: bridge TrackedFile to DataFile/De…
anoopj Apr 22, 2026
cb7133d
Clean up tests
anoopj Apr 24, 2026
2f9c29f
Change design such that a DV adapted to DeleteFile
anoopj Apr 28, 2026
1a33130
Make copy safe
anoopj Apr 28, 2026
4ed70e1
Reorder
anoopj Apr 28, 2026
f9b5437
Core, Parquet: Allow for Writing Parquet/Avro Manifests in V4 - Parqu…
RussellSpitzer Mar 15, 2026
14231dd
Core, Spark: Fix V4 Parquet manifest reading issues
RussellSpitzer Mar 25, 2026
38199c5
Core, Parquet: Clean up Parquet manifest code and tests
RussellSpitzer Mar 26, 2026
9e0d9d7
Core: Use instanceof pattern matching in ManifestWriter
RussellSpitzer Mar 27, 2026
2e2aa47
Core: Remove duplicate validateSnapshot overload in TestBase
RussellSpitzer Apr 3, 2026
a6fe885
Address PR review: Parquet reuse, BaseFile copy, V4Metadata builder, …
RussellSpitzer Apr 20, 2026
0276a96
Parquet: Whitelist mutable JDK collections for Parquet list/map scrat…
RussellSpitzer Apr 20, 2026
dec20aa
Checkpoint
anoopj Apr 27, 2026
a5cffb1
Core: Replace manifest list with root manifest for v4
anoopj Apr 27, 2026
1f11a88
Store relative paths in v4 metadata JSON and root manifests
anoopj Apr 27, 2026
8277e44
Core: Relativize all location fields in v4 metadata
anoopj Apr 28, 2026
395365a
Apply MDVs
anoopj Apr 28, 2026
f32caec
Fix bug
anoopj Apr 28, 2026
c475d63
Add testing guide
anoopj Apr 29, 2026
4824888
more fixes
anoopj Apr 30, 2026
412fbb9
Add Spark 3.5 instructions and tests
anoopj May 7, 2026
925d310
fix: carry forward data file entries from flat root manifests
anoopj May 13, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
124 changes: 124 additions & 0 deletions V4_Testing_Guide.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,124 @@
# Testing V4 Iceberg with Spark

## Build the Iceberg Spark runtime jar

### Spark 4.1

```bash
git checkout v4-amt
./gradlew :iceberg-spark:iceberg-spark-runtime-4.1_2.13:shadowJar
```

The jar is at:
```
spark/v4.1/spark-runtime/build/libs/iceberg-spark-runtime-4.1_2.13-1.11.0-SNAPSHOT.jar
```

### Spark 3.5

```bash
git checkout v4-amt
./gradlew -DsparkVersions=3.5 :iceberg-spark:iceberg-spark-runtime-3.5_2.12:shadowJar
```

The jar is at:
```
spark/v3.5/spark-runtime/build/libs/iceberg-spark-runtime-3.5_2.12-1.11.0-SNAPSHOT.jar
```

## Download Spark

### Spark 4.1.1

```bash
curl -L -o spark-4.1.1-bin-hadoop3.tgz \
https://archive.apache.org/dist/spark/spark-4.1.1/spark-4.1.1-bin-hadoop3.tgz
tar xzf spark-4.1.1-bin-hadoop3.tgz
```

### Spark 3.5.8

```bash
curl -L -o spark-3.5.8-bin-hadoop3.tgz \
https://archive.apache.org/dist/spark/spark-3.5.8/spark-3.5.8-bin-hadoop3.tgz
tar xzf spark-3.5.8-bin-hadoop3.tgz
```

## Start spark-sql

### Spark 4.1

```bash
spark-4.1.1-bin-hadoop3/bin/spark-sql \
--jars /path/to/iceberg-spark-runtime-4.1_2.13-1.11.0-SNAPSHOT.jar \
--conf spark.sql.catalog.local=org.apache.iceberg.spark.SparkCatalog \
--conf spark.sql.catalog.local.type=hadoop \
--conf spark.sql.catalog.local.warehouse=file:///tmp/iceberg-warehouse
```

### Spark 3.5

```bash
spark-3.5.8-bin-hadoop3/bin/spark-sql \
--jars /path/to/iceberg-spark-runtime-3.5_2.12-1.11.0-SNAPSHOT.jar \
--conf spark.sql.catalog.local=org.apache.iceberg.spark.SparkCatalog \
--conf spark.sql.catalog.local.type=hadoop \
--conf spark.sql.catalog.local.warehouse=file:///tmp/iceberg-warehouse
```

## Create a v4 table and query it

```sql
CREATE TABLE local.default.test (id bigint, data string)
USING iceberg TBLPROPERTIES ('format-version' = '4');

INSERT INTO local.default.test VALUES (1, 'a'), (2, 'b'), (3, 'c');

SELECT * FROM local.default.test ORDER BY id;
```

## Inspect the metadata

All paths in v4 metadata are stored as relative:

```bash
# metadata JSON -- manifest-list and metadata-log use relative paths
python3 -m json.tool /tmp/iceberg-warehouse/default/test/metadata/v2.metadata.json

# root manifest and leaf manifests are Parquet -- read with spark-sql
# (replace the UUID with the actual filename)
SELECT * FROM parquet.`file:///tmp/iceberg-warehouse/default/test/metadata/*-root-*.parquet`;
SELECT * FROM parquet.`file:///tmp/iceberg-warehouse/default/test/metadata/*-m0.parquet`;
```

## Run automated V4 tests

### Spark 4.1

```bash
./gradlew :iceberg-spark:iceberg-spark-4.1_2.13:test \
--tests "org.apache.iceberg.spark.source.TestV4ReadEndToEnd"
```

### Spark 3.5

```bash
./gradlew -DsparkVersions=3.5 :iceberg-spark:iceberg-spark-3.5_2.12:test \
--tests "org.apache.iceberg.spark.source.TestV4ReadEndToEnd"
```

## What's implemented

- V4 Adaptive Metadata Tree: root manifest (Parquet) replaces Avro manifest list
- Relative paths at all levels: metadata JSON, root manifest, leaf manifests
- Metadata deletion vectors (inline bitmaps on tracking struct)
- V4 scan path through ManifestExpander (bypasses ManifestGroup)
- FastAppend write path (INSERT INTO)

## Limitations

- Only FastAppend is wired for v4 (INSERT INTO). Overwrites, deletes, and
compaction still use the v2/v3 path.
- Data deletion vectors (colocated with data files) are not yet implemented.
- Metadata deletion vectors are applied on read but there is no write path
that produces them yet.
Loading
Loading