Skip to content

feat: add variant type support (Iceberg v3) - non shredding#932

Open
nssalian wants to merge 5 commits intoapache:mainfrom
nssalian:variant-impl
Open

feat: add variant type support (Iceberg v3) - non shredding#932
nssalian wants to merge 5 commits intoapache:mainfrom
nssalian:variant-impl

Conversation

@nssalian
Copy link
Copy Markdown

@nssalian nssalian commented Apr 22, 2026

Part of #589 and #929

Changes

Add first-class support for the Iceberg v3 variant semi-structured type. The binary encoding is handled by Arrow-Go v18.5.2 (already in go.mod) via parquet/variant and arrow/extensions.VariantType - no new dependencies.

SchemaVisitorPerPrimitiveType[T] gains VisitVariant() T (same pattern as VisitTimestampNs/VisitUnknown additions in #594, #605).

Non-shredded path only; shredding is a follow-up.

Testing

  • Ran the build and test locally along with lint
  • Since Spark 3.5.x is the image in the docker-compose, I added a Spark 4 image locally and modified the docker-compose to help test. I'll add a separate PR to add Spark 4.0 support and then update this test. Currently, the integration test doesn't run a Spark SQL commands like the other tests.
docker compose -f internal/recipe/docker-compose.yml exec -T spark-iceberg spark-sql -e "DESCRIBE             default.go_variant_events"

yielded

Spark master: local[*], Application Id: local-1776835497117
ts                      bigint                                      
event                   string                                      
payload                 variant    

and reading

docker compose -f internal/recipe/docker-compose.yml exec -T spark-iceberg spark-sql -e "SELECT * FROM default.go_variant_events"

yielded

Spark master: local[*], Application Id: local-1776835403757
1713700000      click   {"target":"button-submit","x":320,"y":480}
1713700005      metric  98.6
1713700010      flag    true
1713700015      NULL    NULL
1713700020      tags    ["prod","us-west-2",7]

@nssalian nssalian marked this pull request as ready for review April 22, 2026 16:17
@nssalian nssalian requested a review from zeroshade as a code owner April 22, 2026 16:17
Copy link
Copy Markdown
Contributor

@laskoviymishka laskoviymishka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Solid contribution: pattern matches #594 and #605 cleanly, scope aligns with #929 (non-shredding only), integration test round-trips 5 variant shapes.

Mergeable after three small things:

  1. Fix unknownTypeValidator — error messages say "unknown type field" for Variant. Split the validator or generalize.
  2. Verify against the v3 spec whether Variant must be optional. Unknown must (means "no value"). Variant holds real data — PyIceberg allows required. Confirm and link the spec section.
  3. boundRef[[]byte] for Variant needs a comment explaining the invariant: it only works because ordered/equality predicates reject Variant upstream in createBoundLiteralPredicate / createBoundSetPredicate, so eval() is never called with a real variant value.

LGTM with those.

Comment thread table/metadata_schema_compatibility.go Outdated
Comment thread exprs.go Outdated
Comment thread table/internal/parquet_files.go
@nssalian
Copy link
Copy Markdown
Author

Thanks for taking the time to review @laskoviymishka. I'll wait for @zeroshade to have a look and then address the comments together.

Copy link
Copy Markdown
Member

@zeroshade zeroshade left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this, in general this looks good. Just a bunch of nitpicks

Comment thread table/internal/parquet_files.go Outdated
Comment thread table/internal/parquet_files.go
Comment thread table/internal/parquet_files.go Outdated
Comment thread table/substrait/substrait.go
Comment thread table/metadata_schema_compatibility.go Outdated
Comment thread exprs.go Outdated
Comment thread exprs.go Outdated
Comment thread types.go Outdated
@nssalian
Copy link
Copy Markdown
Author

nssalian commented May 2, 2026

Addressed all comments. Variant is no longer PrimitiveType. I added Variant() on all visitor interfaces. Replaced []byte with dedicated boundVariantRef/boundVariantUnaryPred. Split the validator so variant can be required per spec. colMapping skip now checks parent is actually a variant field. Extracted projectVariant. Added tests for unary predicates, nested types (struct/list/map), multi-variant schemas, and projection exclusion.
I tested with Spark 4.x locally and it gave the same results as in the PR description. I'll follow up with a PR to add Spark 4 support in the Integration tests and have this integration test updated to match the others - I had to make a few changes to get it to work locally.
@zeroshade @laskoviymishka PTAL

Copy link
Copy Markdown
Contributor

@laskoviymishka laskoviymishka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All three comments look addressed now, thanks!

The Variant/Unknown default error is clearer, the boundRef[[]byte] workaround is gone in favor of dedicated Variant refs, and the Parquet stats skip is now limited to Variant sub-columns with a spec comment.

LGTM from my side, i would like to get @zeroshade review before merging on this one.

Only tiny optional nit: maybe add a short spec link/comment for why required Variant is allowed, but I don’t think that should block this.

Comment thread table/internal/parquet_files.go Outdated
Comment thread table/internal/utils.go Outdated
Comment thread exprs.go Outdated
Comment thread exprs.go Outdated
Comment thread exprs.go Outdated
Comment thread literals_test.go
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants