[WIP] CORE|SPARK: POC Read Restrictions E2E #16082
Draft
singhpk234 wants to merge 42 commits intoapache:mainfrom
Draft
[WIP] CORE|SPARK: POC Read Restrictions E2E #16082singhpk234 wants to merge 42 commits intoapache:mainfrom
singhpk234 wants to merge 42 commits intoapache:mainfrom
Conversation
propose a new structure`
Flatten Projection into Action with an "action" discriminator property, matching the BaseUpdate pattern. Each Action subtype now uses allOf to extend Action with a const action value, making them distinguishable in JSON.
…lyExpression - Replace 4 original actions (mask-hash-sha256, replace-with-null, mask-alphanumeric, apply-transform) with 10 predefined action types: mask-alphanum, mask-to-default, replace-with-null, show-first-4, show-last-4, truncate-to-year, truncate-to-month, sha-256-global, sha-256-query-local, apply-expression - Add discriminator mapping for all action subtypes - Add detailed descriptions for each action including type applicability, encoding rules for SHA-256 variants, and type-specific defaults - Replace Term-based ApplyTransform with Expression-based ApplyExpression - Update required-column-projections rules for action model - Regenerate Python from updated YAML spec
- Mark ApplyExpression.expression as required - Fix SHA-256 binary output type from binary(32) to binary - Add NULL-in/NULL-out global rule for all actions - Specify Unicode code point semantics for masking actions - Add signed two's complement for SHA-256 int/long output - Add UTC truncation rule for timestamptz types - Clarify empty ReadRestrictions equivalence and server type validation - Specify minimum 16-byte salt for sha-256-query-local - Clarify unrecognized action types trigger MUST-fail - Add "Applicable to: all data types" for ApplyExpression - Regenerate Python from updated YAML spec
SHA-256 digest truncated to 16 bytes does not produce valid RFC 4122 UUIDs (version and variant bits are not set), so uuid is removed from the input encoding, output encoding, and applicable types for both sha-256-global and sha-256-query-local actions.
Collapse 10 Spark Catalyst expression classes into a single generic IcebergRestricted wrapper + an Actions.bind(Action, Type) factory that returns SerializableFunctions. Engine integrations (Spark, Flink, Trino) can now reuse the spec-defined masking semantics instead of reimplementing them per-engine. Follows the existing Transforms/Expressions bind pattern. Also fixes row-filter evaluation order so filters run on original values before masks are applied, matching the spec.
The spec mandated 9999-12-31T00:00:00 as the mask-to-default sentinel for timestamp_ns and timestamptz_ns, but that value overflows a 64-bit signed integer as nanoseconds from epoch (max representable is ~2262-04-11). Engines MUST use exactly the listed default, so specifying an unrepresentable value broke cross-engine consistency. Use 2261-12-31T00:00:00 as the closest clean far-future sentinel that fits. Add a note explaining the overflow so implementers don't mistake it for a typo.
Introduces the ReadRestrictions data model (Action, ReadRestrictions, ReadRestrictionsAware), JSON parsers (ActionParser, ReadRestrictionsParser), and the LoadTable response plumbing that threads server-provided restrictions through BaseTable and RESTTable into the Spark SparkTable. Pairs with the Actions factory and Spark IcebergRestricted expression already on this branch to deliver end-to-end enforcement.
…lter Drop CodegenFallback from both expressions and implement doGenCode: - IcebergRestricted: emit inline Java for child eval, Spark-to-Iceberg type conversion, mask.apply via addReferenceObj, and Iceberg-to-Spark type conversion. Primitive cases collapse into one branch via CodeGenerator.boxedType + primitiveTypeName. ByteBuffer and Decimal conversions route through a companion IcebergRestrictedCodegenSupport object to sidestep Janino's limited covariant-return handling. - IcebergRowFilterExpr: delegate doGenCode to the child's genCode so the filter participates in the whole-stage loop while prettyName, sql, simpleString, and toString keep EXPLAIN output opaque. Add seven codegen tests exercising the compiled path (asserting the UnsafeProjection is not the Interpreted fallback) across String, Integer, Long, Binary, Decimal, null input, and RowFilter.
The rule currently rewrites only top-level column projections; an action with a fieldId pointing to a nested field (or to a fieldId absent from the current schema) would otherwise be silently dropped, letting the unmasked column through. Validate up front and throw IllegalStateException. The spec permits actions on any fieldId; this is a temporary rule-level limitation, not a spec restriction, and the comment notes to lift it when nested struct-path rewriting is implemented.
First step in migrating read-restriction masking from custom IcebergRestricted expressions (hand-rolled doGenCode) toward Spark's native ScalarFunction + ApplyFunctionExpression path, which gets whole-stage codegen for free. Mirrors the precedent in ReplaceStaticInvoke that constructs ApplyFunctionExpression programmatically inside a Catalyst rule. Adds MaskAlphanumFunction following the BucketFunction / TruncateFunction shape: UnboundFunction.bind(StructType) returns a BoundMaskAlphanum with a static invoke(UTF8String) magic method for codegen, delegating to Actions.bind(MaskAlphanum, STRING) for the canonical masking semantics so eval and codegen paths share one implementation and Flink/Trino can reuse the core factory unchanged. Registers the function as system.iceberg_mask_alphanum in SparkFunctions. ApplyReadRestrictions now routes mask-alphanum actions through ApplyFunctionExpression; all other 9 action types fall through to the existing IcebergRestricted path unchanged so this ships as a non- breaking incremental step. The remaining actions will migrate in follow-up commits (Sha256/Mask-to-default/Truncate each need per-type bound classes).
Capture the three reasons — native whole-stage codegen via the static invoke magic method, forward-compatibility with a future Spark table API that could publish masking expressions as V2 function calls (same shape used for CHECK constraints today), and cross-engine reuse of Actions.bind in core — on the MaskAlphanumFunction class javadoc so future contributors have the rationale in one place as more masks migrate off IcebergRestricted.
…pat parser Three of five reviewer-flagged items from the Anton/Ryan synthesis: 1. Salt out of the Catalyst resolution inner loop ApplyReadRestrictions previously generated a new SecureRandom salt inside the per-attribute match inside resolveOperators, so fixed-point re-entry could regenerate the salt mid-planning and break determinism within a query. Generate the salt once per apply() call and thread it through buildMaskExpression so all Sha256QueryLocal actions in the rewrite share a stable salt. 2. Rename ReadRestrictionsAware -> SupportsReadRestrictions Match the capability-marker naming used elsewhere in Iceberg (SupportsDistributedScanPlanning, SupportsReplaceView, etc.). Centralize the sniff in TableUtil.readRestrictions(Table) following the TableUtil.formatVersion precedent (apache#11620), so engines have one decision locus and future Flink/Trino integrations don't re-invent the check. 3. Forward-compat for unknown action discriminators ActionParser used to throw IllegalArgumentException on an unrecognized action type, blocking older clients from interoperating with newer servers. Introduce Action.Unknown that preserves (fieldId, rawTypeString) so parsing is lossless; enforcement (Actions.bind, rule binding) fails closed when it encounters Unknown so silent bypass of unmasked data is impossible. Mirrors ImmutableUnknownViewRepresentation in the view spec. The existing 'unknownActionTypeRejected' test is renamed and updated to assert preservation instead of rejection. Deferred to follow-up PRs: - Bind row filter at parse time (needs schema plumbing in LoadTableResponseParser) - TestApplyReadRestrictions + TestReadRestrictedQuery (needs SparkSession fixture)
Spec (ReadRestrictions description in open-api/rest-catalog-open-api.yaml:3500):
For all actions, if the input column value is NULL, the output MUST be NULL.
MaskToDefaultFunction was returning the type-specific default for null input
(e.g., 999999999 for int, the empty byte buffer for binary) which violates
that clause. Guard the null case to return null first, matching what every
other action implementation does.
Also update the stale TestActions expectation encoded as
maskToDefaultNullReturnsDefault, renaming to maskToDefaultNullReturnsNull
and referencing the spec clause in a comment.
Fix the class javadoc on ApplyReadRestrictions that described the rewrite
shape inside-out (had Filter-above-Project, actual code is Project-above-
Filter which is what the spec requires: row filters see original values,
projections apply after).
No behavior change for any non-null input. No wire-format change.
…me salt vars Three cleanups from a /simplify sweep: 1. Extract NullSafeFunction base class in Actions. 11 of the 13 inner function classes started with an identical 'if (value == null) return null;' preamble enforcing the spec clause that requires null pass-through. Lift this into an abstract base that declares applyNonNull(Object); each function implements the non-null path, the base adds the guard once. ReplaceWithNullFunction (always returns null) and ApplyExpressionFunction (always throws) stay as direct SerializableFunction impls since their semantics predate and skip the guard. 2. Lock in the cross-file invariant with a bind test. Actions.bind(Action.Unknown, type) must throw — this is what keeps forward-compat parser preservation safe. Previously only documented in a comment; now a test in TestActions prevents regressions. 3. Rename salt variables for clarity. In ApplyReadRestrictions, the outer query-scoped salt is now querySalt and the per-action narrowing is actionSalt. Avoids the subtle shadow where salt (query-wide) was threaded through then a local salt (for one action) was assigned from it. No behavior change; 28 core tests and 18 spark-extensions tests green.
Three blockers on PR apache#16082 CI: 1. api/.../SupportsReadRestrictions.java:30 javadoc error @link pointed at org.apache.iceberg.TableUtil#readRestrictions, but iceberg-api does not depend on iceberg-core so the reference can't resolve. Inline the name as @code text instead. 2. spark-extensions/.../ApplyReadRestrictions.scala:138 Line exceeded 120 chars in the buildMaskExpression javadoc where a fully-qualified @link to Spark's ScalarFunction was inline. Break the link onto its own line. 3. spark-extensions/.../ApplyReadRestrictions.scala:96 scalastyle's 'If block needs braces' on the if-else inside throw new IllegalStateException(...). Pull the condition out and throw from both branches explicitly. Also fix scalastyle import order in IcebergRestrictionExpressions.scala (CodegenContext swapped with CodeGenerator alphabetically).
ReadRestrictions are currently REST-sourced exclusively. The types were
spread across api/restrictions/ (data + marker) and core/restrictions/
(parsers + bind factory), and BaseTable advertised the capability
interface on every table — even Hive/Hadoop-loaded ones that can never
produce restrictions.
Mirror the credentials precedent (Credential, CredentialParser live
entirely in core/rest/credentials/) and collapse everything into a
single core/rest/restrictions/ package:
api/restrictions/ -> core/rest/restrictions/
Action.java
ReadRestrictions.java
SupportsReadRestrictions.java
core/restrictions/ -> core/rest/restrictions/
Actions.java
ActionParser.java
ReadRestrictionsParser.java
core/src/test/.../restrictions/ -> core/src/test/.../rest/restrictions/
TestActions.java
TestReadRestrictionsParser.java
Narrow the capability:
BaseTable keeps the ReadRestrictions field (REST callers pass it in
via the 4-arg constructor) but no longer implements
SupportsReadRestrictions. Only RESTTable advertises the interface,
inheriting the readRestrictions() method from BaseTable to satisfy
it. Instanceof checks now correctly discriminate REST-loaded tables
from non-REST BaseTable subclasses — closing Anton's "capability
spread too widely" review finding.
No wire-format change, no test logic change — only import paths in
tests moved to match the new package. All existing tests pass (core
restrictions tests, spark-extensions restriction expression tests,
checkstyle, scalastyle, spotless, api javadoc).
…core root
The marker is a table-capability advertisement: 'this loaded Table instance
carries per-principal restrictions.' Living under core/rest/restrictions/
implies REST-only applicability and buries it where engine authors wouldn't
look. Move it to org.apache.iceberg (core root), alongside how
SupportsDistributedScanPlanning lives at api root — both are Table-shape
capabilities, not transport details.
BaseTable deliberately does NOT implement the marker even though it carries
the field: that's the asymmetry from the earlier 'capability spread too
widely' review — every Hive/Hadoop-loaded BaseTable would otherwise
advertise a capability it can never fulfill. Keep the data on BaseTable as
opaque state so RESTSessionCatalog's LOCAL-mode fallback still threads
restrictions through, but only RESTTable advertises the marker via
'implements SupportsReadRestrictions'. The inherited public
readRestrictions() method on BaseTable satisfies the interface contract
when RESTTable (or future REST-loaded subclasses) declare the marker.
Consumer code now reads:
if (table instanceof SupportsReadRestrictions r) {
r.readRestrictions().ifPresent(restrictions -> ...);
}
The data types (ReadRestrictions, Action, Actions factory, parsers) stay
in core/rest/restrictions/ — they remain REST-authored, mirroring how
Credential and CredentialParser live entirely in core/rest/credentials/.
The previous shape had a semantic lie in the type system: BaseTable carried
a ReadRestrictions field (to support the LOCAL scan-planning mode fallback
in RESTSessionCatalog) but deliberately did not implement
SupportsReadRestrictions. Only RESTTable — constructed in SERVER
scan-planning mode — advertised the marker. The default LOCAL mode
returned BaseTable, so 'instanceof SupportsReadRestrictions' quietly
returned false and restrictions were dropped.
Fix by introducing BaseRESTTable as an intermediate class in core/rest/:
BaseTable (core, generic; no restriction concept at all)
└─ BaseRESTTable (core/rest; field + implements marker)
└─ RESTTable (core/rest; adds scan planning)
BaseTable goes back to its pre-restrictions shape — removing the 4-arg
constructor, the field, and the readRestrictions() method. Non-REST
catalogs (Hadoop, Hive, Glue, JDBC, Nessie) that construct BaseTable
no longer advertise a capability they have no pathway to produce.
BaseRESTTable carries the field and implements SupportsReadRestrictions
directly. RESTSessionCatalog's three fallback sites (line 598 for LOCAL
loadTable, line 747 for commit, line 1020 for another commit path) now
construct BaseRESTTable instead of BaseTable, so restrictions flow
correctly in all modes.
RESTTable extends BaseRESTTable and inherits both the field and marker —
drops the redundant 'implements SupportsReadRestrictions' declaration.
Consumer code is unchanged:
if (table instanceof SupportsReadRestrictions r) {
r.readRestrictions().ifPresent(...);
}
Now returns true only for REST-loaded tables, never for Hive/Hadoop/etc.
/simplify review flagged two things: 1. BaseRESTTable is instantiated only by RESTSessionCatalog and extended only by RESTTable — both in the same org.apache.iceberg.rest package. Public visibility leaks an implementation detail and enlarges the public API surface without reason. Matches RESTTable, RESTTableOperations, and RESTTableCache — all package-private. Narrow both class and constructor to package-private. 2. SupportsReadRestrictions javadoc referenced 'TableUtil.readRestrictions (Table) (also in iceberg-core)' — the parenthetical is redundant since any reader is already in iceberg-core. Drop it. Other flagged items were investigated and skipped: the Optional allocation and filter lambda in readRestrictions() are query-planning-time, not per-row, so allocation cost is negligible; the nullable readRestrictions parameter is intentional and handled correctly by Optional.ofNullable; no existing utility duplicates were found.
'class MaskAlphanum extends Base' tells you nothing at the call site — Base of what? Iceberg's convention across the codebase is Base<Noun> (BaseTable, BaseMetastoreTableOperations, BaseMetadataTable, BaseSessionCatalog, BaseUpdate, BaseReadOnlyTable, and the just-added BaseRESTTable). Follow that convention: 'extends BaseAction' reads as a proper Iceberg base class, and matches how every similarly-abstract shared-state holder is named in the project. No external references to Action.Base — the only references were the 10 subclasses inside Action.java itself.
- Defensive-copy salt array in Sha256Function to prevent caller mutation - Return ByteBuffer.duplicate() per row in MaskToDefaultFunction to prevent shared-position corruption across rows - Validate field-id > 0 at parse boundary in ActionParser - Fix IcebergRowFilterExpr.nullable to delegate to child.nullable - Guard ApplyReadRestrictions with TreeNodeTag to prevent double-rewrite on fixed-point iterations and ensure salt stability per query - Single-pass code-point iteration in ShowFirst4/ShowLast4 (drop toCodePoints helper) - Use DateTimeUtil for date truncation functions - Use MessageDigest.update(ByteBuffer) directly for BINARY sha-256 - Cache maskedFieldIds in ReadRestrictions constructor - Normalize BaseRESTTable.readRestrictions at construction time - Remove redundant isEmpty guard in LoadTableResponseParser - Remove .claude/plan.md from tracked files
sfc-gh-prsingh
added a commit
to singhpk234/iceberg
that referenced
this pull request
Apr 24, 2026
Dropping the Spark-side Catalyst rule, ScalarFunction, and opaque mask expressions. This PR now contains only the engine-agnostic core-level enforcement via ReadRestrictionsApplier in the data module. The Spark-specific work remains on feature/load-table-return-policy-wip (PR apache#16082) and can be layered on top once this lands.
Merges the separate Action (data model) and Actions.bind() (behavior factory) hierarchies following the Transform<S, T> pattern in org.apache.iceberg.transforms. - Action<S, T> is now generic with bind(Type), bind(Type, byte[]) and canBind(Type) default methods - Each concrete action subclass implements its own bind() and carries type-specific inner SerializableFunction classes - Six TruncateTo* classes reduced to two outer actions with three shared abstract bases (TruncateDateFn / TruncateTimestampFn / TruncateTimestampNanoFn) - Sha256Function split into four type-specialized subclasses (Sha256String / Sha256Integer / Sha256Long / Sha256Binary) sharing a Base that holds the salt and digest ThreadLocal — one switch at bind time instead of two per-row switches - MaskToDefault bind-time dispatch: ByteBuffer-valued types return a function that duplicates per call, non-ByteBuffer types return a simple constant - LocalTime.MIDNIGHT replaces the chained withHour/withMinute/withSecond/withNano - Actions.java shrinks from ~460 lines to ~60 lines of shared helpers (NullSafeFunction base + mapCodePoint) - ReadRestrictions.columnProjections() now returns List<Action<?, ?>> Callers migrate from Actions.bind(action, type, salt) to action.bind(type, salt). TestActions, TestReadRestrictionsParser, TestIcebergRestrictionExpressions, MaskAlphanumFunction, and ApplyReadRestrictions are updated accordingly.
… date util - ApplyExpression.canBind now returns true to match bind() behavior (bind succeeds, apply throws; canBind was contradicting this) - Sha256String.update casts to String instead of calling toString() (value is guaranteed String at this point; documents the invariant) - TestActions maskToDefaultDate uses DateTimeUtil.daysFromDate for consistency with the production code's DATE_DEFAULT computation
sfc-gh-prsingh
added a commit
to singhpk234/iceberg
that referenced
this pull request
Apr 27, 2026
Dropping the Spark-side Catalyst rule, ScalarFunction, and opaque mask expressions. This PR now contains only the engine-agnostic core-level enforcement via ReadRestrictionsApplier in the data module. The Spark-specific work remains on feature/load-table-return-policy-wip (PR apache#16082) and can be layered on top once this lands.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This i POC for Read Restrictions Spec
#13879
per latest : #13879 (comment)