[WIP] CORE|SPARK: POC Read Restrictions E2E by singhpk234 · Pull Request #16082 · apache/iceberg

singhpk234 · 2026-04-23T06:08:23Z

This i POC for Read Restrictions Spec

…ma for now

propose a new structure`

Flatten Projection into Action with an "action" discriminator property, matching the BaseUpdate pattern. Each Action subtype now uses allOf to extend Action with a const action value, making them distinguishable in JSON.

…lyExpression - Replace 4 original actions (mask-hash-sha256, replace-with-null, mask-alphanumeric, apply-transform) with 10 predefined action types: mask-alphanum, mask-to-default, replace-with-null, show-first-4, show-last-4, truncate-to-year, truncate-to-month, sha-256-global, sha-256-query-local, apply-expression - Add discriminator mapping for all action subtypes - Add detailed descriptions for each action including type applicability, encoding rules for SHA-256 variants, and type-specific defaults - Replace Term-based ApplyTransform with Expression-based ApplyExpression - Update required-column-projections rules for action model - Regenerate Python from updated YAML spec

- Mark ApplyExpression.expression as required - Fix SHA-256 binary output type from binary(32) to binary - Add NULL-in/NULL-out global rule for all actions - Specify Unicode code point semantics for masking actions - Add signed two's complement for SHA-256 int/long output - Add UTC truncation rule for timestamptz types - Clarify empty ReadRestrictions equivalence and server type validation - Specify minimum 16-byte salt for sha-256-query-local - Clarify unrecognized action types trigger MUST-fail - Add "Applicable to: all data types" for ApplyExpression - Regenerate Python from updated YAML spec

SHA-256 digest truncated to 16 bytes does not produce valid RFC 4122 UUIDs (version and variant bits are not set), so uuid is removed from the input encoding, output encoding, and applicable types for both sha-256-global and sha-256-query-local actions.

Collapse 10 Spark Catalyst expression classes into a single generic IcebergRestricted wrapper + an Actions.bind(Action, Type) factory that returns SerializableFunctions. Engine integrations (Spark, Flink, Trino) can now reuse the spec-defined masking semantics instead of reimplementing them per-engine. Follows the existing Transforms/Expressions bind pattern. Also fixes row-filter evaluation order so filters run on original values before masks are applied, matching the spec.

The spec mandated 9999-12-31T00:00:00 as the mask-to-default sentinel for timestamp_ns and timestamptz_ns, but that value overflows a 64-bit signed integer as nanoseconds from epoch (max representable is ~2262-04-11). Engines MUST use exactly the listed default, so specifying an unrepresentable value broke cross-engine consistency. Use 2261-12-31T00:00:00 as the closest clean far-future sentinel that fits. Add a note explaining the overflow so implementers don't mistake it for a typo.

Introduces the ReadRestrictions data model (Action, ReadRestrictions, ReadRestrictionsAware), JSON parsers (ActionParser, ReadRestrictionsParser), and the LoadTable response plumbing that threads server-provided restrictions through BaseTable and RESTTable into the Spark SparkTable. Pairs with the Actions factory and Spark IcebergRestricted expression already on this branch to deliver end-to-end enforcement.

…lter Drop CodegenFallback from both expressions and implement doGenCode: - IcebergRestricted: emit inline Java for child eval, Spark-to-Iceberg type conversion, mask.apply via addReferenceObj, and Iceberg-to-Spark type conversion. Primitive cases collapse into one branch via CodeGenerator.boxedType + primitiveTypeName. ByteBuffer and Decimal conversions route through a companion IcebergRestrictedCodegenSupport object to sidestep Janino's limited covariant-return handling. - IcebergRowFilterExpr: delegate doGenCode to the child's genCode so the filter participates in the whole-stage loop while prettyName, sql, simpleString, and toString keep EXPLAIN output opaque. Add seven codegen tests exercising the compiled path (asserting the UnsafeProjection is not the Interpreted fallback) across String, Integer, Long, Binary, Decimal, null input, and RowFilter.

The rule currently rewrites only top-level column projections; an action with a fieldId pointing to a nested field (or to a fieldId absent from the current schema) would otherwise be silently dropped, letting the unmasked column through. Validate up front and throw IllegalStateException. The spec permits actions on any fieldId; this is a temporary rule-level limitation, not a spec restriction, and the comment notes to lift it when nested struct-path rewriting is implemented.

First step in migrating read-restriction masking from custom IcebergRestricted expressions (hand-rolled doGenCode) toward Spark's native ScalarFunction + ApplyFunctionExpression path, which gets whole-stage codegen for free. Mirrors the precedent in ReplaceStaticInvoke that constructs ApplyFunctionExpression programmatically inside a Catalyst rule. Adds MaskAlphanumFunction following the BucketFunction / TruncateFunction shape: UnboundFunction.bind(StructType) returns a BoundMaskAlphanum with a static invoke(UTF8String) magic method for codegen, delegating to Actions.bind(MaskAlphanum, STRING) for the canonical masking semantics so eval and codegen paths share one implementation and Flink/Trino can reuse the core factory unchanged. Registers the function as system.iceberg_mask_alphanum in SparkFunctions. ApplyReadRestrictions now routes mask-alphanum actions through ApplyFunctionExpression; all other 9 action types fall through to the existing IcebergRestricted path unchanged so this ships as a non- breaking incremental step. The remaining actions will migrate in follow-up commits (Sha256/Mask-to-default/Truncate each need per-type bound classes).

Capture the three reasons — native whole-stage codegen via the static invoke magic method, forward-compatibility with a future Spark table API that could publish masking expressions as V2 function calls (same shape used for CHECK constraints today), and cross-engine reuse of Actions.bind in core — on the MaskAlphanumFunction class javadoc so future contributors have the rationale in one place as more masks migrate off IcebergRestricted.

…pat parser Three of five reviewer-flagged items from the Anton/Ryan synthesis: 1. Salt out of the Catalyst resolution inner loop ApplyReadRestrictions previously generated a new SecureRandom salt inside the per-attribute match inside resolveOperators, so fixed-point re-entry could regenerate the salt mid-planning and break determinism within a query. Generate the salt once per apply() call and thread it through buildMaskExpression so all Sha256QueryLocal actions in the rewrite share a stable salt. 2. Rename ReadRestrictionsAware -> SupportsReadRestrictions Match the capability-marker naming used elsewhere in Iceberg (SupportsDistributedScanPlanning, SupportsReplaceView, etc.). Centralize the sniff in TableUtil.readRestrictions(Table) following the TableUtil.formatVersion precedent (apache#11620), so engines have one decision locus and future Flink/Trino integrations don't re-invent the check. 3. Forward-compat for unknown action discriminators ActionParser used to throw IllegalArgumentException on an unrecognized action type, blocking older clients from interoperating with newer servers. Introduce Action.Unknown that preserves (fieldId, rawTypeString) so parsing is lossless; enforcement (Actions.bind, rule binding) fails closed when it encounters Unknown so silent bypass of unmasked data is impossible. Mirrors ImmutableUnknownViewRepresentation in the view spec. The existing 'unknownActionTypeRejected' test is renamed and updated to assert preservation instead of rejection. Deferred to follow-up PRs: - Bind row filter at parse time (needs schema plumbing in LoadTableResponseParser) - TestApplyReadRestrictions + TestReadRestrictedQuery (needs SparkSession fixture)

Spec (ReadRestrictions description in open-api/rest-catalog-open-api.yaml:3500): For all actions, if the input column value is NULL, the output MUST be NULL. MaskToDefaultFunction was returning the type-specific default for null input (e.g., 999999999 for int, the empty byte buffer for binary) which violates that clause. Guard the null case to return null first, matching what every other action implementation does. Also update the stale TestActions expectation encoded as maskToDefaultNullReturnsDefault, renaming to maskToDefaultNullReturnsNull and referencing the spec clause in a comment. Fix the class javadoc on ApplyReadRestrictions that described the rewrite shape inside-out (had Filter-above-Project, actual code is Project-above- Filter which is what the spec requires: row filters see original values, projections apply after). No behavior change for any non-null input. No wire-format change.

…me salt vars Three cleanups from a /simplify sweep: 1. Extract NullSafeFunction base class in Actions. 11 of the 13 inner function classes started with an identical 'if (value == null) return null;' preamble enforcing the spec clause that requires null pass-through. Lift this into an abstract base that declares applyNonNull(Object); each function implements the non-null path, the base adds the guard once. ReplaceWithNullFunction (always returns null) and ApplyExpressionFunction (always throws) stay as direct SerializableFunction impls since their semantics predate and skip the guard. 2. Lock in the cross-file invariant with a bind test. Actions.bind(Action.Unknown, type) must throw — this is what keeps forward-compat parser preservation safe. Previously only documented in a comment; now a test in TestActions prevents regressions. 3. Rename salt variables for clarity. In ApplyReadRestrictions, the outer query-scoped salt is now querySalt and the per-action narrowing is actionSalt. Avoids the subtle shadow where salt (query-wide) was threaded through then a local salt (for one action) was assigned from it. No behavior change; 28 core tests and 18 spark-extensions tests green.

Three blockers on PR apache#16082 CI: 1. api/.../SupportsReadRestrictions.java:30 javadoc error @link pointed at org.apache.iceberg.TableUtil#readRestrictions, but iceberg-api does not depend on iceberg-core so the reference can't resolve. Inline the name as @code text instead. 2. spark-extensions/.../ApplyReadRestrictions.scala:138 Line exceeded 120 chars in the buildMaskExpression javadoc where a fully-qualified @link to Spark's ScalarFunction was inline. Break the link onto its own line. 3. spark-extensions/.../ApplyReadRestrictions.scala:96 scalastyle's 'If block needs braces' on the if-else inside throw new IllegalStateException(...). Pull the condition out and throw from both branches explicitly. Also fix scalastyle import order in IcebergRestrictionExpressions.scala (CodegenContext swapped with CodeGenerator alphabetically).

ReadRestrictions are currently REST-sourced exclusively. The types were spread across api/restrictions/ (data + marker) and core/restrictions/ (parsers + bind factory), and BaseTable advertised the capability interface on every table — even Hive/Hadoop-loaded ones that can never produce restrictions. Mirror the credentials precedent (Credential, CredentialParser live entirely in core/rest/credentials/) and collapse everything into a single core/rest/restrictions/ package: api/restrictions/ -> core/rest/restrictions/ Action.java ReadRestrictions.java SupportsReadRestrictions.java core/restrictions/ -> core/rest/restrictions/ Actions.java ActionParser.java ReadRestrictionsParser.java core/src/test/.../restrictions/ -> core/src/test/.../rest/restrictions/ TestActions.java TestReadRestrictionsParser.java Narrow the capability: BaseTable keeps the ReadRestrictions field (REST callers pass it in via the 4-arg constructor) but no longer implements SupportsReadRestrictions. Only RESTTable advertises the interface, inheriting the readRestrictions() method from BaseTable to satisfy it. Instanceof checks now correctly discriminate REST-loaded tables from non-REST BaseTable subclasses — closing Anton's "capability spread too widely" review finding. No wire-format change, no test logic change — only import paths in tests moved to match the new package. All existing tests pass (core restrictions tests, spark-extensions restriction expression tests, checkstyle, scalastyle, spotless, api javadoc).

…core root The marker is a table-capability advertisement: 'this loaded Table instance carries per-principal restrictions.' Living under core/rest/restrictions/ implies REST-only applicability and buries it where engine authors wouldn't look. Move it to org.apache.iceberg (core root), alongside how SupportsDistributedScanPlanning lives at api root — both are Table-shape capabilities, not transport details. BaseTable deliberately does NOT implement the marker even though it carries the field: that's the asymmetry from the earlier 'capability spread too widely' review — every Hive/Hadoop-loaded BaseTable would otherwise advertise a capability it can never fulfill. Keep the data on BaseTable as opaque state so RESTSessionCatalog's LOCAL-mode fallback still threads restrictions through, but only RESTTable advertises the marker via 'implements SupportsReadRestrictions'. The inherited public readRestrictions() method on BaseTable satisfies the interface contract when RESTTable (or future REST-loaded subclasses) declare the marker. Consumer code now reads: if (table instanceof SupportsReadRestrictions r) { r.readRestrictions().ifPresent(restrictions -> ...); } The data types (ReadRestrictions, Action, Actions factory, parsers) stay in core/rest/restrictions/ — they remain REST-authored, mirroring how Credential and CredentialParser live entirely in core/rest/credentials/.

The previous shape had a semantic lie in the type system: BaseTable carried a ReadRestrictions field (to support the LOCAL scan-planning mode fallback in RESTSessionCatalog) but deliberately did not implement SupportsReadRestrictions. Only RESTTable — constructed in SERVER scan-planning mode — advertised the marker. The default LOCAL mode returned BaseTable, so 'instanceof SupportsReadRestrictions' quietly returned false and restrictions were dropped. Fix by introducing BaseRESTTable as an intermediate class in core/rest/: BaseTable (core, generic; no restriction concept at all) └─ BaseRESTTable (core/rest; field + implements marker) └─ RESTTable (core/rest; adds scan planning) BaseTable goes back to its pre-restrictions shape — removing the 4-arg constructor, the field, and the readRestrictions() method. Non-REST catalogs (Hadoop, Hive, Glue, JDBC, Nessie) that construct BaseTable no longer advertise a capability they have no pathway to produce. BaseRESTTable carries the field and implements SupportsReadRestrictions directly. RESTSessionCatalog's three fallback sites (line 598 for LOCAL loadTable, line 747 for commit, line 1020 for another commit path) now construct BaseRESTTable instead of BaseTable, so restrictions flow correctly in all modes. RESTTable extends BaseRESTTable and inherits both the field and marker — drops the redundant 'implements SupportsReadRestrictions' declaration. Consumer code is unchanged: if (table instanceof SupportsReadRestrictions r) { r.readRestrictions().ifPresent(...); } Now returns true only for REST-loaded tables, never for Hive/Hadoop/etc.

/simplify review flagged two things: 1. BaseRESTTable is instantiated only by RESTSessionCatalog and extended only by RESTTable — both in the same org.apache.iceberg.rest package. Public visibility leaks an implementation detail and enlarges the public API surface without reason. Matches RESTTable, RESTTableOperations, and RESTTableCache — all package-private. Narrow both class and constructor to package-private. 2. SupportsReadRestrictions javadoc referenced 'TableUtil.readRestrictions (Table) (also in iceberg-core)' — the parenthetical is redundant since any reader is already in iceberg-core. Drop it. Other flagged items were investigated and skipped: the Optional allocation and filter lambda in readRestrictions() are query-planning-time, not per-row, so allocation cost is negligible; the nullable readRestrictions parameter is intentional and handled correctly by Optional.ofNullable; no existing utility duplicates were found.

'class MaskAlphanum extends Base' tells you nothing at the call site — Base of what? Iceberg's convention across the codebase is Base<Noun> (BaseTable, BaseMetastoreTableOperations, BaseMetadataTable, BaseSessionCatalog, BaseUpdate, BaseReadOnlyTable, and the just-added BaseRESTTable). Follow that convention: 'extends BaseAction' reads as a proper Iceberg base class, and matches how every similarly-abstract shared-state holder is named in the project. No external references to Action.Base — the only references were the 10 subclasses inside Action.java itself.

- Defensive-copy salt array in Sha256Function to prevent caller mutation - Return ByteBuffer.duplicate() per row in MaskToDefaultFunction to prevent shared-position corruption across rows - Validate field-id > 0 at parse boundary in ActionParser - Fix IcebergRowFilterExpr.nullable to delegate to child.nullable - Guard ApplyReadRestrictions with TreeNodeTag to prevent double-rewrite on fixed-point iterations and ensure salt stability per query - Single-pass code-point iteration in ShowFirst4/ShowLast4 (drop toCodePoints helper) - Use DateTimeUtil for date truncation functions - Use MessageDigest.update(ByteBuffer) directly for BINARY sha-256 - Cache maskedFieldIds in ReadRestrictions constructor - Normalize BaseRESTTable.readRestrictions at construction time - Remove redundant isEmpty guard in LoadTableResponseParser - Remove .claude/plan.md from tracked files

Dropping the Spark-side Catalyst rule, ScalarFunction, and opaque mask expressions. This PR now contains only the engine-agnostic core-level enforcement via ReadRestrictionsApplier in the data module. The Spark-specific work remains on feature/load-table-return-policy-wip (PR apache#16082) and can be layered on top once this lands.

Merges the separate Action (data model) and Actions.bind() (behavior factory) hierarchies following the Transform<S, T> pattern in org.apache.iceberg.transforms. - Action<S, T> is now generic with bind(Type), bind(Type, byte[]) and canBind(Type) default methods - Each concrete action subclass implements its own bind() and carries type-specific inner SerializableFunction classes - Six TruncateTo* classes reduced to two outer actions with three shared abstract bases (TruncateDateFn / TruncateTimestampFn / TruncateTimestampNanoFn) - Sha256Function split into four type-specialized subclasses (Sha256String / Sha256Integer / Sha256Long / Sha256Binary) sharing a Base that holds the salt and digest ThreadLocal — one switch at bind time instead of two per-row switches - MaskToDefault bind-time dispatch: ByteBuffer-valued types return a function that duplicates per call, non-ByteBuffer types return a simple constant - LocalTime.MIDNIGHT replaces the chained withHour/withMinute/withSecond/withNano - Actions.java shrinks from ~460 lines to ~60 lines of shared helpers (NullSafeFunction base + mapCodePoint) - ReadRestrictions.columnProjections() now returns List<Action<?, ?>> Callers migrate from Actions.bind(action, type, salt) to action.bind(type, salt). TestActions, TestReadRestrictionsParser, TestIcebergRestrictionExpressions, MaskAlphanumFunction, and ApplyReadRestrictions are updated accordingly.

… date util - ApplyExpression.canBind now returns true to match bind() behavior (bind succeeds, apply throws; canBind was contradicting this) - Sha256String.update casts to String instead of calling toString() (value is guaranteed String at this point; documents the invariant) - TestActions maskToDefaultDate uses DateTimeUtil.daysFromDate for consistency with the production code's DATE_DEFAULT computation

Dropping the Spark-side Catalyst rule, ScalarFunction, and opaque mask expressions. This PR now contains only the engine-agnostic core-level enforcement via ReadRestrictionsApplier in the data module. The Spark-specific work remains on feature/load-table-return-policy-wip (PR apache#16082) and can be layered on top once this lands.

sfc-gh-prsingh added 30 commits April 16, 2026 22:12

[SPEC] Add FGAC enforcement instructions as part of loadTable

5791b25

Address review feedbacks

fb18818

Address review feedbacks

0400d88

Address review feedback

888f4e3

Address review feedbacks and align to RFC-2119

dbb222e

Make who the evaluated policy is applicable to more verbose

e013b0b

Add projection must be applied after filtering

a895081

Add restriction for data type match

1e5c2d9

[Ongoing Discussion] Add projections and filters are per current sche…

094b541

…ma for now

Address review feedbacks

7b2a5ec

Address review feedback

c99b1f7

propose a new structure`

Address review feedback

768d754

Address review feedback

3f2d887

pass lint

5aa5b73

Address review feedback from steven

dc5cf05

Address feedbacks per sync 02/03/2026

c2e906c

Address comment: clarify unlisted columns wording

a090e60

Add discriminator for Action and remove Projection wrapper

2795d03

Flatten Projection into Action with an "action" discriminator property, matching the BaseUpdate pattern. Each Action subtype now uses allOf to extend Action with a const action value, making them distinguishable in JSON.

github-actions Bot added API spark core OPENAPI labels Apr 23, 2026

singhpk234 closed this Apr 23, 2026

sfc-gh-prsingh added 6 commits April 23, 2026 00:30

singhpk234 reopened this Apr 24, 2026

github-actions Bot removed the API label Apr 24, 2026

Fix javadoc heading hierarchy in MaskAlphanumFunction

e8bf099

sfc-gh-prsingh added 3 commits April 24, 2026 17:41

Remove inline imports in Spark restriction expressions

1fc5a7b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] CORE|SPARK: POC Read Restrictions E2E #16082

[WIP] CORE|SPARK: POC Read Restrictions E2E #16082
singhpk234 wants to merge 42 commits intoapache:mainfrom
singhpk234:feature/load-table-return-policy-wip

singhpk234 commented Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

singhpk234 commented Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants