Skip to content

feat(glue-alpha): support Apache Iceberg tables#37988

Open
ksco92 wants to merge 1 commit into
aws:mainfrom
ksco92:feat/glue-alpha-iceberg-table
Open

feat(glue-alpha): support Apache Iceberg tables#37988
ksco92 wants to merge 1 commit into
aws:mainfrom
ksco92:feat/glue-alpha-iceberg-table

Conversation

@ksco92
Copy link
Copy Markdown

@ksco92 ksco92 commented May 24, 2026

Issue # (if applicable)

Closes #29660.

Reason for this change

CloudFormation AWS::Glue::Table supports Apache Iceberg via OpenTableFormatInput, but the shape that survives UpdateTable is not the one most documentation shows. Placing columns under tableInput.storageDescriptor.columns and openTableFormatInput.icebergInput together creates an Iceberg table on CREATE, then silently strips table_type=ICEBERG and metadata_location from the Glue parameters on the first UPDATE. Athena queries after that fail with HIVE_UNSUPPORTED_FORMAT.

The working shape places schema, partition spec, sort order, and properties under openTableFormatInput.icebergInput.icebergTableInput and omits tableInput entirely. There is no L2 in @aws-cdk/aws-glue-alpha that emits this shape today; users either reach into the L1 escape hatch or end up with the broken shape.

Description of changes

Adds an IcebergTable L2 construct in @aws-cdk/aws-glue-alpha plus the supporting types (IcebergType, IcebergPartitionTransform, IcebergDataFormat, IcebergFormatVersion, IcebergSortDirection, IcebergNullOrder, IcebergColumn, IcebergPartitionField, IcebergSortField, IIcebergTable).

The construct:

  • emits only the safe openTableFormatInput.icebergInput.icebergTableInput shape; never publishes a tableInput sibling
  • validates partition transforms against source-column types at synth time (day / month / year require date/timestamp; hour requires timestamp; bucket and truncate require their respective source-type whitelists)
  • validates tableProperties against the codec / write-mode / format-version matrix (rejects e.g. merge-on-read on v1, bzip2 on parquet, non-positive numeric values)
  • resolves identifierFieldNames to ids, refusing floating-point columns per the Iceberg spec
  • threads a single id counter through nested types so every id in the schema is globally unique
  • surfaces grantRead / grantWrite / grantReadWrite as four separate IAM statements: Glue actions on the Glue table ARN, s3:ListBucket on the bucket ARN with an s3:prefix condition limiting the grantee to the table's own prefix, s3:GetBucketLocation / s3:ListBucketMultipartUploads on the bucket ARN unconditionally (these actions do not support s3:prefix and would be silently denied if conditioned), and S3 object-level actions on bucket/prefix*
  • honors optional per-column id pinning so users can add, remove, and reorder columns across deploys without breaking the Iceberg "ids unique per table schema" invariant

The types model is intentionally jsii-friendly — IcebergType and IcebergPartitionTransform are single concrete classes discriminated by a public kind enum, no private constructors, no function-typed fields.

Description of how you validated changes

  • test/iceberg-table.test.ts — 27 unit tests exercising happy paths, defaults, partition / sort rendering, identifier resolution, pinned column ids, every validation failure path, and the four-statement grant shape.
  • test/integ.iceberg-table.ts + committed .snapshot/ — provisions a Glue database, a warehouse bucket, and two Iceberg tables (orders with day + bucket partitions, sort order, identifier-field-ids, and merge-on-read; events with hourly partitioning). The integ-runner deployed and verified end-to-end in us-east-1 in 81s.
  • The README in @aws-cdk/aws-glue-alpha gained an "Iceberg Tables" section with rosetta-compilable snippets and an explicit limitations callout (CFN's metadataOperation: CREATE-only restriction, the cross-deploy field-id reuse that the construct cannot detect, and the Iceberg void intermediate that CFN cannot express when dropping a partition-source column).

Checklist


By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license

Adds `IcebergTable`, an L2 construct that creates Apache Iceberg tables
in the AWS Glue Data Catalog via the working
`AWS::Glue::Table.OpenTableFormatInput.IcebergInput.IcebergTableInput`
shape. The construct supports CREATE / UPDATE / DELETE through
`cdk deploy` like any other resource.

The motivating issue documents that the obvious shape
(`tableInput.storageDescriptor.columns` + `openTableFormatInput.iceberginput`)
silently strips `table_type=ICEBERG` from the Glue parameters on the
first UPDATE, making the table unqueryable in Athena. `IcebergTable`
emits only the safe shape and refuses to publish `tableInput` next to
`openTableFormatInput`.

Surface:
- `IcebergType` primitives + `decimal(p, s)` / `fixed(L)` / `list` / `map` / `struct` factories. Nested types thread a single id counter through the schema so every field/element/key/value gets a globally unique id per the Iceberg spec.
- `IcebergPartitionTransform` with source-type validation: `IDENTITY`, `YEAR`, `MONTH`, `DAY`, `HOUR`, `VOID`, `bucket(N)`, `truncate(W)`.
- Sort orders with `IcebergSortDirection` (asc/desc) and `IcebergNullOrder` (nulls-first/nulls-last).
- Identifier-field-ids resolved from column names; floating-point columns rejected per the spec.
- Optional per-column `id` pinning for safe schema evolution across deploys.
- `IcebergDataFormat` (parquet/orc/avro — default parquet) and `IcebergFormatVersion` (v1/v2 — default v2).
- `tableProperties` validator: rejects bad codec/format/write-mode combinations, `merge-on-read` on a v1 table, non-positive numeric values, non-boolean booleans, at synth time.
- `grantRead` / `grantWrite` / `grantReadWrite` and `fromIcebergTableAttributes` import shim. Grants are split across four IAM statements so `s3:ListBucket` is scoped to the table's prefix while `s3:GetBucketLocation` and `s3:ListBucketMultipartUploads` (which do not support the `s3:prefix` condition) are granted on the bucket ARN unconditionally.

Tests:
- 27 unit tests in `test/iceberg-table.test.ts`.
- `test/integ.iceberg-table.ts` provisions a database, a warehouse bucket, and two Iceberg tables (one with partitions + sort + identifier ids + merge-on-read; one with hourly partitioning). Snapshot committed under `test/integ.iceberg-table.js.snapshot/`. The integ-runner deploy ran end-to-end in us-east-1 (81s) and verified the resulting Glue / S3 state.

Documented limitations:
- `OpenTableFormatInput.IcebergInput.metadataOperation` only accepts `CREATE` in CFN; subsequent deploys flow through Glue's `UpdateTable` path.
- The construct does not detect cross-deploy field-id reuse. Pin column ids explicitly on tables you intend to evolve and treat dropped ids as retired forever.
- Dropping a partition-source column requires an Iceberg `void` transform intermediate that CFN cannot express. The construct accepts the change but Athena queries against the result will fail — drop the partition first, then the column in a subsequent deploy.

fixes aws#29660
@github-actions github-actions Bot added beginning-contributor [Pilot] contributed between 0-2 PRs to the CDK effort/medium Medium work item – several days of effort feature-request A feature should be added or improved. p2 labels May 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

beginning-contributor [Pilot] contributed between 0-2 PRs to the CDK effort/medium Medium work item – several days of effort feature-request A feature should be added or improved. p2

Projects

None yet

Development

Successfully merging this pull request may close these issues.

(glue): Iceberg Table Support on S3Table construct

1 participant