feat(glue-alpha): support Apache Iceberg tables#37988
Open
ksco92 wants to merge 1 commit into
Open
Conversation
Adds `IcebergTable`, an L2 construct that creates Apache Iceberg tables in the AWS Glue Data Catalog via the working `AWS::Glue::Table.OpenTableFormatInput.IcebergInput.IcebergTableInput` shape. The construct supports CREATE / UPDATE / DELETE through `cdk deploy` like any other resource. The motivating issue documents that the obvious shape (`tableInput.storageDescriptor.columns` + `openTableFormatInput.iceberginput`) silently strips `table_type=ICEBERG` from the Glue parameters on the first UPDATE, making the table unqueryable in Athena. `IcebergTable` emits only the safe shape and refuses to publish `tableInput` next to `openTableFormatInput`. Surface: - `IcebergType` primitives + `decimal(p, s)` / `fixed(L)` / `list` / `map` / `struct` factories. Nested types thread a single id counter through the schema so every field/element/key/value gets a globally unique id per the Iceberg spec. - `IcebergPartitionTransform` with source-type validation: `IDENTITY`, `YEAR`, `MONTH`, `DAY`, `HOUR`, `VOID`, `bucket(N)`, `truncate(W)`. - Sort orders with `IcebergSortDirection` (asc/desc) and `IcebergNullOrder` (nulls-first/nulls-last). - Identifier-field-ids resolved from column names; floating-point columns rejected per the spec. - Optional per-column `id` pinning for safe schema evolution across deploys. - `IcebergDataFormat` (parquet/orc/avro — default parquet) and `IcebergFormatVersion` (v1/v2 — default v2). - `tableProperties` validator: rejects bad codec/format/write-mode combinations, `merge-on-read` on a v1 table, non-positive numeric values, non-boolean booleans, at synth time. - `grantRead` / `grantWrite` / `grantReadWrite` and `fromIcebergTableAttributes` import shim. Grants are split across four IAM statements so `s3:ListBucket` is scoped to the table's prefix while `s3:GetBucketLocation` and `s3:ListBucketMultipartUploads` (which do not support the `s3:prefix` condition) are granted on the bucket ARN unconditionally. Tests: - 27 unit tests in `test/iceberg-table.test.ts`. - `test/integ.iceberg-table.ts` provisions a database, a warehouse bucket, and two Iceberg tables (one with partitions + sort + identifier ids + merge-on-read; one with hourly partitioning). Snapshot committed under `test/integ.iceberg-table.js.snapshot/`. The integ-runner deploy ran end-to-end in us-east-1 (81s) and verified the resulting Glue / S3 state. Documented limitations: - `OpenTableFormatInput.IcebergInput.metadataOperation` only accepts `CREATE` in CFN; subsequent deploys flow through Glue's `UpdateTable` path. - The construct does not detect cross-deploy field-id reuse. Pin column ids explicitly on tables you intend to evolve and treat dropped ids as retired forever. - Dropping a partition-source column requires an Iceberg `void` transform intermediate that CFN cannot express. The construct accepts the change but Athena queries against the result will fail — drop the partition first, then the column in a subsequent deploy. fixes aws#29660
2 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Issue # (if applicable)
Closes #29660.
Reason for this change
CloudFormation
AWS::Glue::Tablesupports Apache Iceberg viaOpenTableFormatInput, but the shape that survivesUpdateTableis not the one most documentation shows. Placing columns undertableInput.storageDescriptor.columnsandopenTableFormatInput.icebergInputtogether creates an Iceberg table onCREATE, then silently stripstable_type=ICEBERGandmetadata_locationfrom the Glue parameters on the firstUPDATE. Athena queries after that fail withHIVE_UNSUPPORTED_FORMAT.The working shape places schema, partition spec, sort order, and properties under
openTableFormatInput.icebergInput.icebergTableInputand omitstableInputentirely. There is no L2 in@aws-cdk/aws-glue-alphathat emits this shape today; users either reach into the L1 escape hatch or end up with the broken shape.Description of changes
Adds an
IcebergTableL2 construct in@aws-cdk/aws-glue-alphaplus the supporting types (IcebergType,IcebergPartitionTransform,IcebergDataFormat,IcebergFormatVersion,IcebergSortDirection,IcebergNullOrder,IcebergColumn,IcebergPartitionField,IcebergSortField,IIcebergTable).The construct:
openTableFormatInput.icebergInput.icebergTableInputshape; never publishes atableInputsiblingday/month/yearrequire date/timestamp;hourrequires timestamp;bucketandtruncaterequire their respective source-type whitelists)tablePropertiesagainst the codec / write-mode / format-version matrix (rejects e.g.merge-on-readon v1,bzip2on parquet, non-positive numeric values)identifierFieldNamesto ids, refusing floating-point columns per the Iceberg specgrantRead/grantWrite/grantReadWriteas four separate IAM statements: Glue actions on the Glue table ARN,s3:ListBucketon the bucket ARN with ans3:prefixcondition limiting the grantee to the table's own prefix,s3:GetBucketLocation/s3:ListBucketMultipartUploadson the bucket ARN unconditionally (these actions do not supports3:prefixand would be silently denied if conditioned), and S3 object-level actions onbucket/prefix*idpinning so users can add, remove, and reorder columns across deploys without breaking the Iceberg "ids unique per table schema" invariantThe types model is intentionally jsii-friendly —
IcebergTypeandIcebergPartitionTransformare single concrete classes discriminated by a publickindenum, no private constructors, no function-typed fields.Description of how you validated changes
test/iceberg-table.test.ts— 27 unit tests exercising happy paths, defaults, partition / sort rendering, identifier resolution, pinned column ids, every validation failure path, and the four-statement grant shape.test/integ.iceberg-table.ts+ committed.snapshot/— provisions a Glue database, a warehouse bucket, and two Iceberg tables (orders with day + bucket partitions, sort order, identifier-field-ids, and merge-on-read; events with hourly partitioning). The integ-runner deployed and verified end-to-end in us-east-1 in 81s.@aws-cdk/aws-glue-alphagained an "Iceberg Tables" section with rosetta-compilable snippets and an explicit limitations callout (CFN'smetadataOperation: CREATE-only restriction, the cross-deploy field-id reuse that the construct cannot detect, and the Icebergvoidintermediate that CFN cannot express when dropping a partition-source column).Checklist
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license