Managed Transform protos & translation; Iceberg SchemaTransforms & translation #30910

ahmedabu98 · 2024-04-09T20:57:52Z

Creating a SchemaTransform for the recently added Iceberg sink (Initial Iceberg Sink #30797)
Creating a SchemaTransform for the recently added Iceberg source (Re-add iceberg bounded source; test splitting #30805)
Adding transform payload translation for IcebergIO
Adding transform payload translation for Managed transforms
Adding utils for Row and Schema (sorted(), toSnakeCase(), toCamelCase())
Establishes snake_case as the convention for SchemaTransform configuration field names (Fixes [Task]: TypedSchemaTransformProvider should generate Schema field names with lower_snake_case convention #31061)

github-actions · 2024-04-09T21:34:32Z

Checks are failing. Will not request review until checks are succeeding. If you'd like to override that behavior, comment assign set of reviewers

chamikaramj

Thanks!

...io/iceberg/src/main/java/org/apache/beam/io/iceberg/IcebergWriteSchemaTransformProvider.java

…t schema; add iceberg to IO expansion service

… a Yaml string

…erg_translation

…erg_write_schematransform

…erg_translation Pulling Read connector and making a translation for that too.

into iceberg_write_schematransform

… into iceberg_write_schematransform

… Managed and Iceberg urns from proto and use SCHEMA_TRANSFORM URN

.../iceberg/src/main/java/org/apache/beam/sdk/io/iceberg/IcebergSchemaTransformTranslation.java

…ransformProvider

…ntion changes

…se conversion step from Python auto-xlang; spotless

chamikaramj

LGTM. Thanks.

I think this includes what we want in the release. But this won't work end-to-end for upgrading till we update the ExpansionService logic as I mentioned in a comment.

sdks/java/io/iceberg/src/main/java/org/apache/beam/sdk/io/iceberg/FileWriteResult.java

sdks/java/io/iceberg/src/main/java/org/apache/beam/sdk/io/iceberg/IcebergIO.java

...ceberg/src/main/java/org/apache/beam/sdk/io/iceberg/IcebergSchemaTransformCatalogConfig.java

.../iceberg/src/main/java/org/apache/beam/sdk/io/iceberg/IcebergSchemaTransformTranslation.java

…ML tests

…ation

…A_TRANSFORM urn, fetch underlying identifier

…ersions from python side

chamikaramj

Thanks. LGTM.

robertwb

Thanks for your patience with all my comments, both on the CL and out of band.

There are a huge number of separable changes going on in this PR, but given the time constraint it probably isn't worth separating them into separate PRs/commits at this point, so we can get this in as is.

ahmedabu98 · 2024-04-22T23:34:47Z

Thank you all for the valuable feedback. Merging this now

* iceberg write schematransform and test * cleanup * IcebergIO translation and tests * add sanity check for building with Row; add documentation about output schema; add iceberg to IO expansion service * spotless * spotless * permitUnusedDeclared iceberg * Change ManagedSchemaTransformProvider to take a Row config instead of a Yaml string * don't auto generate external wrapper for this just yet * spotless * spotless * Read schematransform and tests * pulling in IcebergIO changes; spotless * icebergio translation; managed translation; protos * spotless * spotless; use underscore instead of camel case field names when translating managed transform config * add grpc dependency * updated proto description; fix gen xlang command * ManagedTransform explicit input/output types; move iceberg package to org.apache.beam.sdk.io.iceberg * externalizable IcebergCatalogConfig * externalizable IcebergCatalogConfig supports all properties; address some comments * unify iceberg urns and identifiers; update some comments * one source for all supported managed transform identifiers * add documentation * custom serialization for OneTableDynamicDestinations * add iceberg via managed API tests; update proto doc * rename config; change test schematransform location * spotless * add missing package-info file * spotless * replace icebergIO translation with iceberg schematransform translation; fix Schema::sorted to do recursive sorting * remove ExternalizableIcebergCatalogConfig (no longer needed) * pull identifiers from generated proto * remove unused hadoop dependency * update generate sequence wrapper after Schema sorting * managed transform translation uses default schema * yaml returns null row; cleanup * spotless * remove SchemaAwareTransformPayload and use SchemaTransformPayload instead; rename StandardSchemaAwareTransforms -> ManagedSchemaAwareTransforms * create a beam-schema-compatible class for Snapshot info * removed new proto file and moved Managed URNs to beam_runner_api.proto; we now use SchemaTransformPayload for all schematransforms, including Managed; adding a version number to FileWriteResult encoding so that we can use it to fork in the future whhen needed * Row and Schema snake_case <-> camelCase conversion logic * Row sorted() util * use Row::sorted to fetch Managed & Iceberg row configs * use snake_case convention when translating transforms to spec; remove Managed and Iceberg urns from proto and use SCHEMA_TRANSFORM URN * spotless * cleanup * DefaultSchemaProvider can now provide the underlying SchemaProvider * perform snake_case <-> camelCase conversions directly in TypedSchemaTransformProvider * update icebergIO and managed translations to reflect field name convention changes * sorted SnapshotInfo * update manual Python wrappers to use snake_case convention; remove case conversion step from Python auto-xlang; spotless * Row utils allow nullable * add FileWriteResult test for version number; fix existing Java and YAML tests * add schema-aware transform urn to transform annotations during translation * add comments why we sort and snake_case configuration schemas * add SchemaTransformTranslation abstraction. when encountering a SCHEMA_TRANSFORM urn, fetch underlying identifier * add documentation * prioritize registered providers; remove snake_case <-> camelCase conversions from python side * cleanup

ahmedabu98 added 2 commits April 9, 2024 16:50

iceberg write schematransform and test

42611e0

cleanup

16e6235

github-actions bot added java io labels Apr 9, 2024

chamikaramj reviewed Apr 11, 2024

View reviewed changes

ahmedabu98 added 9 commits April 11, 2024 12:04

IcebergIO translation and tests

ed72898

add sanity check for building with Row; add documentation about outpu…

1738345

…t schema; add iceberg to IO expansion service

spotless

364ebbe

spotless

79d2c94

permitUnusedDeclared iceberg

30de265

Change ManagedSchemaTransformProvider to take a Row config instead of…

905d590

… a Yaml string

Merge branch 'master' of https://github.com/ahmedabu98/beam into iceb…

553281f

…erg_translation

Merge branch 'master' of https://github.com/ahmedabu98/beam into iceb…

2c733ec

…erg_write_schematransform

don't auto generate external wrapper for this just yet

1067d84

github-actions bot added the build label Apr 11, 2024

ahmedabu98 added 2 commits April 11, 2024 18:53

spotless

6db699a

spotless

301e388

ahmedabu98 added this to the 2.56.0 Release milestone Apr 12, 2024

ahmedabu98 added 5 commits April 14, 2024 19:28

Merge branch 'master' of https://github.com/ahmedabu98/beam into iceb…

f1576e3

…erg_write_schematransform

Merge branch 'master' of https://github.com/ahmedabu98/beam into iceb…

04cc2db

…erg_translation Pulling Read connector and making a translation for that too.

Read schematransform and tests

27e5fb0

Merge branch 'iceberg_translation' of https://github.com/ahmedabu98/beam

e2cb93b

into iceberg_write_schematransform

pulling in IcebergIO changes; spotless

aa8b1ed

github-actions bot added the gcp label Apr 15, 2024

Merge branch 'managed_row_config' of https://github.com/ahmedabu98/beam…

6823524

… into iceberg_write_schematransform

ahmedabu98 changed the title ~~Iceberg Write SchemaTransform~~ Iceberg SchemaTransforms and Translation Apr 15, 2024

icebergio translation; managed translation; protos

0674069

github-actions bot added the model label Apr 16, 2024

spotless

9034cee

ahmedabu98 added 2 commits April 19, 2024 19:57

use snake_case convention when translating transforms to spec; remove…

2992192

… Managed and Iceberg urns from proto and use SCHEMA_TRANSFORM URN

spotless

b311068

chamikaramj reviewed Apr 20, 2024

View reviewed changes

.../iceberg/src/main/java/org/apache/beam/sdk/io/iceberg/IcebergSchemaTransformTranslation.java Outdated Show resolved Hide resolved

ahmedabu98 added 6 commits April 19, 2024 23:17

cleanup

2461b44

DefaultSchemaProvider can now provide the underlying SchemaProvider

ecb4dbb

perform snake_case <-> camelCase conversions directly in TypedSchemaT…

68895a7

…ransformProvider

update icebergIO and managed translations to reflect field name conve…

d2135b8

…ntion changes

sorted SnapshotInfo

64863ce

update manual Python wrappers to use snake_case convention; remove ca…

5afb633

…se conversion step from Python auto-xlang; spotless

github-actions bot added the bigtable label Apr 20, 2024

Row utils allow nullable

2ddd5bb

chamikaramj reviewed Apr 22, 2024

View reviewed changes

add FileWriteResult test for version number; fix existing Java and YA…

d5a4d66

…ML tests

github-actions bot added kafka yaml labels Apr 22, 2024

ahmedabu98 added 5 commits April 22, 2024 09:16

add schema-aware transform urn to transform annotations during transl…

3b74f77

…ation

add comments why we sort and snake_case configuration schemas

af65032

add SchemaTransformTranslation abstraction. when encountering a SCHEM…

7130e56

…A_TRANSFORM urn, fetch underlying identifier

add documentation

de81e60

prioritize registered providers; remove snake_case <-> camelCase conv…

34dc371

…ersions from python side

chamikaramj approved these changes Apr 22, 2024

View reviewed changes

cleanup

82b481d

robertwb approved these changes Apr 22, 2024

View reviewed changes

ahmedabu98 merged commit 37609ba into apache:master Apr 22, 2024
108 of 109 checks passed

chamikaramj mentioned this pull request Apr 23, 2024

[Feature Request]: Add Java API for using and upgrading Iceberg via the Managed transforms API #30892

Closed

16 tasks

ahmedabu98 mentioned this pull request Apr 23, 2024

Cherrypicking #30910 into release-2.56.0 #31076

Merged

3 tasks

ahmedabu98 mentioned this pull request Apr 25, 2024

Revert global snake_case convention for SchemaTransforms #31109

Merged

ahmedabu98 mentioned this pull request Jun 10, 2024

Default translation for SchemaTransforms #31558

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Managed Transform protos & translation; Iceberg SchemaTransforms & translation #30910

Managed Transform protos & translation; Iceberg SchemaTransforms & translation #30910

ahmedabu98 commented Apr 9, 2024 •

edited

github-actions bot commented Apr 9, 2024

chamikaramj left a comment

chamikaramj left a comment •

edited

chamikaramj left a comment

robertwb left a comment

ahmedabu98 commented Apr 22, 2024

Managed Transform protos & translation; Iceberg SchemaTransforms & translation #30910

Managed Transform protos & translation; Iceberg SchemaTransforms & translation #30910

Conversation

ahmedabu98 commented Apr 9, 2024 • edited

github-actions bot commented Apr 9, 2024

chamikaramj left a comment

Choose a reason for hiding this comment

chamikaramj left a comment • edited

Choose a reason for hiding this comment

chamikaramj left a comment

Choose a reason for hiding this comment

robertwb left a comment

Choose a reason for hiding this comment

ahmedabu98 commented Apr 22, 2024

ahmedabu98 commented Apr 9, 2024 •

edited

chamikaramj left a comment •

edited