Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Managed Transform protos & translation; Iceberg SchemaTransforms & translation #30910

Merged
merged 66 commits into from
Apr 22, 2024

Conversation

ahmedabu98
Copy link
Contributor

@ahmedabu98 ahmedabu98 commented Apr 9, 2024

Copy link
Contributor

github-actions bot commented Apr 9, 2024

Checks are failing. Will not request review until checks are succeeding. If you'd like to override that behavior, comment assign set of reviewers

Copy link
Contributor

@chamikaramj chamikaramj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@github-actions github-actions bot added the build label Apr 11, 2024
@ahmedabu98 ahmedabu98 added this to the 2.56.0 Release milestone Apr 12, 2024
@github-actions github-actions bot added the gcp label Apr 15, 2024
@ahmedabu98 ahmedabu98 changed the title Iceberg Write SchemaTransform Iceberg SchemaTransforms and Translation Apr 15, 2024
@github-actions github-actions bot added the model label Apr 16, 2024
… Managed and Iceberg urns from proto and use SCHEMA_TRANSFORM URN
Copy link
Contributor

@chamikaramj chamikaramj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks.

I think this includes what we want in the release. But this won't work end-to-end for upgrading till we update the ExpansionService logic as I mentioned in a comment.

Copy link
Contributor

@chamikaramj chamikaramj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. LGTM.

Copy link
Contributor

@robertwb robertwb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your patience with all my comments, both on the CL and out of band.

There are a huge number of separable changes going on in this PR, but given the time constraint it probably isn't worth separating them into separate PRs/commits at this point, so we can get this in as is.

@ahmedabu98
Copy link
Contributor Author

Thank you all for the valuable feedback. Merging this now

@ahmedabu98 ahmedabu98 merged commit 37609ba into apache:master Apr 22, 2024
108 of 109 checks passed
damccorm pushed a commit that referenced this pull request Apr 23, 2024
* iceberg write schematransform and test

* cleanup

* IcebergIO translation and tests

* add sanity check for building with Row; add documentation about output schema; add iceberg to IO expansion service

* spotless

* spotless

* permitUnusedDeclared iceberg

* Change ManagedSchemaTransformProvider to take a Row config instead of a Yaml string

* don't auto generate external wrapper for this just yet

* spotless

* spotless

* Read schematransform and tests

* pulling in IcebergIO changes; spotless

* icebergio translation; managed translation; protos

* spotless

* spotless; use underscore instead of camel case field names when translating managed transform config

* add grpc dependency

* updated proto description; fix gen xlang command

* ManagedTransform explicit input/output types; move iceberg package to org.apache.beam.sdk.io.iceberg

* externalizable IcebergCatalogConfig

* externalizable IcebergCatalogConfig supports all properties; address some comments

* unify iceberg urns and identifiers; update some comments

* one source for all supported managed transform identifiers

* add documentation

* custom serialization for OneTableDynamicDestinations

* add iceberg via managed API tests; update proto doc

* rename config; change test schematransform location

* spotless

* add missing package-info file

* spotless

* replace icebergIO translation with iceberg schematransform translation; fix Schema::sorted to do recursive sorting

* remove ExternalizableIcebergCatalogConfig (no longer needed)

* pull identifiers from generated proto

* remove unused hadoop dependency

* update generate sequence wrapper after Schema sorting

* managed transform translation uses default schema

* yaml returns null row; cleanup

* spotless

* remove SchemaAwareTransformPayload and use SchemaTransformPayload instead; rename StandardSchemaAwareTransforms -> ManagedSchemaAwareTransforms

* create a beam-schema-compatible class for Snapshot info

* removed new proto file and moved Managed URNs to beam_runner_api.proto; we now use SchemaTransformPayload for all schematransforms, including Managed; adding a version number to FileWriteResult encoding so that we can use it to fork in the future whhen needed

* Row and Schema snake_case <-> camelCase conversion logic

* Row sorted() util

* use Row::sorted to fetch Managed & Iceberg row configs

* use snake_case convention when translating transforms to spec; remove Managed and Iceberg urns from proto and use SCHEMA_TRANSFORM URN

* spotless

* cleanup

* DefaultSchemaProvider can now provide the underlying SchemaProvider

* perform snake_case <-> camelCase conversions directly in TypedSchemaTransformProvider

* update icebergIO and managed translations to reflect field name convention changes

* sorted SnapshotInfo

* update manual Python wrappers to use snake_case convention; remove case conversion step from Python auto-xlang; spotless

* Row utils allow nullable

* add FileWriteResult test for version number; fix existing Java and YAML tests

* add schema-aware transform urn to transform annotations during translation

* add comments why we sort and snake_case configuration schemas

* add SchemaTransformTranslation abstraction. when encountering a SCHEMA_TRANSFORM urn, fetch underlying identifier

* add documentation

* prioritize registered providers; remove snake_case <-> camelCase conversions from python side

* cleanup
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Task]: TypedSchemaTransformProvider should generate Schema field names with lower_snake_case convention
3 participants