Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for serialization of iceberg tables #35456

Merged
merged 1 commit into from Nov 5, 2023

Conversation

bolkedebruin
Copy link
Contributor

@bolkedebruin bolkedebruin commented Nov 5, 2023

This adds lightweight support for serialization of Apache Iceberg tables. This means that references
are captured and tables are reinstantiated with their catalog information.

Lightweight means that Catalogs are not reinstated from connections-ids or more generically. I think that should find its home in airflow.catalog but that is up for discussion.

cc @Fokko

This allows you to do the following:

import polars as pl
from pyiceberg.catalog import load_catalog
from pyiceberg.table import Table

@task
def a_task() -> Table:
  catalog = load_catalog("airflow")
  table = catalog.load_table("schema.table")
  return table

@task
def b_task(table: Table) -> pl.DataFrame:
  df = pl.scan_iceberg(table)
  return df

^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

This adds lightweight support for serialization of
Apache Iceberg tables. This means that references
are captured and tables are reinstantiated with their
catalog information.
Copy link
Member

@hussein-awala hussein-awala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

I wonder if we should move some of the serializers to the providers (Kubernetes, this one, ...) and load them from there:

for _, name, _ in iter_namespace(airflow.serialization.serializers):

But until we do that, LGTM.

@bolkedebruin
Copy link
Contributor Author

I think so.

Looks good!

I wonder if we should move some of the serializers to the providers (Kubernetes, this one, ...) and load them from there:

for _, name, _ in iter_namespace(airflow.serialization.serializers):

But until we do that, LGTM.

@bolkedebruin bolkedebruin merged commit 9a921c2 into apache:main Nov 5, 2023
70 checks passed
@bolkedebruin bolkedebruin deleted the iceberg branch November 5, 2023 19:38
romsharon98 pushed a commit to romsharon98/airflow that referenced this pull request Nov 10, 2023
This adds lightweight support for serialization of
Apache Iceberg tables. This means that references
are captured and tables are re-instantiated with their
catalog information.
@ephraimbuddy ephraimbuddy added the type:new-feature Changelog: New Features label Nov 20, 2023
@ephraimbuddy ephraimbuddy added this to the Airflow 2.8.0 milestone Nov 20, 2023
@ephraimbuddy ephraimbuddy added AIP-58 changelog:skip Changes that should be skipped from the changelog (CI, tests, etc..) and removed type:new-feature Changelog: New Features labels Nov 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
AIP-58 area:serialization changelog:skip Changes that should be skipped from the changelog (CI, tests, etc..)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants