Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Basic Integration with Datafusion #324

Merged
merged 37 commits into from May 2, 2024

Conversation

marvinlanhenke
Copy link
Contributor

I had some fun, working with Datafusion and I want to share the (rough) design with you.

As outlined in #242 I went ahead and:

  • created a new crate integrations
  • a subfolder datafusion with its respective modules (catalog.rs, schema.rs, etc.)
  • implemented some of the traits e.g. CatalogProvider
  • basic test infra for integration test with HiveMetastore

The overall structure should be extensible, meaning - if we want to support an integration with e.g. polars we simply create a new subfolder and implement the traits.

The test infra is supposed to be used by every integration.

The IcebergCatalogProvider, for example, should be usable by any Catalog (since it Arc for the client (dynamic dispatch))

I really would appreciate your thoughts on this approach - so we can discuss how to proceed, whats missing upstream (like #277), etc.

@marvinlanhenke marvinlanhenke marked this pull request as draft April 6, 2024 04:28
@marvinlanhenke
Copy link
Contributor Author

@liurenjie1024 @ZENOTME @Fokko
PTAL and let me know what you think.

@liurenjie1024
Copy link
Collaborator

Thanks @marvinlanhenke This is amazing, I'll take a review later!

Copy link

@tshauck tshauck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really cool! I've been hoping for I wouldn't have to try something like this 😅... I left a comment about package structure imagining if I were to use this.

Also if you'all do want to move forward with this, I'd be happy to help fill-in some of the datafusion traits implementations.

Cargo.toml Outdated Show resolved Hide resolved
crates/integrations/src/datafusion/schema.rs Outdated Show resolved Hide resolved
@ZENOTME
Copy link
Contributor

ZENOTME commented Apr 22, 2024

Thanks! Sorry for replying late. I think this is a good start for the integration work. And I have completed #277. Maybe we can convert this PR to ready for review now. @marvinlanhenke

@marvinlanhenke marvinlanhenke marked this pull request as ready for review April 22, 2024 13:10
@marvinlanhenke
Copy link
Contributor Author

Thanks! Sorry for replying late. I think this is a good start for the integration work. And I have completed #277. Maybe we can convert this PR to ready for review now. @marvinlanhenke

thanks for the feedback - i made some minor changes based on the suggestions; and converted this PR into ready for review. If we agree on the basic design here, we could merge and split the actual impl into multiple issues.

@liurenjie1024 @ZENOTME @Fokko @Xuanwo @sdd @tshauck PTAL - especially, the part about "feature-flags" I have no idea whats best practice in 🦀

@marvinlanhenke marvinlanhenke changed the title [WIP] Integration with Datafusion Basic Integration with Datafusion Apr 22, 2024
Copy link
Collaborator

@liurenjie1024 liurenjie1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @marvinlanhenke for this pr! This is really excited for me since this will provide the foundation of sql interface against iceberg-rust! It's really amazing! I've left some comment to improve, but it looks great! I'll invite datafusion community to help review.

crates/integrations/datafusion/Cargo.toml Outdated Show resolved Hide resolved
crates/integrations/datafusion/Cargo.toml Outdated Show resolved Hide resolved
crates/integrations/datafusion/README.md Outdated Show resolved Hide resolved
crates/integrations/datafusion/src/table.rs Outdated Show resolved Hide resolved
crates/integrations/datafusion/src/table.rs Outdated Show resolved Hide resolved
crates/integrations/datafusion/src/table.rs Outdated Show resolved Hide resolved
crates/integrations/datafusion/src/schema.rs Outdated Show resolved Hide resolved
crates/integrations/datafusion/src/table.rs Outdated Show resolved Hide resolved
crates/integrations/datafusion/src/catalog.rs Show resolved Hide resolved
@marvinlanhenke
Copy link
Contributor Author

I've left some comment to improve, but it looks great! I'll invite datafusion community to help review.

@liurenjie1024
Thanks for the review.
I fixed most of the basic issues regarding structure, naming, async-trait etc.

I'll work on the missing impls over the next couple of days - and see how far we can get with the current state of iceberg-rust.

@marvinlanhenke
Copy link
Contributor Author

@liurenjie1024 @ZENOTME @viirya @simonvandel
...I just went ahead and pushed the recent updates; PTAL

unresolved / todo:

  • proper integration tests for table scan (requires us to setup a table with actual snapshots / data )
  • a cache for IcebergCatalogProvider and IcebergSchemaProvider / so data does not become stale
  • improve impl ExecutionPlan for IcebergTableScan (once we support filter pushdown, etc.)

@liurenjie1024
Copy link
Collaborator

@liurenjie1024 @ZENOTME @viirya @simonvandel ...I just went ahead and pushed the recent updates; PTAL

unresolved / todo:

  • proper integration tests for table scan (requires us to setup a table with actual snapshots / data )
  • a cache for IcebergCatalogProvider and IcebergSchemaProvider / so data does not become stale
  • improve impl ExecutionPlan for IcebergTableScan (once we support filter pushdown, etc.)

Hi, @marvinlanhenke How about opening a tracking issue to track the issues related to datafusion integration? I guess we will have more things to do, and we can edit on that tracking issue.

Copy link
Collaborator

@liurenjie1024 liurenjie1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @marvinlanhenke for this pr!

@liurenjie1024
Copy link
Collaborator

Let's wait a moment to see what others think about the catalog snapshot problem. cc @Xuanwo @Fokko @viirya @sdd PTAL

@marvinlanhenke
Copy link
Contributor Author

Let's wait a moment to see what others think about the catalog snapshot problem. cc @Xuanwo @Fokko @viirya @sdd PTAL

If it turns out, that the PR is okay (for now) - let me update the missing docs; before we merge - should make the follow up work a lot easier in a couple of days/weeks.

Copy link
Member

@Xuanwo Xuanwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for raising this PR. Mostly LGTM, just some minor nits.

I'm looking forward to work with you to integrate FileIO into it.

crates/integrations/datafusion/src/physical_plan/scan.rs Outdated Show resolved Hide resolved
crates/integrations/datafusion/src/schema.rs Outdated Show resolved Hide resolved
crates/integrations/datafusion/src/schema.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks a bunch working on this @marvinlanhenke. I don't have much to comment here since it is more on the Datafusion side, rather than the Iceberg side 👍

Comment on lines +48 to +49
// Schemas and providers should be cached and evicted based on time
// As of right now; schemas might become stale.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are 2 hard problems in computer science: cache invalidation, naming things, and off-by-1 errors. - Leon Bambrick

Is there no way to leave this up to the user?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are 2 hard problems in computer science: cache invalidation, naming things, and off-by-1 errors. - Leon Bambrick

😄the classic.

I think leaving it up to the user, leads us to the issue about blocking an async call in a sync trait function? I think if we have an idea how to handle this, we can better reason about if, when, and where to cache?

from the docs:

To implement CatalogProvider and SchemaProvider for remote catalogs, you need to provide an in memory snapshot of the required metadata. Most systems typically either already have this information cached locally or can batch access to the remote catalog to retrieve multiple schemas and tables in a single network call.

}

fn schema_names(&self) -> Vec<String> {
self.schemas.keys().cloned().collect()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be hesitant with caching, especially avoiding upfront optimizations. For Iceberg in general, consistency is king.

For this pr, we should not cache them in the providers, e.g. CatalogProvider, SchemaProvider, etc, but calling Catalog api directly.

This makes sense, but I see the issue with blocking on async calls. At first, I would take the price of waiting for the blocking calls. Even it is still a remote call, the ones to the REST catalog should be lightning-fast (that's where the caching happens).

crates/integrations/datafusion/src/physical_plan/scan.rs Outdated Show resolved Hide resolved
@marvinlanhenke
Copy link
Contributor Author

I've addressed those issues raised by @Xuanwo - thanks for the review.

I think we can merge this PR for now - and continue working on #357;
especially the caching issues; and proper integration testing with data.

@Fokko @liurenjie1024
I haven't checked now, but from memory I believe py-iceberg uses pyspark to setup tables with actual data for proper integration testing? Would this make sense to tackle such an issue in the next weeks in order to provide proper integration testing? Or should we wait once #349 lands and we have a dedicated crate for e2e testing?

@viirya
Copy link
Member

viirya commented Apr 30, 2024

Thanks for working on this @marvinlanhenke . Looks good to me. There are some more integrations with DataFusion I'm interested to work with like improving ExecutionPlan for IcebergTableScan. Let's move this forward.

@liurenjie1024
Copy link
Collaborator

I haven't checked now, but from memory I believe py-iceberg uses pyspark to setup tables with actual data for proper integration testing?

I think so. I think datafusion also python bindings so we can use python for all tests?

I've not reviewed #349 yet, but for datafusion, we can go on without a separate crate for integrate tests?

Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com>
@liurenjie1024 liurenjie1024 merged commit bbd042d into apache:main May 2, 2024
7 checks passed
@liurenjie1024
Copy link
Collaborator

Thanks @marvinlanhenke for this pr, and @Fokko @Xuanwo @viirya @simonvandel @tshauck for review!

c-thiel pushed a commit to c-thiel/iceberg-rust that referenced this pull request May 13, 2024
* chore: basic structure

* feat: add IcebergCatalogProvider

* feat: add IcebergSchemaProvider

* feat: add IcebergTableProvider

* chore: add integration test infr

* fix: remove old test

* chore: update crate structure

* fix: remove workspace dep

* refactor: use try_join_all

* chore: remove feature flag

* chore: rename package

* chore: update readme

* feat: add TableType

* fix: import + async_trait

* fix: imports + async_trait

* chore: remove feature flag

* fix: cargo sort

* refactor: CatalogProvider `fn try_new`

* refactor: SchemaProvider `fn try_new`

* chore: update docs

* chore: update docs

* chore: update doc

* feat: impl `fn schema` on TableProvider

* chore: rename ArrowSchema

* refactor: remove DashMap

* feat: add basic IcebergTableScan

* chore: fix docs

* chore: add comments

* fix: clippy

* fix: typo

* fix: license

* chore: update docs

* chore: move derive stmt

* fix: collect into hashmap

* chore: use DFResult

* Update crates/integrations/datafusion/README.md

Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com>

---------

Co-authored-by: Renjie Liu <liurenjie2008@gmail.com>
Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants