New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Basic Integration with Datafusion #324
Conversation
@liurenjie1024 @ZENOTME @Fokko |
Thanks @marvinlanhenke This is amazing, I'll take a review later! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really cool! I've been hoping for I wouldn't have to try something like this 😅... I left a comment about package structure imagining if I were to use this.
Also if you'all do want to move forward with this, I'd be happy to help fill-in some of the datafusion traits implementations.
Thanks! Sorry for replying late. I think this is a good start for the integration work. And I have completed #277. Maybe we can convert this PR to ready for review now. @marvinlanhenke |
thanks for the feedback - i made some minor changes based on the suggestions; and converted this PR into ready for review. If we agree on the basic design here, we could merge and split the actual impl into multiple issues. @liurenjie1024 @ZENOTME @Fokko @Xuanwo @sdd @tshauck PTAL - especially, the part about "feature-flags" I have no idea whats best practice in 🦀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @marvinlanhenke for this pr! This is really excited for me since this will provide the foundation of sql interface against iceberg-rust! It's really amazing! I've left some comment to improve, but it looks great! I'll invite datafusion community to help review.
@liurenjie1024 I'll work on the missing impls over the next couple of days - and see how far we can get with the current state of iceberg-rust. |
@liurenjie1024 @ZENOTME @viirya @simonvandel unresolved / todo:
|
Hi, @marvinlanhenke How about opening a tracking issue to track the issues related to datafusion integration? I guess we will have more things to do, and we can edit on that tracking issue. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @marvinlanhenke for this pr!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for raising this PR. Mostly LGTM, just some minor nits.
I'm looking forward to work with you to integrate FileIO into it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks a bunch working on this @marvinlanhenke. I don't have much to comment here since it is more on the Datafusion side, rather than the Iceberg side 👍
// Schemas and providers should be cached and evicted based on time | ||
// As of right now; schemas might become stale. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are 2 hard problems in computer science: cache invalidation, naming things, and off-by-1 errors. - Leon Bambrick
Is there no way to leave this up to the user?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are 2 hard problems in computer science: cache invalidation, naming things, and off-by-1 errors. - Leon Bambrick
😄the classic.
I think leaving it up to the user, leads us to the issue about blocking an async call in a sync trait function? I think if we have an idea how to handle this, we can better reason about if, when, and where to cache?
from the docs:
To implement CatalogProvider and SchemaProvider for remote catalogs, you need to provide an in memory snapshot of the required metadata. Most systems typically either already have this information cached locally or can batch access to the remote catalog to retrieve multiple schemas and tables in a single network call.
} | ||
|
||
fn schema_names(&self) -> Vec<String> { | ||
self.schemas.keys().cloned().collect() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should be hesitant with caching, especially avoiding upfront optimizations. For Iceberg in general, consistency is king.
For this pr, we should not cache them in the providers, e.g. CatalogProvider, SchemaProvider, etc, but calling Catalog api directly.
This makes sense, but I see the issue with blocking on async calls. At first, I would take the price of waiting for the blocking calls. Even it is still a remote call, the ones to the REST catalog should be lightning-fast (that's where the caching happens).
I've addressed those issues raised by @Xuanwo - thanks for the review. I think we can merge this PR for now - and continue working on #357; @Fokko @liurenjie1024 |
Thanks for working on this @marvinlanhenke . Looks good to me. There are some more integrations with DataFusion I'm interested to work with like improving ExecutionPlan for IcebergTableScan. Let's move this forward. |
I think so. I think datafusion also python bindings so we can use python for all tests? I've not reviewed #349 yet, but for datafusion, we can go on without a separate crate for integrate tests? |
Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Thanks @marvinlanhenke for this pr, and @Fokko @Xuanwo @viirya @simonvandel @tshauck for review! |
* chore: basic structure * feat: add IcebergCatalogProvider * feat: add IcebergSchemaProvider * feat: add IcebergTableProvider * chore: add integration test infr * fix: remove old test * chore: update crate structure * fix: remove workspace dep * refactor: use try_join_all * chore: remove feature flag * chore: rename package * chore: update readme * feat: add TableType * fix: import + async_trait * fix: imports + async_trait * chore: remove feature flag * fix: cargo sort * refactor: CatalogProvider `fn try_new` * refactor: SchemaProvider `fn try_new` * chore: update docs * chore: update docs * chore: update doc * feat: impl `fn schema` on TableProvider * chore: rename ArrowSchema * refactor: remove DashMap * feat: add basic IcebergTableScan * chore: fix docs * chore: add comments * fix: clippy * fix: typo * fix: license * chore: update docs * chore: move derive stmt * fix: collect into hashmap * chore: use DFResult * Update crates/integrations/datafusion/README.md Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com> --------- Co-authored-by: Renjie Liu <liurenjie2008@gmail.com> Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com>
I had some fun, working with Datafusion and I want to share the (rough) design with you.
As outlined in #242 I went ahead and:
integrations
datafusion
with its respective modules (catalog.rs, schema.rs, etc.)CatalogProvider
The overall structure should be extensible, meaning - if we want to support an integration with e.g.
polars
we simply create a new subfolder and implement the traits.The test infra is supposed to be used by every integration.
The IcebergCatalogProvider, for example, should be usable by any Catalog (since it Arc for the client (dynamic dispatch))
I really would appreciate your thoughts on this approach - so we can discuss how to proceed, whats missing upstream (like #277), etc.