Basic Integration with Datafusion #324

marvinlanhenke · 2024-04-06T04:28:42Z

I had some fun, working with Datafusion and I want to share the (rough) design with you.

As outlined in #242 I went ahead and:

created a new crate integrations
a subfolder datafusion with its respective modules (catalog.rs, schema.rs, etc.)
implemented some of the traits e.g. CatalogProvider
basic test infra for integration test with HiveMetastore

The overall structure should be extensible, meaning - if we want to support an integration with e.g. polars we simply create a new subfolder and implement the traits.

The test infra is supposed to be used by every integration.

The IcebergCatalogProvider, for example, should be usable by any Catalog (since it Arc for the client (dynamic dispatch))

I really would appreciate your thoughts on this approach - so we can discuss how to proceed, whats missing upstream (like #277), etc.

marvinlanhenke · 2024-04-06T04:29:26Z

@liurenjie1024 @ZENOTME @Fokko
PTAL and let me know what you think.

liurenjie1024 · 2024-04-06T04:31:16Z

Thanks @marvinlanhenke This is amazing, I'll take a review later!

tshauck

This is really cool! I've been hoping for I wouldn't have to try something like this 😅... I left a comment about package structure imagining if I were to use this.

Also if you'all do want to move forward with this, I'd be happy to help fill-in some of the datafusion traits implementations.

Cargo.toml

crates/integrations/src/datafusion/schema.rs

crates/integrations/datafusion/src/catalog.rs

ZENOTME · 2024-04-22T02:41:46Z

Thanks! Sorry for replying late. I think this is a good start for the integration work. And I have completed #277. Maybe we can convert this PR to ready for review now. @marvinlanhenke

marvinlanhenke · 2024-04-22T13:13:17Z

Thanks! Sorry for replying late. I think this is a good start for the integration work. And I have completed #277. Maybe we can convert this PR to ready for review now. @marvinlanhenke

thanks for the feedback - i made some minor changes based on the suggestions; and converted this PR into ready for review. If we agree on the basic design here, we could merge and split the actual impl into multiple issues.

@liurenjie1024 @ZENOTME @Fokko @Xuanwo @sdd @tshauck PTAL - especially, the part about "feature-flags" I have no idea whats best practice in 🦀

liurenjie1024

Thanks @marvinlanhenke for this pr! This is really excited for me since this will provide the foundation of sql interface against iceberg-rust! It's really amazing! I've left some comment to improve, but it looks great! I'll invite datafusion community to help review.

crates/integrations/datafusion/Cargo.toml

crates/integrations/datafusion/README.md

crates/integrations/datafusion/src/table.rs

crates/integrations/datafusion/src/schema.rs

crates/integrations/datafusion/src/table.rs

crates/integrations/datafusion/src/catalog.rs

marvinlanhenke · 2024-04-25T16:11:39Z

I've left some comment to improve, but it looks great! I'll invite datafusion community to help review.

@liurenjie1024
Thanks for the review.
I fixed most of the basic issues regarding structure, naming, async-trait etc.

I'll work on the missing impls over the next couple of days - and see how far we can get with the current state of iceberg-rust.

marvinlanhenke · 2024-04-28T10:48:23Z

@liurenjie1024 @ZENOTME @viirya @simonvandel
...I just went ahead and pushed the recent updates; PTAL

unresolved / todo:

proper integration tests for table scan (requires us to setup a table with actual snapshots / data )
a cache for IcebergCatalogProvider and IcebergSchemaProvider / so data does not become stale
improve impl ExecutionPlan for IcebergTableScan (once we support filter pushdown, etc.)

crates/integrations/datafusion/src/physical_plan/scan.rs

liurenjie1024 · 2024-04-28T12:58:38Z

@liurenjie1024 @ZENOTME @viirya @simonvandel ...I just went ahead and pushed the recent updates; PTAL

unresolved / todo:

proper integration tests for table scan (requires us to setup a table with actual snapshots / data )

a cache for IcebergCatalogProvider and IcebergSchemaProvider / so data does not become stale

improve impl ExecutionPlan for IcebergTableScan (once we support filter pushdown, etc.)

Hi, @marvinlanhenke How about opening a tracking issue to track the issues related to datafusion integration? I guess we will have more things to do, and we can edit on that tracking issue.

liurenjie1024

Thanks @marvinlanhenke for this pr!

crates/integrations/datafusion/src/physical_plan/scan.rs

liurenjie1024 · 2024-04-28T13:18:27Z

Let's wait a moment to see what others think about the catalog snapshot problem. cc @Xuanwo @Fokko @viirya @sdd PTAL

marvinlanhenke · 2024-04-28T13:38:29Z

Let's wait a moment to see what others think about the catalog snapshot problem. cc @Xuanwo @Fokko @viirya @sdd PTAL

If it turns out, that the PR is okay (for now) - let me update the missing docs; before we merge - should make the follow up work a lot easier in a couple of days/weeks.

Xuanwo

Thanks for raising this PR. Mostly LGTM, just some minor nits.

I'm looking forward to work with you to integrate FileIO into it.

crates/integrations/datafusion/src/physical_plan/scan.rs

crates/integrations/datafusion/src/schema.rs

Fokko

LGTM, thanks a bunch working on this @marvinlanhenke. I don't have much to comment here since it is more on the Datafusion side, rather than the Iceberg side 👍

Fokko · 2024-04-30T07:22:32Z

crates/integrations/datafusion/src/catalog.rs

+        // Schemas and providers should be cached and evicted based on time
+        // As of right now; schemas might become stale.


There are 2 hard problems in computer science: cache invalidation, naming things, and off-by-1 errors. - Leon Bambrick

Is there no way to leave this up to the user?

There are 2 hard problems in computer science: cache invalidation, naming things, and off-by-1 errors. - Leon Bambrick

😄the classic.

I think leaving it up to the user, leads us to the issue about blocking an async call in a sync trait function? I think if we have an idea how to handle this, we can better reason about if, when, and where to cache?

from the docs:

To implement CatalogProvider and SchemaProvider for remote catalogs, you need to provide an in memory snapshot of the required metadata. Most systems typically either already have this information cached locally or can batch access to the remote catalog to retrieve multiple schemas and tables in a single network call.

Fokko · 2024-04-30T07:30:07Z

crates/integrations/datafusion/src/catalog.rs

+    }
+
+    fn schema_names(&self) -> Vec<String> {
+        self.schemas.keys().cloned().collect()


We should be hesitant with caching, especially avoiding upfront optimizations. For Iceberg in general, consistency is king.

For this pr, we should not cache them in the providers, e.g. CatalogProvider, SchemaProvider, etc, but calling Catalog api directly.

This makes sense, but I see the issue with blocking on async calls. At first, I would take the price of waiting for the blocking calls. Even it is still a remote call, the ones to the REST catalog should be lightning-fast (that's where the caching happens).

crates/integrations/datafusion/src/physical_plan/scan.rs

marvinlanhenke · 2024-04-30T13:20:10Z

I've addressed those issues raised by @Xuanwo - thanks for the review.

I think we can merge this PR for now - and continue working on #357;
especially the caching issues; and proper integration testing with data.

@Fokko @liurenjie1024
I haven't checked now, but from memory I believe py-iceberg uses pyspark to setup tables with actual data for proper integration testing? Would this make sense to tackle such an issue in the next weeks in order to provide proper integration testing? Or should we wait once #349 lands and we have a dedicated crate for e2e testing?

crates/integrations/datafusion/README.md

viirya · 2024-04-30T14:29:01Z

Thanks for working on this @marvinlanhenke . Looks good to me. There are some more integrations with DataFusion I'm interested to work with like improving ExecutionPlan for IcebergTableScan. Let's move this forward.

liurenjie1024 · 2024-05-02T03:04:59Z

I haven't checked now, but from memory I believe py-iceberg uses pyspark to setup tables with actual data for proper integration testing?

I think so. I think datafusion also python bindings so we can use python for all tests?

I've not reviewed #349 yet, but for datafusion, we can go on without a separate crate for integrate tests?

Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com>

liurenjie1024 · 2024-05-02T03:33:53Z

Thanks @marvinlanhenke for this pr, and @Fokko @Xuanwo @viirya @simonvandel @tshauck for review!

* chore: basic structure * feat: add IcebergCatalogProvider * feat: add IcebergSchemaProvider * feat: add IcebergTableProvider * chore: add integration test infr * fix: remove old test * chore: update crate structure * fix: remove workspace dep * refactor: use try_join_all * chore: remove feature flag * chore: rename package * chore: update readme * feat: add TableType * fix: import + async_trait * fix: imports + async_trait * chore: remove feature flag * fix: cargo sort * refactor: CatalogProvider `fn try_new` * refactor: SchemaProvider `fn try_new` * chore: update docs * chore: update docs * chore: update doc * feat: impl `fn schema` on TableProvider * chore: rename ArrowSchema * refactor: remove DashMap * feat: add basic IcebergTableScan * chore: fix docs * chore: add comments * fix: clippy * fix: typo * fix: license * chore: update docs * chore: move derive stmt * fix: collect into hashmap * chore: use DFResult * Update crates/integrations/datafusion/README.md Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com> --------- Co-authored-by: Renjie Liu <liurenjie2008@gmail.com> Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com>

marvinlanhenke added 5 commits April 5, 2024 20:25

chore: basic structure

333b607

feat: add IcebergCatalogProvider

4bfc87a

feat: add IcebergSchemaProvider

e4ba25d

feat: add IcebergTableProvider

881cf37

chore: add integration test infr

47a041f

marvinlanhenke marked this pull request as draft April 6, 2024 04:28

fix: remove old test

66a1667

tshauck reviewed Apr 11, 2024

View reviewed changes

Cargo.toml Outdated Show resolved Hide resolved

crates/integrations/src/datafusion/schema.rs Outdated Show resolved Hide resolved

chore: update crate structure

709ab3e

simonvandel reviewed Apr 18, 2024

View reviewed changes

crates/integrations/datafusion/src/catalog.rs Outdated Show resolved Hide resolved

marvinlanhenke added 2 commits April 22, 2024 14:04

fix: remove workspace dep

9510b7c

refactor: use try_join_all

2b92021

marvinlanhenke marked this pull request as ready for review April 22, 2024 13:10

marvinlanhenke changed the title ~~[WIP] Integration with Datafusion~~ Basic Integration with Datafusion Apr 22, 2024

marvinlanhenke mentioned this pull request Apr 25, 2024

Tracking issues of iceberg-rust v0.3.0 #348

Open

72 tasks

liurenjie1024 reviewed Apr 25, 2024

View reviewed changes

marvinlanhenke added 9 commits April 25, 2024 17:36

Merge branch 'main' into draft_datafusion

5141d11

chore: remove feature flag

475a9a3

chore: rename package

0f706ca

chore: update readme

e05fc45

feat: add TableType

c868439

fix: import + async_trait

26b257e

fix: imports + async_trait

60ff7f2

chore: remove feature flag

7ef31fd

fix: cargo sort

2f152fe

marvinlanhenke added 6 commits April 28, 2024 12:11

feat: add basic IcebergTableScan

0d55fbc

chore: fix docs

294e575

chore: add comments

13cc2d8

fix: clippy

c95b1dd

fix: typo

32f33cb

fix: license

30830ec

marvinlanhenke commented Apr 28, 2024

View reviewed changes

crates/integrations/datafusion/src/physical_plan/scan.rs Show resolved Hide resolved

liurenjie1024 approved these changes Apr 28, 2024

View reviewed changes

crates/integrations/datafusion/src/physical_plan/scan.rs Show resolved Hide resolved

liurenjie1024 requested a review from Fokko April 28, 2024 13:18

marvinlanhenke mentioned this pull request Apr 28, 2024

Tracking Issue: Integration with Datafusion #357

Open

3 tasks

chore: update docs

391f983

Xuanwo reviewed Apr 30, 2024

View reviewed changes

crates/integrations/datafusion/src/physical_plan/scan.rs Outdated Show resolved Hide resolved

crates/integrations/datafusion/src/schema.rs Outdated Show resolved Hide resolved

crates/integrations/datafusion/src/schema.rs Outdated Show resolved Hide resolved

Fokko approved these changes Apr 30, 2024

View reviewed changes

marvinlanhenke added 3 commits April 30, 2024 15:03

chore: move derive stmt

d94d615

fix: collect into hashmap

996f249

chore: use DFResult

199382d

viirya reviewed Apr 30, 2024

View reviewed changes

crates/integrations/datafusion/README.md Outdated Show resolved Hide resolved

viirya approved these changes Apr 30, 2024

View reviewed changes

Update crates/integrations/datafusion/README.md

177e5c8

Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com>

liurenjie1024 merged commit bbd042d into apache:main May 2, 2024
7 checks passed

liurenjie1024 mentioned this pull request May 2, 2024

Make all SchemaProvider trait APIs async apache/datafusion#10339

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Basic Integration with Datafusion #324

Basic Integration with Datafusion #324

marvinlanhenke commented Apr 6, 2024

marvinlanhenke commented Apr 6, 2024

liurenjie1024 commented Apr 6, 2024

tshauck left a comment

ZENOTME commented Apr 22, 2024

marvinlanhenke commented Apr 22, 2024

liurenjie1024 left a comment

marvinlanhenke commented Apr 25, 2024

marvinlanhenke commented Apr 28, 2024

liurenjie1024 commented Apr 28, 2024

unresolved / todo:

liurenjie1024 left a comment

liurenjie1024 commented Apr 28, 2024

marvinlanhenke commented Apr 28, 2024

Xuanwo left a comment

Fokko left a comment

Fokko Apr 30, 2024

marvinlanhenke Apr 30, 2024

Fokko Apr 30, 2024

marvinlanhenke commented Apr 30, 2024

viirya commented Apr 30, 2024

liurenjie1024 commented May 2, 2024

liurenjie1024 commented May 2, 2024

		// Schemas and providers should be cached and evicted based on time
		// As of right now; schemas might become stale.

Basic Integration with Datafusion #324

Basic Integration with Datafusion #324

Conversation

marvinlanhenke commented Apr 6, 2024

marvinlanhenke commented Apr 6, 2024

liurenjie1024 commented Apr 6, 2024

tshauck left a comment

Choose a reason for hiding this comment

ZENOTME commented Apr 22, 2024

marvinlanhenke commented Apr 22, 2024

liurenjie1024 left a comment

Choose a reason for hiding this comment

marvinlanhenke commented Apr 25, 2024

marvinlanhenke commented Apr 28, 2024

unresolved / todo:

liurenjie1024 commented Apr 28, 2024

unresolved / todo:

liurenjie1024 left a comment

Choose a reason for hiding this comment

liurenjie1024 commented Apr 28, 2024

marvinlanhenke commented Apr 28, 2024

Xuanwo left a comment

Choose a reason for hiding this comment

Fokko left a comment

Choose a reason for hiding this comment

Fokko Apr 30, 2024

Choose a reason for hiding this comment

marvinlanhenke Apr 30, 2024

Choose a reason for hiding this comment

Fokko Apr 30, 2024

Choose a reason for hiding this comment

marvinlanhenke commented Apr 30, 2024

viirya commented Apr 30, 2024

liurenjie1024 commented May 2, 2024

liurenjie1024 commented May 2, 2024