Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Basic Integration with Datafusion #324
Basic Integration with Datafusion #324
Changes from 33 commits
333b607
4bfc87a
e4ba25d
881cf37
47a041f
66a1667
709ab3e
9510b7c
2b92021
5141d11
475a9a3
0f706ca
e05fc45
c868439
26b257e
60ff7f2
7ef31fd
2f152fe
8135bc7
6cd85cc
d646785
b466936
3c9bafc
d24a0d3
948fc56
5b9d9c7
0d55fbc
294e575
13cc2d8
c95b1dd
32f33cb
30830ec
391f983
d94d615
996f249
199382d
177e5c8
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there no way to leave this up to the user?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😄the classic.
I think leaving it up to the user, leads us to the issue about blocking an async call in a sync trait function? I think if we have an idea how to handle this, we can better reason about if, when, and where to cache?
from the docs:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking that this maybe incorrect since others processes may create new namespaces after creating the
try_new
method? We should the result oflist_namespaces
each time. For performance issue, we may create sth likeCachingCatalog
in java to implementCatalog
trait, what do you think?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, you're right this is a simple but naive impl. based on the docs and the ref from delta-rs.
I think having a caching wrapper would be a better solution.
However, since we have to provide a vec of all schema_names we cannot check if a single schema is in the cache and if not fetch again. So I'm guessing we can only implement it with a time based eviction policy i.e. cache all schemas for 1min in order to avoid multiple network calls in a short amount of time (same concept as debouncing)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, when adding a cache, we need to a design for the trade of between performance and consistency. But I think we could leave it to actualy cache implentation? For this pr, we should not cache them in the providers, e.g.
CatalogProvider
,SchemaProvider
, etc, but callingCatalog
api directly.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, we can do it like that for this PR and impl. the more sophisticated solution later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@liurenjie1024
...I took another look; and I'm not sure we can call the Catalog API directly each time the funtion on the e.g.
trait CatalogProvider
is invoked. The trait is required to be synchronous; however our catalog api calls are async.So perhaps; we leave it as is for this PR and work on the cache instead? However, I'll guess we're facing the same problem here as well? We could use
rt.block_on(...)
to solve this, but maybe, I'm just missing something and we can do this without needing to block?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does possible but we should avoid using
rt.block_on()
in our lib.I'm second with this suggestion. We can move on and figure how to improve this part in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a tracking issue to add runtime api: #124
But I agree with @Xuanwo that we should avoid
rt.block_on
as much as possible.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should be hesitant with caching, especially avoiding upfront optimizations. For Iceberg in general, consistency is king.
This makes sense, but I see the issue with blocking on async calls. At first, I would take the price of waiting for the blocking calls. Even it is still a remote call, the ones to the REST catalog should be lightning-fast (that's where the caching happens).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think blocking + timeout would be a reasonable solution before we implement sophiscated caching. For this pr, we can leave it as now since it's even not caching, it's a snapshot. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for "moving on" and create or track those issues in #357 before this PR gets too big.