-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
any plan for Iceberg Table on S3? #1468
Comments
You can use it with S3 with Hadoop client libraries only, you don't actually need a Hadoop cluster or HDFS. |
Supporting S3 requires Hive, because of S3's characteristic, eventual consistency. I see OSP version of Delta Lake solved it in different way, but pretty much limited. (It assumes concurrent writes for S3 only happen in "a" Spark driver. https://github.com/delta-io/delta/blob/master/src/main/scala/org/apache/spark/sql/delta/storage/S3SingleDriverLogStore.scala) |
Iceberg works reliably with s3 even if the same table is accessed via multiple clusters and query engines. Using Iceberg requires a catalog that can swap a pointer to the metadata file atomically. This can be done using a compare and swap or lock/unlock API. Iceberg contains a built-in implementation that uses Hive metastore to work with s3 reliably (lock/unlock). Anyone could easily build an integration for any catalog. For example, one may have a Cassandra-based catalog and use compare and swap to commit new table versions. That will be enough to work with s3 reliably. |
We've been working on a non-Hive way to provide this functionality and plan on contributing it to the project within the next two weeks. |
That will be awesome, can't wait to see it. Thank you - jacques-n! |
Thanks for all the responses as alternatives. All answers are great! |
That sounds great! Assuming it still needs to do CAS with external storage (I'd be really curious if it doesn't rely on the external storage) which is that? Is it one of AWS services? If then even better, as there's no external dependency outside of AWS. Given we assume to use S3, which is already locked-in. |
We're doing something pluggable but the default implementation is on top of DynamoDB. |
is it possible to write JDBC based catalog? that could unlock many catalog option |
That's a good idea. I know that AWS Glue is backed by DynamoDB, so if you can make a catalog using Dynamo, then possibly the AWS team can implement the atomic swap in Glue. If I'm not mistaken, you'd need to use either read / write consistency or possibly a DynamoDB versioned object. Looking forward to seeing the DynamoDB catalog as I assume many companies looking to write to S3 are also likely using DynamoDB. I know that my company uses DynamoDB a ton so this would be a great work around until there is Glue Catalog support (which I've been giving some thought to myself). |
Hi @jacques-n this is Jack from AWS. We are planning to introduce a new |
Hey guys, we just posted more information on the new stuff we've been building for Iceberg + DynamoDB. You can check it out here: https://projectnessie.org/ We'll have a PR up against Iceberg shortly to contribute the Iceberg integrations: |
Very cool!
…On Thu, Oct 1, 2020 at 4:34 PM Jacques Nadeau ***@***.***> wrote:
Hey guys, we just posted more information on the new stuff we've been
building for Iceberg + DynamoDB. You can check it out here:
https://projectnessie.org/
We'll have a PR up against Iceberg shortly to contribute the Iceberg
integrations:
https://github.com/projectnessie/nessie/tree/main/clients/iceberg
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#1468 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AADE2YKA6G5T55NR2OUSRVLSITYWXANCNFSM4RPIUBNQ>
.
|
I just sent out a PR for AWS Glue support. With this update you can use |
For anyone new to this issue, I think we have summarized all information in https://iceberg.apache.org/aws/, and we can close this issue. @Lindayangyy |
This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible. |
This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale' |
New to Apache Iceberg, We are looking for Iceberg Table or warehouse (catalog) implementation upon S3, if without any reference to Hive and HDFS (hadoop) is possible? The current implementation seems tightly coupled with Hive and hadoop.
The text was updated successfully, but these errors were encountered: