Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

any plan for Iceberg Table on S3? #1468

Closed
Lindayangyy opened this issue Sep 16, 2020 · 17 comments
Closed

any plan for Iceberg Table on S3? #1468

Lindayangyy opened this issue Sep 16, 2020 · 17 comments
Labels

Comments

@Lindayangyy
Copy link

New to Apache Iceberg, We are looking for Iceberg Table or warehouse (catalog) implementation upon S3, if without any reference to Hive and HDFS (hadoop) is possible? The current implementation seems tightly coupled with Hive and hadoop.

@RussellSpitzer
Copy link
Member

You can use it with S3 with Hadoop client libraries only, you don't actually need a Hadoop cluster or HDFS.

@HeartSaVioR
Copy link
Contributor

Supporting S3 requires Hive, because of S3's characteristic, eventual consistency. I see OSP version of Delta Lake solved it in different way, but pretty much limited. (It assumes concurrent writes for S3 only happen in "a" Spark driver. https://github.com/delta-io/delta/blob/master/src/main/scala/org/apache/spark/sql/delta/storage/S3SingleDriverLogStore.scala)

@aokolnychyi
Copy link
Contributor

Iceberg works reliably with s3 even if the same table is accessed via multiple clusters and query engines. Using Iceberg requires a catalog that can swap a pointer to the metadata file atomically. This can be done using a compare and swap or lock/unlock API. Iceberg contains a built-in implementation that uses Hive metastore to work with s3 reliably (lock/unlock). Anyone could easily build an integration for any catalog. For example, one may have a Cassandra-based catalog and use compare and swap to commit new table versions. That will be enough to work with s3 reliably.

@jacques-n
Copy link

We've been working on a non-Hive way to provide this functionality and plan on contributing it to the project within the next two weeks.

@Lindayangyy
Copy link
Author

That will be awesome, can't wait to see it. Thank you - jacques-n!

@Lindayangyy
Copy link
Author

Thanks for all the responses as alternatives. All answers are great!

@HeartSaVioR
Copy link
Contributor

That sounds great! Assuming it still needs to do CAS with external storage (I'd be really curious if it doesn't rely on the external storage) which is that? Is it one of AWS services? If then even better, as there's no external dependency outside of AWS. Given we assume to use S3, which is already locked-in.

@jacques-n
Copy link

We're doing something pluggable but the default implementation is on top of DynamoDB.

@ismailsimsek
Copy link
Contributor

is it possible to write JDBC based catalog? that could unlock many catalog option

@kbendick
Copy link
Contributor

We're doing something pluggable but the default implementation is on top of DynamoDB.

That's a good idea. I know that AWS Glue is backed by DynamoDB, so if you can make a catalog using Dynamo, then possibly the AWS team can implement the atomic swap in Glue. If I'm not mistaken, you'd need to use either read / write consistency or possibly a DynamoDB versioned object.

Looking forward to seeing the DynamoDB catalog as I assume many companies looking to write to S3 are also likely using DynamoDB. I know that my company uses DynamoDB a ton so this would be a great work around until there is Glue Catalog support (which I've been giving some thought to myself).

@jackye1995
Copy link
Contributor

Hi @jacques-n this is Jack from AWS. We are planning to introduce a new iceberg-aws module, and we do have plan to offer a Glue + DynamoDB implementation for Catalog and TableOperations. Since you say you already have something working, let's have a sync after you have a PR and see what is the best way to have this shipped all together 😃

@jacques-n
Copy link

Hey guys, we just posted more information on the new stuff we've been building for Iceberg + DynamoDB. You can check it out here: https://projectnessie.org/

We'll have a PR up against Iceberg shortly to contribute the Iceberg integrations:
https://github.com/projectnessie/nessie/tree/main/clients/iceberg

@RussellSpitzer
Copy link
Member

RussellSpitzer commented Oct 1, 2020 via email

@jackye1995
Copy link
Contributor

I just sent out a PR for AWS Glue support. With this update you can use HiveCatalog without the need to set up any Hive infrastructure and build your data lake on top of S3. #1608

@jackye1995
Copy link
Contributor

For anyone new to this issue, I think we have summarized all information in https://iceberg.apache.org/aws/, and we can close this issue. @Lindayangyy

Copy link

This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.

@github-actions github-actions bot added the stale label Feb 25, 2024
Copy link

This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale'

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Mar 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

8 participants