Add new SnowflakeCatalog implementation to enable directly using Snowflake-managed Iceberg tables #6428

dennishuo · 2022-12-15T03:10:17Z

This read-only implementation of the Catalog interface, initially built on top of the Snowflake JDBC driver for the connection layer, enables engines like Spark using the Iceberg Java SDK to be able to consume Snowflake-managed Iceberg Tables via Iceberg Catalog interfaces.

Example, assuming a Snowflake account with a database iot_data containing a schema public and a managed Iceberg table sensor_test_results:

spark-shell --conf spark.sql.catalog.snowlog=org.apache.iceberg.spark.SparkCatalog \
    --conf spark.sql.catalog.snowlog.catalog-impl=org.apache.iceberg.snowflake.SnowflakeCatalog \
    --conf spark.sql.catalog.snowlog.uri="jdbc:snowflake://$ACCOUNT.snowflakecomputing.com" \
    ....
scala> spark.sessionState.catalogManager.setCurrentCatalog("snowlog");
scala> spark.sql("show namespaces in iot_data").show(false);
scala> spark.sql("select * from iot_data.public.sensor_test_results limit 10").show(false);

Note that the involvement of a JDBC driver is only incidental, and functionality is different from the JdbcCatalog - here, Snowflake itself manages manifest/metadata files and table/snapshot metadata, and this catalog layer facilitates the coordination of metadata-file locations and discovery of the latest table snapshot versions without resorting to file-listing or "directory-name"-listing (for listTables or listNamespaces) like the HadoopCatalog.

…#1) Initial read-only Snowflake Catalog implementation built on top of the Snowflake JDBC driver, providing support for basic listing of namespaces, listing of tables, and loading/reads of tables. Auth options are passthrough to the JDBC driver. Co-authored-by: Maninder Parmar <maninder.parmar@snowflake.com> Co-authored-by: Maninder Parmar <maninder.parmar+oss@snowflake.com> Co-authored-by: Dennis Huo <dennis.huo+oss@snowflake.com>

Add JdbcSnowflakeClientTest using mocks; provides full coverage of JdbcSnowflakeClient and entities' ResultSetHandler logic. Also update target Spark runtime versions to be included.

snowflake/src/test/java/org/apache/iceberg/snowflake/SnowflakeCatalogTest.java

consistency and future interoperability with inheriting from abstact unittest base classes.

nastra

Thanks for the PR @dennishuo, great to have this. I did a more thorough review and my comments are inlined.

build.gradle

snowflake/src/main/java/org/apache/iceberg/snowflake/SnowflakeCatalog.java

snowflake/src/test/java/org/apache/iceberg/snowflake/SnowflakeCatalogTest.java

-Convert unittests to all use assertj/Assertions for "fluent assertions" -Refactor test injection into overloaded initialize() method -Add test cases for close() propagation -Use CloseableGroup.

…seLocation

SnowflakTableOperations class itself, add test case.

SnowflakeClient/JdbcSnowflakeClient layers and merge SnowflakeTable and SnowflakeSchema into a single SnowflakeIdentifier that also encompasses ROOT and DATABASE level identifiers. A SnowflakeIdentifier thus functions like a type-checked/constrained Iceberg TableIdentifier, and eliminates any tight coupling between a SnowflakeClient and Catalog business logic. Parsing of Namespace numerical levels into a SnowflakeIdentifier is now fully encapsulated in NamespaceHelpers so that callsites don't duplicate namespace-handling/validation logic.

…ssert in favor of assertj's Assertions.

dennishuo · 2022-12-17T03:55:16Z

@nastra Thanks for the thorough review and suggestions! Finished applying all your suggestions, including fully converting to assertj/Assertions and refactoring out the Namespace<->SnowflakeIdentifier parsing to better encapsulate all the argument-checking/parsing into one place.

dennishuo · 2022-12-17T19:56:00Z

Interesting, not sure why it removed @danielcweeks when I re-requested review from @nastra (definitely wasn't intentional -- as far as I can tell my repository permissions don't even provide the ability to remove reviewer requests). I wonder if this is possibly another manifestation of community/community#8939

snowflake/src/main/java/org/apache/iceberg/snowflake/entities/SnowflakeIdentifier.java

snowflake/src/main/java/org/apache/iceberg/snowflake/SnowflakeCatalog.java

snowflake/src/main/java/org/apache/iceberg/snowflake/SnowflakeClient.java

snowflake/src/main/java/org/apache/iceberg/snowflake/SnowflakeCatalog.java

snowflake/src/main/java/org/apache/iceberg/snowflake/JdbcSnowflakeClient.java

snowflake/src/main/java/org/apache/iceberg/snowflake/SnowflakeIdentifier.java

snowflake/src/main/java/org/apache/iceberg/snowflake/SnowflakeTableMetadata.java

snowflake/src/main/java/org/apache/iceberg/snowflake/JdbcSnowflakeClient.java

…lity.

danielcweeks · 2023-01-09T21:29:40Z

@dennishuo #6538 is merged, so you might want to rebase on top of that. I'm good, but I it would be great to have @nastra sign off as well since there are couple comments open.

dennishuo · 2023-01-09T23:38:57Z

Thanks for the note! Successfully merged main. I'll await @nastra 's review

nastra

thanks @dennishuo for working on this. I've commented on a few nits but overall this looks good to me.
It would also be good to have @rdblue review/approve this.

snowflake/src/test/java/org/apache/iceberg/snowflake/NamespaceHelpersTest.java

snowflake/src/main/java/org/apache/iceberg/snowflake/SnowflakeTableMetadata.java

snowflake/src/main/java/org/apache/iceberg/snowflake/SnowflakeTableOperations.java

snowflake/src/main/java/org/apache/iceberg/snowflake/SnowflakeCatalog.java

jackye1995 · 2023-01-13T00:05:33Z

snowflake/src/main/java/org/apache/iceberg/snowflake/SnowflakeTableMetadata.java

+   * snowflakeLocation is a known non-compatible path syntax but fails to match the expected path
+   * components for a successful translation.
+   */
+  public static String snowflakeLocationToIcebergLocation(String snowflakeLocation) {


I see Azure and GCP paths converted, what about the S3 ones? Are all combinations like s3a, s3n, etc. all compatible natively in Snowflake?

Right, in situations where Snowflake handles paths coming from externally produced sources, Snowflake tracks a canonical form of it and indeed handles s3a and s3n.

Here, the conversion is for outbound paths produced by Snowflake, where the "s3://" prefix is already used natively, so only Azure and GCS need translation for "standard" scheme mappings accepted by e.g. HadoopFileSystem.

This does remind me though - I've been meaning to open a discussion to see if anyone has thought about maybe adding some ease-of-use hooks for last-mile automatic basic path translations right before they go to FileIO resolution. For example, someone might want s3://somebucket paths to be rewritten to a viewfs:// base path before letting HadoopFileIO automatically delegate to the right ViewFs impl. Or it could come up in cases where manifest files hold s3:// paths but someone wants everything to go through a corresponding dbfs:// mount point. Do you know of any prior discussions along those lines, and would it be worth opening an issue for broader input?

Thanks for the clarification, I approved the changes, very excited to see this from Snowflake.

I've been meaning to open a discussion to see if anyone has thought about maybe adding some ease-of-use hooks for last-mile automatic basic path translations right before they go to FileIO resolution.

For HadoopFileIO, i think this is not really needed because typically we see our users already configured HDFS settings to map schemes to whatever file system implementations they would like to use.

I think ResolvingFileIO to some extent already does this kind of translation to some extent, maybe we can extend its functionality at that front.

rdblue · 2023-01-13T23:24:01Z

snowflake/src/main/java/org/apache/iceberg/snowflake/SnowflakeCatalog.java

+
+  static class FileIOFactory {
+    public FileIO newFileIO(String impl, Map<String, String> properties, Object hadoopConf) {
+      return CatalogUtil.loadFileIO(impl, properties, hadoopConf);


This factory seems odd since it has no state that is passed through to the loadFileIO method. Couldn't this just call loadFileIO directly?

This was just extracted to preserve the flow of unittest setup to be able to return the pre-configured InMemoryFileIO fake instance rather than having dynamic classloading get an empty one. The alternative there would've been to introduce some static global state in InMemoryFileIO, but that seems prone to causing cross-test-case problems.

I could add a comment or annotation here to explain its existence if that'd make it cleaner.

Added code comment to clarify

jackye1995 · 2023-01-14T01:30:11Z

Looks like some CI tests are failing? Could you check? Maybe need to rebase.

…tial

dennishuo · 2023-01-14T03:21:42Z

@jackye1995 Thanks for the heads up! Looks like merging to head fixed it.

sfc-gh-dhuo and others added 3 commits December 12, 2022 21:50

Add JdbcSnowflakeClientTest using mocks (#2)

930a3f0

Add JdbcSnowflakeClientTest using mocks; provides full coverage of JdbcSnowflakeClient and entities' ResultSetHandler logic. Also update target Spark runtime versions to be included.

Merge branch 'main' into snowflake-catalog-initial

86b3d11

github-actions bot added build core INFRA spark labels Dec 15, 2022

nastra reviewed Dec 15, 2022

View reviewed changes

snowflake/src/test/java/org/apache/iceberg/snowflake/SnowflakeCatalogTest.java Show resolved Hide resolved

danielcweeks self-requested a review December 15, 2022 17:44

Add test { useJUnitPlatform() } tuple to iceberg-snowflake for

076a14a

consistency and future interoperability with inheriting from abstact unittest base classes.

nastra reviewed Dec 16, 2022

View reviewed changes

dennishuo added 6 commits December 16, 2022 16:14

Extract versions into versions.props per PR review

a7b5aa7

Misc test-related refactors per review suggestions

dd5255c

-Convert unittests to all use assertj/Assertions for "fluent assertions" -Refactor test injection into overloaded initialize() method -Add test cases for close() propagation -Use CloseableGroup.

Fix unsupported behaviors of loadNamedpaceMetadata and defaultWarehou…

500b36b

…seLocation

Move TableIdentifier checks out of newTableOps into the

ad2c55f

SnowflakTableOperations class itself, add test case.

Finish migrating JdbcSnowflakeClientTest off any usage of org.junit.A…

58d258e

…ssert in favor of assertj's Assertions.

dennishuo requested review from nastra and removed request for danielcweeks December 17, 2022 03:55

danielcweeks self-requested a review December 17, 2022 19:43