Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new SnowflakeCatalog implementation to enable directly using Snowflake-managed Iceberg tables #6428

Merged
merged 23 commits into from
Jan 14, 2023

Conversation

dennishuo
Copy link
Contributor

This read-only implementation of the Catalog interface, initially built on top of the Snowflake JDBC driver for the connection layer, enables engines like Spark using the Iceberg Java SDK to be able to consume Snowflake-managed Iceberg Tables via Iceberg Catalog interfaces.

Example, assuming a Snowflake account with a database iot_data containing a schema public and a managed Iceberg table sensor_test_results:

spark-shell --conf spark.sql.catalog.snowlog=org.apache.iceberg.spark.SparkCatalog \
    --conf spark.sql.catalog.snowlog.catalog-impl=org.apache.iceberg.snowflake.SnowflakeCatalog \
    --conf spark.sql.catalog.snowlog.uri="jdbc:snowflake://$ACCOUNT.snowflakecomputing.com" \
    ....
scala> spark.sessionState.catalogManager.setCurrentCatalog("snowlog");
scala> spark.sql("show namespaces in iot_data").show(false);
scala> spark.sql("select * from iot_data.public.sensor_test_results limit 10").show(false);

Note that the involvement of a JDBC driver is only incidental, and functionality is different from the JdbcCatalog - here, Snowflake itself manages manifest/metadata files and table/snapshot metadata, and this catalog layer facilitates the coordination of metadata-file locations and discovery of the latest table snapshot versions without resorting to file-listing or "directory-name"-listing (for listTables or listNamespaces) like the HadoopCatalog.

sfc-gh-dhuo and others added 3 commits December 12, 2022 21:50
…#1)

Initial read-only Snowflake Catalog implementation built on top of the Snowflake JDBC driver,
providing support for basic listing of namespaces, listing of tables, and loading/reads of tables.

Auth options are passthrough to the JDBC driver.

Co-authored-by: Maninder Parmar <maninder.parmar@snowflake.com>
Co-authored-by: Maninder Parmar <maninder.parmar+oss@snowflake.com>
Co-authored-by: Dennis Huo <dennis.huo+oss@snowflake.com>
Add JdbcSnowflakeClientTest using mocks; provides full coverage of JdbcSnowflakeClient
and entities' ResultSetHandler logic.

Also update target Spark runtime versions to be included.
@danielcweeks danielcweeks self-requested a review December 15, 2022 17:44
consistency and future interoperability with inheriting from abstact
unittest base classes.
Copy link
Contributor

@nastra nastra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @dennishuo, great to have this. I did a more thorough review and my comments are inlined.

-Convert unittests to all use assertj/Assertions for "fluent assertions"
-Refactor test injection into overloaded initialize() method
-Add test cases for close() propagation
-Use CloseableGroup.
SnowflakTableOperations class itself, add test case.
SnowflakeClient/JdbcSnowflakeClient layers and merge SnowflakeTable
and SnowflakeSchema into a single SnowflakeIdentifier that also
encompasses ROOT and DATABASE level identifiers.

A SnowflakeIdentifier thus functions like a type-checked/constrained
Iceberg TableIdentifier, and eliminates any tight coupling between
a SnowflakeClient and Catalog business logic.

Parsing of Namespace numerical levels into a SnowflakeIdentifier
is now fully encapsulated in NamespaceHelpers so that callsites
don't duplicate namespace-handling/validation logic.
@dennishuo
Copy link
Contributor Author

@nastra Thanks for the thorough review and suggestions! Finished applying all your suggestions, including fully converting to assertj/Assertions and refactoring out the Namespace<->SnowflakeIdentifier parsing to better encapsulate all the argument-checking/parsing into one place.

@dennishuo dennishuo requested review from nastra and removed request for danielcweeks December 17, 2022 03:55
@danielcweeks danielcweeks self-requested a review December 17, 2022 19:43
@dennishuo
Copy link
Contributor Author

Interesting, not sure why it removed @danielcweeks when I re-requested review from @nastra (definitely wasn't intentional -- as far as I can tell my repository permissions don't even provide the ability to remove reviewer requests). I wonder if this is possibly another manifestation of community/community#8939

@dennishuo dennishuo requested review from danielcweeks and removed request for rdblue January 7, 2023 18:09
@danielcweeks
Copy link
Contributor

@dennishuo #6538 is merged, so you might want to rebase on top of that. I'm good, but I it would be great to have @nastra sign off as well since there are couple comments open.

@dennishuo
Copy link
Contributor Author

Thanks for the note! Successfully merged main. I'll await @nastra 's review

@dennishuo dennishuo requested review from nastra and removed request for danielcweeks January 9, 2023 23:39
Copy link
Contributor

@nastra nastra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @dennishuo for working on this. I've commented on a few nits but overall this looks good to me.
It would also be good to have @rdblue review/approve this.

@nastra nastra requested a review from rdblue January 10, 2023 15:25
@jackye1995 jackye1995 self-requested a review January 12, 2023 23:54
* snowflakeLocation is a known non-compatible path syntax but fails to match the expected path
* components for a successful translation.
*/
public static String snowflakeLocationToIcebergLocation(String snowflakeLocation) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see Azure and GCP paths converted, what about the S3 ones? Are all combinations like s3a, s3n, etc. all compatible natively in Snowflake?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, in situations where Snowflake handles paths coming from externally produced sources, Snowflake tracks a canonical form of it and indeed handles s3a and s3n.

Here, the conversion is for outbound paths produced by Snowflake, where the "s3://" prefix is already used natively, so only Azure and GCS need translation for "standard" scheme mappings accepted by e.g. HadoopFileSystem.

This does remind me though - I've been meaning to open a discussion to see if anyone has thought about maybe adding some ease-of-use hooks for last-mile automatic basic path translations right before they go to FileIO resolution. For example, someone might want s3://somebucket paths to be rewritten to a viewfs:// base path before letting HadoopFileIO automatically delegate to the right ViewFs impl. Or it could come up in cases where manifest files hold s3:// paths but someone wants everything to go through a corresponding dbfs:// mount point. Do you know of any prior discussions along those lines, and would it be worth opening an issue for broader input?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clarification, I approved the changes, very excited to see this from Snowflake.

I've been meaning to open a discussion to see if anyone has thought about maybe adding some ease-of-use hooks for last-mile automatic basic path translations right before they go to FileIO resolution.

For HadoopFileIO, i think this is not really needed because typically we see our users already configured HDFS settings to map schemes to whatever file system implementations they would like to use.

I think ResolvingFileIO to some extent already does this kind of translation to some extent, maybe we can extend its functionality at that front.


static class FileIOFactory {
public FileIO newFileIO(String impl, Map<String, String> properties, Object hadoopConf) {
return CatalogUtil.loadFileIO(impl, properties, hadoopConf);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This factory seems odd since it has no state that is passed through to the loadFileIO method. Couldn't this just call loadFileIO directly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was just extracted to preserve the flow of unittest setup to be able to return the pre-configured InMemoryFileIO fake instance rather than having dynamic classloading get an empty one. The alternative there would've been to introduce some static global state in InMemoryFileIO, but that seems prone to causing cross-test-case problems.

I could add a comment or annotation here to explain its existence if that'd make it cleaner.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added code comment to clarify

@jackye1995
Copy link
Contributor

Looks like some CI tests are failing? Could you check? Maybe need to rebase.

@dennishuo
Copy link
Contributor Author

@jackye1995 Thanks for the heads up! Looks like merging to head fixed it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants