[HUDI-4137] SnowflakeSyncTool MVP implementation to integrate with Snowflake#5659
[HUDI-4137] SnowflakeSyncTool MVP implementation to integrate with Snowflake#5659vingov wants to merge 3 commits intoapache:masterfrom
Conversation
223068c to
518decf
Compare
nsivabalan
left a comment
There was a problem hiding this comment.
Really good addition to Hudi. Keep up the good contributions :)
hudi-snowflake/src/main/java/org/apache/hudi/snowflake/sync/SnowflakeSyncTool.java
Outdated
Show resolved
Hide resolved
hudi-snowflake/src/main/java/org/apache/hudi/snowflake/sync/SnowflakeSyncConfig.java
Outdated
Show resolved
Hide resolved
hudi-snowflake/src/main/java/org/apache/hudi/snowflake/sync/SnowflakeSyncConfig.java
Outdated
Show resolved
Hide resolved
hudi-snowflake/src/main/java/org/apache/hudi/snowflake/sync/SnowflakeSyncTool.java
Show resolved
Hide resolved
hudi-snowflake/src/main/java/org/apache/hudi/snowflake/sync/SnowflakeSyncTool.java
Show resolved
Hide resolved
hudi-snowflake/src/main/java/org/apache/hudi/snowflake/sync/SnowflakeSyncTool.java
Show resolved
Hide resolved
|
@vinothchandar : You may want to review this sometime. |
|
@prasannarajaperumal : Can you review this when you can. |
|
@nsivabalan - all the requested changes have been done, please review it again, thanks! |
699d684 to
5b8c6c5
Compare
prasannarajaperumal
left a comment
There was a problem hiding this comment.
Left comments on some changes. Also overall I think the naming of the methods should match the underlying snowflake terminology.
| <project xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://maven.apache.org/POM/4.0.0" | ||
| xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"> | ||
| <parent> | ||
| <artifactId>hudi</artifactId> |
There was a problem hiding this comment.
Why is this a top-level module? Shouldn't this be under hudi-sync?
| <!-- Snowflake --> | ||
| <dependency> | ||
| <groupId>com.snowflake</groupId> | ||
| <artifactId>snowpark</artifactId> |
There was a problem hiding this comment.
I dont understand why we need snowpark here?
| <excludes> | ||
| <exclude>junit:junit</exclude> | ||
| <exclude>com.google.code.findbugs:*</exclude> | ||
| <exclude>org.apache.hbase:*</exclude> |
There was a problem hiding this comment.
why exclude these packages?
| } | ||
| } | ||
|
|
||
| private void syncCoWTable(HoodieSnowflakeSyncClient snowSyncClient) { |
There was a problem hiding this comment.
Can you please add some java docs on the steps to sync a CoW table? It will be easy to follow.
| .noDefaultValue() | ||
| .withDocumentation("Name of the target table in Snowflake"); | ||
|
|
||
| public static final ConfigProperty<String> SNOWFLAKE_SYNC_SYNC_BASE_PATH = ConfigProperty |
There was a problem hiding this comment.
Did you intend to have SYNC_SYNC in the name? Can it just be SNOWFLAKE_SYNC_BASE_PATH?
| query += " WITH LOCATION = @" + stageName | ||
| + " FILE_FORMAT = (TYPE = " + config.getString(SNOWFLAKE_SYNC_SYNC_BASE_FILE_FORMAT) + ")" | ||
| + " PATTERN = '.*[.]" + config.getString(SNOWFLAKE_SYNC_SYNC_BASE_FILE_FORMAT).toLowerCase() + "'" | ||
| + " AUTO_REFRESH = false"; |
There was a problem hiding this comment.
Make auto-refresh option configurable?
| } | ||
| } | ||
|
|
||
| public void createVersionsTable(String stageName, String tableName, String partitionFields, String partitionExtractExpr) { |
There was a problem hiding this comment.
I think we are imposing the naming convention of Gcs here on to snowflake? I dont understand why this external table is called versions table?
| @Override | ||
| public Option<String> getLastCommitTimeSynced(final String tableName) { | ||
| // snowflake doesn't support tblproperties, so do nothing. | ||
| throw new UnsupportedOperationException("Not support getLastCommitTimeSynced yet."); |
There was a problem hiding this comment.
Can we try storing this as tags on the external table we create?
https://docs.snowflake.com/en/sql-reference/sql/create-tag.html
|
|
||
| public void createManifestTable(String stageName, String tableName) { | ||
| try { | ||
| String query = "CREATE OR REPLACE EXTERNAL TABLE " + tableName + " (" |
There was a problem hiding this comment.
It is better for the list of files to be an internal snowflake table (not an external table). This will unlock compiler optimizations in snowflake for the sub-query on the snapshot view. Plus, This should be a fairly simple to do this using copy into statement in snowflake.
| + " file_format => '" + fileFormatName + "'" | ||
| + " )" | ||
| + ")"; | ||
| Optional<Row> row = snowflakeSession.sql(query).first(); |
There was a problem hiding this comment.
Use JDBC to get the value. Including snowpark for this seems excessive.
|
@vingov : may I know when are you planning to address the comments. |
|
Hey everyone. How is this amazing work going? We are planing to migrate our all historical data into Lakehouse using Hudi. Looking forward for your replay! Thanks! |
|
Apache XTable™ enables existing Hudi tables to seamlessly work with Snowflake. Closing this one. |
What is the purpose of the pull request
This pull request adds the Snowflake Sync feature, this is a requirement to read Hudi tables on the Snowflake warehouse.
Brief change log
Verify this pull request
This change added tests and can be verified as follows:
(example:)
Committer checklist
Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.